ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009



Similar documents
ECLIPSE Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

HPC Applications Scalability.

FLOW-3D Performance Benchmark and Profiling. September 2012

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Advancing Applications Performance With InfiniBand

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Enabling High performance Big Data platform with RDMA

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

SMB Direct for SQL Server and Private Cloud

Clusters: Mainstream Technology for CAE

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Michael Kagan.

Performance Across the Generations: Processor and Interconnect Technologies

RDMA over Ethernet - A Preliminary Study

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Running Oracle s PeopleSoft Human Capital Management on Oracle SuperCluster T5-8 O R A C L E W H I T E P A P E R L A S T U P D A T E D J U N E

Boost Database Performance with the Cisco UCS Storage Accelerator

High Performance SQL Server with Storage Center 6.4 All Flash Array

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Parallel Programming Survey

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

High Performance Computing in CST STUDIO SUITE

Improved LS-DYNA Performance on Sun Servers

How System Settings Impact PCIe SSD Performance

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

Business white paper. HP Process Automation. Version 7.0. Server performance

Scaling in a Hypervisor Environment

SUN ORACLE EXADATA STORAGE SERVER

Crossing the Performance Chasm with OpenPOWER

Building a Scalable Storage with InfiniBand

State of the Art Cloud Infrastructure

7 Real Benefits of a Virtual Infrastructure

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Accelerating CFD using OpenFOAM with GPUs

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

Data Center Storage Solutions

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

You re not alone if you re feeling pressure

White Paper Solarflare High-Performance Computing (HPC) Applications

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Cloud Computing through Virtualization and HPC technologies

NFS SERVER WITH 10 GIGABIT ETHERNET NETWORKS

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Minimize cost and risk for data warehousing

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Maximum performance, minimal risk for data warehousing

Best Practices. Server: Power Benchmark

HP ProLiant BL460c achieves #1 performance spot on Siebel CRM Release 8.0 Benchmark Industry Applications running Microsoft, Oracle

InfiniBand Update Addressing new I/O challenges in HPC, Cloud, and Web 2.0 infrastructures. Brian Sparks IBTA Marketing Working Group Co-Chair

Choosing the Best Network Interface Card Mellanox ConnectX -3 Pro EN vs. Intel X520

HPC Update: Engagement Model

Building Clusters for Gromacs and other HPC applications

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Hadoop on the Gordon Data Intensive Cluster

Cut I/O Power and Cost while Boosting Blade Server Performance

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

Cluster Implementation and Management; Scheduling

Jean-Pierre Panziera Teratec 2011

SR-IOV: Performance Benefits for Virtualized Interconnects!

InfiniBand -- Industry Standard Data Center Fabric is Ready for Prime Time

LLNL Redefines High Performance Computing with Fusion Powered I/O

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Enabling the Use of Data

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

Performance Analysis: Scale-Out File Server Cluster with Windows Server 2012 R2 Date: December 2014 Author: Mike Leone, ESG Lab Analyst

3G Converged-NICs A Platform for Server I/O to Converged Networks

HP reference configuration for entry-level SAS Grid Manager solutions

Accelerating Microsoft Exchange Servers with I/O Caching

An Oracle White Paper August Oracle WebCenter Content 11gR1 Performance Testing Results

THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

InfiniBand Strengthens Leadership as the High-Speed Interconnect Of Choice

8Gb Fibre Channel Adapter of Choice in Microsoft Hyper-V Environments

High Performance Tier Implementation Guideline

Transcription:

ECLIPSE Best Practices Performance, Productivity, Efficiency March 29

ECLIPSE Performance, Productivity, Efficiency The following research was performed under the HPC Advisory Council activities HPC Advisory Council Cluster Center Special thanks to AMD, Dell, Mellanox, Schlumberger In particular to Joshua Mora (AMD) Jacob Liberman (Dell) Gilad Shainer and Tong Liu (Mellanox Technologies) Owen Brazell (Schlumberger) For more info please refer to http://hpcadvisorycouncil.mellanox.com/ 2

The HPC Advisory Council Distinguished HPC alliance (OEMs, IHVs, ISVs, end-users) More than 6 members worldwide 1st tier OEMs (AMD, Dell, HP, Intel, Sun) and systems integrators across the world Leading OSVs and ISVs (LSTC, Microsoft, Schlumberger, Wolfram etc.) Strategic end-users (Total, Fermi Lab, ORNL, OSU, VPAC etc.) Storage vendors (DDN, IBRIX, LSI, Panasas etc) The Council mission Bridge the gap between high-performance computing (HPC) use and its potential Bring the beneficial capabilities of HPC to new users for better research, education, innovation and product manufacturing Bring users the expertise needed to operate HPC systems Provide application designers with the tools needed to enable parallel computing Strengthen the qualification and integration of HPC system products 3

HPC Advisory Council Activities Network of Expertise Best Practices - Oil and Gas - Automotive -Bioscience - Weather -CFD - Quantum Chemistry - and more. HPC Outreach and Education Cluster Center End-user applications benchmarking center HPC Technology Demonstration 4

ECLIPSE Best Practices Objectives Ways to improve performance, productivity, efficiency Knowledge, expertise, usage models The following presentation reviews: ECLIPSE performance benchmarking Interconnect performance comparisons ECLIPSE productivity Understanding ECLIPSE communication patterns Power-aware simulations 5

Test Cluster Configuration Dell PowerEdge SC 1435 24-node cluster Dual socket 1-U rack server supports 8x PCIe and 8MHz DDR2 DIMMs Energy efficient building block for scalable and productive high performance computing Quad-Core AMD Opteron Model 2382 processors ( Shanghai ) Industry leading technology that delivers performance and power efficiency Opteron processors provide up to 21 GB/s peak bandwidth per processor Mellanox InfiniBand ConnectX DDR HCAs and Mellanox InfiniBand DDR Switch High-performance I/O consolidation interconnect transport offload, HW reliability, QoS Supports up to 4Gb/s, 1usec latency, high message rate, low CPU overhead and zero scalable latency Application: Schlumberger ECLIPSE Simulators 28.2 ECLIPSE is the market leading reservoir simulator Has become the de-facto standard for reservoir simulation in the oil and gas industry Benchmark Workload 4 million cell model ( 248 2 1) Blackoil 3 phase model with ~ 8 wells 6

ECLIPSE Performance - Interconnect InfiniBand enables highest performance and scalability Performance accelerates with cluster size Performance over GigE and 1GigE is not scaling Slowdown occurs beyond 8 nodes 6 Schlumberger ECLIPSE (FOURMILL) Elapsed Time (Seconds) 5 4 3 2 1 4 8 12 16 2 22 24 Number of Nodes Lower is better GigE 1GigE InfiniBand Single job per cluster size 7

ECLIPSE Productivity By allowing multiple jobs to run simultaneously, ECLIPSE productivity can be increased Maximum gain with one job per core, maximum % improve with 1 job per 2 cores Three cases are presented Single job over the entire systems Four jobs, each on two cores per CPU per server Eight jobs, each on one CPU core per server Eight jobs per node increases productivity by up to 142% 3 Schlumberger ECLIPSE (FOURMILL) Number of Jobs 25 2 15 1 5 Higher is better 8 12 16 2 22 24 Number of Nodes 1 Job per Node 4 Jobs per Node 8 Jobs per Node InfiniBand 8

ECLIPSE Productivity Interconnect Comparison Productivity improvement depends on the cluster interconnect As number of job increases, I/O becomes the system bottleneck InfiniBand demonstrates scalable productivity compared to Ethernet GigE shows performance decrease beyond 8 nodes 1GigE demonstrates no scaling beyond 16 nodes 25 Schlumberger ECLIPSE (FOURMILL) Number of Jobs 2 15 1 5 Higher is better 4 8 12 16 2 22 Number of Nodes GigE 1GigE InfiniBand 4 Jobs on each node 9

ECLIPSE Profiling Data Transferred Number of Messages (Millions) 1 8 6 4 2 ECLIPSE MPI Profiliing MPI_Isend [..128B] [128B..1KB] [1..8KB] [8..256KB] [256KB..1M] [1M..Infinity] Message Size MPI message profiling Determine Application sensitivity point Both Send and Receive 4nodes 8nodes 12nodes 16nodes 2nodes 24nodes Sensitivity points: 8K-256KB messages Relate to data communications -128B messages Relate to data synchronizations Number of Messages (Millions) 7 6 5 4 3 2 1 ECLIPSE MPI Profiliing MPI_Recv [..128B] [128B..1KB] [1..8KB] [8..256KB] [256KB..1M] [1M..Infinity] Message Size 4nodes 8nodes 12nodes 16nodes 2nodes 24nodes 1

ECLIPSE Profiling Message Distribution Eclipse MPI Profiliing Percentage of Messages 6% 5% 4% 3% 4 8 12 16 2 24 Number of Nodes MPI_Isend < 128 Bytes MPI_Recv < 128 Bytes MPI_Isend < 256 KB MPI_Recv < 256 KB Majority of MPI messages are large size Percentage slightly increase with cluster size Demonstrating the need for highest throughput 11

Interconnect Usage by ECLIPSE Total server throughput increases rapidly with cluster size Data Transferred (MB/s) Data Transferred (MB/s) Data Sent 4 Nodes Data Sent 8 Nodes 14 12 1 8 6 4 2 1 146 291 436 581 726 871 116 1161 136 1451 1596 1741 1886 231 Timing (s) Data Sent 16 Nodes Data Sent 24 Nodes 14 12 1 8 6 4 2 1 48 95 142 189 236 283 33 377 424 471 518 565 612 659 76 753 8 847 Timing (s) Data Transferred (MB/s) Data Transferred (MB/s) This data is per node based 14 12 1 8 6 4 2 1 86 171 256 341 426 511 596 681 766 851 936 121 116 1191 Timing (s) 14 12 1 8 6 4 2 1 45 89 133 177 221 265 39 353 397 441 485 529 573 617 661 75 749 793 Timing (s) 12

Power Consumption/Productivity Comparison By enabling higher productivity on a given system, power/job decreases By using InfiniBand power/job consumption decreases by Up to 66% vs GigE and 33% vs 1GigE For productivity case 4 jobs per node With a single job approach, InfiniBand reduces power/job consumption by more than 82% compared to 1GigE Power Consumption Power per Job (Wh) 2 18 16 14 12 1 8 6 4 2 66% 33% GigE 1GigE InfiniBand 4 Jobs on each node 13

ECLIPSE Profiling Summary - Interconnect ECLIPSE was profiled to determine networking dependency Majority of data transferred between compute nodes Done with 8KB-256KB message size Data transferred increases with cluster size Most used message sizes <128B messages mainly synchronizations 8KB-256KB data transferring Message size distribution Percentage of smaller messages (<128B) slightly decreases with cluster size Percentage of mid-size messages (8KB-256KB) increases with cluster size ECLIPSE interconnects sensitivity points Interconnect latency and throughput for <256KB message range As node number increases, interconnect throughput becomes more critical 14

Quad-Core AMD Shanghai Performance and Power consumption AMD Opteron Shanghai versus Opteron Barcelona Overall 2% performance increase Both GFLOP/s and GB/s While sustaining similar power consumption. Memory throughput intensive applications (such as ECLIPSE) Performance is driven by north-bridge and DDR2 memory frequencies Consume less energy than cache friendly compute intensive applications Power management schemes Delivers the highest application performance per Watt ratio Critical for low power consumption of the platform while in idle Core frequency drop to as low as 8 MHz 15

Conclusions Eclipse is widely used to perform reservoir simulation Developed by Schulmberger ECLIPSE performance and productivity relies on: Scalable HPC systems and interconnect solutions Low-latency and high-throughout interconnect technology NUMA aware application for fast access to memory Reasonable job distribution can dramatically improve productivity Increasing number of jobs per day while maintaining fast run time Interconnect comparison shows: InfiniBand delivers superior performance and productivity in every cluster size Scalability requires low-latency and zero scalable latency 16

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 17 17