ECLIPSE Performance Benchmarks and Profiling. January 2009

Similar documents

LS DYNA Performance Benchmarks and Profiling. January 2009

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

FLOW-3D Performance Benchmark and Profiling. September 2012

HPC Applications Scalability.

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Enabling High performance Big Data platform with RDMA

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Mellanox Academy Online Training (E-learning)

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

White Paper Solarflare High-Performance Computing (HPC) Applications

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Building a Scalable Storage with InfiniBand

SMB Direct for SQL Server and Private Cloud

Advancing Applications Performance With InfiniBand

Can High-Performance Interconnects Benefit Memcached and Hadoop?

CUTTING-EDGE SOLUTIONS FOR TODAY AND TOMORROW. Dell PowerEdge M-Series Blade Servers

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Replacing SAN with High Performance Windows Share over a Converged Network

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

3G Converged-NICs A Platform for Server I/O to Converged Networks

Solution Brief July All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

7 Real Benefits of a Virtual Infrastructure

RDMA over Ethernet - A Preliminary Study

Choosing the Best Network Interface Card Mellanox ConnectX -3 Pro EN vs. Intel X520

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Sun Constellation System: The Open Petascale Computing Architecture

Clusters: Mainstream Technology for CAE

8Gb Fibre Channel Adapter of Choice in Microsoft Hyper-V Environments

RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vsphere 5 R E S E A R C H N O T E

NFS SERVER WITH 10 GIGABIT ETHERNET NETWORKS

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

How System Settings Impact PCIe SSD Performance

SOLUTION BRIEF AUGUST All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

State of the Art Cloud Infrastructure

InfiniBand vs Fibre Channel Throughput. Figure 1. InfiniBand vs 2Gb/s Fibre Channel Single Port Storage Throughput to Disk Media

Dell Microsoft SQL Server 2008 Fast Track Data Warehouse Performance Characterization

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Interoperability Testing and iwarp Performance. Whitepaper

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

IBM System x family brochure

Tyche: An efficient Ethernet-based protocol for converged networked storage

Hadoop on the Gordon Data Intensive Cluster

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Accelerating Microsoft Exchange Servers with I/O Caching

Connecting the Clouds

Storage at a Distance; Using RoCE as a WAN Transport

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

High Performance SQL Server with Storage Center 6.4 All Flash Array

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

High Performance Computing in CST STUDIO SUITE

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

PCI Technology Overview

Understanding PCI Bus, PCI-Express and In finiband Architecture

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

HP reference configuration for entry-level SAS Grid Manager solutions

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

HP Z Turbo Drive PCIe SSD

Parallel Programming Survey

Converging Data Center Applications onto a Single 10Gb/s Ethernet Network

Data Center Storage Solutions

An Analysis of 8 Gigabit Fibre Channel & 10 Gigabit iscsi in Terms of Performance, CPU Utilization & Power Consumption

HPC Update: Engagement Model

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

The Bus (PCI and PCI-Express)

Accelerating CFD using OpenFOAM with GPUs

Building Clusters for Gromacs and other HPC applications

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

You re not alone if you re feeling pressure

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Intel Xeon Processor 3400 Series-based Platforms

Microsoft Exchange Server 2003 Deployment Considerations

Long-Haul System Family. Highest Levels of RDMA Scalability, Simplified Distance Networks Manageability, Maximum System Productivity

Transcription:

ECLIPSE Performance Benchmarks and Profiling January 2009

Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster Center For more info please refer to www.mellanox.com, www.dell.com/hpc, www.amd.com, http://www.slb.com/ 2

Schlumberger ECLIPSE Oil and gas reservoir simulation software Developed by Schlumberger Offers multiple choices of numerical simulation techniques for accurate and fast simulation for Black-oil Compositional Thermal Streamline Others ECLIPSE support MPI to achieve high performance and scalability 3

Objectives The presented research was done to provide best practices ECLIPSE performance benchmarking Interconnect performance comparisons Ways to increase ECLIPSE productivity Understanding ECLIPSE communication patterns Power-efficient simulations 4

Test Cluster Configuration System Upgrade Dell PowerEdge SC 1435 24-node cluster Quad-Core AMD Opteron Model 2382 processors ( Shanghai ) Mellanox InfiniBand ConnectX DDR HCAs Mellanox InfiniBand DDR Switch Memory: 16GB memory, DDR2 800MHz per node OS: RHEL5U2, OFED 1.3 InfiniBand SW stack MPI: Platform MPI 5.6.5 Application: Schlumberger ECLIPSE Simulators 2008.2 Benchmark Workload 4 million cell model ( 2048 200 10) Blackoil 3 phase model with ~ 800 wells 5

Mellanox InfiniBand Solutions Industry Standard Hardware, software, cabling, management Design for clustering and storage interconnect Price and Performance 40Gb/s node-to-node 120Gb/s switch-to-switch 1us application latency Most aggressive roadmap in the industry Reliable with congestion management Efficient RDMA and Transport Offload Kernel bypass CPU focuses on application processing Scalable for Petascale computing & beyond End-to-end quality of service Virtualization acceleration I/O consolidation Including storage The InfiniBand Performance Gap is Increasing 60Gb/s 20Gb/s 120Gb/s 40Gb/s 240Gb/s (12X) 80Gb/s (4X) Ethernet Fibre Channel InfiniBand Delivers the Lowest Latency 6

Quad-Core AMD Opteron Processor Performance Quad-Core Enhanced CPU IPC Dual Channel Reg DDR2 4x 512K L2 cache 6MB L3 Cache 8 GB/S 8 GB/S Direct Connect Architecture HyperTransport technology Up to 24 GB/s peak per processor Floating Point 8 GB/S 8 GB/S PCI-E PCI-E 128-bit FPU per core Bridge Bridge Bridge Bridge 4 FLOPS/clk peak per core 8 GB/S Integrated Memory Controller Up to 12.8 GB/s USB USB I/O I/O Hub Hub DDR2-800 MHz or DDR2-667 MHz PCI PCI Scalability 48-bit Physical Addressing Compatibility Same power/thermal envelopes as 2 nd / 3 rd generation AMD Opteron processor 7 November5, 2007 7

Dell PowerEdge Servers helping Simplify IT System Structure and Sizing Guidelines 24-node cluster build with Dell PowerEdge SC 1435 Servers Servers optimized for High Performance Computing environments Building Block Foundations for best price/performance and performance/watt Dell HPC Solutions Scalable Architectures for High Performance and Productivity Dell's comprehensive HPC services help manage the lifecycle requirements. Integrated, Tested and Validated Architectures Workload Modeling Optimized System Size, Configuration and Workloads Test-bed Benchmarks ISV Applications Characterization Best Practices & Usage Analysis 8

ECLIPSE Performance Results - Interconnect InfiniBand enables highest scalability Performance accelerates with cluster size Performance over GigE and 10GigE is not scaling Slowdown occurs beyond 8 nodes 6000 Schlumberger ECLIPSE (FOURMILL) Elapsed Time (Seconds) 5000 4000 3000 2000 1000 0 4 8 12 16 20 22 24 Number of Nodes Lower is better GigE 10GigE InfiniBand Single job per cluster size 9

ECLIPSE Performance Results - Interconnect 600% Schlumberger ECLIPSE (InfiniBand vs GigE & 10GigE) Performance Advantage 500% 400% 300% 200% 100% 0% 4 8 12 16 20 22 24 Number of Nodes GigE 10GigE InfiniBand outperforms GigE by up to 500% and 10GigE by up to 457% As node number increases, bigger advantage is gained 10

ECLIPSE Performance Results - Productivity InfiniBand increases productivity by allowing multiple jobs to run simultaneously Providing required productivity for reservoir simulations Three cases are presented Single job over the entire systems Four jobs, each on two cores per CPU per server Eight jobs, each on one CPU core per server Eight jobs per node increases productivity by up to 142% 300 Schlumberger ECLIPSE (FOURMILL) Number of Jobs 250 200 150 100 50 0 8 12 16 20 22 24 Number of Nodes Higher is better 1 Job per Node 4 Jobs per Node 8 Jobs per Node InfiniBand 11

ECLIPSE Performance Results - Productivity InfiniBand offers unparalleled productivity compared to Ethernet GigE shows performance decrease beyond 8 nodes 10GigE demonstrates no scaling beyond 16 nodes 250 Schlumberger ECLIPSE (FOURMILL) Number of Jobs 200 150 100 50 0 4 8 12 16 20 22 Number of Nodes Higher is better GigE 10GigE InfiniBand 4 Jobs on each node 12

ECLIPSE Profiling Data Transferred ECLIPSE MPI Profiliing MPI_Isend 10 Number of Messages (Millions) 8 6 4 2 0 [0..128B] [128B..1KB] [1..8KB] [8..256KB] [256KB..1M] [1M..Infinity] Message Size 4nodes 8nodes 12nodes 16nodes 20nodes 24nodes 13

ECLIPSE Profiling Data Transferred ECLIPSE MPI Profiliing MPI_Recv Number of Messages (Millions) 7 6 5 4 3 2 1 0 [0..128B] [128B..1KB] [1..8KB] [8..256KB] [256KB..1M] [1M..Infinity] Message Size 4nodes 8nodes 12nodes 16nodes 20nodes 24nodes 14

ECLIPSE Profiling Message Distribution Eclipse MPI Profiliing Percentage of Messages 60% 50% 40% 30% 4 8 12 16 20 24 Number of Nodes MPI_Isend < 128 Bytes MPI_Recv < 128 Bytes MPI_Isend < 256 KB MPI_Recv < 256 KB Majority of MPI messages are large size Demonstrating the need for highest throughput 15

Interconnect Usage by ECLIPSE Total server throughput increases rapidly with cluster size Data Transferred (MB/s) Data Transferred (MB/s) Data Sent 4 Nodes Data Sent 8 Nodes 1400 1200 1000 800 600 400 200 0 1 146 291 436 581 726 871 1016 1161 1306 1451 1596 1741 1886 2031 Data Transferred (MB/s) Timing (s) Timing (s) Data Sent 16 Nodes Data Sent 24 Nodes 1400 1200 1000 800 600 400 200 0 1 48 95 142 189 236 283 330 377 424 471 518 565 612 659 706 753 800 847 Timing (s) Data Transferred (MB/s) This data is per node based 1400 1200 1000 800 600 400 200 0 1 86 171 256 341 426 511 596 681 766 851 936 1021 1106 1191 1400 1200 1000 800 600 400 200 0 1 45 89 133 177 221 265 309 353 397 441 485 529 573 617 661 705 749 793 Timing (s) 16

ECLIPSE Profiling Summary - Interconnect ECLIPSE was profiled to determine networking dependency Majority of data transferred between compute nodes Done with 8KB-256KB message size Data transferred increases with cluster size Most used message sizes <128B messages mainly synchronizations 8KB-256KB data transferring Message size distribution Percentage of smaller messages (<128B) slightly decreases with cluster size Percentage of mid-size messages (8KB-256KB) increases with cluster size ECLIPSE interconnects sensitivity points Interconnect latency and throughput for <256KB message range As node number increases, interconnect throughput becomes more critical 17

Power Consumption/Productivity Comparison InfiniBand enables power efficient simulations Reducing system power/job consumption up to 66% vs GigE and 33% vs 10GigE For productivity case 4 jobs per node When using single job approach, InfiniBand reduces power/job consumption by more than 82% compared to 10GigE Power Consumption Power per Job (Wh) 2000 1800 1600 1400 1200 1000 800 600 400 200 0 66% 33% GigE 10GigE InfiniBand 4 Jobs on each node 18

Conclusions Eclipse is widely used to perform reservoir simulation Developed by Schulmberger ECLIPSE performance and productivity relies on Scalable HPC systems and interconnect solutions Low latency and high throughout interconnect technology NUMA aware application for fast access to memory Reasonable job distribution can dramatically improve productivity Increasing number of jobs per day while maintaining fast run time Interconnect comparison shows InfiniBand delivers superior performance and productivity in every cluster size Scalability requires low latency and zero scalable latency InfiniBand enables lowest power consumption per job Optimizing power/job ratio 19

Thank You HPC Advisory Council HPC@mellanox.com All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 20 20