Advancing Applications Performance With InfiniBand

Similar documents
SMB Direct for SQL Server and Private Cloud

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Interconnecting Future DoE leadership systems

Long-Haul System Family. Highest Levels of RDMA Scalability, Simplified Distance Networks Manageability, Maximum System Productivity

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Efficiency and Scalability

Mellanox Accelerated Storage Solutions

State of the Art Cloud Infrastructure

Mellanox Academy Online Training (E-learning)

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2015

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Storage, Cloud, Web 2.0, Big Data Driving Growth

Mellanox HPC-X Software Toolkit Release Notes

Connecting the Clouds

ECLIPSE Performance Benchmarks and Profiling. January 2009

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Building a Scalable Storage with InfiniBand

Mellanox Technologies, Ltd. Announces Record Quarterly Results

SX1012: High Performance Small Scale Top-of-Rack Switch

Power Saving Features in Mellanox Products

Enabling High performance Big Data platform with RDMA

Introduction to Cloud Design Four Design Principals For IaaS

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

SX1024: The Ideal Multi-Purpose Top-of-Rack Switch

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

RDMA over Ethernet - A Preliminary Study

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

Storage at a Distance; Using RoCE as a WAN Transport

ConnectX -3 Pro: Solving the NVGRE Performance Challenge

LS DYNA Performance Benchmarks and Profiling. January 2009

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

Choosing the Best Network Interface Card Mellanox ConnectX -3 Pro EN vs. Intel X520

Deep Learning GPU-Based Hardware Platform

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

FLOW-3D Performance Benchmark and Profiling. September 2012

HP Mellanox Low Latency Benchmark Report 2012 Benchmark Report

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

3G Converged-NICs A Platform for Server I/O to Converged Networks

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Mellanox Global Professional Services

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

White Paper Solarflare High-Performance Computing (HPC) Applications

Replacing SAN with High Performance Windows Share over a Converged Network

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

Hadoop on the Gordon Data Intensive Cluster

Mellanox WinOF for Windows 8 Quick Start Guide

Michael Kagan.

Deploying 10/40G InfiniBand Applications over the WAN

RoCE vs. iwarp Competitive Analysis

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

Mellanox Academy Course Catalog. Empower your organization with a new world of educational possibilities

Mellanox OpenStack Solution Reference Architecture

Performance Evaluation of InfiniBand with PCI Express

Ultra Low Latency Data Center Switches and iwarp Network Interface Cards

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Enabling the Use of Data

Building Enterprise-Class Storage Using 40GbE

Cluster Implementation and Management; Scheduling

Advanced Computer Networks. High Performance Networking I

Pedraforca: ARM + GPU prototype

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Solution Brief July All-Flash Server-Side Storage for Oracle Real Application Clusters (RAC) on Oracle Linux

High Speed I/O Server Computing with InfiniBand

Security in Mellanox Technologies InfiniBand Fabrics Technical Overview

Interoperability Testing and iwarp Performance. Whitepaper

Mellanox Reference Architecture for Red Hat Enterprise Linux OpenStack Platform 4.0

High Performance Computing in CST STUDIO SUITE

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:

Running Oracle s PeopleSoft Human Capital Management on Oracle SuperCluster T5-8 O R A C L E W H I T E P A P E R L A S T U P D A T E D J U N E

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Installing Hadoop over Ceph, Using High Performance Networking

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Transcription:

Advancing Applications Performance With InfiniBand Pak Lui, Application Performance Manager September 12, 2013

Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server and storage interconnect FDR 56Gb/s InfiniBand and 10/40/56GbE Reduces application wait-time for data Dramatically increases ROI on data center infrastructure Company headquarters: Yokneam, Israel; Sunnyvale, California ~1,200 employees* worldwide Solid financial position Record revenue in FY12; $500.8M, up 93% year-over-year Q2 13 revenue of $98.2M Q3 13 guidance ~$104M to $109M Cash + investments @ 6/30/13 = $411.3M * As of June 2013 2013 Mellanox Technologies 2

Providing End-to-End Interconnect Solutions Comprehensive End-to-End Software Accelerators and Managment Management Storage and Data MXM Mellanox Messaging Acceleration FCA Fabric Collectives Acceleration UFM Unified Fabric Management VSA Storage Accelerator (iscsi) UDA Unstructured Data Accelerator Comprehensive End-to-End InfiniBand and Ethernet Solutions Portfolio ICs Adapter Cards Switches/Gateways Long-Haul Systems Cables/Modules 2013 Mellanox Technologies 3

Virtual Protocol Interconnect (VPI) Technology VPI Adapter VPI Switch Unified Fabric Manager Applications Switch OS Layer Networking Storage Clustering Management Acceleration Engines 3.0 Ethernet: 10/40/56 Gb/s InfiniBand:10/20/40/56 Gb/s 64 ports 10GbE 36 ports 40/56GbE 48 10GbE + 12 40/56GbE 36 ports IB up to 56Gb/s 8 VPI subnets From data center to campus and metro connectivity LOM Adapter Card Mezzanine Card 2013 Mellanox Technologies 4

MetroDX and MetroX MetroX and MetroDX extends InfiniBand and Ethernet RDMA reach Fastest interconnect over 40Gb/s InfiniBand or Ethernet links Supporting multiple distances Simple management to control distant sites Low-cost, low-power, long-haul solution 40Gb/s over Campus and Metro 2013 Mellanox Technologies 5

Data Center Expansion Example Disaster Recovery 2013 Mellanox Technologies 6

Key Elements in a Data Center Interconnect Applications Servers Storage Switch and IC Cables, Silicon Photonics, Parallel Optical Modules Adapter and IC Adapter and IC 2013 Mellanox Technologies 7

IPtronics and Kotura Complete Mellanox s 100Gb/s+ Technology Recent Acquisitions of Kotura and IPtronics Enable Mellanox to Deliver Complete High-Speed Optical Interconnect Solutions for 100Gb/s and Beyond 2013 Mellanox Technologies 8

Mellanox InfiniBand Paves the Road to Exascale 2013 Mellanox Technologies 9

Meteo France 2013 Mellanox Technologies 10

NASA Ames Research Center Pleiades 20K InfiniBand nodes Mellanox end-to-end FDR and QDR InfiniBand Supports variety of scientific and engineering projects Coupled atmosphere-ocean models Future space vehicle design Large-scale dark matter halos and galaxy evolution Asian Monsoon Water Cycle High-Resolution Climate Simulations 2013 Mellanox Technologies 11

NCAR (National Center for Atmospheric Research) Yellowstone system 72,288 processor cores, 4,518 nodes Mellanox end-to-end FDR InfiniBand, CLOS (full fat tree) network, single plane 2013 Mellanox Technologies 12

Applications Performance (Courtesy of the HPC Advisory Council) 284% Intel Sandy Bridge 182% 173% Intel Ivy Bridge 1556% 282% 2013 Mellanox Technologies 13

Applications Performance (Courtesy of the HPC Advisory Council) 392% 135% 2013 Mellanox Technologies 14

Dominant in Enterprise Back-End Storage Interconnects SMB Direct 2013 Mellanox Technologies 15

Leading Interconnect, Leading Performance Bandwidth 100Gb/s 200Gb/s 20Gb/s 40Gb/s 56Gb/s 10Gb/s 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Same Software Interface 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 5usec 2.5usec 1.3usec 0.7usec Latency 0.6usec <0.5usec 2013 Mellanox Technologies 16

Connect-IB Architectural Foundation for Exascale Computing 2013 Mellanox Technologies 17

Connect-IB : The Exascale Foundation World s first 100Gb/s interconnect adapter PCIe 3.0 x16, dual FDR 56Gb/s InfiniBand ports to provide >100Gb/s Highest InfiniBand message rate: 137 million messages per second 4X higher than other InfiniBand solutions <0.7 micro-second application latency Supports GPUDirect RDMA for direct GPU-to-GPU communication Unmatchable Storage Performance 8,000,000 IOPs (1QP), 18,500,000 IOPs (32 QPs) New Innovative Transport Dynamically Connected Transport Service Supports Scalable HPC with MPI, SHMEM and PGAS/UPC offloads Enter the World of Boundless Performance 2013 Mellanox Technologies 18

Host Memory Consumption (MB) Connect-IB Memory Scalability 1,000,000,000 1,000,000 8 nodes 2K nodes 10K nodes 100K nodes 1,000 1 InfiniHost, RC 2002 InfiniHost-III, SRQ 2005 ConnectX, XRC 2008 Connect-IB, DCT 2012 2013 Mellanox Technologies 19

Dynamically Connected Transport Advantages 2013 Mellanox Technologies 20

Message Rate (Million) FDR InfiniBand Delivers Highest Application Performance 150 Message Rate 100 50 0 QDR InfiniBand FDR InfiniBand 2013 Mellanox Technologies 21

Scalable Communication MXM, FCA 2013 Mellanox Technologies 22

Mellanox ScalableHPC Accelerate Parallel Applications MPI SHMEM PGAS P1 P2 P3 P1 P2 P3 P1 P2 P3 Memory Memory Memory Logical Shared Memory Logical Shared Memory Memory Memory Memory Memory MXM Reliable Messaging Optimized for Mellanox HCA Hybrid Transport Mechanism Efficient Memory Registration Receive Side Tag Matching FCA Topology Aware Collective Optimization Hardware Multicast Separate Virtual Fabric for Collectives CORE-Direct Hardware Offload InfiniBand Verbs API 2013 Mellanox Technologies 23

MXM v2.0 - Highlights Transport library integrated with OpenMPI, OpenSHMEM, BUPC, Mvapich2 More solutions will be added in the future Utilizing Mellanox offload engines Supported APIs (both sync/async): AM, p2p, atomics, synchronization Supported transports: RC, UD, DC, RoCE, SHMEM Supported built-in mechanisms: tag matching, progress thread, memory registration cache, fast path send for small messages, zero copy, flow control Supported data transfer protocols: Eager Send/Recv, Eager RDMA, Rendezvous 2013 Mellanox Technologies 24

Bandwidth (KB*processes) Latency (us) Latency (us) Mellanox FCA Collective Scalability Barrier Collective Reduce Collective 100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 0 500 1000 1500 2000 2500 Processes (PPN=8) 3000 2500 2000 1500 1000 500 0 0 500 1000 1500 2000 2500 processes (PPN=8) Without FCA With FCA Without FCA With FCA 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 8-Byte Broadcast 0 500 1000 1500 2000 2500 Processes (PPN=8) Without FCA With FCA 2013 Mellanox Technologies 25

FDR InfiniBand Delivers Highest Application Performance 2013 Mellanox Technologies 26

GPU Direct 2013 Mellanox Technologies 27

1 1 GPUDirect RDMA Receive Transmit System Memory 1 CPU GPUDirect 1.0 CPU System Memory GPU Chip set Chip set GPU InfiniBand InfiniBand GPU Memory GPU Memory System Memory 1 CPU CPU System Memory GPU Chip set Chip set GPU GPU Memory InfiniBand GPUDirect RDMA InfiniBand GPU Memory 2013 Mellanox Technologies 28

Latency (us) Bandwidth (MB/s) Higher is Better Preliminary Performance of MVAPICH2 with GPUDirect RDMA GPU-GPU Internode MPI Latency GPU-GPU Internode MPI Bandwidth 35 30 Small Message Latency MVAPICH2-1.9 MVAPICH2-1.9-GDR 900 800 700 Small Message Bandwidth MVAPICH2-1.9 MVAPICH2-1.9-GDR 25 20 15 10 5 19.78 6.12 69 % Lower is Better 600 500 400 300 200 100 3x 0 1 4 16 64 256 1K 4K 0 1 4 16 64 256 1K 4K Message Size (bytes) Source: Prof. DK Panda Message Size (bytes) 69% Lower Latency 3X Increase in Throughput 2013 Mellanox Technologies 29

Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA Execution Time of HSG (Heisenberg Spin Glass) Application with 2 GPU Nodes Source: Prof. DK Panda 2013 Mellanox Technologies 30

Thank You 2013 Mellanox Technologies 31