Higher Message Rate: Minimum of 56M Non-coalesced MPI Msg/Sec at 16-core pairs running the OSU Message Bandwidth Test using QDR 80.

Similar documents
The Case for Rack Scale Architecture

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Intel Core TM i3 Processor Series Embedded Application Power Guideline Addendum

Intel HTML5 Development Environment. Tutorial Test & Submit a Microsoft Windows Phone 8* App (BETA)

Intel Media SDK Library Distribution and Dispatching Process

Intel Service Assurance Administrator. Product Overview

iscsi Quick-Connect Guide for Red Hat Linux

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Specification Update. January 2014

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

COSBench: A benchmark Tool for Cloud Object Storage Services. Jiangang.Duan@intel.com

Intel Core i5 processor 520E CPU Embedded Application Power Guideline Addendum January 2011

VNF & Performance: A practical approach

How to Configure Intel Ethernet Converged Network Adapter-Enabled Virtual Functions on VMware* ESXi* 5.1

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study

Intel Platform and Big Data: Making big data work for you.

Intel RAID RS25 Series Performance

Intel Ethernet and Configuring Single Root I/O Virtualization (SR-IOV) on Microsoft* Windows* Server 2012 Hyper-V. Technical Brief v1.

Creating Overlay Networks Using Intel Ethernet Converged Network Adapters

Cloud based Holdfast Electronic Sports Game Platform

Intel HTML5 Development Environment. Tutorial Building an Apple ios* Application Binary

Intelligent Business Operations

Intel X38 Express Chipset Memory Technology and Configuration Guide

Scaling Networking Solutions for IoT Challenges and Opportunities

Intel Technical Advisory

Intel HTML5 Development Environment Article Using the App Dev Center

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Intel HTML5 Development Environment. Article - Native Application Facebook* Integration

2013 Intel Corporation

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

Intel Solid-State Drive Pro 2500 Series Opal* Compatibility Guide

Partition Alignment of Intel SSDs for Achieving Maximum Performance and Endurance Technical Brief February 2014

Accomplish Optimal I/O Performance on SAS 9.3 with

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms

Intel Network Builders: Lanner and Intel Building the Best Network Security Platforms

* * * Intel RealSense SDK Architecture

Intel Internet of Things (IoT) Developer Kit

Accelerating Business Intelligence with Large-Scale System Memory

Intel SSD 520 Series Specification Update

Intel True Scale Fabric Architecture. Enhanced HPC Architecture and Performance

Software Solutions for Multi-Display Setups

Intel vpro Technology. How To Purchase and Install Symantec* Certificates for Intel AMT Remote Setup and Configuration

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

Developing High-Performance, Flexible SDN & NFV Solutions with Intel Open Network Platform Server Reference Architecture

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

DDR2 x16 Hardware Implementation Utilizing the Intel EP80579 Integrated Processor Product Line

Intel vpro Technology. How To Purchase and Install Go Daddy* Certificates for Intel AMT Remote Setup and Configuration

Configuring RAID for Optimal Performance

NFV Reference Platform in Telefónica: Bringing Lab Experience to Real Deployments

High Performance Computing and Big Data: The coming wave.

Intel Desktop Board D945GCPE Specification Update

Haswell Cryptographic Performance

MCA Enhancements in Future Intel Xeon Processors June 2013

Hadoop* on Lustre* Liu Ying High Performance Data Division, Intel Corporation

Intel Desktop Board D945GCPE

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Intel Desktop Board DP55WB

Intel Desktop Board DG43RK

Intel Desktop Board DG41BI

Intel Data Migration Software

Fiber Channel Over Ethernet (FCoE)

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Intel Identity Protection Technology Enabling improved user-friendly strong authentication in VASCO's latest generation solutions

Fast, Low-Overhead Encryption for Apache Hadoop*

Accelerating Business Intelligence with Large-Scale System Memory

System Event Log (SEL) Viewer User Guide

Intel Desktop Board DG41TY

Intel Desktop Board DP43BF

Intel Desktop Board DG31PR

A Superior Hardware Platform for Server Virtualization

Intel Retail Client Manager Audience Analytics

with PKI Use Case Guide

Douglas Fisher Vice President General Manager, Software and Services Group Intel Corporation

Intel Identity Protection Technology (IPT)

Intel System Event Log (SEL) Viewer Utility

Intel Small Business Advantage (Intel SBA) Release Notes for OEMs

Intel Desktop Board DQ43AP

Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Platforms

Intel Remote Configuration Certificate Utility Frequently Asked Questions

Intel Atom Processor E3800 Product Family

Resetting USB drive using Windows Diskpart command

How High Temperature Data Centers and Intel Technologies Decrease Operating Costs

Intel Desktop Board DG41WV

Intel Open Network Platform Release 2.1: Driving Network Transformation

Different NFV/SDN Solutions for Telecoms and Enterprise Cloud

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

USB 3.0* Radio Frequency Interference Impact on 2.4 GHz Wireless Devices

System Image Recovery* Training Foils

Intel Solid-State Drive 320 Series

Intel Retail Client Manager

How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on Citrix* XenServer 6.0*

10GBASE-T for Broad 10 Gigabit Adoption in the Data Center

Intel Network Builders

RDMA over Ethernet - A Preliminary Study

SAP * Mobile Platform 3.0 Scaling on Intel Xeon Processor E5 v2 Family

Big Data Technologies for Ultra-High-Speed Data Transfer and Processing

This guide explains how to install an Intel Solid-State Drive (Intel SSD) in a SATA-based desktop or notebook computer.

Transcription:

Introduction to QDR-80 and Dual Plane InfiniBand Fabrics being installed today by Intel utilize the True Scale Fabric. This is a Single Plane Fabric with QDR-40 HCA (Single Rail) connectivity. It provides the performance required by most applications in use today. A Single Plane Fabric has a single fabric ID and can be single or multi-tier. This paper is to introduce two other concepts for constructing Fabrics. These concepts are meant to provide higher Bandwidth to/from each server through the Fabric. QDR-80 provides up to 56 million messages per second, 80Gb/s Link Speed and Application BW up to ~6.7GB+ for each server with servers utilizing the Intel Xeon E5-2680. QDR-80 can be set to 2 modes which will be explored in more detail. One of these modes supports Dual Plane Fabrics. Dual Plane Fabric are simply defined as 2 separate Single Plane Fabrics with each Fabric having its own unique ID. A QDR-80 Server would attach one HCA to one Fabric and one HCA to the other. QDR-80 also can support a Single Plane Fabric. In fact from a cost perspective, Dual Plane Fabrics really makes sense up to the size of the largest single switch. Key Findings Following are the key finding of this document. All QDR-80 performance results are from servers utilizing the Intel Xeon E5-2680. More detail follow. Lower Deployment Cost: Dual Plane Fabrics significantly reduce the deployment cost of QDR-80 for fabrics supporting between 325 and 648 Hosts @ FBB (Full Bisectional Bandwidth) and between 433 and 864 Hosts @ 2 to 1 CBB (Constant Bisectional Bandwidth). Higher Message Rate: Minimum of 56M Non-coalesced MPI Msg/Sec at 16-core pairs running the OSU Message Bandwidth Test using QDR 80. Low Latency: MPI Latency as low as ~1.15us with the OSU latency test. Minimum of 1.5 us Natural Order Ring Latency and less than 3us Random Order Ring Latency as measured by the HPC Challenge benchmark at 16-nodes / 256-cores of Xeon E5-2680. Increase Nodal Bandwidth: Nodal Bandwidth doubles to 80Gb/s. Application Bandwidth: Up to 6.7 GB Bandwidth per node with QDR-80. 1

Redundancy Modes: QDR-80 with Single and Dual Plane Fabrics provide added redundancy over QDR-40 installations. Server Utilization: QDR-80 can provide higher server utilization due to internal traffic avoidance. Single Plane Dual Plane Fabrics Single Plane and Dual Plane Fabrics are defined separately from the QDR-80 mode. Each have their own characteristics and each in combination with QDR-80 has its own performance profile. QDR-80 supports Single and Dual Plane, however Dual Plane Fabrics can only be used with PSM_MULTIRAIL enabled. QDR-40 or Single Rail HCAs only support Single Plane Fabrics only. Single Plane As previously stated, Single Plane Fabrics are the configurations predominately used in HPC configurations. They consist of either a single switch or multiple switches in any supported networking configuration. Here are sample configurations for single and a 1500 node multi-tier switching fabric. FBB or CBB are the most prevalent. However Mesh and Torus are also supported. The key determining factor is that there is a single Fabric ID. Remember to keep in mind the application performance requirements and the customer s network requirements. Figure 1 - Single Plane-Single Switch Figure 2 - Multi-Switch, Single Plane Dual Plane Dual Plane Fabrics consist of 2 separate Single Plane Fabrics. They are only supported by Hosts utilizing QDR-80 with PSM_MULTIRAIL set. QDR- 80 without the setting or QDR-40 are not supported. With Dual Plane and QDR- 80 with PSM_MULTIRAIL, each node has two HCAs that are connected to switching fabrics. In this case two SM s are required but Figure 3 - Separate Single Plane Fabrics Figure 4-2 Separate Switches using a single Host SM both independent fabrics can be managed by this host SM). The benefit with this implementation is improved MPI message rate, latency and bandwidth to the node and can be a less expensive way of deploying QDR-80. 2

QDR-80 As with any HPC Fabric the overall system and software architecture are still they keys. These lead to overall application performance along with cost benefits. While QDR-40 already exceeds FDR based switching in many areas including cost and performance, QDR-80 and Dual Plane will bring even higher performance while still providing a cost benefit for some configurations. Dual plane is predominately a cost reduction implementation for QDR-80 implementation between 325 to 648 servers in an HPC cluster at full bisectional bandwidth or 864 nodes at 2:1 oversubscribed. Above these sides, there is no cost benefit. Figure 5 - Dual Plane Fabric with QDR-80 With QDR-80 (w/psm_multirail) and Dual Plane, each node has two HCAs, but they are connected to two different switching fabrics (typically two SM s are required but using a single Host SM both independent fabrics can be managed by this host SM). The benefit with this implementation is improved MPI message rate, latency and bandwidth to the node and a less expensive way of deploying QDR-80. A Single Plane Fabric supports QDR-80 with or without PSM_MULTIRAIL set. Each configuration has its own performance characteristics. Figure 6 - Single Plane Fabric with QDR-80 QDR-80, PSM_MULTIRAIL variable set. High single-process bandwidth. Economical for 325 to 648 Servers @FBB or 433 to 864 Servers @ 2 to 1. 30-40% Cost Savings @ 648 Nodes ~55 Million msg/sec (Sandy Bridge) 1.16 us Latency (Sandy Bridge) Large messages from a process are striped across both HCAs. Small Messages Alternate HCAs Up to 6.7 GB/s single-process-per-node bandwidth (Sandy Bridge) Targeted for bandwidth-intensive applications These performance parameters are the same whether the QDR-80 w/psm_multirail servers are connected to a Single Plane or Dual Plane Fabric. However, only with the parameter set will QDR-80 support Dual Plane. 3

QDR-80, PSM_MULTIRAIL variable NOT set. Taregeted for latency and high message rate sensitive applications 1.15 us Latency (Sandy Bridge) Up to 56M MPI Msg/Sec (Sandy Bridge) Processor Affinity Leverages the PCIe bus within each E5 processor Improves latency by eliminating the remote to first socket hop Provides up to 5.2GB/s Application BW (Sandy Bridge) QDR-80 Dual Plane Fabric Deployment One key thing to remember is that while Dual Plane is only supported by QDR-80 with PSM_MULTIRAIL set, Single Plane Fabrics are also supported. The data transmission performance characteristics with this QDR-80 mode are identical. Dual Plane does add a higher degree of fabric redundancy. If one HCA or an entire fabric fails then all traffic can traverse the remaining switch(es). It is important to note that with Mellanox a single dual port HCA can be used to provide Dual Rail, if that single HCA fails, then the entire Host is out of service. Figure 7-648 Node Dual Plane Fabric Figure 8 - Single Plane 648 Host QDR-80 Fabric Although extra HPC redundancy does provide some customer benefit, the predominant reason for Dual Plane is to reduce customer cost. Cost reduction is only be accomplished on a limited subset of fabric sizes. Figure 7 and Figure 8 both show a 648 Host fabric. As you can see, Dual Plane significantly reduces HW cost, Space and Power requirements. There is a 30-40% reduction in customer cost. This cost reduction is only relevant for half the size of the largest switch plus 1 port up to the port maximum in the switch. The means that Dual Plane is more cost effective from 325 to 624 in an FBB switch and from 433 to 864 in a 2:1 oversubscribed switch. 4

As shown in Figure 9 and Figure 10, the cost benefit disappears once a Dual Plane Fabric exceed the port count of a single switch. While the cost benefit will go away, it still leaves behind some additional Host to fabric redundancy that Mellanox can t support with their dual port HCA. In the example shown there are 1500 Hosts, each supporting QDR-80 and with PSM_MULTIRAIL enabled. You need roughly this same Hardware/Software compliment for each fabric. In fact, in this case the Dual Plane Fabric is slightly more costly. Figure 9-1500 QDR-80 Host Single Plane Fabric Figure 10-1500 QDR-80 Host Dual Plane Fabric QDR-80 Performance The Intel True Scale Architecture with QDR-80 provides higher overall performance to each server in the cluster. Processor : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz Sockets : 2 CPU cores : 8 Memory : 32GB MPI : MVAPICH Message Rate: In either QDR-80 Multirail or non-multirail each server can source over 55M message per second with messages @ 1 Byte. Figure 11 - QDR-80 Message Rate 5

Maximum Bandwidth: In either QDR-80 Multirail or non-multirail each server can provide over 6.7GB application bandwidth with QDR-80 Multirail. Standard QDR-80 provides similar results. Figure 12 - QDR-80 Maximum Bandwidth Latency: In either QDR-80 Multirail or non- Multirail Latency is not significantly increased over QDR-40. Figure 13 - QDR-80 1 Byte Message Latency Conclusion The Intel True Scale Architecture supports a myriad of configurations. Both Single Plane and Dual Plane Fabrics have their place depending on customer performance, cost and reliability requirements. In any case the True Scale Architecture provides better overall redundancy with QDR-80 than the Mellanox FDR offering. Dual Plane also provides a significant cost advantage at certain fabric sizes. QDR-80 brings a higher Application bandwidth and Message Rate to each host with Single and Dual Plane Fabrics. The True Scale design seeks to eliminate wasteful CPU cycles and provide that power to the Application. QDR-80 w/o PSM_MULTIRAIL provides 80Gb network to each Host and specifically helps overall performance by utilizing Affinity Balancing across dual Intel E5-2600 processors below are the workload savings delivered. 6

Recognizes a dual E5-2600 and dual rail InfiniBand configuration Aligns the E5-2600 cores to closest InfiniBand HCA Works transparently with MPIs and Applications Allocates context memory on local socket to avoid costly remote memory accesses Takes advantage of DDIO acceleration with both HCAs QDR-80 w/ PSM_MULTIRAIL provides 80Gb/s network bandwidth to each Host and efficiently stripes the data across both HCAs to supply the highest Application bandwidth. The Intel True Scale Architecture continues to show that scale and architecture matters. 7

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Any software source code reprinted in this document is furnished for informational purposes only and may only be used or copied and no license, express or implied, by estoppel or otherwise, to any of the reprinted source code is granted by this document. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Testing was done by Intel. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Copyright 2013, Intel Corporation. All rights reserved 8