Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Similar documents

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Sockets vs RDMA Interface over 10-Gigabit Networks: An In-depth analysis of the Memory Traffic Bottleneck

An Architectural study of Cluster-Based Multi-Tier Data-Centers

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

RDMA over Ethernet - A Preliminary Study

Lustre Networking BY PETER J. BRAAM

Performance Evaluation of InfiniBand with PCI Express

Can High-Performance Interconnects Benefit Memcached and Hadoop?

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

MAGENTO HOSTING Progressive Server Performance Improvements

High Speed I/O Server Computing with InfiniBand

ECLIPSE Performance Benchmarks and Profiling. January 2009

Performance Analysis of Web based Applications on Single and Multi Core Servers

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

CentOS Linux 5.2 and Apache 2.2 vs. Microsoft Windows Web Server 2008 and IIS 7.0 when Serving Static and PHP Content

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Windows 8 SMB 2.2 File Sharing Performance

TÓPICOS AVANÇADOS EM REDES ADVANCED TOPICS IN NETWORKS

SMB Direct for SQL Server and Private Cloud

- An Essential Building Block for Stable and Reliable Compute Clusters

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

LS DYNA Performance Benchmarks and Profiling. January 2009

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Eloquence Training What s new in Eloquence B.08.00

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Data Center Architecture Overview

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

TÓPICOS AVANÇADOS EM REDES ADVANCED TOPICS IN NETWORKS

Purpose-Built Load Balancing The Advantages of Coyote Point Equalizer over Software-based Solutions

Tyche: An efficient Ethernet-based protocol for converged networked storage

Broadcom 10GbE High-Performance Adapters for Dell PowerEdge 12th Generation Servers

GeoGrid Project and Experiences with Hadoop

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

White Paper Solarflare High-Performance Computing (HPC) Applications

New Features in SANsymphony -V10 Storage Virtualization Software

Boost Database Performance with the Cisco UCS Storage Accelerator

Shared Parallel File System

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

InfiniBand in the Enterprise Data Center

Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

The Benefits of Virtualizing

Boosting Data Transfer with TCP Offload Engine Technology

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Purpose Computer Hardware Configurations... 6 Single Computer Configuration... 6 Multiple Server Configurations Data Encryption...

CHAPTER 4 PERFORMANCE ANALYSIS OF CDN IN ACADEMICS

VMWARE WHITE PAPER 1

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

SQL Server Consolidation Using Cisco Unified Computing System and Microsoft Hyper-V

A virtual SAN for distributed multi-site environments

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Tushar Joshi Turtle Networks Ltd

Software Performance, Scalability, and Availability Specifications V 3.0

Legal Notices Introduction... 3

Development of Software Dispatcher Based. for Heterogeneous. Cluster Based Web Systems

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

Quantum StorNext. Product Brief: Distributed LAN Client

Optimization of Cluster Web Server Scheduling from Site Access Statistics

Understanding the Benefits of IBM SPSS Statistics Server

Chapter 2 TOPOLOGY SELECTION. SYS-ED/ Computer Education Techniques, Inc.

Tier Architectures. Kathleen Durant CS 3200

Infrastructure Matters: POWER8 vs. Xeon x86

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

Performance and Recommended Use of AB545A 4-Port Gigabit Ethernet Cards

White Paper. Recording Server Virtualization

D1.2 Network Load Balancing

Measurement of the Achieved Performance Levels of the WEB Applications With Distributed Relational Database

Aqua Connect Load Balancer User Manual (Mac)

Windows Server 2008 R2 Hyper-V Live Migration

Cisco Application Networking for IBM WebSphere

Cisco Application Networking Manager Version 2.0

1-Gigabit TCP Offload Engine

Transcription:

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based Computing Laboratory The Ohio State University

COTS Clusters Advent of High Performance Networks Ex: InfiniBand, Myrinet, Quadrics, 10-Gigabit Ethernet High Performance Protocols: VAPI / IBAL, GM, EMP Provide applications direct and protected access to the network Commodity-Off-the-Shelf (COTS) Clusters Enabled through High Performance Networks Built of commodity components High Performance-to-Cost Ratio

InfiniBand Architecture Overview Industry Standard Interconnect for connecting compute and I/O nodes Provides High Performance Low latency of lesser than 4us Over 935MBps uni-directional bandwidth Offloaded Transport Layer; Zero-Copy data-transfer Provides one-sided communication (RDMA, Remote Atomics) Becoming increasingly popular

Increasing adoption of Internet Cluster-based Data-Centers Primary means of electronic interaction Highly Scalable and Available Web-Servers: Critical! Utilizing Clusters for Data-Center environments? Studied and Proposed by the Industry and Research communities (Courtesy CSP Architecture Design) Nodes are logically partitioned Interact depending on the query Provide services requested Services provided are related Fragmentation of resources

Shared Multi-Tier Data-Centers Load Balancing Cluster (Site A) Website A Clients Servers Load Balancing Cluster (Site B) Website B Clients WAN Load Balancing Cluster (Site C) Servers Servers Website C Hosting several unrelated services on a single clustered data-center

Issues in Shared Data-Centers Hosting several unrelated services on a single data-center Ex: A single data-center hosting multiple websites Currently used by several ISPs and Web Service Providers (IBM, HP) Allows differentiation in resources provided for each service Fragmentation is a big concern! Over-provisioning of nodes for each service Nodes provided to each service based on the worst-case estimates Widely used approach Leads to severe under-utilization of resources

Dynamic Reconfigurability Load Balancing Cluster (Site A) Website A Clients Servers Clients WAN Load Balancing Cluster (Site B) Load Balancing Cluster (Site C) Servers Website B Website C Servers Nodes reconfigure themselves to highly loaded websites at run-time

Objective Under Utilization of resources needs to be curbed Dynamically Configuring nodes allotted to each service Widely studied approach for Clusters Interesting Challenges in the Data-Center Environment Highly loaded back-end servers Compatibility with existing applications (Apache, MySQL, etc) Can the advanced features provided by InfiniBand help?

Presentation Roadmap Experimental Results Shared Data-Centers Continuing and Future Work Concluding Remarks Introduction and Background Designing Dynamic Reconfigurability for Shared Data-Centers

Shared Data-Centers Overview Clients request services using high level protocols such as HTTP Requests are distributed to the nodes using load-balancers Load Balancers expose a single IP address to the clients Maintain a list of several internal IP addresses to forward the requests Several solutions for load-balancers Hardware Load-Balancers Software Load-Balancers Cluster-based load-balancers

Hardware Load-Balancers Cluster-based Load Balancers Commonly used in several environments In-flexible and cannot be tuned to the data-center requirements Software Load-Balancers Easy to modify and tune to the data-center requirements Potential bottlenecks for highly loaded data-center environments Cluster-based load-balancers Proposed by several researchers as an additional Edge Tier [shah01] Provides intelligent services such as load-balancing, caching, etc Use an additional hardware load-balancer or DNS aliasing to get requests [shah01]: CSP: A Novel System Architecture for Scalable Internet and Communication Services. H. V. Shah, D. B. Minturn, A. Foong, G. L. McAlpine, R. S. Madukkarumukumana and G. J. Regnier. In USITS 2001.

Presentation Roadmap Experimental Results Shared Data-Centers Continuing and Future Work Concluding Remarks Introduction and Background Designing Dynamic Reconfigurability for Shared Data-Centers

Design Issues Support for Existing Applications Modifying existing applications: Cumbersome and Impractical Utilizing External Helper Modules (external programs running on each node) Take care of load monitoring, reconfiguration, etc. Reflect changes to the data-center applications using environment settings Load-Balancer based vs. Server based Reconfiguration Trading network traffic for CPU overhead Load Balancers convert nodes to serve their website Remote Memory Operations based Design Server node applications are typically very compute intensive Execution of CGI scripts, business logic, database processing Utilizing one-sided operations provided by InfiniBand Load-balancers remotely monitor and reconfigure the system

Implementation Details History Aware Reconfiguration Avoiding Server Thrashing by maintaining a history of the load pattern Reconfigurability Module Sensitivity Time Interval between two consecutive checks Maintaining a System Wide Shared State Shared State with Concurrency Control Tackling Load-Balancing Delays

System Wide Shared State Nodes in the cluster need to share control information Load, Current State of the node, etc. Sockets based Implementation has several disadvantages All communication needs to be explicitly performed Asynchronous requests need to be handled by the host A major concern due to the high CPU overhead on the servers InfiniBand RDMA operations try to avoid these disadvantages Load-balancers can share data on the servers using RDMA Read Can update system state using RDMA Write and Atomic Operations

Shared State with Concurrency Control Load-balancers query the system load at regular intervals On detecting a high load, a reconfiguration is done Multiple Concurrency issues to be dealt with: Multiple simultaneous transitions possible Each node in the load-balancer cluster can attempt a reconfiguration Multiple nodes might end up being converted on a single burst Hot Spot Effects on remote nodes All load-balancers might try to get load information from the same node They might try to convert the same node Additional Logic Required!

Locking Mechanism We propose a two-level hierarchical locking mechanism Internal Lock for each web-site cluster Only one load-balancer in a cluster can attempt a reconfiguration External Lock for performing reconfiguration Only one web-site can convert any given node Both locks performed remotely using InfiniBand Atomic Operations Internal Lock Internal Lock Website B Server Load Balancer External Lock Website C Website A

Load-Balancing Delays Tackling Load-Balancing Delays After a reconfiguration, balancing of load might take some time Locking mechanisms only ensure no simultaneous transitions We need to ensure that all load-balancers are aware of reconfigurations Server Website A Load Balancer Server Website B Dual Counters Not Loaded Load Query Loaded Load Query Shared Update Counter (SUC) Local Update Counter (LUC) Successful Atomic (Lock) Successful Atomic (SUC) On reconfiguration: LUC should be equal to SUC Reconfigure Node Successful Atomic (Unlock) All remote SUCs are incremented Load Shared Load Shared

Presentation Roadmap Experimental Results Shared Data-Centers Continuing and Future Work Concluding Remarks Introduction and Background Designing Dynamic Reconfigurability for Shared Data-Centers

Experimental Test-bed Cluster 1 with: 8 SuperMicro SUPER X5DL8-GG nodes; Dual Intel Xeon 3.0 GHz processors 512 KB L2 cache, 1 GB memory; PCI-X 64-bit 133 MHz Cluster 2 with: 8 SuperMicro SUPER P4DL6 nodes; Dual Intel Xeon 2.4 GHz processors 512 KB L2 cache, 512 MB memory; PCI-X 64-bit 133 MHz Mellanox MT23108 Dual Port 4x HCAs; MT43132 24-port switch Apache 2.0.50 Web and PHP servers; MySQL Database server Experimental Results (Outline) Basic IBA Performance Impact of Background Computation Threads Impact of Request Burst Length Node Utilizations

Basic IBA Performance Latency (usec) 120 100 80 60 40 20 0 1 4 16 64 256 1024 Message Size (bytes) 4096 Send CPU Recv CPU IPoIB CPU RDMA Event RDMA poll IPoIB 50 40 30 20 10 0 % CPU Utilization Bandwidth (Mbytes/sec) 1000 800 600 400 200 0 1 4 16 64 256 1k 4k Message Size (bytes) 16k 64k 60 50 40 30 20 10 0 % CPU Utilization Send CPU Recv CPU IPoIB CPU RDMA Event RDMA poll IPoIB RDMA Read operation on IBA outperforms TCP/IP (IPoIB) IBA achieves about 12us latency compared to the 56us of IPoIB IBA achieves about 830 MBps bandwidth compared to the 230 MBps of IPoIB More importantly near zero CPU requirements on the receiver side

Impact of Background Threads 1400 Impact on Latency 900 Impact on Bandwidth 1200 800 Latency (us) 1000 800 600 400 200 Bandwidth (MBps) 700 600 500 400 300 200 100 0 1 2 4 8 16 32 64 Number of Background Threads 0 1 2 4 8 16 32 64 Number of Background Threads 64B RDMA Read 64B IPoIB 32k RDMA Read 32k IPoIB Remote memory operations are not affected AT ALL with remote server load Ideal for the data-center environment

Impact of Burst Length 60000 50000 40000 #Co-hosted Web-Sites = 3 #Co-hosted Web-Sites = 4 60000 50000 40000 TPS 30000 TPS 30000 20000 20000 10000 10000 0 512 1024 2048 4096 8192 16384 0 512 1024 2048 4096 8192 16384 Burst Length Burst Length Rigid Reconf Over-Provisioning Rigid Reconf Over-Provisioning Rigid has 3 nodes for each website; Over-provisioning has 6 nodes for each website Large Burst Length allows reconfiguration of the system closer to the best case! Performs comparably with the static scheme for small burst sizes

Node Utilization for 3 Co-hosted Web sites 6 For Burst Length = 512 For Burst Length = 8096 6 Number of busy nodes 5 4 3 2 1 0 1 251 501 751 1001 1251 1501 1751 2001 2251 2501 Num ber of busy nodes 5 4 3 2 1 0 0 3486 6986 10454 13954 17438 20929 24429 27903 31403 Iterations Reconf Node Utilization Rigid Over-provisioning Iterations Reconf Node Utilization Rigid Over provisioning For large burst lengths, the reconfiguration time is negligible; performance is better

Presentation Roadmap Experimental Results Shared Data-Centers Continuing and Future Work Concluding Remarks Introduction and Background Designing Dynamic Reconfigurability for Shared Data-Centers

Concluding Remarks Growing Fragmentation of resources in data-centers Related services provided by Multi-Tier Data-Centers Unrelated services provided by Shared Data-Centers Dynamically configuring resources allotted A common approach used in clusters Data-Center environment has its own challenges Highly loaded back-end servers Compatibility with existing applications Provided a novel approach utilizing the RDMA features of IBA A scheme resilient to the load on the back-end servers Demonstrated up to 2.5 times improvement in the throughput Similar performance using only half the nodes

Presentation Roadmap Experimental Results Shared Data-Centers Continuing and Future Work Concluding Remarks Introduction and Background Designing Dynamic Reconfigurability for Shared Data-Centers

Continuing and Future Work Multi-Stage Reconfigurations Least loaded servers might not be the best server to reconfigure Caching constraints Replicated Databases Hardware heterogeneity Utilizing Dynamic Reconfigurability for advanced services QoS guarantees Differentiation in the resources provided

Thank You! For more information, please visit the NBC Home Page http://nowlab.cis.ohio-state.edu Network Based Computing Laboratory, The Ohio State University

Backup Slides