Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Similar documents

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Accelera'ng Big Data Processing with Hadoop, Spark and Memcached

Accelerating Big Data Processing with Hadoop, Spark and Memcached

Accelerating Big Data Processing with Hadoop, Spark and Memcached

Big Data: Hadoop and Memcached

RDMA over Ethernet - A Preliminary Study

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed File system THESIS

Enabling High performance Big Data platform with RDMA

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Hadoop on the Gordon Data Intensive Cluster

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

RDMA for Apache Hadoop User Guide

Accelerating Big Data Processing with Hadoop, Spark, and Memcached over High-Performance Interconnects

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Hadoop Optimizations for BigData Analytics

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

High-Performance Networking for Optimized Hadoop Deployments

TCP/IP Implementation of Hadoop Acceleration. Cong Xu

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Benchmarking Hadoop & HBase on Violin

Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com

OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Building Enterprise-Class Storage Using 40GbE

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

GraySort on Apache Spark by Databricks

FLOW-3D Performance Benchmark and Profiling. September 2012

SMB Direct for SQL Server and Private Cloud

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Advancing Applications Performance With InfiniBand

LS DYNA Performance Benchmarks and Profiling. January 2009

An In-Memory RDMA-Based Architecture for the Hadoop Distributed Filesystem

Performance Analysis of Mixed Distributed Filesystem Workloads

Installing Hadoop over Ceph, Using High Performance Networking

Windows 8 SMB 2.2 File Sharing Performance

Accelerating and Simplifying Apache

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

White Paper Solarflare High-Performance Computing (HPC) Applications

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Variations in Performance and Scalability when Migrating n-tier Applications to Different Clouds

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Performance Evaluation of InfiniBand with PCI Express

ECLIPSE Performance Benchmarks and Profiling. January 2009

HiBench Introduction. Carson Wang Software & Services Group

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

New Storage System Solutions

Lustre Networking BY PETER J. BRAAM

Deploying 10/40G InfiniBand Applications over the WAN

Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand

Self service for software development tools

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Microsoft SMB Running Over RDMA in Windows Server 8

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Platfora Big Data Analytics

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Hadoop: Embracing future hardware

Deep Learning GPU-Based Hardware Platform

Big Fast Data Hadoop acceleration with Flash. June 2013

VMWARE WHITE PAPER 1

Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!

High Speed I/O Server Computing with InfiniBand

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Boost Database Performance with the Cisco UCS Storage Accelerator

Cloud Computing through Virtualization and HPC technologies

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

RoCE vs. iwarp Competitive Analysis

Performance of RDMA-Capable Storage Protocols on Wide-Area Network

SR-IOV: Performance Benefits for Virtualized Interconnects!

SR-IOV In High Performance Computing

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation)

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Advanced Computer Networks. High Performance Networking I

How to Deploy OpenStack on TH-2 Supercomputer Yusong Tan, Bao Li National Supercomputing Center in Guangzhou April 10, 2014

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

An Architectural study of Cluster-Based Multi-Tier Data-Centers

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Transcription:

Accelerating Spark with RDMA for Big Data Processing: Early Experiences Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, Dip7 Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State University, Columbus, OH, USA

Outline Introduc7on Problem Statement Proposed Design Performance Evalua7on Conclusion & Future work 2

Digital Data Explosion in the Society Figure Source (http://www.csc.com/insights/flxwd/78931-big_data_growth_just_beginning_to_explode) 3

Big Data Technology - Hadoop Apache Hadoop is one of the most popular Big Data technology Provides frameworks for large-scale, distributed data storage and processing MapReduce, HDFS, YARN, RPC, etc. Hadoop 1.x Hadoop 2.x MapReduce (Data Processing) Other Models (Data Processing) MapReduce (Cluster Resource Management & Data Processing) YARN (Cluster Resource Management & Job Scheduling) Hadoop Distributed File System (HDFS) Hadoop Distributed File System (HDFS) Hadoop Common/Core (RPC,..) Hadoop Common/Core (RPC,..) 4

Big Data Technology - Spark An emerging in- memory processing framework Itera7ve machine learning Zookeeper Worker Worker Interac7ve data analy7cs Scala based Implementa7on Master- Slave; HDFS, Zookeeper Scalable and communica7on intensive Driver SparkContext Master Worker Worker Worker HDFS Wide dependencies between Resilient Distributed Datasets (RDDs) MapReduce- like shuffle opera7ons to repar77on RDDs Same as Hadoop, Socketsbased communication Narrow Dependencies Wide dependencies 5

Common Protocols using Open Fabrics Applica7on Interface Applica-on Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP iwarp RDMA RDMA Ethernet Driver IPoIB Hardware Offload User space RDMA User space User space User space Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/10/40 GigE IPoIB 10/40 GigE- TOE RSockets SDP iwarp RoCE IB Verbs 6

Previous Studies Very good performance improvements for Hadoop (HDFS /MapReduce/RPC), HBase, Memcached over InfiniBand Hadoop Acceleration with RDMA N. S. Islam, et.al., SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC 14 N. S. Islam, et.al., High Performance RDMA-Based Design of HDFS over InfiniBand, SC 12 M. W. Rahman, et.al. HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS 14 M. W. Rahman, et.al., High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, HPDIC 13 X. Lu, et.al., High-Performance Design of Hadoop RPC with RDMA over InfiniBand, ICPP 13 HBase Acceleration with RDMA J. Huang, et.al., High-Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 Memcached Acceleration with RDMA J. Jose, et.al., Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 7

The High-Performance Big Data (HiBD) Project RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) RDMA for Memcached (RDMA-Memcached) OSU HiBD-Benchmarks (OHB) http://hibd.cse.ohio-state.edu 8

RDMA for Apache Hadoop High-Performance Design of Hadoop over RDMA-enabled Interconnects High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components Easily configurable for native InfiniBand, RoCE and the traditional sockets -based support (Ethernet and InfiniBand with IPoIB) Current release: 0.9.9 (03/31/14) Based on Apache Hadoop 1.2.1 Compliant with Apache Hadoop 1.2.1 APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE support with Mellanox adapters Various multi-core platforms Different file systems with disks and SSDs http://hibd.cse.ohio-state.edu RDMA for Apache Hadoop 2.x 0.9.1 is released in HiBD! 9

Performance Benefits RandomWriter & Sort in SDSC-Gordon IPoIB (QDR) OSU-IB (QDR) IPoIB (QDR) OSU-IB (QDR) Job Execution Time (sec) 160 140 120 100 80 60 40 20 0 400 Reduced by 20% Reduced by 36% 350 Job Execution Time (sec) Can RDMA benefit 100 Apache Spark 50 0 50 GB 100 GB 200 GB 300 GB 50 GB 100 GB 200 GB 300 GB on High-Performance Networks also? Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64 RandomWriter in a cluster of single SSD per node 16% improvement over IPoIB for 50GB in a cluster of 16 nodes 20% improvement over IPoIB for 300GB in a cluster of 64 nodes 300 250 200 150 Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64 Sort in a cluster of single SSD per node 20% improvement over IPoIB for 50GB in a cluster of 16 nodes 36% improvement over IPoIB for 300GB in a cluster of 64 nodes 10

Outline Introduc7on Problem Statement Proposed Design Performance Evalua7on Conclusion & Future work 11

Problem Statement Is it worth it? Is the performance improvement potential high enough, if we can successfully adapt RDMA to Spark? A few percentage points or orders of magnitude? How difficult is it to adapt RDMA to Spark? Can RDMA be adapted to suit the communication needs of Spark? Is it viable to have to rewrite portions of the Spark code with RDMA? Can Spark applications benefit from an RDMA-enhanced design? What are the performance benefits that can be achieved by using RDMA for Spark applications on modern HPC clusters? Can RDMA-based design benefit applications transparently? 12

Assessment of Performance Improvement Potential How much benefit RDMA can bring in Apache Spark compared to other interconnects/protocols? Assessment Methodology Evaluation on primitive-level micro-benchmarks Latency, Bandwidth 1GigE, 10GigE, IPoIB (32Gbps), RDMA (32Gbps) Java/Scala-based environment Evaluation on typical workloads GroupByTest 1GigE, 10GigE, IPoIB (32Gbps) Spark 0.9.1 13

Bandwidth (MBps) Time (us) 3,000 2,500 2,000 1,500 1,000 500 Evaluation on Primitive-level Micro-benchmarks 120 100 80 60 1GigE 10GigE IPoIB (32Gbps) RDMA (32Gbps) Time (ms) 15 40 10 Can 20 these benefits of High-Performance 5 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) 0 1GigE 10GigE IPoIB (32Gbps) Networks be achieved in Small Message Apache Spark? RDMA (32Gbps) 40 35 30 25 20 Large Message 1GigE 10GigE IPoIB (32Gbps) RDMA (32Gbps) Obviously, IPoIB and 10GigE are much better than 1GigE. Compared to other interconnects /protocols, RDMA can significantly reduce the latencies for all the message sizes improve the peak bandwidth 14

GroupBy Time (sec) 30 25 20 15 10 5 0 Network Throughput (MB/s) 400 300 200 100 Evaluation on Typical Workloads 1GigE 10GigE IPoIB (32Gbps) 65% 4 6 8 10 22% For High-Performance Networks, The execution time of GroupBy is significantly improved The network throughputs are much higher for both send and receive Spark can benefit from high -performance interconnects to a greater extent! Can RDMA further benefit Spark Data Size (GB) performance compared with IPoIB and 600 600 1GigE 1GigE 10GigE 10GigE 500 IPoIB (32Gbps) 500 IPoIB (32Gbps) 10GigE? Network Throughput (MB/s) 400 300 200 100 0 5 10 15 20 Sample Times (s) 0 5 10 15 20 Sample Times (s) Network Throughput in Recv Network Throughput in Send 15

Outline Introduc7on Problem Statement Proposed Design Performance Evalua7on Conclusion & Future work 16

Architecture Overview Design Goals High Performance Keeping the existing Spark architecture and interface intact Minimal code changes Task Java NIO Shuffle Server (default) BlockManager Netty Shuffle Server (optional) Java Socket Task RDMA Shuffle Server (plug-in) Spark Applications (Scala/Java/Python) Spark (Scala/Java) Task Task BlockFetcherIterator Java NIO Shuffle Fetcher (default) Netty Shuffle Fetcher (optional) RDMA Shuffle Fetcher (plug-in) RDMA-based Shuffle Engine (Java/JNI) Approaches 1/10 Gig Ethernet/IPoIB (QDR/FDR) Network Native InfiniBand (QDR/FDR) Plug-in based approach to extend shuffle framework in Spark RDMA Shuffle Server + RDMA Shuffle Fetcher 100 lines of code changes inside Spark original files RDMA-based Shuffle Engine 17

SEDA-based Data Shuffle Plug-ins SEDA - Staged Event-Driven Architecture A set of stages connected by queues A dedicated thread pool will be in charge of processing events on the corresponding queue Listener, Receiver, Handlers, Responders Performing admission controls on these event queues High throughput through maximally overlapping different processing stages as well as maintain default task-level parallelism in Spark Connection Pool Listener Accept Connection Request Enqeue Receiver Receive Request Request Queue RDMA-based Shuffle Engine Request Dequeue Handler Send Response Response Queue Response Enqueue Response Dequeue Responder Read Data Block Data 18

RDMA-based Shuffle Engine Connection Management Alternative designs Pre-connection Hide the overhead in the initialization stage Before the actual communication, pre-connect processes to each other Sacrifice more resources to keep all of these connections alive Dynamic Connection Establishment and Sharing A connection will be established if and only if an actual data exchange is going to take place A naive dynamic connection design is not optimal, because we need to allocate resources for every data block transfer Advanced dynamic connection scheme that reduces the number of connection establishments Spark uses multi-threading approach to support multi-task execution in a single JVM à Good chance to share connections! How long should the connection be kept alive for possible re-use? Time out mechanism for connection destroy 19

RDMA-based Shuffle Engine Data Transfer Each connection is used by multiple tasks (threads) to transfer data concurrently Packets over the same communication lane will go to different entities in both server and fetcher sides Alternative designs Perform sequential transfers of complete blocks over a communication lane à Keep the order Cause long wait times for some tasks that are ready to transfer data over the same connection Non-blocking and Out-of-order Data Transfer Chunking data blocks Non-blocking sending over shared connections Out-of-order packet communication Guarantee both performance and ordering Efficiently work with the dynamic connection management and sharing mechanism 20

RDMA-based Shuffle Engine Buffer Management On-JVM-Heap vs. Off-JVM-Heap Buffer Management Off-JVM-Heap High-Performance through native IO Shadow buffers in Java/Scala Registered for RDMA communication Flexibility for upper layer design choices Support connection sharing mechanism à Request ID Support packet processing in order à Sequence number Support non-blocking send à Buffer flag + callback 21

Outline Introduc7on Problem Statement Proposed Design Performance Evalua7on Conclusion & Future work 22

Experimental Setup Hardware Intel Westmere Cluster (A) Up to 9 nodes Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz quad- core CPUs, 24 GB main memory Mellanox QDR HCAs (32Gbps) + 10GigE TACC Stampede Cluster (B) Up to 17 nodes Intel Sandy Bridge (E5-2680) dual octa- core processors, running at 2.70GHz, 32 GB main memory Mellanox FDR HCAs (56Gbps) Soiware Spark 0.9.1, Scala 2.10.4 and JDK 1.7.0 GroupBy Test 23

GroupBy Time (sec) 9 8 7 6 5 4 3 2 1 0 Performance Evaluation on Cluster A 10GigE IPoIB (32Gbps) RDMA (32Gbps) 4 6 8 10 Data Size (GB) 18% GroupBy Time (sec) 10 8 6 4 2 0 10GigE IPoIB (32Gbps) RDMA (32Gbps) 8 12 16 20 Data Size (GB) 4 Worker Nodes, 32 Cores, (32M 32R) 8 Worker Nodes, 64 Cores, (64M 64R) 20% For 32 cores, up to 18% over IPoIB (32Gbps) and up to 36% over 10GigE For 64 cores, up to 20% over IPoIB (32Gbps) and 34% over 10GigE 24

Performance Evaluation on Cluster B 50 IPoIB (56Gbps) RDMA (56Gbps) 83% 35 30 IPoIB (56Gbps) RDMA (56Gbps) 79% GroupBy Time (sec) 40 30 20 10 GroupBy Time (sec) 25 20 15 10 5 0 8 12 16 20 0 16 24 32 40 Data Size (GB) Data Size (GB) 8 Worker Nodes, 128 Cores, (128M 128R) 16 Worker Nodes, 256 Cores, (256M 256R) For 128 cores, up to 83% over IPoIB (56Gbps) For 256 cores, up to 79% over IPoIB (56Gbps) 25

Performance Analysis on Cluster B Java-based micro-benchmark comparison IPoIB (56Gbps) RDMA (56Gbps) Peak Bandwidth 1741.46MBps 5612.55MBps Benefit of RDMA Connection Sharing Design 20 GroupBy Time (sec) 15 10 5 RDMA w/o Conn Sharing (56Gbps) RDMA w Conn Sharing (56Gbps) 55% By enabling connection sharing, achieve 55% performance benefit for 40GB data size on 16 nodes 0 16 24 32 40 Data Size (GB) 26

Outline Introduc7on Problem Statement Proposed Design Performance Evalua7on Conclusion & Future work 27

Conclusion and Future Work Three major conclusions RDMA and high-performance interconnects can benefit Spark. Plug-in based approach with SEDA-/RDMA-based designs provides both performance and productivity. Spark applications can benefit from an RDMA -enhanced design. Future Work Continuously update this package with newer designs and carry out evaluations with more Spark applications on systems with high-performance networks Will make this design publicly available through the HiBD project 28

Thank You! {luxi, rahmanmd, islamn, shankard, panda} @cse.ohio- state.edu Network- Based Compu7ng Laboratory hrp://nowlab.cse.ohio- state.edu/ The High- Performance Big Data Project hrp://hibd.cse.ohio- state.edu/