Accelerating Spark with RDMA for Big Data Processing: Early Experiences Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, Dip7 Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State University, Columbus, OH, USA
Outline Introduc7on Problem Statement Proposed Design Performance Evalua7on Conclusion & Future work 2
Digital Data Explosion in the Society Figure Source (http://www.csc.com/insights/flxwd/78931-big_data_growth_just_beginning_to_explode) 3
Big Data Technology - Hadoop Apache Hadoop is one of the most popular Big Data technology Provides frameworks for large-scale, distributed data storage and processing MapReduce, HDFS, YARN, RPC, etc. Hadoop 1.x Hadoop 2.x MapReduce (Data Processing) Other Models (Data Processing) MapReduce (Cluster Resource Management & Data Processing) YARN (Cluster Resource Management & Job Scheduling) Hadoop Distributed File System (HDFS) Hadoop Distributed File System (HDFS) Hadoop Common/Core (RPC,..) Hadoop Common/Core (RPC,..) 4
Big Data Technology - Spark An emerging in- memory processing framework Itera7ve machine learning Zookeeper Worker Worker Interac7ve data analy7cs Scala based Implementa7on Master- Slave; HDFS, Zookeeper Scalable and communica7on intensive Driver SparkContext Master Worker Worker Worker HDFS Wide dependencies between Resilient Distributed Datasets (RDDs) MapReduce- like shuffle opera7ons to repar77on RDDs Same as Hadoop, Socketsbased communication Narrow Dependencies Wide dependencies 5
Common Protocols using Open Fabrics Applica7on Interface Applica-on Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP iwarp RDMA RDMA Ethernet Driver IPoIB Hardware Offload User space RDMA User space User space User space Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/10/40 GigE IPoIB 10/40 GigE- TOE RSockets SDP iwarp RoCE IB Verbs 6
Previous Studies Very good performance improvements for Hadoop (HDFS /MapReduce/RPC), HBase, Memcached over InfiniBand Hadoop Acceleration with RDMA N. S. Islam, et.al., SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC 14 N. S. Islam, et.al., High Performance RDMA-Based Design of HDFS over InfiniBand, SC 12 M. W. Rahman, et.al. HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS 14 M. W. Rahman, et.al., High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand, HPDIC 13 X. Lu, et.al., High-Performance Design of Hadoop RPC with RDMA over InfiniBand, ICPP 13 HBase Acceleration with RDMA J. Huang, et.al., High-Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 Memcached Acceleration with RDMA J. Jose, et.al., Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 7
The High-Performance Big Data (HiBD) Project RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) RDMA for Memcached (RDMA-Memcached) OSU HiBD-Benchmarks (OHB) http://hibd.cse.ohio-state.edu 8
RDMA for Apache Hadoop High-Performance Design of Hadoop over RDMA-enabled Interconnects High performance design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components Easily configurable for native InfiniBand, RoCE and the traditional sockets -based support (Ethernet and InfiniBand with IPoIB) Current release: 0.9.9 (03/31/14) Based on Apache Hadoop 1.2.1 Compliant with Apache Hadoop 1.2.1 APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE support with Mellanox adapters Various multi-core platforms Different file systems with disks and SSDs http://hibd.cse.ohio-state.edu RDMA for Apache Hadoop 2.x 0.9.1 is released in HiBD! 9
Performance Benefits RandomWriter & Sort in SDSC-Gordon IPoIB (QDR) OSU-IB (QDR) IPoIB (QDR) OSU-IB (QDR) Job Execution Time (sec) 160 140 120 100 80 60 40 20 0 400 Reduced by 20% Reduced by 36% 350 Job Execution Time (sec) Can RDMA benefit 100 Apache Spark 50 0 50 GB 100 GB 200 GB 300 GB 50 GB 100 GB 200 GB 300 GB on High-Performance Networks also? Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64 RandomWriter in a cluster of single SSD per node 16% improvement over IPoIB for 50GB in a cluster of 16 nodes 20% improvement over IPoIB for 300GB in a cluster of 64 nodes 300 250 200 150 Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64 Sort in a cluster of single SSD per node 20% improvement over IPoIB for 50GB in a cluster of 16 nodes 36% improvement over IPoIB for 300GB in a cluster of 64 nodes 10
Outline Introduc7on Problem Statement Proposed Design Performance Evalua7on Conclusion & Future work 11
Problem Statement Is it worth it? Is the performance improvement potential high enough, if we can successfully adapt RDMA to Spark? A few percentage points or orders of magnitude? How difficult is it to adapt RDMA to Spark? Can RDMA be adapted to suit the communication needs of Spark? Is it viable to have to rewrite portions of the Spark code with RDMA? Can Spark applications benefit from an RDMA-enhanced design? What are the performance benefits that can be achieved by using RDMA for Spark applications on modern HPC clusters? Can RDMA-based design benefit applications transparently? 12
Assessment of Performance Improvement Potential How much benefit RDMA can bring in Apache Spark compared to other interconnects/protocols? Assessment Methodology Evaluation on primitive-level micro-benchmarks Latency, Bandwidth 1GigE, 10GigE, IPoIB (32Gbps), RDMA (32Gbps) Java/Scala-based environment Evaluation on typical workloads GroupByTest 1GigE, 10GigE, IPoIB (32Gbps) Spark 0.9.1 13
Bandwidth (MBps) Time (us) 3,000 2,500 2,000 1,500 1,000 500 Evaluation on Primitive-level Micro-benchmarks 120 100 80 60 1GigE 10GigE IPoIB (32Gbps) RDMA (32Gbps) Time (ms) 15 40 10 Can 20 these benefits of High-Performance 5 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) 0 1GigE 10GigE IPoIB (32Gbps) Networks be achieved in Small Message Apache Spark? RDMA (32Gbps) 40 35 30 25 20 Large Message 1GigE 10GigE IPoIB (32Gbps) RDMA (32Gbps) Obviously, IPoIB and 10GigE are much better than 1GigE. Compared to other interconnects /protocols, RDMA can significantly reduce the latencies for all the message sizes improve the peak bandwidth 14
GroupBy Time (sec) 30 25 20 15 10 5 0 Network Throughput (MB/s) 400 300 200 100 Evaluation on Typical Workloads 1GigE 10GigE IPoIB (32Gbps) 65% 4 6 8 10 22% For High-Performance Networks, The execution time of GroupBy is significantly improved The network throughputs are much higher for both send and receive Spark can benefit from high -performance interconnects to a greater extent! Can RDMA further benefit Spark Data Size (GB) performance compared with IPoIB and 600 600 1GigE 1GigE 10GigE 10GigE 500 IPoIB (32Gbps) 500 IPoIB (32Gbps) 10GigE? Network Throughput (MB/s) 400 300 200 100 0 5 10 15 20 Sample Times (s) 0 5 10 15 20 Sample Times (s) Network Throughput in Recv Network Throughput in Send 15
Outline Introduc7on Problem Statement Proposed Design Performance Evalua7on Conclusion & Future work 16
Architecture Overview Design Goals High Performance Keeping the existing Spark architecture and interface intact Minimal code changes Task Java NIO Shuffle Server (default) BlockManager Netty Shuffle Server (optional) Java Socket Task RDMA Shuffle Server (plug-in) Spark Applications (Scala/Java/Python) Spark (Scala/Java) Task Task BlockFetcherIterator Java NIO Shuffle Fetcher (default) Netty Shuffle Fetcher (optional) RDMA Shuffle Fetcher (plug-in) RDMA-based Shuffle Engine (Java/JNI) Approaches 1/10 Gig Ethernet/IPoIB (QDR/FDR) Network Native InfiniBand (QDR/FDR) Plug-in based approach to extend shuffle framework in Spark RDMA Shuffle Server + RDMA Shuffle Fetcher 100 lines of code changes inside Spark original files RDMA-based Shuffle Engine 17
SEDA-based Data Shuffle Plug-ins SEDA - Staged Event-Driven Architecture A set of stages connected by queues A dedicated thread pool will be in charge of processing events on the corresponding queue Listener, Receiver, Handlers, Responders Performing admission controls on these event queues High throughput through maximally overlapping different processing stages as well as maintain default task-level parallelism in Spark Connection Pool Listener Accept Connection Request Enqeue Receiver Receive Request Request Queue RDMA-based Shuffle Engine Request Dequeue Handler Send Response Response Queue Response Enqueue Response Dequeue Responder Read Data Block Data 18
RDMA-based Shuffle Engine Connection Management Alternative designs Pre-connection Hide the overhead in the initialization stage Before the actual communication, pre-connect processes to each other Sacrifice more resources to keep all of these connections alive Dynamic Connection Establishment and Sharing A connection will be established if and only if an actual data exchange is going to take place A naive dynamic connection design is not optimal, because we need to allocate resources for every data block transfer Advanced dynamic connection scheme that reduces the number of connection establishments Spark uses multi-threading approach to support multi-task execution in a single JVM à Good chance to share connections! How long should the connection be kept alive for possible re-use? Time out mechanism for connection destroy 19
RDMA-based Shuffle Engine Data Transfer Each connection is used by multiple tasks (threads) to transfer data concurrently Packets over the same communication lane will go to different entities in both server and fetcher sides Alternative designs Perform sequential transfers of complete blocks over a communication lane à Keep the order Cause long wait times for some tasks that are ready to transfer data over the same connection Non-blocking and Out-of-order Data Transfer Chunking data blocks Non-blocking sending over shared connections Out-of-order packet communication Guarantee both performance and ordering Efficiently work with the dynamic connection management and sharing mechanism 20
RDMA-based Shuffle Engine Buffer Management On-JVM-Heap vs. Off-JVM-Heap Buffer Management Off-JVM-Heap High-Performance through native IO Shadow buffers in Java/Scala Registered for RDMA communication Flexibility for upper layer design choices Support connection sharing mechanism à Request ID Support packet processing in order à Sequence number Support non-blocking send à Buffer flag + callback 21
Outline Introduc7on Problem Statement Proposed Design Performance Evalua7on Conclusion & Future work 22
Experimental Setup Hardware Intel Westmere Cluster (A) Up to 9 nodes Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz quad- core CPUs, 24 GB main memory Mellanox QDR HCAs (32Gbps) + 10GigE TACC Stampede Cluster (B) Up to 17 nodes Intel Sandy Bridge (E5-2680) dual octa- core processors, running at 2.70GHz, 32 GB main memory Mellanox FDR HCAs (56Gbps) Soiware Spark 0.9.1, Scala 2.10.4 and JDK 1.7.0 GroupBy Test 23
GroupBy Time (sec) 9 8 7 6 5 4 3 2 1 0 Performance Evaluation on Cluster A 10GigE IPoIB (32Gbps) RDMA (32Gbps) 4 6 8 10 Data Size (GB) 18% GroupBy Time (sec) 10 8 6 4 2 0 10GigE IPoIB (32Gbps) RDMA (32Gbps) 8 12 16 20 Data Size (GB) 4 Worker Nodes, 32 Cores, (32M 32R) 8 Worker Nodes, 64 Cores, (64M 64R) 20% For 32 cores, up to 18% over IPoIB (32Gbps) and up to 36% over 10GigE For 64 cores, up to 20% over IPoIB (32Gbps) and 34% over 10GigE 24
Performance Evaluation on Cluster B 50 IPoIB (56Gbps) RDMA (56Gbps) 83% 35 30 IPoIB (56Gbps) RDMA (56Gbps) 79% GroupBy Time (sec) 40 30 20 10 GroupBy Time (sec) 25 20 15 10 5 0 8 12 16 20 0 16 24 32 40 Data Size (GB) Data Size (GB) 8 Worker Nodes, 128 Cores, (128M 128R) 16 Worker Nodes, 256 Cores, (256M 256R) For 128 cores, up to 83% over IPoIB (56Gbps) For 256 cores, up to 79% over IPoIB (56Gbps) 25
Performance Analysis on Cluster B Java-based micro-benchmark comparison IPoIB (56Gbps) RDMA (56Gbps) Peak Bandwidth 1741.46MBps 5612.55MBps Benefit of RDMA Connection Sharing Design 20 GroupBy Time (sec) 15 10 5 RDMA w/o Conn Sharing (56Gbps) RDMA w Conn Sharing (56Gbps) 55% By enabling connection sharing, achieve 55% performance benefit for 40GB data size on 16 nodes 0 16 24 32 40 Data Size (GB) 26
Outline Introduc7on Problem Statement Proposed Design Performance Evalua7on Conclusion & Future work 27
Conclusion and Future Work Three major conclusions RDMA and high-performance interconnects can benefit Spark. Plug-in based approach with SEDA-/RDMA-based designs provides both performance and productivity. Spark applications can benefit from an RDMA -enhanced design. Future Work Continuously update this package with newer designs and carry out evaluations with more Spark applications on systems with high-performance networks Will make this design publicly available through the HiBD project 28
Thank You! {luxi, rahmanmd, islamn, shankard, panda} @cse.ohio- state.edu Network- Based Compu7ng Laboratory hrp://nowlab.cse.ohio- state.edu/ The High- Performance Big Data Project hrp://hibd.cse.ohio- state.edu/