Big Data: Hadoop and Memcached Talk at HPC Advisory Council Stanford Conference and Exascale Workshop (Feb 214) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
Introduction to Big Data Applications and Analytics Big Data has become the one of the most important elements of business analytics Provides groundbreaking opportunities for enterprise information management and decision making The amount of data is exploding; companies are capturing and digitizing more information than ever The rate of information growth appears to be exceeding Moore s Law Commonly accepted 3V s of Big Data Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/nist-stonebraker.pdf 5V s of Big Data 3V + Value, Veracity 2
Not Only in Internet Services - Big Data in Scientific Domains Scientific Data Management, Analysis, and Visualization Applications examples Climate modeling Combustion Fusion Astrophysics Bioinformatics Data Intensive Tasks Runs large-scale simulations on supercomputers Dump data on parallel storage systems Collect experimental / observational data Move experimental / observational data to analysis sites Visual analytics help understand data visually 3
Typical Solutions or Architectures for Big Data Applications and Analytics Hadoop: http://hadoop.apache.org The most popular framework for Big Data Analytics HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc. Spark: http://spark-project.org Provides primitives for in-memory cluster computing; Jobs can load data into memory and query it repeatedly Shark: http://shark.cs.berkeley.edu A fully Apache Hive-compatible data warehousing system Storm: http://storm-project.net A distributed real-time computation system for real-time analytics, online machine learning, continuous computation, etc. S4: http://incubator.apache.org/s4 A distributed system for processing continuous unbounded streams of data Web 2.: RDBMS + Memcached (http://memcached.org) Memcached: A high-performance, distributed memory object caching systems 4
Who Are Using Hadoop? Focuses on large data and data analysis Hadoop (e.g. HDFS, MapReduce, HBase, RPC) environment is gaining a lot of momentum http://wiki.apache.org/hadoop/poweredby 5
Hadoop at Yahoo! Scale: 42, nodes, 365PB HDFS, 1M daily slot hours http://developer.yahoo.com/blogs/hadoop/apache-hbase-yahoo-multi-tenancy-helm-again-17171422.html A new Jim Gray Sort Record: 1.42 TB/min over 2,1 nodes Hadoop.23.7, an early branch of the Hadoop 2.X http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-927488.html http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html 6
Presentation Outline Overview of Hadoop and Memcached Trends in Networking Technologies and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 7
Big Data Processing with Hadoop The open-source implementation of MapReduce programming model for Big Data Analytics Major components HDFS MapReduce Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase Model scales but high amount of communication during intermediate phases can be further optimized User Applications MapReduce HBase HDFS Hadoop Common (RPC) Hadoop Framework 8
Hadoop MapReduce Architecture & Operations Disk Operations Map and Reduce Tasks carry out the total job execution Communication in shuffle phase uses HTTP over Java Sockets 9
Hadoop Distributed File System (HDFS) Primary storage of Hadoop; highly reliable and fault-tolerant Adopted by many reputed organizations eg: Facebook, Yahoo! NameNode: stores the file system namespace DataNode: stores data blocks Developed in Java for platformindependence and portability Uses sockets for communication! Client RPC RPC RPC RPC (HDFS Architecture) 1
System-Level Interaction Between Clients and Data Nodes in HDFS (HDD/SSD)...... High Performance Networks (HDD/SSD)...... (HDD/SSD) (HDFS Clients) (HDFS Data Nodes) 11
HBase Apache Hadoop Database (http://hbase.apache.org/) Semi-structured database, which is highly scalable ZooKeeper Integral part of many datacenter applications eg:- Facebook Social Inbox Developed in Java for platformindependence and portability HBase Client HMaster HRegion Servers (HBase Architecture) HDFS Uses sockets for communication! 12
System-Level Interaction Among HBase Clients, Region Servers, and Data Nodes (HDD/SSD)............ High Performance Networks... High Performance Networks... (HDD/SSD) (HDD/SSD) (HBase Clients) (HRegion Servers) (Data Nodes) 13
Memcached Architecture Main memory SSD CPUs HDD... Main memory SSD CPUs HDD!"#$%"$#& High Performance Networks High Performance Networks Main memory SSD CPUs HDD High Performance Networks Main memory SSD CPUs HDD (Database Servers)...... Main memory CPUs Web Frontend Servers (Memcached Clients) (Memcached Servers) SSD HDD Integral part of Web 2. architecture Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive 14
Presentation Outline Overview of Hadoop and Memcached Trends in Networking Technologies and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 15
Number of Clusters Percentage of Clusters Trends for Commodity Computing Clusters in the Top 5 List (http://www.top5.org) 5 45 4 35 3 25 2 15 1 5 Percentage of Clusters Number of Clusters 1 9 8 7 6 5 4 3 2 1 Timeline 16
HPC and Advanced Interconnects High-Performance Computing (HPC) has adopted advanced interconnects and protocols InfiniBand 1 Gigabit Ethernet/iWARP RDMA over Converged Enhanced Ethernet (RoCE) Very Good Performance Low latency (few micro seconds) High Bandwidth (1 Gb/s with dual FDR InfiniBand) Low CPU overhead (5-1%) OpenFabrics software stack with IB, iwarp and RoCE interfaces are driving HPC systems Many such systems in Top5 list 17
Trends of Networking Technologies in TOP5 Systems Percentage share of InfiniBand is steadily increasing Interconnect Family Systems Share 18
Use of High-Performance Networks for Scientific Computing Message Passing Interface (MPI) Most commonly used programming model for HPC applications Parallel File Systems Storage Systems Almost 13 years of Research and Development since InfiniBand was introduced in October 2 Other Programming Models are emerging to take advantage of High-Performance Networks UPC OpenSHMEM Hybrid MPI+PGAS (OpenSHMEM and UPC) 19
Latency (us) Latency (us) One-way Latency: MPI over IB with MVAPICH2 6. 5. 4. 3. 2. 1.82 1.66 1.64 Small Message Latency 25. 2. 15. 1. Large Message Latency Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR 1.. 1.56 1.9.99 1.12 5.. Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch 2
Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Bandwidth: MPI over IB with MVAPICH2 14 12 1 8 6 4 2 Unidirectional Bandwidth 1281 12485 6343 3385 328 1917 176 3 25 2 15 1 5 Bidirectional Bandwidth Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR 24727 2125 11643 6521 447 374 3341 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch 21
Presentation Outline Overview of Hadoop and Memcached Trends in Networking Technologies and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 22
Can High-Performance Interconnects Benefit Hadoop? Most of the current Hadoop environments use Ethernet Infrastructure with Sockets Concerns for performance and scalability Usage of High-Performance Networks is beginning to draw interest from many companies What are the challenges? Where do the bottlenecks lie? Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)? Can HPC Clusters with High-Performance networks be used for Big Data applications using Hadoop? 23
Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase and Memcached) Programming Models (Sockets) Other Protocols? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 24
Common Protocols using Open Fabrics Application /Middleware Interface Application/ Middleware Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP iwarp RDMA RDMA Ethernet Driver IPoIB Hardware Offload User space RDMA User space User space User space Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/1/4 GigE IPoIB 1/4 GigE- TOE RSockets SDP iwarp RoCE IB Verbs 25
Can Hadoop be Accelerated with High-Performance Networks and Protocols? Current Design Enhanced Designs Our Approach Application Application Application Sockets Accelerated Sockets Verbs / Hardware Offload OSU Design Verbs Interface 1/1 GigE Network 1 GigE or InfiniBand 1 GigE or InfiniBand Sockets not designed for high-performance Stream semantics often mismatch for upper layers (Hadoop and HBase) Zero-copy not available for non-blocking sockets 26
Presentation Outline Overview of Hadoop and Memcached Trends in Networking Technologies and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 27
RDMA for Apache Hadoop Project High-Performance Design of Hadoop over RDMA-enabled Interconnects High performance design with native InfiniBand and RoCE support at the verbslevel for HDFS, MapReduce, and RPC components Easily configurable for native InfiniBand, RoCE and the traditional socketsbased support (Ethernet and InfiniBand with IPoIB) Current release:.9.8 Based on Apache Hadoop 1.2.1 Compliant with Apache Hadoop 1.2.1 APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE Various multi-core platforms Different file systems with disks and SSDs http://hadoop-rdma.cse.ohio-state.edu 28
Design Overview of HDFS with RDMA Enables high performance RDMA communication, while supporting traditional socket interface Others Java Socket Interface Applications HDFS Write Java Native Interface (JNI) OSU Design Design Features RDMA-based HDFS write RDMA-based HDFS replication InfiniBand/RoCE support 1/1 GigE, IPoIB Network Verbs RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) JNI Layer bridges Java based HDFS with communication library written in native code Only the communication part of HDFS Write is modified; No change in HDFS architecture 29
Communication Time in HDFS Communication Time (s) 25 2 1GigE IPoIB (QDR) OSU-IB (QDR) 15 1 Reduced by 3% 5 2GB 4GB 6GB 8GB 1GB File Size (GB) Cluster with HDD DataNodes 3% improvement in communication time over IPoIB (QDR) 56% improvement in communication time over 1GigE Similar improvements are obtained for SSD DataNodes N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 212 3
Total Throughput (MBps) Total Throughput (MBps) Evaluations using TestDFSIO 25 2 15 1GigE IPoIB (QDR) OSU-IB (QDR) Increased by 24% 12 1 8 6 IPoIB (QDR) OSU-IB (QDR) Increased by 61% 1 4 5 2 5 1 15 2 File Size (GB) 5 1 15 2 File Size (GB) Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps Cluster with 8 HDD DataNodes, single disk per node 24% improvement over IPoIB (QDR) for 2GB file size Cluster with 4 SSD DataNodes, single SSD per node 61% improvement over IPoIB (QDR) for 2GB file size 31
Total Throughput (MBps) Evaluations using TestDFSIO 25 2 1GigE-1HDD IPoIB -1HDD (QDR) OSU-IB -1HDD (QDR) 1GigE-2HDD IPoIB -2HDD (QDR) OSU-IB -2HDD (QDR) 15 Increased by 31% 1 5 5 1 15 2 File Size (GB) Cluster with 4 DataNodes, 1 HDD per node 24% improvement over IPoIB (QDR) for 2GB file size Cluster with 4 DataNodes, 2 HDD per node 31% improvement over IPoIB (QDR) for 2GB file size 2 HDD vs 1 HDD 76% improvement for OSU-IB (QDR) 66% improvement for IPoIB (QDR) 32
Aggregated Throughput (MBps) Execution Time (s) Evaluations using Enhanced DFSIO of Intel HiBench on SDSC-Gordon 25 2 IPoIB (QDR) OSU-IB (QDR) 16 Increased by 47% IPoIB (QDR) Reduced by 16% 14 12 OSU-IB (QDR) 15 1 8 1 6 5 4 2 32 64 128 File Size (GB) 32 64 128 File Size (GB) Cluster with 32 DataNodes, single SSD per node 47% improvement in throughput over IPoIB (QDR) for 128GB file size 16% improvement in latency over IPoIB (QDR) for 128GB file size 33
Aggregated Throughput (MBps) Execution Time (s) Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede 14 12 1 8 6 IPoIB (FDR) OSU-IB (FDR) Increased by 64% 6 5 4 3 IPoIB (FDR) OSU-IB (FDR) Reduced by 37% 4 2 2 1 64 128 256 File Size (GB) 64 128 256 File Size (GB) Cluster with 64 DataNodes, single HDD per node 64% improvement in throughput over IPoIB (FDR) for 256GB file size 37% improvement in latency over IPoIB (FDR) for 256GB file size 34
Presentation Outline Overview of Hadoop and Memcached Trends in High Performance Networking and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 35
Design Overview of MapReduce with RDMA Applications MapReduce Job Tracker Task Tracker Map Reduce Design features for RDMA Prefetch/Caching of MOF In-Memory Merge Overlap of Merge & Reduce Enables high performance RDMA communication, while supporting traditional socket interface Java Socket Interface 1/1 GigE, IPoIB Network Java Native Interface (JNI) OSU Design Verbs RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) JNI Layer bridges Java based MapReduce with communication library written in native code InfiniBand/RoCE support 36
Job Execution Time (sec) Job Execution Time (sec) Evaluations using Sort 5 45 4 1GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR) 35 3 25 2 15 1 5 25 3 35 4 Data Size (GB) 8 HDD DataNodes With 8 HDD DataNodes for 4GB sort 41% improvement over IPoIB (QDR) 39% improvement over Hadoop-A-IB (QDR) 5 45 4 1GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR) 35 3 25 2 15 1 5 25 3 35 4 Data Size (GB) 8 SSD DataNodes With 8 SSD DataNodes for 4GB sort 52% improvement over IPoIB (QDR) 44% improvement over Hadoop-A-IB (QDR) M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMAbased Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May 213. 37
Job Execution Time (sec) Evaluations using TeraSort 18 16 14 12 1 8 6 4 2 1GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR) 6 8 1 Data Size (GB) 1 GB TeraSort with 8 DataNodes with 2 HDD per node 51% improvement over IPoIB (QDR) 41% improvement over Hadoop-A-IB (QDR) 38
Job Execution Time (sec) Job Execution Time (sec) Performance Evaluation on Larger Clusters 12 1 IPoIB (QDR) OSU-IB (QDR) Hadoop-A-IB (QDR) 7 6 IPoIB (FDR) Hadoop-A-IB (FDR) OSU-IB (FDR) 8 6 4 2 Data Size: 6 GB Data Size: 12 GB Data Size: 24 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 Sort in OSU Cluster For 24GB Sort in 64 nodes 4% improvement over IPoIB (QDR) with HDD used for HDFS 5 4 3 2 1 Data Size: 8 GB Data Size: 16 GBData Size: 32 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 TeraSort in TACC Stampede For 32GB TeraSort in 64 nodes 38% improvement over IPoIB (FDR) with HDD used for HDFS 39
Normalized Execution Time Evaluations using PUMA Workload 1.9.8 1GigE IPoIB (QDR) OSU-IB (QDR).7.6.5.4.3.2.1 AdjList (3GB) SelfJoin (8GB) SeqCount (3GB) WordCount (3GB) InvertIndex (3GB) Benchmarks 5% improvement in Self Join over IPoIB (QDR) for 8 GB data size 49% improvement in Sequence Count over IPoIB (QDR) for 3 GB data size 4
Presentation Outline Overview of Hadoop and Memcached Trends in High Performance Networking and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 41
Time (us) Time (us) Motivation Detailed Analysis on HBase Put/Get 2 18 16 14 12 1 8 6 4 2 16 14 12 1 8 6 4 2 Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization IPoIB (DDR) 1GigE OSU-IB (DDR) HBase Put 1KB HBase 1KB Put Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time HBase 1KB Get Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time IPoIB (DDR) 1GigE OSU-IB (DDR) Hbase Get 1KB M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS 12 42
Time (us) Ops/sec HBase Single-Server-Multi-Client Results 45 4 35 3 25 2 6 5 4 3 IPoIB (DDR) OSU-IB (DDR) 1GigE 15 1 5 1 2 4 8 16 No. of Clients Latency HBase Get latency 4 clients: 14.5 us; 16 clients: 296.1 us HBase Get throughput 2 1 4 clients: 37.1 Kops/sec; 16 clients: 53.4 Kops/sec 27% improvement in throughput for 16 clients over 1GE 1 2 4 8 16 No. of Clients Throughput J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 43
Time (us) Time (us) HBase YCSB Read-Write Workload 7 1 6 9 8 5 7 4 3 2 1 6 5 4 3 2 1 IPoIB (QDR) 1GigE OSU-IB (QDR) 8 16 32 64 96 128 No. of Clients 8 16 32 64 96 128 No. of Clients Read Latency Write Latency HBase Get latency (Yahoo! Cloud Service Benchmark) 64 clients: 2. ms; 128 Clients: 3.5 ms 42% improvement over IPoIB for 128 clients HBase Get latency 64 clients: 1.9 ms; 128 Clients: 3.5 ms 4% improvement over IPoIB for 128 clients 44
Presentation Outline Overview of Hadoop and Memcached Trends in High Performance Networking and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 45
Design Overview of Hadoop RPC with RDMA Applications Design Features Default Java Socket Interface Hadoop RPC Our Design Java Native Interface (JNI) OSU Design Verbs JVM-bypassed buffer management On-demand connection setup RDMA or send/recv based adaptive communication InfiniBand/RoCE support 1/1 GigE, IPoIB Network RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) Enables high performance RDMA communication, while supporting traditional socket interface 46
Latency (us) Throughput (Kops/Sec) Hadoop RPC over IB - Gain in Latency and Throughput 12 1 8 1GigE IPoIB(QDR) OSU-IB(QDR) 16 14 12 1 6 8 4 6 2 4 2 1 2 4 8 16 32 64 128 256 512 124 248 496 Payload Size (Byte) 8 16 24 32 4 48 56 64 Number of Clients Hadoop RPC over IB PingPong Latency 1 byte: 39 us; 4 KB: 52 us 42%-49% and 46%-5% improvements compared with the performance on 1 GigE and IPoIB (32Gbps), respectively Hadoop RPC over IB Throughput 512 bytes & 48 clients: 135.22 Kops/sec 82% and 64% improvements compared with the peak performance on 1 GigE and IPoIB (32Gbps), respectively 47
Presentation Outline Overview of Hadoop and Memcached Trends in High Performance Networking and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 48
Total Throughput (MBps) Total Throughput (MBps) 9 8 7 6 5 4 3 2 1 Performance Benefits - TestDFSIO 1GigE IPoIB (QDR) OSU-IB (QDR) 1 9 8 7 6 5 4 3 2 1 1GigE IPoIB (QDR) OSU-IB (QDR) 5 1 15 2 Size (GB) 5 1 15 2 Size (GB) For 5GB, Cluster with 4 HDD Nodes, TestDFSIO with total 4 maps Cluster with 4 SSD Nodes, TestDFSIO with total 4 maps 53.4% improvement over IPoIB, 126.9% improvement over 1GigE with HDD used for HDFS 57.8% improvement over IPoIB, 138.1% improvement over 1GigE with SSD used for HDFS For 2GB, 1.7% improvement over IPoIB, 46.8% improvement over 1GigE with HDD used for HDFS 73.3% improvement over IPoIB, 141.3% improvement over 1GigE with SSD used for HDFS 49
Job Execution Time (sec) Job Execution Time (sec) Performance Benefits - Sort 18 16 1GigE IPoIB (QDR) OSU-IB (QDR) 16 14 1GigE IPoIB (QDR) OSU-IB (QDR) 14 12 12 1 1 8 6 8 6 4 4 2 2 48 64 8 Sort Size (GB) 48 64 8 Sort Size (GB) Cluster with 8 HDD Nodes, Sort with <8 maps, 8 reduces>/host Cluster with 8 SSD Nodes, Sort with <8 maps, 8 reduces>/host For 8GB Sort in a cluster size of 8, 4.7% improvement over IPoIB, 32.2% improvement over 1GigE with HDD used for HDFS 43.6% improvement over IPoIB, 42.9% improvement over 1GigE with SSD used for HDFS 5
Job Execution Time (sec) Job Execution Time (sec) Performance Benefits - TeraSort 25 1GigE IPoIB (QDR) OSU-IB (QDR) 25 1GigE IPoIB (QDR) OSU-IB (QDR) 2 2 15 15 1 1 5 5 8 9 1 8 9 1 Size (GB) Size (GB) Cluster with 8 HDD Nodes, TeraSort with <8 maps, 8reduces>/host Cluster with 8 SSD Nodes, TeraSort with <8 maps, 8reduces>/host For 1GB TeraSort in a cluster size of 8, 45% improvement over IPoIB, 39% improvement over 1GigE with HDD used for HDFS 46% improvement over IPoIB, 4% improvement over 1GigE with SSD used for HDFS 51
Job Execution Time (sec) Job Execution Time (sec) Performance Benefits RandomWriter & Sort in SDSC-Gordon 16 14 12 1 8 6 4 2 IPoIB (QDR) OSU-IB (QDR) IPoIB (QDR) OSU-IB (QDR) 4 Reduced by 2% Reduced by 36% 35 3 25 2 15 1 5 5 GB 1 GB 2 GB 3 GB 5 GB 1 GB 2 GB 3 GB Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64 Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64 RandomWriter in a cluster of single SSD per Sort in a cluster of single SSD per node node 2% improvement over IPoIB (QDR) 16% improvement over IPoIB (QDR) for for 5GB in a cluster of 16 5GB in a cluster of 16 36% improvement over IPoIB (QDR) 2% improvement over IPoIB (QDR) for 3GB in a cluster of 64 for 3GB in a cluster of 64 52
Presentation Outline Overview of Hadoop and Memcached Trends in High Performance Networking and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 53
Memcached-RDMA Design Sockets Client 1 2 Master Thread Sockets Worker Thread Shared Data RDMA Client 1 2 Sockets Worker Thread Verbs Worker Thread Verbs Worker Thread Memory Slabs Items Server and client perform a negotiation protocol Master thread assigns clients to appropriate worker thread Once a client is assigned a verbs worker thread, it can communicate directly and is bound to that thread All other Memcached data structures are shared among RDMA and Sockets worker threads 54
Time%(us)% Time%(us)% Time (us) Time (us) Memcached Get Latency (Small Message) 1 9 8 7 6 5 4 3 2 1 Time%(us)% Time%(us)% Time%(us)% Time%(us)% 1 2 4 8 16 32 64 128 256 512 1K 2K Message Size Intel Clovertown Cluster (IB: DDR) Memcached Get latency 4 bytes RC/UD DDR: 6.82/7.55 us; QDR: 4.28/4.86 us 2K bytes RC/UD DDR: 12.31/12.78 us; QDR: 8.19/8.46 us Almost factor of four improvement over 1GE (TOE) for 2K bytes on the DDR cluster Message%Size% Message%Size% IPoIB (DDR) 1GigE (DDR) OSU-RC-IB (DDR) OSU-UD-IB (DDR) Time%(us)% 1 Time%(us)% Message%Size% Message%Size% 9 8 7 6 5 4 3 2 1 1 2 4 8 16 32 64 128 256 512 1K 2K Message Size IPoIB (QDR) 1GigE (QDR) Intel Westmere Cluster (IB: QDR) Message%Size% Message%Size% OSU-RC-IB (QDR) OSU-UD-IB (QDR) Message%Size% Message%Size% 55
Thousands of Transactions per second (TPS) Thousands of Transactions per second (TPS) Time%(us)% Time%(us)% Time%(us)% Memcached Get TPS (4byte) 16 16 14 12 1 14 12 1 IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) 8 6 8 6 4 2 4 8 No. of Clients Memcached Get transactions per second for 4 bytes On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients 4 2 1 2 4 8 16 32 64 128 256 512 8 1K No. of Clients Message%Size% Message%Size% Significant improvement with native IB QDR compared to SDP and IPoIB Message%Size 56
1 2 4 8 16 32 64 128 256 512 1K 2K 4K Time (us) Thousands of Transactions per Second (TPS) 1 1 Memcached Performance (FDR Interconnect) 1 Memcached GET Latency OSU-IB (FDR) IPoIB (FDR) Latency Reduced by nearly 2X 7 6 5 4 3 2 Memcached Throughput 2X 1 Message Size Memcached Get latency Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us Memcached Throughput (4bytes) 48 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s Nearly 2X improvement in throughput 1 16 32 64 128 256 512 124 248 48 No. of Clients 57
Time (ms) Time (ms) Application Level Evaluation Olio Benchmark 6 25 IPoIB (QDR) 5 2 OSU-RC-IB (QDR) OSU-UD-IB (QDR) 4 3 Time%(ms)% 15 OSU-Hybrid-IB (QDR) 1 2 1 5 1 4 8 No. of Clients No.%of%Clients% 64 128 256 512 124 No. of Clients Olio Benchmark RC 1.6 sec, UD 1.9 sec, Hybrid 1.7 sec for 124 clients 4X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design 58
Time (ms) Time (ms) 9 8 7 6 5 4 Application Level Evaluation Real Application Workloads Time%(ms)% 35 3 25 2 15 IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) OSU-Hybrid-IB (QDR) 3 2 1 1 5 1 4 8 No. of Clients Real Application Workload RC 32 ms, UD 318 ms, Hybrid 314 ms for 124 clients 12X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design No.%of%Clients% 64 128 256 512 124 No. of Clients J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid 12 59
Presentation Outline Overview of Hadoop and Memcached Trends in High Performance Networking and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 6
Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase and Memcached) Upper level Changes? Programming Models (Sockets) Other RDMA Protocols? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 61
More Challenges Multi-threading and Synchronization Multi-threaded model exploration Fine-grained synchronization and lock-free design Unified helper threads for different components Multi-endpoint design to support multi-threading communications QoS and Virtualization Network virtualization and locality-aware communication for Big Data middleware Hardware-level virtualization support for End-to-End QoS I/O scheduling and storage virtualization Live migration 62
More Challenges (Cont d) Support of Accelerators Efficient designs for Big Data middleware to take advantage of NVIDA GPGPUs and Intel MICs Offload computation-intensive workload to accelerators Explore maximum overlapping between communication and offloaded computation Fault Tolerance Enhancements Novel data replication schemes for high-efficiency fault tolerance Exploration of light-weight fault tolerance mechanisms for Big Data processing Support of Parallel File Systems Optimize Big Data middleware over parallel file systems (e.g. Lustre) on modern HPC clusters 63
Job Execution Time (sec) Job Execution Time (sec) Performance Improvement Potential of MapReduce over Lustre on HPC Clusters An Initial Study 12 25 1 IPoIB (FDR) OSU-IB (FDR) 2 IPoIB (QDR) OSU-IB (QDR) 8 15 6 4 1 2 5 3 4 5 6 8 1 Data Size (GB) Data Size (GB) Sort in TACC Stampede Sort in SDSC Gordon For 5GB Sort in 64 nodes For 1GB Sort in 16 nodes 44% improvement over IPoIB (FDR) 33% improvement over IPoIB (QDR) Can more optimizations be achieved by leveraging more features of Lustre? 64
Are the Current Benchmarks Sufficient for Big Data? The current benchmarks provide some performance behavior However, do not provide any information to the designer/developer on: What is happening at the lower-layer? Where the benefits are coming from? Which design is leading to benefits or bottlenecks? Which component in the design needs to be changed and what will be its impact? Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer? 65
Challenges in Benchmarking of RDMA-based Designs Applications Big Data Middleware (HDFS, MapReduce, HBase and Memcached) Programming Models (Sockets) Benchmarks Other RDMA Protocols? Current Benchmarks Correlation? Communication and I/O Library Point-to-Point Communication Threaded Models and Synchronization Virtualization No Benchmarks I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 66
OSU MPI Micro-Benchmarks (OMB) Suite A comprehensive suite of benchmarks to Compare performance of different MPI libraries on various networks and systems Validate low-level functionalities Provide insights to the underlying MPI-level designs Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth Extended later to MPI-2 one-sided Collectives GPU-aware data movement OpenSHMEM (point-to-point and collectives) UPC Has become an industry standard Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems Available from http://mvapich.cse.ohio-state.edu/benchmarks Available in an integrated manner with MVAPICH2 stack 67
Iterative Process Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase and Memcached) Applications- Level Benchmarks Programming Models (Sockets) Other RDMA Protocols? Communication and I/O Library Point-to-Point Communication I/O and File Systems Threaded Models and Synchronization QoS Virtualization Fault-Tolerance Micro- Benchmarks Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 68
Future Plans of OSU Big Data Project Upcoming RDMA for Apache Hadoop Releases will support HDFS Parallel Replication Advanced MapReduce HBase Hadoop 2..x Memcached-RDMA Exploration of other components (Threading models, QoS, Virtualization, Accelerators, etc.) Advanced designs with upper-level changes and optimizations Comprehensive set of Benchmarks More details at http://hadoop-rdma.cse.ohio-state.edu 69
Concluding Remarks Presented an overview of Big Data, Hadoop and Memcached Provided an overview of Networking Technologies Discussed challenges in accelerating Hadoop Presented initial designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, RPC, HBase and Memcached Results are promising Many other open issues need to be solved Will enable Big Data community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner 7
Personnel Acknowledgments Current Students S. Chakraborthy (Ph.D.) M. Rahman (Ph.D.) N. Islam (Ph.D.) D. Shankar (Ph.D.) J. Jose (Ph.D.) R. Shir (Ph.D.) M. Li (Ph.D.) A. Venkatesh (Ph.D.) S. Potluri (Ph.D.) J. Zhang (Ph.D.) R. Rajachandrasekhar (Ph.D.) Past Students P. Balaji (Ph.D.) W. Huang (Ph.D.) D. Buntinas (Ph.D.) W. Jiang (M.S.) S. Bhagvat (M.S.) S. Kini (M.S.) L. Chai (Ph.D.) M. Koop (Ph.D.) B. Chandrasekharan (M.S.) R. Kumar (M.S.) N. Dandapanthula (M.S.) S. Krishnamoorthy (M.S.) V. Dhanraj (M.S.) K. Kandalla (Ph.D.) T. Gangadharappa (M.S.) P. Lai (M.S.) K. Gopalakrishnan (M.S.) J. Liu (Ph.D.) Current Senior Research Associate H. Subramoni Current Post-Docs Current Programmers X. Lu M. Arnold K. Hamidouche J. Perkins M. Luo (Ph.D.) G. Santhanaraman (Ph.D.) A. Mamidala (Ph.D.) A. Singh (Ph.D.) G. Marsh (M.S.) J. Sridhar (M.S.) V. Meshram (M.S.) S. Sur (Ph.D.) S. Naravula (Ph.D.) H. Subramoni (Ph.D.) R. Noronha (Ph.D.) K. Vaidyanathan (Ph.D.) X. Ouyang (Ph.D.) A. Vishnu (Ph.D.) S. Pai (M.S.) J. Wu (Ph.D.) W. Yu (Ph.D.) Past Post-Docs H. Wang X. Besseron H.-W. Jin E. Mancini S. Marcarelli J. Vienne Past Research Scientist S. Sur Past Programmers D. Bureddy M. Luo 71
Pointers http://nowlab.cse.ohio-state.edu http://mvapich.cse.ohio-state.edu http://hadoop-rdma.cse.ohio-state.edu panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda 72