Big Data: Hadoop and Memcached

Transcription

1 Big Data: Hadoop and Memcached Talk at HPC Advisory Council Stanford Conference and Exascale Workshop (Feb 214) by Dhabaleswar K. (DK) Panda The Ohio State University

2 Introduction to Big Data Applications and Analytics Big Data has become the one of the most important elements of business analytics Provides groundbreaking opportunities for enterprise information management and decision making The amount of data is exploding; companies are capturing and digitizing more information than ever The rate of information growth appears to be exceeding Moore s Law Commonly accepted 3V s of Big Data Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, 5V s of Big Data 3V + Value, Veracity 2

3 Not Only in Internet Services - Big Data in Scientific Domains Scientific Data Management, Analysis, and Visualization Applications examples Climate modeling Combustion Fusion Astrophysics Bioinformatics Data Intensive Tasks Runs large-scale simulations on supercomputers Dump data on parallel storage systems Collect experimental / observational data Move experimental / observational data to analysis sites Visual analytics help understand data visually 3

4 Typical Solutions or Architectures for Big Data Applications and Analytics Hadoop: The most popular framework for Big Data Analytics HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc. Spark: Provides primitives for in-memory cluster computing; Jobs can load data into memory and query it repeatedly Shark: A fully Apache Hive-compatible data warehousing system Storm: A distributed real-time computation system for real-time analytics, online machine learning, continuous computation, etc. S4: A distributed system for processing continuous unbounded streams of data Web 2.: RDBMS + Memcached ( Memcached: A high-performance, distributed memory object caching systems 4

5 Who Are Using Hadoop? Focuses on large data and data analysis Hadoop (e.g. HDFS, MapReduce, HBase, RPC) environment is gaining a lot of momentum 5

6 Hadoop at Yahoo! Scale: 42, nodes, 365PB HDFS, 1M daily slot hours A new Jim Gray Sort Record: 1.42 TB/min over 2,1 nodes Hadoop.23.7, an early branch of the Hadoop 2.X 6

7 Presentation Outline Overview of Hadoop and Memcached Trends in Networking Technologies and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 7

8 Big Data Processing with Hadoop The open-source implementation of MapReduce programming model for Big Data Analytics Major components HDFS MapReduce Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase Model scales but high amount of communication during intermediate phases can be further optimized User Applications MapReduce HBase HDFS Hadoop Common (RPC) Hadoop Framework 8

9 Hadoop MapReduce Architecture & Operations Disk Operations Map and Reduce Tasks carry out the total job execution Communication in shuffle phase uses HTTP over Java Sockets 9

10 Hadoop Distributed File System (HDFS) Primary storage of Hadoop; highly reliable and fault-tolerant Adopted by many reputed organizations eg: Facebook, Yahoo! NameNode: stores the file system namespace DataNode: stores data blocks Developed in Java for platformindependence and portability Uses sockets for communication! Client RPC RPC RPC RPC (HDFS Architecture) 1

11 System-Level Interaction Between Clients and Data Nodes in HDFS (HDD/SSD) High Performance Networks (HDD/SSD) (HDD/SSD) (HDFS Clients) (HDFS Data Nodes) 11

12 HBase Apache Hadoop Database ( Semi-structured database, which is highly scalable ZooKeeper Integral part of many datacenter applications eg:- Facebook Social Inbox Developed in Java for platformindependence and portability HBase Client HMaster HRegion Servers (HBase Architecture) HDFS Uses sockets for communication! 12

13 System-Level Interaction Among HBase Clients, Region Servers, and Data Nodes (HDD/SSD) High Performance Networks... High Performance Networks... (HDD/SSD) (HDD/SSD) (HBase Clients) (HRegion Servers) (Data Nodes) 13

14 Memcached Architecture Main memory SSD CPUs HDD... Main memory SSD CPUs HDD!"#$%"$#& High Performance Networks High Performance Networks Main memory SSD CPUs HDD High Performance Networks Main memory SSD CPUs HDD (Database Servers) Main memory CPUs Web Frontend Servers (Memcached Clients) (Memcached Servers) SSD HDD Integral part of Web 2. architecture Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive 14

16 Number of Clusters Percentage of Clusters Trends for Commodity Computing Clusters in the Top 5 List ( Percentage of Clusters Number of Clusters Timeline 16

17 HPC and Advanced Interconnects High-Performance Computing (HPC) has adopted advanced interconnects and protocols InfiniBand 1 Gigabit Ethernet/iWARP RDMA over Converged Enhanced Ethernet (RoCE) Very Good Performance Low latency (few micro seconds) High Bandwidth (1 Gb/s with dual FDR InfiniBand) Low CPU overhead (5-1%) OpenFabrics software stack with IB, iwarp and RoCE interfaces are driving HPC systems Many such systems in Top5 list 17

18 Trends of Networking Technologies in TOP5 Systems Percentage share of InfiniBand is steadily increasing Interconnect Family Systems Share 18

19 Use of High-Performance Networks for Scientific Computing Message Passing Interface (MPI) Most commonly used programming model for HPC applications Parallel File Systems Storage Systems Almost 13 years of Research and Development since InfiniBand was introduced in October 2 Other Programming Models are emerging to take advantage of High-Performance Networks UPC OpenSHMEM Hybrid MPI+PGAS (OpenSHMEM and UPC) 19

20 Latency (us) Latency (us) One-way Latency: MPI over IB with MVAPICH Small Message Latency Large Message Latency Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch 2

21 Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Bandwidth: MPI over IB with MVAPICH Unidirectional Bandwidth Bidirectional Bandwidth Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Message Size (bytes) Message Size (bytes) DDR, QDR GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch 21

23 Can High-Performance Interconnects Benefit Hadoop? Most of the current Hadoop environments use Ethernet Infrastructure with Sockets Concerns for performance and scalability Usage of High-Performance Networks is beginning to draw interest from many companies What are the challenges? Where do the bottlenecks lie? Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)? Can HPC Clusters with High-Performance networks be used for Big Data applications using Hadoop? 23

24 Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase and Memcached) Programming Models (Sockets) Other Protocols? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 24

25 Common Protocols using Open Fabrics Application /Middleware Interface Application/ Middleware Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP iwarp RDMA RDMA Ethernet Driver IPoIB Hardware Offload User space RDMA User space User space User space Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/1/4 GigE IPoIB 1/4 GigE- TOE RSockets SDP iwarp RoCE IB Verbs 25

26 Can Hadoop be Accelerated with High-Performance Networks and Protocols? Current Design Enhanced Designs Our Approach Application Application Application Sockets Accelerated Sockets Verbs / Hardware Offload OSU Design Verbs Interface 1/1 GigE Network 1 GigE or InfiniBand 1 GigE or InfiniBand Sockets not designed for high-performance Stream semantics often mismatch for upper layers (Hadoop and HBase) Zero-copy not available for non-blocking sockets 26

28 RDMA for Apache Hadoop Project High-Performance Design of Hadoop over RDMA-enabled Interconnects High performance design with native InfiniBand and RoCE support at the verbslevel for HDFS, MapReduce, and RPC components Easily configurable for native InfiniBand, RoCE and the traditional socketsbased support (Ethernet and InfiniBand with IPoIB) Current release:.9.8 Based on Apache Hadoop Compliant with Apache Hadoop APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE Various multi-core platforms Different file systems with disks and SSDs 28

29 Design Overview of HDFS with RDMA Enables high performance RDMA communication, while supporting traditional socket interface Others Java Socket Interface Applications HDFS Write Java Native Interface (JNI) OSU Design Design Features RDMA-based HDFS write RDMA-based HDFS replication InfiniBand/RoCE support 1/1 GigE, IPoIB Network Verbs RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) JNI Layer bridges Java based HDFS with communication library written in native code Only the communication part of HDFS Write is modified; No change in HDFS architecture 29

30 Communication Time in HDFS Communication Time (s) GigE IPoIB (QDR) OSU-IB (QDR) 15 1 Reduced by 3% 5 2GB 4GB 6GB 8GB 1GB File Size (GB) Cluster with HDD DataNodes 3% improvement in communication time over IPoIB (QDR) 56% improvement in communication time over 1GigE Similar improvements are obtained for SSD DataNodes N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 212 3

31 Total Throughput (MBps) Total Throughput (MBps) Evaluations using TestDFSIO GigE IPoIB (QDR) OSU-IB (QDR) Increased by 24% IPoIB (QDR) OSU-IB (QDR) Increased by 61% File Size (GB) File Size (GB) Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps Cluster with 8 HDD DataNodes, single disk per node 24% improvement over IPoIB (QDR) for 2GB file size Cluster with 4 SSD DataNodes, single SSD per node 61% improvement over IPoIB (QDR) for 2GB file size 31

32 Total Throughput (MBps) Evaluations using TestDFSIO GigE-1HDD IPoIB -1HDD (QDR) OSU-IB -1HDD (QDR) 1GigE-2HDD IPoIB -2HDD (QDR) OSU-IB -2HDD (QDR) 15 Increased by 31% File Size (GB) Cluster with 4 DataNodes, 1 HDD per node 24% improvement over IPoIB (QDR) for 2GB file size Cluster with 4 DataNodes, 2 HDD per node 31% improvement over IPoIB (QDR) for 2GB file size 2 HDD vs 1 HDD 76% improvement for OSU-IB (QDR) 66% improvement for IPoIB (QDR) 32

33 Aggregated Throughput (MBps) Execution Time (s) Evaluations using Enhanced DFSIO of Intel HiBench on SDSC-Gordon 25 2 IPoIB (QDR) OSU-IB (QDR) 16 Increased by 47% IPoIB (QDR) Reduced by 16% OSU-IB (QDR) File Size (GB) File Size (GB) Cluster with 32 DataNodes, single SSD per node 47% improvement in throughput over IPoIB (QDR) for 128GB file size 16% improvement in latency over IPoIB (QDR) for 128GB file size 33

34 Aggregated Throughput (MBps) Execution Time (s) Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede IPoIB (FDR) OSU-IB (FDR) Increased by 64% IPoIB (FDR) OSU-IB (FDR) Reduced by 37% File Size (GB) File Size (GB) Cluster with 64 DataNodes, single HDD per node 64% improvement in throughput over IPoIB (FDR) for 256GB file size 37% improvement in latency over IPoIB (FDR) for 256GB file size 34

35 Presentation Outline Overview of Hadoop and Memcached Trends in High Performance Networking and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 35

36 Design Overview of MapReduce with RDMA Applications MapReduce Job Tracker Task Tracker Map Reduce Design features for RDMA Prefetch/Caching of MOF In-Memory Merge Overlap of Merge & Reduce Enables high performance RDMA communication, while supporting traditional socket interface Java Socket Interface 1/1 GigE, IPoIB Network Java Native Interface (JNI) OSU Design Verbs RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) JNI Layer bridges Java based MapReduce with communication library written in native code InfiniBand/RoCE support 36

37 Job Execution Time (sec) Job Execution Time (sec) Evaluations using Sort GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR) Data Size (GB) 8 HDD DataNodes With 8 HDD DataNodes for 4GB sort 41% improvement over IPoIB (QDR) 39% improvement over Hadoop-A-IB (QDR) GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR) Data Size (GB) 8 SSD DataNodes With 8 SSD DataNodes for 4GB sort 52% improvement over IPoIB (QDR) 44% improvement over Hadoop-A-IB (QDR) M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMAbased Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May

38 Job Execution Time (sec) Evaluations using TeraSort GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR) Data Size (GB) 1 GB TeraSort with 8 DataNodes with 2 HDD per node 51% improvement over IPoIB (QDR) 41% improvement over Hadoop-A-IB (QDR) 38

39 Job Execution Time (sec) Job Execution Time (sec) Performance Evaluation on Larger Clusters 12 1 IPoIB (QDR) OSU-IB (QDR) Hadoop-A-IB (QDR) 7 6 IPoIB (FDR) Hadoop-A-IB (FDR) OSU-IB (FDR) Data Size: 6 GB Data Size: 12 GB Data Size: 24 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 Sort in OSU Cluster For 24GB Sort in 64 nodes 4% improvement over IPoIB (QDR) with HDD used for HDFS Data Size: 8 GB Data Size: 16 GBData Size: 32 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 TeraSort in TACC Stampede For 32GB TeraSort in 64 nodes 38% improvement over IPoIB (FDR) with HDD used for HDFS 39

40 Normalized Execution Time Evaluations using PUMA Workload GigE IPoIB (QDR) OSU-IB (QDR) AdjList (3GB) SelfJoin (8GB) SeqCount (3GB) WordCount (3GB) InvertIndex (3GB) Benchmarks 5% improvement in Self Join over IPoIB (QDR) for 8 GB data size 49% improvement in Sequence Count over IPoIB (QDR) for 3 GB data size 4

42 Time (us) Time (us) Motivation Detailed Analysis on HBase Put/Get Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization IPoIB (DDR) 1GigE OSU-IB (DDR) HBase Put 1KB HBase 1KB Put Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time HBase 1KB Get Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time IPoIB (DDR) 1GigE OSU-IB (DDR) Hbase Get 1KB M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS 12 42

43 Time (us) Ops/sec HBase Single-Server-Multi-Client Results IPoIB (DDR) OSU-IB (DDR) 1GigE No. of Clients Latency HBase Get latency 4 clients: 14.5 us; 16 clients: us HBase Get throughput clients: 37.1 Kops/sec; 16 clients: 53.4 Kops/sec 27% improvement in throughput for 16 clients over 1GE No. of Clients Throughput J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 43

44 Time (us) Time (us) HBase YCSB Read-Write Workload IPoIB (QDR) 1GigE OSU-IB (QDR) No. of Clients No. of Clients Read Latency Write Latency HBase Get latency (Yahoo! Cloud Service Benchmark) 64 clients: 2. ms; 128 Clients: 3.5 ms 42% improvement over IPoIB for 128 clients HBase Get latency 64 clients: 1.9 ms; 128 Clients: 3.5 ms 4% improvement over IPoIB for 128 clients 44

46 Design Overview of Hadoop RPC with RDMA Applications Design Features Default Java Socket Interface Hadoop RPC Our Design Java Native Interface (JNI) OSU Design Verbs JVM-bypassed buffer management On-demand connection setup RDMA or send/recv based adaptive communication InfiniBand/RoCE support 1/1 GigE, IPoIB Network RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) Enables high performance RDMA communication, while supporting traditional socket interface 46

47 Latency (us) Throughput (Kops/Sec) Hadoop RPC over IB - Gain in Latency and Throughput GigE IPoIB(QDR) OSU-IB(QDR) Payload Size (Byte) Number of Clients Hadoop RPC over IB PingPong Latency 1 byte: 39 us; 4 KB: 52 us 42%-49% and 46%-5% improvements compared with the performance on 1 GigE and IPoIB (32Gbps), respectively Hadoop RPC over IB Throughput 512 bytes & 48 clients: Kops/sec 82% and 64% improvements compared with the peak performance on 1 GigE and IPoIB (32Gbps), respectively 47

49 Total Throughput (MBps) Total Throughput (MBps) Performance Benefits - TestDFSIO 1GigE IPoIB (QDR) OSU-IB (QDR) GigE IPoIB (QDR) OSU-IB (QDR) Size (GB) Size (GB) For 5GB, Cluster with 4 HDD Nodes, TestDFSIO with total 4 maps Cluster with 4 SSD Nodes, TestDFSIO with total 4 maps 53.4% improvement over IPoIB, 126.9% improvement over 1GigE with HDD used for HDFS 57.8% improvement over IPoIB, 138.1% improvement over 1GigE with SSD used for HDFS For 2GB, 1.7% improvement over IPoIB, 46.8% improvement over 1GigE with HDD used for HDFS 73.3% improvement over IPoIB, 141.3% improvement over 1GigE with SSD used for HDFS 49

50 Job Execution Time (sec) Job Execution Time (sec) Performance Benefits - Sort GigE IPoIB (QDR) OSU-IB (QDR) GigE IPoIB (QDR) OSU-IB (QDR) Sort Size (GB) Sort Size (GB) Cluster with 8 HDD Nodes, Sort with <8 maps, 8 reduces>/host Cluster with 8 SSD Nodes, Sort with <8 maps, 8 reduces>/host For 8GB Sort in a cluster size of 8, 4.7% improvement over IPoIB, 32.2% improvement over 1GigE with HDD used for HDFS 43.6% improvement over IPoIB, 42.9% improvement over 1GigE with SSD used for HDFS 5

51 Job Execution Time (sec) Job Execution Time (sec) Performance Benefits - TeraSort 25 1GigE IPoIB (QDR) OSU-IB (QDR) 25 1GigE IPoIB (QDR) OSU-IB (QDR) Size (GB) Size (GB) Cluster with 8 HDD Nodes, TeraSort with <8 maps, 8reduces>/host Cluster with 8 SSD Nodes, TeraSort with <8 maps, 8reduces>/host For 1GB TeraSort in a cluster size of 8, 45% improvement over IPoIB, 39% improvement over 1GigE with HDD used for HDFS 46% improvement over IPoIB, 4% improvement over 1GigE with SSD used for HDFS 51

52 Job Execution Time (sec) Job Execution Time (sec) Performance Benefits RandomWriter & Sort in SDSC-Gordon IPoIB (QDR) OSU-IB (QDR) IPoIB (QDR) OSU-IB (QDR) 4 Reduced by 2% Reduced by 36% GB 1 GB 2 GB 3 GB 5 GB 1 GB 2 GB 3 GB Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64 Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64 RandomWriter in a cluster of single SSD per Sort in a cluster of single SSD per node node 2% improvement over IPoIB (QDR) 16% improvement over IPoIB (QDR) for for 5GB in a cluster of 16 5GB in a cluster of 16 36% improvement over IPoIB (QDR) 2% improvement over IPoIB (QDR) for 3GB in a cluster of 64 for 3GB in a cluster of 64 52

54 Memcached-RDMA Design Sockets Client 1 2 Master Thread Sockets Worker Thread Shared Data RDMA Client 1 2 Sockets Worker Thread Verbs Worker Thread Verbs Worker Thread Memory Slabs Items Server and client perform a negotiation protocol Master thread assigns clients to appropriate worker thread Once a client is assigned a verbs worker thread, it can communicate directly and is bound to that thread All other Memcached data structures are shared among RDMA and Sockets worker threads 54

55 Time%(us)% Time%(us)% Time (us) Time (us) Memcached Get Latency (Small Message) Time%(us)% Time%(us)% Time%(us)% Time%(us)% K 2K Message Size Intel Clovertown Cluster (IB: DDR) Memcached Get latency 4 bytes RC/UD DDR: 6.82/7.55 us; QDR: 4.28/4.86 us 2K bytes RC/UD DDR: 12.31/12.78 us; QDR: 8.19/8.46 us Almost factor of four improvement over 1GE (TOE) for 2K bytes on the DDR cluster Message%Size% Message%Size% IPoIB (DDR) 1GigE (DDR) OSU-RC-IB (DDR) OSU-UD-IB (DDR) Time%(us)% 1 Time%(us)% Message%Size% Message%Size% K 2K Message Size IPoIB (QDR) 1GigE (QDR) Intel Westmere Cluster (IB: QDR) Message%Size% Message%Size% OSU-RC-IB (QDR) OSU-UD-IB (QDR) Message%Size% Message%Size% 55

56 Thousands of Transactions per second (TPS) Thousands of Transactions per second (TPS) Time%(us)% Time%(us)% Time%(us)% Memcached Get TPS (4byte) IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) No. of Clients Memcached Get transactions per second for 4 bytes On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients K No. of Clients Message%Size% Message%Size% Significant improvement with native IB QDR compared to SDP and IPoIB Message%Size 56

57 K 2K 4K Time (us) Thousands of Transactions per Second (TPS) 1 1 Memcached Performance (FDR Interconnect) 1 Memcached GET Latency OSU-IB (FDR) IPoIB (FDR) Latency Reduced by nearly 2X Memcached Throughput 2X 1 Message Size Memcached Get latency Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) 4 bytes OSU-IB: 2.84 us; IPoIB: us 2K bytes OSU-IB: 4.49 us; IPoIB: us Memcached Throughput (4bytes) 48 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s Nearly 2X improvement in throughput No. of Clients 57

58 Time (ms) Time (ms) Application Level Evaluation Olio Benchmark 6 25 IPoIB (QDR) 5 2 OSU-RC-IB (QDR) OSU-UD-IB (QDR) 4 3 Time%(ms)% 15 OSU-Hybrid-IB (QDR) No. of Clients No.%of%Clients% No. of Clients Olio Benchmark RC 1.6 sec, UD 1.9 sec, Hybrid 1.7 sec for 124 clients 4X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design 58

59 Time (ms) Time (ms) Application Level Evaluation Real Application Workloads Time%(ms)% IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) OSU-Hybrid-IB (QDR) No. of Clients Real Application Workload RC 32 ms, UD 318 ms, Hybrid 314 ms for 124 clients 12X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design No.%of%Clients% No. of Clients J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid 12 59

61 Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase and Memcached) Upper level Changes? Programming Models (Sockets) Other RDMA Protocols? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 61

62 More Challenges Multi-threading and Synchronization Multi-threaded model exploration Fine-grained synchronization and lock-free design Unified helper threads for different components Multi-endpoint design to support multi-threading communications QoS and Virtualization Network virtualization and locality-aware communication for Big Data middleware Hardware-level virtualization support for End-to-End QoS I/O scheduling and storage virtualization Live migration 62

63 More Challenges (Cont d) Support of Accelerators Efficient designs for Big Data middleware to take advantage of NVIDA GPGPUs and Intel MICs Offload computation-intensive workload to accelerators Explore maximum overlapping between communication and offloaded computation Fault Tolerance Enhancements Novel data replication schemes for high-efficiency fault tolerance Exploration of light-weight fault tolerance mechanisms for Big Data processing Support of Parallel File Systems Optimize Big Data middleware over parallel file systems (e.g. Lustre) on modern HPC clusters 63

64 Job Execution Time (sec) Job Execution Time (sec) Performance Improvement Potential of MapReduce over Lustre on HPC Clusters An Initial Study IPoIB (FDR) OSU-IB (FDR) 2 IPoIB (QDR) OSU-IB (QDR) Data Size (GB) Data Size (GB) Sort in TACC Stampede Sort in SDSC Gordon For 5GB Sort in 64 nodes For 1GB Sort in 16 nodes 44% improvement over IPoIB (FDR) 33% improvement over IPoIB (QDR) Can more optimizations be achieved by leveraging more features of Lustre? 64

65 Are the Current Benchmarks Sufficient for Big Data? The current benchmarks provide some performance behavior However, do not provide any information to the designer/developer on: What is happening at the lower-layer? Where the benefits are coming from? Which design is leading to benefits or bottlenecks? Which component in the design needs to be changed and what will be its impact? Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer? 65

66 Challenges in Benchmarking of RDMA-based Designs Applications Big Data Middleware (HDFS, MapReduce, HBase and Memcached) Programming Models (Sockets) Benchmarks Other RDMA Protocols? Current Benchmarks Correlation? Communication and I/O Library Point-to-Point Communication Threaded Models and Synchronization Virtualization No Benchmarks I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 66

67 OSU MPI Micro-Benchmarks (OMB) Suite A comprehensive suite of benchmarks to Compare performance of different MPI libraries on various networks and systems Validate low-level functionalities Provide insights to the underlying MPI-level designs Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth Extended later to MPI-2 one-sided Collectives GPU-aware data movement OpenSHMEM (point-to-point and collectives) UPC Has become an industry standard Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems Available from Available in an integrated manner with MVAPICH2 stack 67

68 Iterative Process Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase and Memcached) Applications- Level Benchmarks Programming Models (Sockets) Other RDMA Protocols? Communication and I/O Library Point-to-Point Communication I/O and File Systems Threaded Models and Synchronization QoS Virtualization Fault-Tolerance Micro- Benchmarks Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 68

69 Future Plans of OSU Big Data Project Upcoming RDMA for Apache Hadoop Releases will support HDFS Parallel Replication Advanced MapReduce HBase Hadoop 2..x Memcached-RDMA Exploration of other components (Threading models, QoS, Virtualization, Accelerators, etc.) Advanced designs with upper-level changes and optimizations Comprehensive set of Benchmarks More details at 69

70 Concluding Remarks Presented an overview of Big Data, Hadoop and Memcached Provided an overview of Networking Technologies Discussed challenges in accelerating Hadoop Presented initial designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, RPC, HBase and Memcached Results are promising Many other open issues need to be solved Will enable Big Data community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner 7

71 Personnel Acknowledgments Current Students S. Chakraborthy (Ph.D.) M. Rahman (Ph.D.) N. Islam (Ph.D.) D. Shankar (Ph.D.) J. Jose (Ph.D.) R. Shir (Ph.D.) M. Li (Ph.D.) A. Venkatesh (Ph.D.) S. Potluri (Ph.D.) J. Zhang (Ph.D.) R. Rajachandrasekhar (Ph.D.) Past Students P. Balaji (Ph.D.) W. Huang (Ph.D.) D. Buntinas (Ph.D.) W. Jiang (M.S.) S. Bhagvat (M.S.) S. Kini (M.S.) L. Chai (Ph.D.) M. Koop (Ph.D.) B. Chandrasekharan (M.S.) R. Kumar (M.S.) N. Dandapanthula (M.S.) S. Krishnamoorthy (M.S.) V. Dhanraj (M.S.) K. Kandalla (Ph.D.) T. Gangadharappa (M.S.) P. Lai (M.S.) K. Gopalakrishnan (M.S.) J. Liu (Ph.D.) Current Senior Research Associate H. Subramoni Current Post-Docs Current Programmers X. Lu M. Arnold K. Hamidouche J. Perkins M. Luo (Ph.D.) G. Santhanaraman (Ph.D.) A. Mamidala (Ph.D.) A. Singh (Ph.D.) G. Marsh (M.S.) J. Sridhar (M.S.) V. Meshram (M.S.) S. Sur (Ph.D.) S. Naravula (Ph.D.) H. Subramoni (Ph.D.) R. Noronha (Ph.D.) K. Vaidyanathan (Ph.D.) X. Ouyang (Ph.D.) A. Vishnu (Ph.D.) S. Pai (M.S.) J. Wu (Ph.D.) W. Yu (Ph.D.) Past Post-Docs H. Wang X. Besseron H.-W. Jin E. Mancini S. Marcarelli J. Vienne Past Research Scientist S. Sur Past Programmers D. Bureddy M. Luo 71

72 Pointers