Big Data: Hadoop and Memcached

Similar documents

Accelerating Big Data Processing with Hadoop, Spark and Memcached

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Accelerating Big Data Processing with Hadoop, Spark and Memcached

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Accelera'ng Big Data Processing with Hadoop, Spark and Memcached

RDMA over Ethernet - A Preliminary Study

Enabling High performance Big Data platform with RDMA

Hari Subramoni. Education: Employment: Research Interests: Projects:

Hadoop on the Gordon Data Intensive Cluster

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Accelerating Big Data Processing with Hadoop, Spark, and Memcached over High-Performance Interconnects

High-Performance Networking for Optimized Hadoop Deployments

Hadoop Optimizations for BigData Analytics

RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed File system THESIS

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

SMB Direct for SQL Server and Private Cloud

Accelerating and Simplifying Apache

RDMA for Apache Hadoop User Guide

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Performance Evaluation of InfiniBand with PCI Express

MapReduce Evaluator: User Guide

Building Enterprise-Class Storage Using 40GbE

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Big Fast Data Hadoop acceleration with Flash. June 2013

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

CSE-E5430 Scalable Cloud Computing Lecture 2

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Advancing Applications Performance With InfiniBand

Hadoop IST 734 SS CHUNG

Hadoop: Embracing future hardware

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

TCP/IP Implementation of Hadoop Acceleration. Cong Xu

HiBench Introduction. Carson Wang Software & Services Group

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Benchmarking Hadoop & HBase on Violin

Hadoop implementation of MapReduce computational model. Ján Vaňo

Understanding the Significance of Network Performance in End Applications: A Case Study with EtherFabric and InfiniBand

How To Scale Out Of A Nosql Database

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

LS DYNA Performance Benchmarks and Profiling. January 2009

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

GraySort on Apache Spark by Databricks

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

FTP 100: An Ultra-High Speed Data Transfer Service over Next Generation 100 Gigabit Per Second Network

Large scale processing using Hadoop. Ján Vaňo

SR-IOV In High Performance Computing

ECLIPSE Performance Benchmarks and Profiling. January 2009

Virtualizing Apache Hadoop. June, 2012

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

Windows 8 SMB 2.2 File Sharing Performance

An In-Memory RDMA-Based Architecture for the Hadoop Distributed Filesystem

Zadara Storage Cloud A

Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Self service for software development tools

A Tour of the Linux OpenFabrics Stack

Data-Intensive Computing with Map-Reduce and Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Extending Hadoop beyond MapReduce

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

FLOW-3D Performance Benchmark and Profiling. September 2012

Application Development. A Paradigm Shift

Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

State of the Art Cloud Infrastructure

The Greenplum Analytics Workbench

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

New Storage System Solutions

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Architecture. Part 1

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Microsoft SMB Running Over RDMA in Windows Server 8

Transcription:

Big Data: Hadoop and Memcached Talk at HPC Advisory Council Stanford Conference and Exascale Workshop (Feb 214) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Introduction to Big Data Applications and Analytics Big Data has become the one of the most important elements of business analytics Provides groundbreaking opportunities for enterprise information management and decision making The amount of data is exploding; companies are capturing and digitizing more information than ever The rate of information growth appears to be exceeding Moore s Law Commonly accepted 3V s of Big Data Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/nist-stonebraker.pdf 5V s of Big Data 3V + Value, Veracity 2

Not Only in Internet Services - Big Data in Scientific Domains Scientific Data Management, Analysis, and Visualization Applications examples Climate modeling Combustion Fusion Astrophysics Bioinformatics Data Intensive Tasks Runs large-scale simulations on supercomputers Dump data on parallel storage systems Collect experimental / observational data Move experimental / observational data to analysis sites Visual analytics help understand data visually 3

Typical Solutions or Architectures for Big Data Applications and Analytics Hadoop: http://hadoop.apache.org The most popular framework for Big Data Analytics HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc. Spark: http://spark-project.org Provides primitives for in-memory cluster computing; Jobs can load data into memory and query it repeatedly Shark: http://shark.cs.berkeley.edu A fully Apache Hive-compatible data warehousing system Storm: http://storm-project.net A distributed real-time computation system for real-time analytics, online machine learning, continuous computation, etc. S4: http://incubator.apache.org/s4 A distributed system for processing continuous unbounded streams of data Web 2.: RDBMS + Memcached (http://memcached.org) Memcached: A high-performance, distributed memory object caching systems 4

Who Are Using Hadoop? Focuses on large data and data analysis Hadoop (e.g. HDFS, MapReduce, HBase, RPC) environment is gaining a lot of momentum http://wiki.apache.org/hadoop/poweredby 5

Hadoop at Yahoo! Scale: 42, nodes, 365PB HDFS, 1M daily slot hours http://developer.yahoo.com/blogs/hadoop/apache-hbase-yahoo-multi-tenancy-helm-again-17171422.html A new Jim Gray Sort Record: 1.42 TB/min over 2,1 nodes Hadoop.23.7, an early branch of the Hadoop 2.X http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-927488.html http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html 6

Presentation Outline Overview of Hadoop and Memcached Trends in Networking Technologies and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 7

Big Data Processing with Hadoop The open-source implementation of MapReduce programming model for Big Data Analytics Major components HDFS MapReduce Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase Model scales but high amount of communication during intermediate phases can be further optimized User Applications MapReduce HBase HDFS Hadoop Common (RPC) Hadoop Framework 8

Hadoop MapReduce Architecture & Operations Disk Operations Map and Reduce Tasks carry out the total job execution Communication in shuffle phase uses HTTP over Java Sockets 9

Hadoop Distributed File System (HDFS) Primary storage of Hadoop; highly reliable and fault-tolerant Adopted by many reputed organizations eg: Facebook, Yahoo! NameNode: stores the file system namespace DataNode: stores data blocks Developed in Java for platformindependence and portability Uses sockets for communication! Client RPC RPC RPC RPC (HDFS Architecture) 1

System-Level Interaction Between Clients and Data Nodes in HDFS (HDD/SSD)...... High Performance Networks (HDD/SSD)...... (HDD/SSD) (HDFS Clients) (HDFS Data Nodes) 11

HBase Apache Hadoop Database (http://hbase.apache.org/) Semi-structured database, which is highly scalable ZooKeeper Integral part of many datacenter applications eg:- Facebook Social Inbox Developed in Java for platformindependence and portability HBase Client HMaster HRegion Servers (HBase Architecture) HDFS Uses sockets for communication! 12

System-Level Interaction Among HBase Clients, Region Servers, and Data Nodes (HDD/SSD)............ High Performance Networks... High Performance Networks... (HDD/SSD) (HDD/SSD) (HBase Clients) (HRegion Servers) (Data Nodes) 13

Memcached Architecture Main memory SSD CPUs HDD... Main memory SSD CPUs HDD!"#$%"$#& High Performance Networks High Performance Networks Main memory SSD CPUs HDD High Performance Networks Main memory SSD CPUs HDD (Database Servers)...... Main memory CPUs Web Frontend Servers (Memcached Clients) (Memcached Servers) SSD HDD Integral part of Web 2. architecture Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive 14

Presentation Outline Overview of Hadoop and Memcached Trends in Networking Technologies and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 15

Number of Clusters Percentage of Clusters Trends for Commodity Computing Clusters in the Top 5 List (http://www.top5.org) 5 45 4 35 3 25 2 15 1 5 Percentage of Clusters Number of Clusters 1 9 8 7 6 5 4 3 2 1 Timeline 16

HPC and Advanced Interconnects High-Performance Computing (HPC) has adopted advanced interconnects and protocols InfiniBand 1 Gigabit Ethernet/iWARP RDMA over Converged Enhanced Ethernet (RoCE) Very Good Performance Low latency (few micro seconds) High Bandwidth (1 Gb/s with dual FDR InfiniBand) Low CPU overhead (5-1%) OpenFabrics software stack with IB, iwarp and RoCE interfaces are driving HPC systems Many such systems in Top5 list 17

Trends of Networking Technologies in TOP5 Systems Percentage share of InfiniBand is steadily increasing Interconnect Family Systems Share 18

Use of High-Performance Networks for Scientific Computing Message Passing Interface (MPI) Most commonly used programming model for HPC applications Parallel File Systems Storage Systems Almost 13 years of Research and Development since InfiniBand was introduced in October 2 Other Programming Models are emerging to take advantage of High-Performance Networks UPC OpenSHMEM Hybrid MPI+PGAS (OpenSHMEM and UPC) 19

Latency (us) Latency (us) One-way Latency: MPI over IB with MVAPICH2 6. 5. 4. 3. 2. 1.82 1.66 1.64 Small Message Latency 25. 2. 15. 1. Large Message Latency Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR 1.. 1.56 1.9.99 1.12 5.. Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch 2

Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Bandwidth: MPI over IB with MVAPICH2 14 12 1 8 6 4 2 Unidirectional Bandwidth 1281 12485 6343 3385 328 1917 176 3 25 2 15 1 5 Bidirectional Bandwidth Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR 24727 2125 11643 6521 447 374 3341 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch 21

Presentation Outline Overview of Hadoop and Memcached Trends in Networking Technologies and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 22

Can High-Performance Interconnects Benefit Hadoop? Most of the current Hadoop environments use Ethernet Infrastructure with Sockets Concerns for performance and scalability Usage of High-Performance Networks is beginning to draw interest from many companies What are the challenges? Where do the bottlenecks lie? Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)? Can HPC Clusters with High-Performance networks be used for Big Data applications using Hadoop? 23

Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase and Memcached) Programming Models (Sockets) Other Protocols? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 24

Common Protocols using Open Fabrics Application /Middleware Interface Application/ Middleware Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP iwarp RDMA RDMA Ethernet Driver IPoIB Hardware Offload User space RDMA User space User space User space Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/1/4 GigE IPoIB 1/4 GigE- TOE RSockets SDP iwarp RoCE IB Verbs 25

Can Hadoop be Accelerated with High-Performance Networks and Protocols? Current Design Enhanced Designs Our Approach Application Application Application Sockets Accelerated Sockets Verbs / Hardware Offload OSU Design Verbs Interface 1/1 GigE Network 1 GigE or InfiniBand 1 GigE or InfiniBand Sockets not designed for high-performance Stream semantics often mismatch for upper layers (Hadoop and HBase) Zero-copy not available for non-blocking sockets 26

Presentation Outline Overview of Hadoop and Memcached Trends in Networking Technologies and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 27

RDMA for Apache Hadoop Project High-Performance Design of Hadoop over RDMA-enabled Interconnects High performance design with native InfiniBand and RoCE support at the verbslevel for HDFS, MapReduce, and RPC components Easily configurable for native InfiniBand, RoCE and the traditional socketsbased support (Ethernet and InfiniBand with IPoIB) Current release:.9.8 Based on Apache Hadoop 1.2.1 Compliant with Apache Hadoop 1.2.1 APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE Various multi-core platforms Different file systems with disks and SSDs http://hadoop-rdma.cse.ohio-state.edu 28

Design Overview of HDFS with RDMA Enables high performance RDMA communication, while supporting traditional socket interface Others Java Socket Interface Applications HDFS Write Java Native Interface (JNI) OSU Design Design Features RDMA-based HDFS write RDMA-based HDFS replication InfiniBand/RoCE support 1/1 GigE, IPoIB Network Verbs RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) JNI Layer bridges Java based HDFS with communication library written in native code Only the communication part of HDFS Write is modified; No change in HDFS architecture 29

Communication Time in HDFS Communication Time (s) 25 2 1GigE IPoIB (QDR) OSU-IB (QDR) 15 1 Reduced by 3% 5 2GB 4GB 6GB 8GB 1GB File Size (GB) Cluster with HDD DataNodes 3% improvement in communication time over IPoIB (QDR) 56% improvement in communication time over 1GigE Similar improvements are obtained for SSD DataNodes N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 212 3

Total Throughput (MBps) Total Throughput (MBps) Evaluations using TestDFSIO 25 2 15 1GigE IPoIB (QDR) OSU-IB (QDR) Increased by 24% 12 1 8 6 IPoIB (QDR) OSU-IB (QDR) Increased by 61% 1 4 5 2 5 1 15 2 File Size (GB) 5 1 15 2 File Size (GB) Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps Cluster with 8 HDD DataNodes, single disk per node 24% improvement over IPoIB (QDR) for 2GB file size Cluster with 4 SSD DataNodes, single SSD per node 61% improvement over IPoIB (QDR) for 2GB file size 31

Total Throughput (MBps) Evaluations using TestDFSIO 25 2 1GigE-1HDD IPoIB -1HDD (QDR) OSU-IB -1HDD (QDR) 1GigE-2HDD IPoIB -2HDD (QDR) OSU-IB -2HDD (QDR) 15 Increased by 31% 1 5 5 1 15 2 File Size (GB) Cluster with 4 DataNodes, 1 HDD per node 24% improvement over IPoIB (QDR) for 2GB file size Cluster with 4 DataNodes, 2 HDD per node 31% improvement over IPoIB (QDR) for 2GB file size 2 HDD vs 1 HDD 76% improvement for OSU-IB (QDR) 66% improvement for IPoIB (QDR) 32

Aggregated Throughput (MBps) Execution Time (s) Evaluations using Enhanced DFSIO of Intel HiBench on SDSC-Gordon 25 2 IPoIB (QDR) OSU-IB (QDR) 16 Increased by 47% IPoIB (QDR) Reduced by 16% 14 12 OSU-IB (QDR) 15 1 8 1 6 5 4 2 32 64 128 File Size (GB) 32 64 128 File Size (GB) Cluster with 32 DataNodes, single SSD per node 47% improvement in throughput over IPoIB (QDR) for 128GB file size 16% improvement in latency over IPoIB (QDR) for 128GB file size 33

Aggregated Throughput (MBps) Execution Time (s) Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede 14 12 1 8 6 IPoIB (FDR) OSU-IB (FDR) Increased by 64% 6 5 4 3 IPoIB (FDR) OSU-IB (FDR) Reduced by 37% 4 2 2 1 64 128 256 File Size (GB) 64 128 256 File Size (GB) Cluster with 64 DataNodes, single HDD per node 64% improvement in throughput over IPoIB (FDR) for 256GB file size 37% improvement in latency over IPoIB (FDR) for 256GB file size 34

Presentation Outline Overview of Hadoop and Memcached Trends in High Performance Networking and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 35

Design Overview of MapReduce with RDMA Applications MapReduce Job Tracker Task Tracker Map Reduce Design features for RDMA Prefetch/Caching of MOF In-Memory Merge Overlap of Merge & Reduce Enables high performance RDMA communication, while supporting traditional socket interface Java Socket Interface 1/1 GigE, IPoIB Network Java Native Interface (JNI) OSU Design Verbs RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) JNI Layer bridges Java based MapReduce with communication library written in native code InfiniBand/RoCE support 36

Job Execution Time (sec) Job Execution Time (sec) Evaluations using Sort 5 45 4 1GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR) 35 3 25 2 15 1 5 25 3 35 4 Data Size (GB) 8 HDD DataNodes With 8 HDD DataNodes for 4GB sort 41% improvement over IPoIB (QDR) 39% improvement over Hadoop-A-IB (QDR) 5 45 4 1GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR) 35 3 25 2 15 1 5 25 3 35 4 Data Size (GB) 8 SSD DataNodes With 8 SSD DataNodes for 4GB sort 52% improvement over IPoIB (QDR) 44% improvement over Hadoop-A-IB (QDR) M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMAbased Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May 213. 37

Job Execution Time (sec) Evaluations using TeraSort 18 16 14 12 1 8 6 4 2 1GigE IPoIB (QDR) Hadoop-A-IB (QDR) OSU-IB (QDR) 6 8 1 Data Size (GB) 1 GB TeraSort with 8 DataNodes with 2 HDD per node 51% improvement over IPoIB (QDR) 41% improvement over Hadoop-A-IB (QDR) 38

Job Execution Time (sec) Job Execution Time (sec) Performance Evaluation on Larger Clusters 12 1 IPoIB (QDR) OSU-IB (QDR) Hadoop-A-IB (QDR) 7 6 IPoIB (FDR) Hadoop-A-IB (FDR) OSU-IB (FDR) 8 6 4 2 Data Size: 6 GB Data Size: 12 GB Data Size: 24 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 Sort in OSU Cluster For 24GB Sort in 64 nodes 4% improvement over IPoIB (QDR) with HDD used for HDFS 5 4 3 2 1 Data Size: 8 GB Data Size: 16 GBData Size: 32 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 TeraSort in TACC Stampede For 32GB TeraSort in 64 nodes 38% improvement over IPoIB (FDR) with HDD used for HDFS 39

Normalized Execution Time Evaluations using PUMA Workload 1.9.8 1GigE IPoIB (QDR) OSU-IB (QDR).7.6.5.4.3.2.1 AdjList (3GB) SelfJoin (8GB) SeqCount (3GB) WordCount (3GB) InvertIndex (3GB) Benchmarks 5% improvement in Self Join over IPoIB (QDR) for 8 GB data size 49% improvement in Sequence Count over IPoIB (QDR) for 3 GB data size 4

Presentation Outline Overview of Hadoop and Memcached Trends in High Performance Networking and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 41

Time (us) Time (us) Motivation Detailed Analysis on HBase Put/Get 2 18 16 14 12 1 8 6 4 2 16 14 12 1 8 6 4 2 Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization IPoIB (DDR) 1GigE OSU-IB (DDR) HBase Put 1KB HBase 1KB Put Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time HBase 1KB Get Communication Time 8.9 us A factor of 6X improvement over 1GE for communication time IPoIB (DDR) 1GigE OSU-IB (DDR) Hbase Get 1KB M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS 12 42

Time (us) Ops/sec HBase Single-Server-Multi-Client Results 45 4 35 3 25 2 6 5 4 3 IPoIB (DDR) OSU-IB (DDR) 1GigE 15 1 5 1 2 4 8 16 No. of Clients Latency HBase Get latency 4 clients: 14.5 us; 16 clients: 296.1 us HBase Get throughput 2 1 4 clients: 37.1 Kops/sec; 16 clients: 53.4 Kops/sec 27% improvement in throughput for 16 clients over 1GE 1 2 4 8 16 No. of Clients Throughput J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 43

Time (us) Time (us) HBase YCSB Read-Write Workload 7 1 6 9 8 5 7 4 3 2 1 6 5 4 3 2 1 IPoIB (QDR) 1GigE OSU-IB (QDR) 8 16 32 64 96 128 No. of Clients 8 16 32 64 96 128 No. of Clients Read Latency Write Latency HBase Get latency (Yahoo! Cloud Service Benchmark) 64 clients: 2. ms; 128 Clients: 3.5 ms 42% improvement over IPoIB for 128 clients HBase Get latency 64 clients: 1.9 ms; 128 Clients: 3.5 ms 4% improvement over IPoIB for 128 clients 44

Presentation Outline Overview of Hadoop and Memcached Trends in High Performance Networking and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 45

Design Overview of Hadoop RPC with RDMA Applications Design Features Default Java Socket Interface Hadoop RPC Our Design Java Native Interface (JNI) OSU Design Verbs JVM-bypassed buffer management On-demand connection setup RDMA or send/recv based adaptive communication InfiniBand/RoCE support 1/1 GigE, IPoIB Network RDMA Capable Networks (IB, 1GE/ iwarp, RoCE..) Enables high performance RDMA communication, while supporting traditional socket interface 46

Latency (us) Throughput (Kops/Sec) Hadoop RPC over IB - Gain in Latency and Throughput 12 1 8 1GigE IPoIB(QDR) OSU-IB(QDR) 16 14 12 1 6 8 4 6 2 4 2 1 2 4 8 16 32 64 128 256 512 124 248 496 Payload Size (Byte) 8 16 24 32 4 48 56 64 Number of Clients Hadoop RPC over IB PingPong Latency 1 byte: 39 us; 4 KB: 52 us 42%-49% and 46%-5% improvements compared with the performance on 1 GigE and IPoIB (32Gbps), respectively Hadoop RPC over IB Throughput 512 bytes & 48 clients: 135.22 Kops/sec 82% and 64% improvements compared with the peak performance on 1 GigE and IPoIB (32Gbps), respectively 47

Presentation Outline Overview of Hadoop and Memcached Trends in High Performance Networking and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 48

Total Throughput (MBps) Total Throughput (MBps) 9 8 7 6 5 4 3 2 1 Performance Benefits - TestDFSIO 1GigE IPoIB (QDR) OSU-IB (QDR) 1 9 8 7 6 5 4 3 2 1 1GigE IPoIB (QDR) OSU-IB (QDR) 5 1 15 2 Size (GB) 5 1 15 2 Size (GB) For 5GB, Cluster with 4 HDD Nodes, TestDFSIO with total 4 maps Cluster with 4 SSD Nodes, TestDFSIO with total 4 maps 53.4% improvement over IPoIB, 126.9% improvement over 1GigE with HDD used for HDFS 57.8% improvement over IPoIB, 138.1% improvement over 1GigE with SSD used for HDFS For 2GB, 1.7% improvement over IPoIB, 46.8% improvement over 1GigE with HDD used for HDFS 73.3% improvement over IPoIB, 141.3% improvement over 1GigE with SSD used for HDFS 49

Job Execution Time (sec) Job Execution Time (sec) Performance Benefits - Sort 18 16 1GigE IPoIB (QDR) OSU-IB (QDR) 16 14 1GigE IPoIB (QDR) OSU-IB (QDR) 14 12 12 1 1 8 6 8 6 4 4 2 2 48 64 8 Sort Size (GB) 48 64 8 Sort Size (GB) Cluster with 8 HDD Nodes, Sort with <8 maps, 8 reduces>/host Cluster with 8 SSD Nodes, Sort with <8 maps, 8 reduces>/host For 8GB Sort in a cluster size of 8, 4.7% improvement over IPoIB, 32.2% improvement over 1GigE with HDD used for HDFS 43.6% improvement over IPoIB, 42.9% improvement over 1GigE with SSD used for HDFS 5

Job Execution Time (sec) Job Execution Time (sec) Performance Benefits - TeraSort 25 1GigE IPoIB (QDR) OSU-IB (QDR) 25 1GigE IPoIB (QDR) OSU-IB (QDR) 2 2 15 15 1 1 5 5 8 9 1 8 9 1 Size (GB) Size (GB) Cluster with 8 HDD Nodes, TeraSort with <8 maps, 8reduces>/host Cluster with 8 SSD Nodes, TeraSort with <8 maps, 8reduces>/host For 1GB TeraSort in a cluster size of 8, 45% improvement over IPoIB, 39% improvement over 1GigE with HDD used for HDFS 46% improvement over IPoIB, 4% improvement over 1GigE with SSD used for HDFS 51

Job Execution Time (sec) Job Execution Time (sec) Performance Benefits RandomWriter & Sort in SDSC-Gordon 16 14 12 1 8 6 4 2 IPoIB (QDR) OSU-IB (QDR) IPoIB (QDR) OSU-IB (QDR) 4 Reduced by 2% Reduced by 36% 35 3 25 2 15 1 5 5 GB 1 GB 2 GB 3 GB 5 GB 1 GB 2 GB 3 GB Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64 Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64 RandomWriter in a cluster of single SSD per Sort in a cluster of single SSD per node node 2% improvement over IPoIB (QDR) 16% improvement over IPoIB (QDR) for for 5GB in a cluster of 16 5GB in a cluster of 16 36% improvement over IPoIB (QDR) 2% improvement over IPoIB (QDR) for 3GB in a cluster of 64 for 3GB in a cluster of 64 52

Presentation Outline Overview of Hadoop and Memcached Trends in High Performance Networking and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 53

Memcached-RDMA Design Sockets Client 1 2 Master Thread Sockets Worker Thread Shared Data RDMA Client 1 2 Sockets Worker Thread Verbs Worker Thread Verbs Worker Thread Memory Slabs Items Server and client perform a negotiation protocol Master thread assigns clients to appropriate worker thread Once a client is assigned a verbs worker thread, it can communicate directly and is bound to that thread All other Memcached data structures are shared among RDMA and Sockets worker threads 54

Time%(us)% Time%(us)% Time (us) Time (us) Memcached Get Latency (Small Message) 1 9 8 7 6 5 4 3 2 1 Time%(us)% Time%(us)% Time%(us)% Time%(us)% 1 2 4 8 16 32 64 128 256 512 1K 2K Message Size Intel Clovertown Cluster (IB: DDR) Memcached Get latency 4 bytes RC/UD DDR: 6.82/7.55 us; QDR: 4.28/4.86 us 2K bytes RC/UD DDR: 12.31/12.78 us; QDR: 8.19/8.46 us Almost factor of four improvement over 1GE (TOE) for 2K bytes on the DDR cluster Message%Size% Message%Size% IPoIB (DDR) 1GigE (DDR) OSU-RC-IB (DDR) OSU-UD-IB (DDR) Time%(us)% 1 Time%(us)% Message%Size% Message%Size% 9 8 7 6 5 4 3 2 1 1 2 4 8 16 32 64 128 256 512 1K 2K Message Size IPoIB (QDR) 1GigE (QDR) Intel Westmere Cluster (IB: QDR) Message%Size% Message%Size% OSU-RC-IB (QDR) OSU-UD-IB (QDR) Message%Size% Message%Size% 55

Thousands of Transactions per second (TPS) Thousands of Transactions per second (TPS) Time%(us)% Time%(us)% Time%(us)% Memcached Get TPS (4byte) 16 16 14 12 1 14 12 1 IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) 8 6 8 6 4 2 4 8 No. of Clients Memcached Get transactions per second for 4 bytes On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients 4 2 1 2 4 8 16 32 64 128 256 512 8 1K No. of Clients Message%Size% Message%Size% Significant improvement with native IB QDR compared to SDP and IPoIB Message%Size 56

1 2 4 8 16 32 64 128 256 512 1K 2K 4K Time (us) Thousands of Transactions per Second (TPS) 1 1 Memcached Performance (FDR Interconnect) 1 Memcached GET Latency OSU-IB (FDR) IPoIB (FDR) Latency Reduced by nearly 2X 7 6 5 4 3 2 Memcached Throughput 2X 1 Message Size Memcached Get latency Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us Memcached Throughput (4bytes) 48 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s Nearly 2X improvement in throughput 1 16 32 64 128 256 512 124 248 48 No. of Clients 57

Time (ms) Time (ms) Application Level Evaluation Olio Benchmark 6 25 IPoIB (QDR) 5 2 OSU-RC-IB (QDR) OSU-UD-IB (QDR) 4 3 Time%(ms)% 15 OSU-Hybrid-IB (QDR) 1 2 1 5 1 4 8 No. of Clients No.%of%Clients% 64 128 256 512 124 No. of Clients Olio Benchmark RC 1.6 sec, UD 1.9 sec, Hybrid 1.7 sec for 124 clients 4X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design 58

Time (ms) Time (ms) 9 8 7 6 5 4 Application Level Evaluation Real Application Workloads Time%(ms)% 35 3 25 2 15 IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) OSU-Hybrid-IB (QDR) 3 2 1 1 5 1 4 8 No. of Clients Real Application Workload RC 32 ms, UD 318 ms, Hybrid 314 ms for 124 clients 12X times better than IPoIB for 8 clients Hybrid design achieves comparable performance to that of pure RC design No.%of%Clients% 64 128 256 512 124 No. of Clients J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid 12 59

Presentation Outline Overview of Hadoop and Memcached Trends in High Performance Networking and Protocols Challenges in Accelerating Hadoop and Memcached Designs and Case Studies Hadoop HDFS MapReduce HBase RPC Combination of components Memcached Open Challenges Conclusion and Q&A 6

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase and Memcached) Upper level Changes? Programming Models (Sockets) Other RDMA Protocols? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 61

More Challenges Multi-threading and Synchronization Multi-threaded model exploration Fine-grained synchronization and lock-free design Unified helper threads for different components Multi-endpoint design to support multi-threading communications QoS and Virtualization Network virtualization and locality-aware communication for Big Data middleware Hardware-level virtualization support for End-to-End QoS I/O scheduling and storage virtualization Live migration 62

More Challenges (Cont d) Support of Accelerators Efficient designs for Big Data middleware to take advantage of NVIDA GPGPUs and Intel MICs Offload computation-intensive workload to accelerators Explore maximum overlapping between communication and offloaded computation Fault Tolerance Enhancements Novel data replication schemes for high-efficiency fault tolerance Exploration of light-weight fault tolerance mechanisms for Big Data processing Support of Parallel File Systems Optimize Big Data middleware over parallel file systems (e.g. Lustre) on modern HPC clusters 63

Job Execution Time (sec) Job Execution Time (sec) Performance Improvement Potential of MapReduce over Lustre on HPC Clusters An Initial Study 12 25 1 IPoIB (FDR) OSU-IB (FDR) 2 IPoIB (QDR) OSU-IB (QDR) 8 15 6 4 1 2 5 3 4 5 6 8 1 Data Size (GB) Data Size (GB) Sort in TACC Stampede Sort in SDSC Gordon For 5GB Sort in 64 nodes For 1GB Sort in 16 nodes 44% improvement over IPoIB (FDR) 33% improvement over IPoIB (QDR) Can more optimizations be achieved by leveraging more features of Lustre? 64

Are the Current Benchmarks Sufficient for Big Data? The current benchmarks provide some performance behavior However, do not provide any information to the designer/developer on: What is happening at the lower-layer? Where the benefits are coming from? Which design is leading to benefits or bottlenecks? Which component in the design needs to be changed and what will be its impact? Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer? 65

Challenges in Benchmarking of RDMA-based Designs Applications Big Data Middleware (HDFS, MapReduce, HBase and Memcached) Programming Models (Sockets) Benchmarks Other RDMA Protocols? Current Benchmarks Correlation? Communication and I/O Library Point-to-Point Communication Threaded Models and Synchronization Virtualization No Benchmarks I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 66

OSU MPI Micro-Benchmarks (OMB) Suite A comprehensive suite of benchmarks to Compare performance of different MPI libraries on various networks and systems Validate low-level functionalities Provide insights to the underlying MPI-level designs Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth Extended later to MPI-2 one-sided Collectives GPU-aware data movement OpenSHMEM (point-to-point and collectives) UPC Has become an industry standard Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems Available from http://mvapich.cse.ohio-state.edu/benchmarks Available in an integrated manner with MVAPICH2 stack 67

Iterative Process Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase and Memcached) Applications- Level Benchmarks Programming Models (Sockets) Other RDMA Protocols? Communication and I/O Library Point-to-Point Communication I/O and File Systems Threaded Models and Synchronization QoS Virtualization Fault-Tolerance Micro- Benchmarks Networking Technologies (InfiniBand, 1/1/4GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 68

Future Plans of OSU Big Data Project Upcoming RDMA for Apache Hadoop Releases will support HDFS Parallel Replication Advanced MapReduce HBase Hadoop 2..x Memcached-RDMA Exploration of other components (Threading models, QoS, Virtualization, Accelerators, etc.) Advanced designs with upper-level changes and optimizations Comprehensive set of Benchmarks More details at http://hadoop-rdma.cse.ohio-state.edu 69

Concluding Remarks Presented an overview of Big Data, Hadoop and Memcached Provided an overview of Networking Technologies Discussed challenges in accelerating Hadoop Presented initial designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, RPC, HBase and Memcached Results are promising Many other open issues need to be solved Will enable Big Data community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner 7

Personnel Acknowledgments Current Students S. Chakraborthy (Ph.D.) M. Rahman (Ph.D.) N. Islam (Ph.D.) D. Shankar (Ph.D.) J. Jose (Ph.D.) R. Shir (Ph.D.) M. Li (Ph.D.) A. Venkatesh (Ph.D.) S. Potluri (Ph.D.) J. Zhang (Ph.D.) R. Rajachandrasekhar (Ph.D.) Past Students P. Balaji (Ph.D.) W. Huang (Ph.D.) D. Buntinas (Ph.D.) W. Jiang (M.S.) S. Bhagvat (M.S.) S. Kini (M.S.) L. Chai (Ph.D.) M. Koop (Ph.D.) B. Chandrasekharan (M.S.) R. Kumar (M.S.) N. Dandapanthula (M.S.) S. Krishnamoorthy (M.S.) V. Dhanraj (M.S.) K. Kandalla (Ph.D.) T. Gangadharappa (M.S.) P. Lai (M.S.) K. Gopalakrishnan (M.S.) J. Liu (Ph.D.) Current Senior Research Associate H. Subramoni Current Post-Docs Current Programmers X. Lu M. Arnold K. Hamidouche J. Perkins M. Luo (Ph.D.) G. Santhanaraman (Ph.D.) A. Mamidala (Ph.D.) A. Singh (Ph.D.) G. Marsh (M.S.) J. Sridhar (M.S.) V. Meshram (M.S.) S. Sur (Ph.D.) S. Naravula (Ph.D.) H. Subramoni (Ph.D.) R. Noronha (Ph.D.) K. Vaidyanathan (Ph.D.) X. Ouyang (Ph.D.) A. Vishnu (Ph.D.) S. Pai (M.S.) J. Wu (Ph.D.) W. Yu (Ph.D.) Past Post-Docs H. Wang X. Besseron H.-W. Jin E. Mancini S. Marcarelli J. Vienne Past Research Scientist S. Sur Past Programmers D. Bureddy M. Luo 71

Pointers http://nowlab.cse.ohio-state.edu http://mvapich.cse.ohio-state.edu http://hadoop-rdma.cse.ohio-state.edu panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda 72