Accelerating Big Data Processing with Hadoop, Spark and Memcached



Similar documents
Accelerating Big Data Processing with Hadoop, Spark and Memcached

Big Data: Hadoop and Memcached

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Accelera'ng Big Data Processing with Hadoop, Spark and Memcached

Can High-Performance Interconnects Benefit Memcached and Hadoop?

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Enabling High performance Big Data platform with RDMA

Accelerating Big Data Processing with Hadoop, Spark, and Memcached over High-Performance Interconnects

Hari Subramoni. Education: Employment: Research Interests: Projects:

RDMA over Ethernet - A Preliminary Study

RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed File system THESIS

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Hadoop on the Gordon Data Intensive Cluster

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

A Micro-benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

RDMA for Apache Hadoop User Guide

HiBench Introduction. Carson Wang Software & Services Group

MapReduce Evaluator: User Guide

Big Fast Data Hadoop acceleration with Flash. June 2013

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Hadoop: Embracing future hardware

High-Performance Networking for Optimized Hadoop Deployments

Hadoop Optimizations for BigData Analytics

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

New Storage System Solutions

Can High-Performance Interconnects Benefit Hadoop Distributed File System?

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Performance Evaluation of InfiniBand with PCI Express

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Hadoop IST 734 SS CHUNG

CSE-E5430 Scalable Cloud Computing Lecture 2

Wrangler: A New Generation of Data-intensive Supercomputing. Christopher Jordan, Siva Kulasekaran, Niall Gaffney

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Advancing Applications Performance With InfiniBand

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

SMB Direct for SQL Server and Private Cloud

Benchmarking Hadoop & HBase on Violin

How to Deploy OpenStack on TH-2 Supercomputer Yusong Tan, Bao Li National Supercomputing Center in Guangzhou April 10, 2014

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

GraySort on Apache Spark by Databricks

Accelerating and Simplifying Apache

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

An In-Memory RDMA-Based Architecture for the Hadoop Distributed Filesystem

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

FLOW-3D Performance Benchmark and Profiling. September 2012

TCP/IP Implementation of Hadoop Acceleration. Cong Xu

Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!

Michael Kagan.

Windows 8 SMB 2.2 File Sharing Performance

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

Lustre * Filesystem for Cloud and Hadoop *

Hadoop Architecture. Part 1

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Installing Hadoop over Ceph, Using High Performance Networking

Hadoop Ecosystem B Y R A H I M A.

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Mambo Running Analytics on Enterprise Storage

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Understanding the Significance of Network Performance in End Applications: A Case Study with EtherFabric and InfiniBand

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Self service for software development tools

Dell Reference Configuration for Hortonworks Data Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

A Case for Flash Memory SSD in Hadoop Applications

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

InfiniBand Update Addressing new I/O challenges in HPC, Cloud, and Web 2.0 infrastructures. Brian Sparks IBTA Marketing Working Group Co-Chair

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Microsoft SMB Running Over RDMA in Windows Server 8

High-Performance Design of HBase with RDMA over InfiniBand

Big Data Analytics - Accelerated. stream-horizon.com

Building Enterprise-Class Storage Using 40GbE

Automating Big Data Benchmarking for Different Architectures with ALOJA

THE HADOOP DISTRIBUTED FILE SYSTEM

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team


Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

A Tour of the Linux OpenFabrics Stack

Evaluating HDFS I/O Performance on Virtualized Systems

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

HPC data becomes Big Data. Peter Braam

Ant Rowstron. Joint work with Paolo Costa, Austin Donnelly and Greg O Shea Microsoft Research Cambridge. Hussam Abu-Libdeh, Simon Schubert Interns

Big Data With Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Transcription:

Accelerating Big Data Processing with Hadoop, Spark and Memcached Talk at HPC Advisory Council Stanford Conference (Feb 15) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Introduction to Big Data Applications and Analytics Big Data has become the one of the most important elements of business analytics Provides groundbreaking opportunities for enterprise information management and decision making The amount of data is exploding; companies are capturing and digitizing more information than ever The rate of information growth appears to be exceeding Moore s Law Commonly accepted 3V s of Big Data Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/nist-stonebraker.pdf 5V s of Big Data 3V + Value, Veracity 2

Data Management and Processing on Modern Clusters Substantial impact on designing and utilizing modern data management and processing systems in multiple tiers Front-end data accessing and serving (Online) Memcached + DB (e.g. MySQL), HBase Back-end data analytics (Offline) HDFS, MapReduce, Spark 3

Overview of Apache Hadoop Architecture Open-source implementation of Google MapReduce, GFS, and BigTable for Big Data Analytics Hadoop Common Utilities (RPC, etc.), HDFS, MapReduce, YARN http://hadoop.apache.org Hadoop 1.x Hadoop 2.x MapReduce (Data Processing) Other Models (Data Processing) MapReduce (Cluster Resource Management & Data Processing) YARN (Cluster Resource Management & Job Scheduling) Hadoop Distributed File System (HDFS) Hadoop Distributed File System (HDFS) Hadoop Common/Core (RPC,..) Hadoop Common/Core (RPC,..) 4

Spark Architecture Overview An in-memory data-processing framework Worker Iterative machine learning jobs Interactive data analytics Scala based Implementation Zookeeper Worker Standalone, YARN, Mesos Scalable and communication intensive Wide dependencies between Resilient Distributed Datasets (RDDs) Driver SparkContext Master Worker Worker Worker HDFS MapReduce-like shuffle operations to repartition RDDs Sockets based communication http://spark.apache.org 5

Memcached Architecture Main memory SSD CPUs HDD... Main memory SSD CPUs HDD!"#$%"$#& Internet High Performance Networks High Performance Networks Main memory SSD CPUs HDD High Performance Networks Main memory SSD CPUs HDD...... Main memory CPUs (Database Servers) SSD HDD Web Frontend Servers (Memcached Clients) (Memcached Servers) Three-layer architecture of Web 2.0 Web Servers, Memcached Servers, Database Servers Distributed Caching Layer Allows to aggregate spare memory from multiple nodes General purpose Typically used to cache database queries, results of API calls Scalable model, but typical usage very network intensive 6

Presentation Outline Challenges for Accelerating Big Data Processing The High-Performance Big Data (HiBD) Project RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Enhanced HDFS with In-memory and Heterogeneous Storage RDMA-based designs for Memcached and HBase RDMA-based Memcached RDMA-based HBase Challenges in Designing Benchmarks for Big Data Processing OSU HiBD Benchmarks Conclusion and Q&A 7

Drivers for Modern HPC Clusters High End Computing (HEC) is growing dramatically High Performance Computing Big Data Computing Technology Advancement Multi-core/many-core technologies and accelerators Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) Solid State Drives (SSDs) and Non-Volatile Random- Access Memory (NVRAM) Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) Tianhe 2 Titan Stampede Tianhe 1A 8

Interconnects and Protocols in the Open Fabrics Stack Application / Middleware Interface Application / Middleware Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP TCP/IP RDMA RDMA Ethernet Driver IPoIB Hardware Offload User Space RDMA User Space User Space User Space Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter InfiniBand Adapter iwarp Adapter RoCE Adapter InfiniBand Adapter Switch Ethernet Switch InfiniBand Switch Ethernet Switch InfiniBand Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch 1/10/40 GigE IPoIB 10/40 GigE- TOE RSockets SDP iwarp RoCE IB Native 9

Wide Adoption of RDMA Technology Message Passing Interface (MPI) for HPC Parallel File Systems Lustre GPFS Delivering excellent performance: < 1.0 microsec latency 100 Gbps bandwidth 5-10% CPU utilization Delivering excellent scalability 10

Challenges in Designing Communication and I/O Libraries for Big Data Systems Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Upper level Changes? Programming Models (Sockets) Other RDMA Protocols? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/10/40GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 11

Can Big Data Processing Systems be Designed with High- Performance Networks and Protocols? Current Design Application Our Approach Application Sockets OSU Design Verbs Interface 1/10 GigE Network 10 GigE or InfiniBand Sockets not designed for high-performance Stream semantics often mismatch for upper layers Zero-copy not available for non-blocking sockets 12

Presentation Outline Challenges for Accelerating Big Data Processing The High-Performance Big Data (HiBD) Project RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Enhanced HDFS with In-memory and Heterogeneous Storage RDMA-based designs for Memcached and HBase RDMA-based Memcached RDMA-based HBase Challenges in Designing Benchmarks for Big Data Processing OSU HiBD Benchmarks Conclusion and Q&A 13

The High-Performance Big Data (HiBD) Project RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) RDMA for Memcached (RDMA-Memcached) OSU HiBD-Benchmarks (OHB) http://hibd.cse.ohio-state.edu Users Base: 88 organizations from 18 countries More than 2,500 downloads RDMA for Apache HBase and Spark will be available in near future 14

RDMA for Apache Hadoop 1.x/2.x Distributions High-Performance Design of Hadoop over RDMA-enabled Interconnects High performance design with native InfiniBand and RoCE support at the verbslevel for HDFS, MapReduce, and RPC components Easily configurable for native InfiniBand, RoCE and the traditional socketsbased support (Ethernet and InfiniBand with IPoIB) Current release: 0.9.9/0.9.5 Based on Apache Hadoop 1.2.1/2.5.0 Compliant with Apache Hadoop 1.2.1/2.5.0 APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE support with Mellanox adapters Various multi-core platforms Different file systems with disks and SSDs http://hibd.cse.ohio-state.edu 15

RDMA for Memcached Distribution High-Performance Design of Memcached over RDMA-enabled Interconnects High performance design with native InfiniBand and RoCE support at the verbslevel for Memcached and libmemcached components Easily configurable for native InfiniBand, RoCE and the traditional socketsbased support (Ethernet and InfiniBand with IPoIB) Current release: 0.9.1 Based on Memcached 1.4.20 and libmemcached 1.0.18. Compliant with Memcached APIs and applications Tested with Mellanox InfiniBand adapters (DDR, QDR and FDR) RoCE support with Mellanox adapters Various multi-core platforms http://hibd.cse.ohio-state.edu 16

OSU HiBD Micro-Benchmark (OHB) Suite - Memcached Released in OHB 0.7.1 (ohb_memlat) Evaluates the performance of stand-alone Memcached Three different micro-benchmarks SET Micro-benchmark: Micro-benchmark for memcached set operations GET Micro-benchmark: Micro-benchmark for memcached get operations MIX Micro-benchmark: Micro-benchmark for a mix of memcached set/get operations (Read:Write ratio is 90:10) Calculates average latency of Memcached operations Can measure throughput in Transactions Per Second 17

Presentation Outline Challenges for Accelerating Big Data Processing The High-Performance Big Data (HiBD) Project RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Enhanced HDFS with In-memory and Heterogeneous Storage RDMA-based designs for Memcached and HBase RDMA-based Memcached RDMA-based HBase Challenges in Designing Benchmarks for Big Data Processing OSU HiBD Benchmarks Conclusion and Q&A 18

Acceleration Case Studies and In-Depth Performance Evaluation RDMA-based Designs and Performance Evaluation HDFS MapReduce Spark 19

Design Overview of HDFS with RDMA Applications Design Features Others HDFS Write RDMA-based HDFS write RDMA-based HDFS replication Java Socket Interface Java Native Interface (JNI) Parallel replication support OSU Design On-demand connection setup 1/10 GigE, IPoIB Network Verbs RDMA Capable Networks (IB, 10GE/ iwarp, RoCE..) InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Java based HDFS with communication library written in native code 20

Communication Time (s) Communication Times in HDFS 25 20 10GigE IPoIB (QDR) OSU-IB (QDR) 15 10 Reduced by 30% 5 0 Cluster with HDD DataNodes 2GB 4GB 6GB 8GB 10GB File Size (GB) 30% improvement in communication time over IPoIB (QDR) 56% improvement in communication time over 10GigE Similar improvements are obtained for SSD DataNodes N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 2012 N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014 21

Aggregated Throughput (MBps) Execution Time (s) Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede 1400 1200 1000 800 600 IPoIB (FDR) OSU-IB (FDR) Increased by 64% 600 500 400 300 IPoIB (FDR) OSU-IB (FDR) Reduced by 37% 400 200 200 100 0 64 128 256 File Size (GB) 0 64 128 256 File Size (GB) Cluster with 64 DataNodes (1K cores), single HDD per node 64% improvement in throughput over IPoIB (FDR) for 256GB file size 37% improvement in latency over IPoIB (FDR) for 256GB file size 22

Acceleration Case Studies and In-Depth Performance Evaluation RDMA-based Designs and Performance Evaluation HDFS MapReduce Spark 23

Design Overview of MapReduce with RDMA Java Socket Interface MapReduce Job Tracker 1/10 GigE, IPoIB Network Applications Task Tracker Map Reduce Java Native Interface (JNI) OSU Design Verbs RDMA Capable Networks (IB, 10GE/ iwarp, RoCE..) Design Features RDMA-based shuffle Prefetching and caching map output Efficient Shuffle Algorithms In-memory merge On-demand Shuffle Adjustment Advanced overlapping map, shuffle, and merge shuffle, merge, and reduce On-demand connection setup InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Java based MapReduce with communication library written in native code 24

Job Execution Time (sec) Job Execution Time (sec) Performance Evaluation of Sort and TeraSort 1200 1000 IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR) 700 IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR) 600 800 500 600 400 400 300 200 200 100 0 0 Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB Data Size: 80 GB Data Size: 160 GBData Size: 320 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 Sort in OSU Cluster TeraSort in TACC Stampede For 240GB Sort in 64 nodes (512 cores) 40% improvement over IPoIB (QDR) with HDD used for HDFS For 320GB TeraSort in 64 nodes (1K cores) 38% improvement over IPoIB (FDR) with HDD used for HDFS 25

Normalized Execution Time Evaluations using PUMA Workload 1 0.9 0.8 10GigE IPoIB (QDR) OSU-IB (QDR) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB) Benchmarks 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size 26

Acceleration Case Studies and In-Depth Performance Evaluation RDMA-based Designs and Performance Evaluation HDFS MapReduce Spark 27

Design Overview of Spark with RDMA Design Features RDMA based shuffle SEDA-based plugins Dynamic connection management and sharing Non-blocking and out-oforder data transfer Off-JVM-heap buffer management InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014 28

GroupBy Time (sec) GroupBy Time (sec) Preliminary Results of Spark-RDMA Design - GroupBy 9 8 7 6 9 8 7 6 10GigE IPoIB RDMA 5 5 4 4 3 3 2 2 1 1 0 4 6 8 10 Data Size (GB) 0 8 12 16 20 Data Size (GB) Cluster with 4 HDD Nodes, GroupBy with 32 cores Cluster with 8 HDD Nodes, GroupBy with 64 cores Cluster with 4 HDD Nodes, single disk per node, 32 concurrent tasks 18% improvement over IPoIB (QDR) for 10GB data size Cluster with 8 HDD Nodes, single disk per node, 64 concurrent tasks 20% improvement over IPoIB (QDR) for 20GB data size 29

Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Big Data Processing The High-Performance Big Data (HiBD) Project RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Enhanced HDFS with In-memory and Heterogeneous Storage RDMA-based designs for Memcached and HBase RDMA-based Memcached RDMA-based HBase Challenges in Designing Benchmarks for Big Data Processing OSU HiBD Benchmarks Conclusion and Q&A 30

Optimize Hadoop YARN MapReduce over Parallel File Systems Compute Nodes App Master Map Reduce Lustre Client MetaData Servers Object Storage Servers Lustre Setup HPC Cluster Deployment Hybrid topological solution of Beowulf architecture with separate I/O nodes Lean compute nodes with light OS; more memory space; small local storage Sub-cluster of dedicated I/O nodes with parallel file systems, such as Lustre MapReduce over Lustre Local disk is used as the intermediate data directory Lustre is used as the intermediate data directory 31

Design Overview of Shuffle Strategies for MapReduce over Lustre Map 1 Map 2 Map 3 In-memory merge/sort Intermediate Data Directory Lustre Lustre Read / RDMA Reduce 1 Reduce 2 reduce In-memory merge/sort reduce Design Features Two shuffle approaches Lustre read based shuffle RDMA based shuffle Hybrid shuffle algorithm to take benefit from both shuffle approaches Dynamically adapts to the better shuffle approach for each shuffle request based on profiling values for each Lustre read operation In-memory merge and overlapping of different phases are kept similar to RDMAenhanced MapReduce design M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015. 32

Job Execution Time (sec) Job Execution Time (sec) Performance Improvement of MapReduce over Lustre on TACC-Stampede 1200 1000 800 Local disk is used as the intermediate data directory IPoIB (FDR) OSU-IB (FDR) 500 450 400 350 300 IPoIB (FDR) OSU-IB (FDR) 600 250 200 400 150 200 100 50 0 300 400 500 0 20 GB 40 GB 80 GB 160 GB 320 GB 640 GB Data Size (GB) Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128 For 500GB Sort in 64 nodes For 640GB Sort in 128 nodes 44% improvement over IPoIB (FDR) 48% improvement over IPoIB (FDR) M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMAbased Approach Benefit?, Euro-Par, August 2014. 33

Job Execution Time (sec) Job Execution Time (sec) Case Study - Performance Improvement of MapReduce over Lustre on SDSC-Gordon Lustre is used as the intermediate data directory 900 800 700 IPoIB (QDR) OSU-IB (QDR) 900 800 700 IPoIB (QDR) OSU-IB (QDR) 600 600 500 500 400 400 300 300 200 200 100 100 0 40 60 80 Data Size (GB) 0 40 80 120 Data Size (GB) For 80GB Sort in 8 nodes For 120GB TeraSort in 16 nodes 34% improvement over IPoIB (QDR) 25% improvement over IPoIB (QDR) 34

Presentation Outline Challenges for Accelerating Big Data Processing The High-Performance Big Data (HiBD) Project RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Enhanced HDFS with In-memory and Heterogeneous Storage RDMA-based designs for Memcached and HBase RDMA-based Memcached RDMA-based HBase Challenges in Designing Benchmarks for Big Data Processing OSU HiBD Benchmarks Conclusion and Q&A 35

Enhanced HDFS with In-memory and Heterogeneous Storage Hybrid Replication Applications HHH (Triple-H) Data Placement Policies Eviction/ Promotion Heterogeneous Storage RAM Disk SSD HDD Lustre Design Features Two modes Standalone Lustre-Integrated Policies to efficiently utilize the heterogeneous storage devices RAM, SSD, HDD, Lustre Eviction/Promotion based on data usage pattern Hybrid Replication Lustre-Integrated mode: Lustre-based fault-tolerance N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid 15, May 2015 36

Total Throughput (MBps) Execution Time (s) 6000 5000 Case Study - Performance Improvement of Standalone Mode on TACC Stampede IPoIB (FDR) OSU-IB (FDR) 300 250 IPoIB (FDR) OSU-IB (FDR) 4000 3000 2000 1000 200 150 100 50 0 0 write read TestDFSIO 8:30 16:60 32:120 Cluster Size:Data Size RandomWriter For 160GB TestDFSIO in 32 nodes For 120GB RandomWriter in 32 Write Throughput: 7x improvement over IPoIB (FDR) nodes 3x improvement over IPoIB (QDR) Read Throughput: 2x improvement over IPoIB (FDR) 37

Execution Time (s) 500 450 400 350 300 250 200 150 100 50 0 Case Study - Performance Improvement of Lustre- Integrated Mode on SDSC Gordon HDFS-IPoIB (QDR) Lustre-IPoIB (QDR) OSU-IB (QDR) 20 40 60 Data Size (GB) Sort For 60GB Sort in 8 nodes 24% improvement over default HDFS 54% improvement over Lustre 33% storage space saving compared to default HDFS Storage Used (GB) HDFS-IPoIB (QDR) Lustre-IPoIB (QDR) 360 120 OSU-IB (QDR) 240 Storage space for 60GB Sort 38

Presentation Outline Challenges for Accelerating Big Data Processing The High-Performance Big Data (HiBD) Project RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Enhanced HDFS with In-memory and Heterogeneous Storage RDMA-based designs for Memcached and HBase RDMA-based Memcached RDMA-based HBase Challenges in Designing Benchmarks for Big Data Processing OSU HiBD Benchmarks Conclusion and Q&A 39

Memcached-RDMA Design Sockets Client RDMA Client 1 2 1 2 Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items Server and client perform a negotiation protocol Master thread assigns clients to appropriate worker thread Once a client is assigned a verbs worker thread, it can communicate directly and is bound to that thread All other Memcached data structures are shared among RDMA and Sockets worker threads Native IB-verbs-level Design and evaluation with Server : Memcached (http://memcached.org) Client : libmemcached (http://libmemcached.org) Different networks and protocols: 10GigE, IPoIB, native IB (RC, UD) 40

1 2 4 8 16 32 64 128 256 512 1K 2K 4K Time (us) Thousands of Transactions per Second (TPS) 1000 100 Memcached Performance (FDR Interconnect) 10 Memcached GET Latency OSU-IB (FDR) IPoIB (FDR) Latency Reduced by nearly 20X 700 600 500 400 300 200 Memcached Throughput 2X 1 Message Size Memcached Get latency Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us Memcached Throughput (4bytes) 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s Nearly 2X improvement in throughput 100 0 16 32 64 128 256 512 1024 2048 4080 No. of Clients 41

Presentation Outline Challenges for Accelerating Big Data Processing The High-Performance Big Data (HiBD) Project RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Enhanced HDFS with In-memory and Heterogeneous Storage RDMA-based designs for Memcached and HBase RDMA-based Memcached RDMA-based HBase Challenges in Designing Benchmarks for Big Data Processing OSU HiBD Benchmarks Conclusion and Q&A 42

HBase-RDMA Design Overview Applications HBase Java Socket Interface Java Native Interface (JNI) OSU-IB Design IB Verbs 1/10 GigE, IPoIB Networks RDMA Capable Networks (IB, 10GE/ iwarp, RoCE..) JNI Layer bridges Java based HBase with communication library written in native code Enables high performance RDMA communication, while supporting traditional socket interface 43

Time (us) Time (us) HBase YCSB Read-Write Workload 7000 10000 6000 9000 8000 5000 7000 4000 3000 2000 1000 6000 5000 4000 3000 2000 1000 IPoIB (QDR) 10GigE OSU-IB (QDR) 0 8 16 32 64 96 128 No. of Clients 0 8 16 32 64 96 128 No. of Clients Read Latency HBase Get latency (Yahoo! Cloud Service Benchmark) 64 clients: 2.0 ms; 128 Clients: 3.5 ms 42% improvement over IPoIB for 128 clients HBase Put latency 64 clients: 1.9 ms; 128 Clients: 3.5 ms 40% improvement over IPoIB for 128 clients Write Latency J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS 12 44

Presentation Outline Challenges for Accelerating Big Data Processing The High-Performance Big Data (HiBD) Project RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Enhanced HDFS with In-memory and Heterogeneous Storage RDMA-based designs for Memcached and HBase RDMA-based Memcached RDMA-based HBase Challenges in Designing Benchmarks for Big Data Processing OSU HiBD Benchmarks Conclusion and Q&A 45

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Upper level Changes? Programming Models (Sockets) Other RDMA Protocols? Point-to-Point Communication Communication and I/O Library Threaded Models and Synchronization Virtualization I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/10/40GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 46

Are the Current Benchmarks Sufficient for Big Data Management and Processing? The current benchmarks provide some performance behavior However, do not provide any information to the designer/developer on: What is happening at the lower-layer? Where the benefits are coming from? Which design is leading to benefits or bottlenecks? Which component in the design needs to be changed and what will be its impact? Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer? 47

OSU MPI Micro-Benchmarks (OMB) Suite A comprehensive suite of benchmarks to Compare performance of different MPI libraries on various networks and systems Validate low-level functionalities Provide insights to the underlying MPI-level designs Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth Extended later to MPI-2 one-sided Collectives GPU-aware data movement OpenSHMEM (point-to-point and collectives) UPC Has become an industry standard Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems Available from http://mvapich.cse.ohio-state.edu/benchmarks Available in an integrated manner with MVAPICH2 stack 48

Challenges in Benchmarking of RDMA-based Designs Applications Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Programming Models (Sockets) Benchmarks Other RDMA Protocols? Current Benchmarks Correlation? Communication and I/O Library Point-to-Point Communication Threaded Models and Synchronization Virtualization No Benchmarks I/O and File Systems QoS Fault-Tolerance Networking Technologies (InfiniBand, 1/10/40GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 49

Iterative Process Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications Applications Benchmarks Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Applications- Level Benchmarks Programming Models (Sockets) Other RDMA Protocols? Communication and I/O Library Point-to-Point Communication I/O and File Systems Threaded Models and Synchronization QoS Virtualization Fault-Tolerance Micro- Benchmarks Networking Technologies (InfiniBand, 1/10/40GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD and SSD) 50

OSU HiBD Micro-Benchmark (OHB) Suite - HDFS Evaluate the performance of standalone HDFS Five different benchmarks Sequential Write Latency (SWL) Sequential or Random Read Latency (SRL or RRL) Sequential Write Throughput (SWT) Sequential Read Throughput (SRT) Sequential Read-Write Throughput (SRWT) N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Int'l Workshop on Big Data Benchmarking (WBDB '12), December 2012. Benchmark File Name File Size HDFS Parameter Readers Writers Random/ Sequential Read Seek Interval SWL SRL/RRL (RRL) SWT SRT SRWT 51

OSU HiBD Micro-Benchmark (OHB) Suite - RPC Two different micro-benchmarks to evaluate the performance of standalone Hadoop RPC Latency: Single Server, Single Client Throughput: Single Server, Multiple Clients A simple script framework for job launching and resource monitoring Calculates statistics like Min, Max, Average Network configuration, Tunable parameters, DataType, CPU Utilization Component Network Address Port Data Type Min Msg Size Max Msg Size No. of Iterations Handlers Verbose lat_client lat_server Component Network Address Port Data Type Min Msg Size Max Msg Size No. of Iterations No. of Clients Handlers Verbose thr_client thr_server X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop RPC on High- Performance Networks, Int'l Workshop on Big Data Benchmarking (WBDB '13), July 2013. 52

OSU HiBD Micro-Benchmark (OHB) Suite - MapReduce Evaluate the performance of stand-alone MapReduce Does not require or involve HDFS or any other distributed file system Considers various factors that influence the data shuffling phase underlying network configuration, number of map and reduce tasks, intermediate shuffle data pattern, shuffle data size etc. Three different micro-benchmarks based on intermediate shuffle data patterns MR-AVG micro-benchmark: intermediate data is evenly distributed among reduce tasks. MR-RAND micro-benchmark: intermediate data is pseudo-randomly distributed among reduce tasks. MR-SKEW micro-benchmark: intermediate data is unevenly distributed among reduce tasks. D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks, BPOE-5 (2014). 53

Presentation Outline Overview of Modern Clusters, Interconnects and Protocols Challenges for Accelerating Big Data Processing The High-Performance Big Data (HiBD) Project RDMA-based designs for Apache Hadoop and Spark Case studies with HDFS, MapReduce, and Spark RDMA-based MapReduce on HPC Clusters with Lustre Enhanced HDFS with In-memory and Heterogeneous Storage RDMA-based designs for Memcached and HBase RDMA-based Memcached Case study with OLDP RDMA-based HBase Challenges in Designing Benchmarks for Big Data Processing OSU HiBD Benchmarks Conclusion and Q&A 54

Future Plans of OSU High Performance Big Data Project Upcoming Releases of RDMA-enhanced Packages will support Spark, HBase Plugin-based designs Lustre-based designs HHH-based HDFS designs Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support HDFS MapReduce RPC Exploration of other components (Threading models, QoS, Virtualization, Accelerators, etc.) Advanced designs with upper-level changes and optimizations 55

Concluding Remarks Presented an overview of Big Data processing middleware Provided an overview of cluster technologies Discussed challenges in accelerating Big Data middleware Presented initial designs to take advantage of InfiniBand/RDMA for Hadoop, Spark, Memcached, and HBase Presented challenges in designing benchmarks Results are promising Many other open issues need to be solved Will enable Big processing community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner 56

Personnel Acknowledgments Current Students A. Awan (Ph.D.) M. Li (Ph.D.) A. Bhat (M.S.) M. Rahman (Ph.D.) S. Chakraborthy (Ph.D.) D. Shankar (Ph.D.) C.-H. Chu (Ph.D.) A. Venkatesh (Ph.D.) N. Islam (Ph.D.) J. Zhang (Ph.D.) Past Students P. Balaji (Ph.D.) W. Huang (Ph.D.) D. Buntinas (Ph.D.) W. Jiang (M.S.) S. Bhagvat (M.S.) J. Jose (Ph.D.) L. Chai (Ph.D.) S. Kini (M.S.) B. Chandrasekharan (M.S.) M. Koop (Ph.D.) N. Dandapanthula (M.S.) R. Kumar (M.S.) V. Dhanraj (M.S.) S. Krishnamoorthy (M.S.) T. Gangadharappa (M.S.) K. Kandalla (Ph.D.) K. Gopalakrishnan (M.S.) P. Lai (M.S.) Past Post-Docs H. Wang X. Besseron H.-W. Jin J. Liu (Ph.D.) E. Mancini S. Marcarelli J. Vienne M. Luo Current Senior Research Associates K. Hamidouche H. Subramoni X. Lu Current Post-Doc Current Programmer J. Lin J. Perkins Current Research Specialist M. Arnold M. Luo (Ph.D.) G. Santhanaraman (Ph.D.) A. Mamidala (Ph.D.) A. Singh (Ph.D.) G. Marsh (M.S.) J. Sridhar (M.S.) V. Meshram (M.S.) S. Sur (Ph.D.) S. Naravula (Ph.D.) H. Subramoni (Ph.D.) R. Noronha (Ph.D.) K. Vaidyanathan (Ph.D.) X. Ouyang (Ph.D.) A. Vishnu (Ph.D.) S. Pai (M.S.) J. Wu (Ph.D.) S. Potluri (Ph.D.) W. Yu (Ph.D.) R. Rajachandrasekar (Ph.D.) Past Research Scientist Past Programmers S. Sur D. Bureddy 57

Thank You! panda@cse.ohio-state.edu Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2/MVAPICH2-X Project http://mvapich.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/ 58