Hadoop Optimizations for BigData Analytics Weikuan Yu Auburn University
Outline WBDB, Oct 2012 S-2 Background Network Levitated Merge JVM-Bypass Shuffling Fast Completion Scheduler
WBDB, Oct 2012 S-3 Emerging Demand for BigData Analytics Big demand from many organizations in various domains Scalable computing power without worrying about system maintenance. Ubiquitously accessible computing and storage resources. Low cost, highly reliable, trusted computing infrastructure. Commercial companies are gearing up resources for BigData
MapReduce l l l l A simple data processing model to process big data Designed for commodity off-the-shelf hardware components. Strong merits for big data analytics l l Scalability: increase throughput by increasing # of nodes Fault-tolerance (quick and low cost recovery of the failures of tasks) Hadoop, An open-source implementation of MapReduce: l Widely deployed by many big data companies: AOL, Baidu, EBay, Facebook, IBM, NY Times, Yahoo!. WBDB, Oct 2012, S-4
WBDB, Oct 2012 S-5 High-Level Overview of Hadoop l l l HDFS and the MapReduce Framework Data processing with MapTasks and ReduceTasks Three main steps of data movement. l Intermediate data shuffling in the MapReduce is time-consuming Applications JobTracker Job Submission Task Tracker/Runner 1 HDFS 3 Task Tracker/Runner MapTasks 2 Shuffle Intermediate Data ReduceTasks
Outline WBDB, Oct 2012 S-6 Background Network Levitated Merge JVM-Bypass Shuffling Fast Completion Scheduler
WBDB, Oct 2012 S-7 Motivation for Network-Levitated Merge 1: Serialization between shuffle/merge and reduce phases shuffle map merge reduce Start First MOF Serialization Time
WBDB, Oct 2012 S-8 Repetitive Merges and Disk Access Hadoop data spilling controlled through parameters To limit the number of outstanding files An example with io.sort.factor=3 1: merge more 2: insert 3: merge 4: to merge soon
WBDB, Oct 2012 S-9 Hadoop Acceleration (Hadoop-A) Pipelined shuffle, merge and reduce Network-levitated data merge Hadoop JobTracker TaskTracker TaskTracker Java MapTask ReduceTask C++ NetMerger Data Engine MOFSupplier RDMA Server RDMA Client Fetch Manager Merged Data Merging Thread Merge Manager RDMA Interconnects Acceleration
WBDB, Oct 2012 S-10 Pipelined Data Shuffle, Merge and Reduce shuffle map header merge reduce PQ setup start First MOF Last MOF Time
WBDB, Oct 2012 S-11 Network-Levitated Merge Algorithm S1 S1 Merge Point <k1,v1>, S2 S3 S2 S3 <k2,v2>, <k3,v3>, (a) Fetching Header (b) Priority Queue Setup S2 <k2,v2>, <k2,v2 >, S1 <k1,v1>, <k1,v1 >, S1 <k1,v1>, <k1,v1 > S2 <k2,v2>, <k2,v2 >, S3 <k3,v3>, Merged Data: <k1,v1><k2,v2><k3,v3>, <k3,v3 >, S3 <k3,v3>, <k3,v3 >, Merged Data: <k1,v1><k2,v2><k3,v3>,,<k2,v2 ><k1,v1 ><k3,v3 >, (c) Concurrent Fetching & Merging (d) Towards Completion
WBDB, Oct 2012 S-12 Job Progression with Network-levitated Merge a) Hadoop-A speeds up the execution time by more than 47% b) Both MapTasks and ReduceTasks are improved 140 120 Hadoop-A (Map) Hadoop on IPoIB (Map) Hadoop on GigE (Map) 140 120 Hadoop-A (Reduce) Hadoop on IPoIB (Reduce) Hadoop on GigE (Reduce) Progress (%) 100 80 60 Progress (%) 100 80 60 40 40 20 20 0 0 500 1000 1500 2000 Time (sec) 0 0 500 1000 1500 2000 Time (sec) a) Map Progress of TeraSort b) Reduce Progress of TeraSort
WBDB, Oct 2012 S-13 Breakdown of ReduceTask Execution Time (sec) Significantly reduced the execution time of ReduceTasks Most came from reduced shuffle/merge time An improvement of 2.5 times Also improved the time to reduce data An improvement of 15% Category PQ-Setup Shuffle/Merge Reduce or Merge/Reduce Hadoop-GigE 1150.01 (65.0%) 599.97 (35.0%) Hadoop-IPoIB 1148.26 (65.9%) 597.09 (34.1%) Hadoop-A 460.01 (47.4%) 509.81 (52.6%)
Outline WBDB, Oct 2012 S-14 Background Network Levitated Merge JVM-Bypass Shuffling Fast Completion Scheduler
JVM-Dependent Intermediate Data Shuffling MapTask Map TaskTracker HttpServlet JobTracker TaskTracker ReduceTask MOF1 MOF2 MOF1 MOF2 MOF1 MOF2 Staging HttpServlet Staging HttpServlet Staging TCP/IP-Only MOFCopiers Sort/Merge Reduce HDFS Heavily relies on Java WBDB, Oct 2012, S-15
JVM-Bypass Shuffling (JBS) JBS removes JVM from the critical path of intermediate data shuffling JBS is a portable library supporting both TCP/IP and RDMA protocols Data Analytics Applications MapTask TaskTracker ReduceTask HTTP Servlet Java C Sockets TCP/IP HTTP GET JVM-Bypass C JVM-Bypass Shuffling (JBS) MOFSupplier NetMerger RDMA Verbs, TCP/IP Ethernet InfiniBand/Ethernet WBDB, Oct 2012, S-16
Benefits of JBS: 1/10 Gigabit Ethernets JBS is effective for intermediate data of different sizes Ø Using Terasort benchmark, size of intermediate data = size of input data JBS reduces the execution time by 20.9% on average in 1GigE, 19.3% on average in 10GigE Terasort Job Execution Time (sec) 2000 1500 1000 500 Hadoop on 1GigE JBS on 1GigE Terasort Job Execution Time (sec) 2000 1500 1000 500 Hadoop on 10GigE JBS on 10GigE 0 0 16 32 64 128 256 16 32 64 128 256 Input Data Size (GB) Input Data Size (GB) (a): 1 Gigabit Ethernet (b): 10 Gigabit Ethernet WBDB, Oct 2012, S-17
Benefits of JBS: InfiniBand Cluster JBS on IPoIB outperforms Hadoop on IPoIB and SDP by 14.1%, 14.8%, respectively. Hadoop performs similarly when using IPoIB or SDP. Terasort Job Execution Time (sec) 2000 1500 1000 500 Hadoop on IPoIB Hadoop on SDP JBS on IPoIB 0 16 32 64 128 256 Input Data Size (GB) WBDB, Oct 2012, S-18
Outline WBDB, Oct 2012 S-19 Background Network Levitated Merge JVM-Bypass Shuffling Fast Completion Scheduler
Hadoop Fair Scheduler WBDB, Oct 2012 S-20 Scheduler assigns tasks to the TaskTrackers Tasks occupy slots until completion or failure Slot-M5 J-2 J-3 J-3 Slot-M4 J-2 J-3 J-3 Slot-M3 Slot-M2 Slot-M1 J-1 J-1 J-1 J-2 J-3 J-2 J-2 J-2 J-2 shuffle reduce Slot-R3 Slot-R2 Slot-R1 1 2 3 Job Arrival Time
WBDB, Oct 2012 S-21 Fair Completion Scheduler Prioritize ReduceTasks based on the shortest remaining map phase When remaining map phases are equal, prioritize ReduceTasks of jobs with least remaining reduce data Track the slowdown of preempted ReduceTasks Prevent large jobs from being preempted for too long
Average ReduceTask Waiting Time WBDB, Oct 2012 S-22 ReduceTasks in small jobs are significantly speedup Average ReduceTask Waiting Time (sec) 10000 1000 100 10 1 19,5 FCS 12,4 22 HFS 21 32.2 27.3 1 2 3 4 5 6 7 8 9 10 Groups 1.22 4.51 0.52 0.8
WBDB, Oct 2012 S-23 Conclusions Examined the design and architecture of Hadoop MapReduce framework and reveal critical issues faced by the existing implementation Designed and implemented Hadoop-A as an extensible acceleration framework which addresses all these issues Provided JVM-Bypass Shuffling to avoid JVM overhead, meanwhile we enable it to be a portable library that can run on both TCP/IP and RDMA protocols. Designed and Implemented Fast Completion Scheduler for fast job completion and job fairness.
Sponsors of our research WBDB, Oct 2012, S-24