Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish
Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists Data Size Journalists Systems researchers Workload Complexity Counts & Aggregates Statistical analysis, Linear Algebra Rollups & Drilldowns Machine learning Text / Images / Video / Graphs 6/29/2011 Starfish 2
MapReduce/Hadoop Ecosystem Java / Ruby / Python Client Pig Hive Jaql Oozie Elastic MapReduce Hadoop MapReduce Execution Engine Distributed File System HBase 6/29/2011 Starfish 3
MADDER Principles of Big Data Analytics Magnetic Agile Deep Data-lifecycle-aware Elastic Robust 6/29/2011 Starfish 4
MADDER Principles of Big Data Analytics Magnetic A D D E R Easy to get data into the system 6/29/2011 Starfish 5
MADDER Principles of Big Data Analytics M Agile D D E R Make change (data/requirements) easy 6/29/2011 Starfish 6
MADDER Principles of Big Data Analytics M A Deep D E R Support the full spectrum of analytics Write MapReduce programs in Java / Python / R or use interfaces like Pig / Jaql 6/29/2011 Starfish 7
MADDER Principles of Big Data Analytics M A D Data-lifecycleaware E R Data cycle at LinkedIn 6/29/2011 Starfish 8
MADDER Principles of Big Data Analytics M A D D Elastic R Adapt resources/costs to actual workloads 6/29/2011 Starfish 9
MADDER Principles of Big Data Analytics M A D D E Robust Graceful degradation under undesirable events 6/29/2011 Starfish 10
Ease-of-Use Vs. Out-of-the-box Perf. Magnetic Agile Deep Data-lifecycle-aware Elastic Robust Ease of use Data can be opaque until run-time Programs are a different beast from SQL Policies are nontrivial Hard to get good performance out of the box 6/29/2011 Starfish 11
Starfish: MADDER + Self-Tuning Magnetic Agile Deep Data-lifecycle-aware Elastic Robust Ease of use Profile Recursive random search Dynamic instrumentation Mix of models & simulation Relative what-if calls Get good performance automatically 6/29/2011 Starfish 12
Starfish: MADDER + Self-Tuning Goal: Provide good performance automatically Java Client Pig Hive Oozie Elastic MR Analytics System Starfish Hadoop MapReduce Execution Engine Distributed File System 6/29/2011 Starfish 13
What are the Tuning Problems? Job-level MapReduce configuration Cluster sizing J1 Data layout tuning D J2 Workflow optimization Workload management 6/29/2011 Starfish 14
Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 15
Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 16
MapReduce Job Execution job j = < program p, data d, resources r, configuration c > Map function Reduce function Input Splits Run this program as a MapReduce job
MapReduce Job Execution job j = < program p, data d, resources r, configuration c > Input Splits Map Wave 1 Map Wave 2 Reduce Wave 1 Reduce Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?
Optimizing MapReduce Job Execution job j = < program p, data d, resources r, configuration c > Space of configuration choices include settings for: Number of map tasks Number of reduce tasks Partitioning of map outputs to reduce tasks Memory allocation to task-level buffers Multiphase external sorting in the tasks Whether output data from tasks should be compressed Whether combine function should be used 6/29/2011 Starfish 19
Optimizing MapReduce Job Execution 2-dim projection of 13-dim surface 190+ parameters in Hadoop 6/29/2011 Starfish 20
Case for Automated Hadoop Tuning (from wiki.apache.org/pig/pigjournal) Hadoop has that can affect the latency and scalability of a job. For different types of jobs, different configurations will yield optimal results. For example, a job with no memory-intensive operations in the map phase but with a combine phase will want to set Hadoop's io.sort.mb quite high, to minimize the number of spills from the map. many configuration parameters feature will greatly increase the utility Adding this of Pig for Hadoop users, as it will free them from needing to understand Hadoop well enough to tune it themselves for their particular jobs. 6/29/2011 Starfish 21
Starfish s Core Approach to Tuning c S Challenges: p is an arbitrary MapReduce program; c is high-dimensional; Profiler What-if Engine Optimizer(s) perf F( p, d, r, c) program p, data d, resources r, copt arg min F( p, d, r, c) configuration c Runs MapReduce jobs to collect job profiles (concise execution summary) Given profile of j = <p,d,r,c>, estimates virtual profile for job j' = <p,d,r,c > Enumerates and searches through the optimization space efficiently 6/29/2011 Starfish 22
Profile of a MapReduce Job Concise representation of program execution Records information at the level of task phases split 0 map split 2 map reduce out 0 split 1 map split 3 map reduce Out 1 Two Map Waves One Reduce Wave 6/29/2011 Starfish 23
Profile of a MapReduce Job Concise representation of program execution Records information at the level of task phases split 0 map reduce out 0 split 0 Map func Serialize, Partition Memory Buffer Sort, [Combine], [Compress] Merge DFS Map Task Phases Read Map Collect Spill Merge 6/29/2011 Starfish 24
Profile of a MapReduce Job Concise representation of program execution Records information at the level of task phases split 0 map reduce out 0 Profile Dataflow Amount of data flowing though tasks & task phases Dataflow Statistics Statistical info about the dataflow Cost Execution time at level of tasks & task phases Cost Statistics Statistical info about the costs 6/29/2011 Starfish 25
Fields in a Job Profile Dataflow Map output bytes Number of map-side spills Number of merge rounds Number of records in buffer per spill Cost Read phase time in the map task Map phase time in the map task Collect phase time in the map task Spill phase time in the map task Dataflow Statistics Map func s selectivity (output / input) Map output compression ratio Combiner s selectivity Size of records (keys and values) Cost Statistics I/O cost for reading from local disk per byte I/O cost for writing to HDFS per byte CPU cost for executing Map func per record CPU cost for uncompressing the input per byte 6/29/2011 Starfish 26
Generating Profiles Concise representation of program execution Records information at the level of task phases Generated by Profiler through measurement or by the What-if Engine through estimation 6/29/2011 Starfish 27
Generating Profiles by Measurement Goals Have zero overhead when off, low overhead when on Require no modifications to Hadoop Support unmodified MapReduce programs written in Java/Python/Ruby/C++ Approach: Dynamic (on-demand) instrumentation Event-condition-action rules are specified (in Java) Leads to run-time instrumentation of Hadoop internals Monitors task phases of MapReduce job execution We currently use BTrace (Hadoop internals are in Java) 6/29/2011 Starfish 28
Generating Profiles by Measurement JVM Enable Profiling JVM split 0 map reduce out 0 raw data ECA rules raw data JVM split 1 map map profile reduce profile raw data job profile Use of Sampling Profile fewer tasks Execute fewer tasks JVM = Java Virtual Machine, ECA = Event-Condition-Action 6/29/2011 Starfish 29
Overhead of Profile Measurement Word Co-occurrence job running on a 16-node cluster of c1.medium EC2 nodes on the Amazon Cloud 6/29/2011 Starfish 30
What-if Engine Possibly Hypothetical Job Profile Input Data Properties What-if Engine Job Oracle Cluster Resources Virtual Job Profile for <p, d 2, r 2, c 2 > Task Scheduler Simulator Configuration Settings <p, d 1, r 1, c 1 > <d 2 > <r 2 > <c 2 > Properties of Hypothetical Job 6/29/2011 Starfish 31
What-if Questions Starfish can Answer How will job j s execution time change if the number of reduce tasks is changed from 20 to 40? What will the change in I/O be if map o/p compression is turned on, but the input data size increases by 40%? What will job j s new execution time be if 5 more nodes are added to the cluster, bringing the total to 20? How will workload execution time & dollar cost change if we move the production cluster from m1.xlarge nodes to c1.medium nodes on Amazon EC2? 6/29/2011 Starfish 32
Virtual Profile Estimation Given profile for job j = <p, d 1, r 1, c 1 > Estimate profile for job j' = <p, d 2, r 2, c 2 > Profile for j Dataflow Statistics Cost Statistics Dataflow Cost Input Data d 2 Resources r 2 Relative Black-box Models (Virtual) Profile for j' Cardinality Models Cost Statistics Cost Dataflow Statistics White-box Models Configuration c 2 White-box Models Dataflow 6/29/2011 Starfish 33
Job Optimizer Job Profile Input Data Properties Job Optimizer Subspace Enumeration Cluster Resources <p, d 1, r 1, c 1 > <d 2 > <r 2 > Recursive Random Search What-if calls Best Configuration Settings <c opt > for <p, d 2, r 2 > 6/29/2011 Starfish 34
Experimental Setup Hadoop cluster on 16 Amazon EC2 nodes, c1.medium type 2 map slots & 2 reduce slots 300MB max memory per task Cost-Based Job Optimizer Vs. Rule-Based Optimizer 6/29/2011 Starfish 35
Experimental Setup Hadoop cluster on 16 Amazon EC2 nodes, c1.medium type 2 map slots & 2 reduce slots 300MB max memory per task Cost-Based Job Optimizer Vs. Rule-Based Optimizer Abbr. MapReduce Program Domain Dataset CO Word Co-occurrence NLP 30GB,Wikipedia WC WordCount Text Analytics 30GB, Wikipedia TS TeraSort Business Analytics 30GB, Teragen LG LinkGraph Graph Processing 10GB, Wikipedia (compressed) JO Join Business Analytics 30GB, TPC-H 6/29/2011 Starfish 36
Speedup Job Optimizer Evaluation 16 14 12 10 8 6 4 2 Default Settings Rule-Based Optimizer Cost-Based Just-in-TimeJob Optimizer 0 CO WC TS LG JO MapReduce Job 6/29/2011 Starfish 37
Insights from Profiles for WordCount A: Rule-Based B: Cost-Based (2x faster) Few, large spills Combiner gave high data reduction Combiner made Mappers CPU bound Many, small spills Combiner gave smaller data reduction Better resource utilization in Mappers 6/29/2011 Starfish 38
Estimates from the What-if Engine True surface Estimated surface 6/29/2011 Starfish 39
Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 40
Workflow Optimization Space Optimization Space Logical Physical Vertical Packing Partition Function Selection Job-level Configuration Dataset-level Configuration Intra-job Inter-job 6/29/2011 Starfish 41
Optimizations on TF-IDF Workflow D0 <{D},{W}> D1 D2 D4 J1 M1 R1 <{D, W},{f}> J2 J3, J4 M2 R2 <{D},{W, f, c}> M3 R3 M4 <{W},{D, t}> D0 <{D},{W}> J1, J2 M1 R1 M2 Logical R2 Optimization D2 <{D},{W, f, c}> J3, J4 D4 M3 R3 M4 <{W},{D, t}> Partition:{D} Sort: {D,W} Physical Optimization Reducers= 50 Compress = off Memory = 400 Reducers= 20 Compress = on Memory = 300 Legend D = docname f = frequency W = word c = count t = TF-IDF 6/29/2011 Starfish 42
Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 43
Cost ($) Running Time (min) Multi-objective Cluster Provisioning Cloud enables users to provision Hadoop clusters in minutes 1,200 1,000 Can avoid the man (system administrator) in the middle Pay only 800 for what resources are used 600 400 200 0 10.00 8.00 6.00 4.00 2.00 0.00 m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type 6/29/2011 Starfish 44
Cost ($) Running Time (min) Multi-objective Cluster Provisioning 1,200 1,000 800 600 400 200 0 10.00 8.00 6.00 4.00 2.00 0.00 m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type for Target Cluster m1.small m1.large m1.xlarge c1.medium c1.xlarge Actual Predicted Actual Predicted EC2 Instance Type for Target Cluster Instance Type for Source Cluster: m1.large 6/29/2011 Starfish 45
More Info: www.cs.duke.edu/starfish Job-level MapReduce configuration Cluster sizing J1 Data layout tuning D J2 Workflow optimization Workload management 6/29/2011 Starfish 46