Herodotos Herodotou Shivnath Babu. Duke University

Similar documents

Duke University

Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs

No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics

Starfish: A Self-tuning System for Big Data Analytics

Report 02 Data analytics workbench for educational data. Palak Agrawal

A What-if Engine for Cost-based MapReduce Optimization

Workloads. Herodotos Herodotou. Department of Computer Science Duke University. Date: Approved: Shivnath Babu, Supervisor. Jun Yang.

Automatic Tuning of Data-Intensive Analytical. Workloads

HADOOP PERFORMANCE TUNING

An Enhanced MapReduce Workload Allocation Tool for Spot Market Resources

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

Hadoop Memory Usage Model

Understanding Hadoop Performance on Lustre

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: STARFISH

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Apache Flink Next-gen data analysis. Kostas

Parameterizable benchmarking framework for designing a MapReduce performance model

Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide

Maximizing Hadoop Performance with Hardware Compression

Tuning Hadoop on Dell PowerEdge Servers

GraySort on Apache Spark by Databricks

MapReduce Evaluator: User Guide

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

COURSE CONTENT Big Data and Hadoop Training

UBUNTU DISK IO BENCHMARK TEST RESULTS

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework

Amazon Elastic Compute Cloud Getting Started Guide. My experience

Automating Big Data Benchmarking for Different Architectures with ALOJA

Performance and Energy Efficiency of. Hadoop deployment models

Jeffrey D. Ullman slides. MapReduce for data intensive computing

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

Record Setting Hadoop in the Cloud By M.C. Srivas, CTO, MapR Technologies

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Ecosystem. Fei Dong. Department of Computer Science Duke University. Date: Approved: Shivnath Babu, Supervisor. Bruce Maggs. Benjamin C.

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Performance Tuning Guide

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Workshop on Hadoop with Big Data

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

A PROFILING AND PERFORMANCE ANALYSIS BASED SELF-TUNING SYSTEM FOR OPTIMIZATION OF HADOOP MAPREDUCE CLUSTER CONFIGURATION. Dili Wu.

See Spot Run: Using Spot Instances for MapReduce Workflows

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Analytical Processing in the Big Data Era

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Hadoop Development & BI- 0 to 100

Big Data Course Highlights

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

High Performance Computing MapReduce & Hadoop. 17th Apr 2014

Moving From Hadoop to Spark

Performance Analysis of Hadoop for Query Processing

Hadoop Big Data for Processing Data and Performing Workload

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Spark: Cluster Computing with Working Sets

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

ITG Software Engineering

Unified Big Data Analytics Pipeline. 连城

Hands-on Exercises with Big Data

Improving MapReduce Performance in Heterogeneous Environments

Peers Techno log ies Pv t. L td. HADOOP

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Fast Data Hadoop acceleration with Flash. June 2013

Hadoop Performance Tuning - A Pragmatic & Iterative Approach

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

HiBench Introduction. Carson Wang Software & Services Group

Hadoop Job Oriented Training Agenda

A brief introduction of IBM s work around Hadoop - BigInsights

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Hadoop Performance Tuning - A Pragmatic & Iterative Approach

MapReduce and Lustre * : Running Hadoop * in a High Performance Computing Environment

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters

Hadoop: The Definitive Guide

A Brief Outline on Bigdata Hadoop

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

Energy Efficient MapReduce

Hadoop and Map-Reduce. Swati Gore

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Accelerating and Simplifying Apache

BIG DATA HADOOP TRAINING

The Impact of Capacity Scheduler Configuration Settings on MapReduce Jobs

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

High Throughput Sequencing Data Analysis using Cloud Computing

How MapReduce Works 資碩一戴睿宸

Entering the Zettabyte Age Jeffrey Krone

CSE-E5430 Scalable Cloud Computing Lecture 2

HiTune: Dataflow-Based Performance Analysis for Big Data Cloud

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Transcription:

Herodotos Herodotou Shivnath Babu Duke University

Analysis in the Big Data Era Popular option Hadoop software stack Java / C++ / R / Python Pig Hive Jaql Oozie Elastic MapReduce Hadoop HBase MapReduce Execution Engine Distributed File System 8/31/2011 Duke University 2

Analysis in the Big Data Era Popular option Hadoop software stack Who are the users? Data analysts, statisticians, computational scientists Researchers, developers, testers You! Who performs setup and tuning? The users! Usually lack expertise to tune the system 8/31/2011 Duke University 3

Problem Overview Goal Enable Hadoop users and applications to get good performance automatically Part of the Starfish system This talk: tuning individual MapReduce jobs Challenges Heavy use of programming languages for MapReduce programs and UDFs (e.g., Java/Python) Data loaded/accessed as opaque files Large space of tuning choices 8/31/2011 Duke University 4

MapReduce Job Execution job j = < program p, data d, resources r, configuration c > split 0 map split 2 map reduce out 0 split 1 map split 3 map reduce out 1 Two Map Waves One Reduce Wave 8/31/2011 Duke University 5

Optimizing MapReduce Job Execution job j = < program p, data d, resources r, configuration c > Space of configuration choices: Number of map tasks Number of reduce tasks Partitioning of map outputs to reduce tasks Memory allocation to task-level buffers Multiphase external sorting in the tasks Whether output data from tasks should be compressed Whether combine function should be used 8/31/2011 Duke University 6

Optimizing MapReduce Job Execution Rules-of-thumb settings 2-dim projection of 13-dim surface Use defaults or set manually (rules-of-thumb) Rules-of-thumb may not suffice 8/31/2011 Duke University 7

Applying Cost-based Optimization Goal: c opt Just-in-Time Optimizer Searches through the space S of parameter settings What-if Engine perf F( p, d, r, c) arg min F( p, d, r, c) c S Estimates perf using properties of p, d, r, and c Challenge: How to capture the properties of an arbitrary MapReduce program p? 8/31/2011 Duke University 8

Job Profile Concise representation of program execution as a job Records information at the level of task phases Generated by Profiler through measurement or by the What-if Engine through estimation split DFS map Serialize, Memory Partition Buffer Sort, [Combine], [Compress] Merge Read Map Collect Spill Merge 8/31/2011 Duke University 9

Job Profile Fields Dataflow: amount of data flowing through task phases Map output bytes Number of map-side spills Number of records in buffer per spill Costs: execution times at the level of task phases Read phase time in the map task Map phase time in the map task Spill phase time in the map task Dataflow Statistics: statistical information about the dataflow Map func s selectivity (output / input) Map output compression ratio Size of records (keys and values) Cost Statistics: statistical information about the costs I/O cost for reading from local disk per byte CPU cost for executing Map func per record CPU cost for uncompressing the input per byte 8/31/2011 Duke University 10

Generating Profiles by Measurement Goals Have zero overhead when profiling is turned off Require no modifications to Hadoop Support unmodified MapReduce programs written in Java or Hadoop Streaming/Pipes (Python/Ruby/C++) Dynamic instrumentation Monitors task phases of MapReduce job execution Event-condition-action rules are specified, leading to run-time instrumentation of Hadoop internals We currently use BTrace (Hadoop internals are in Java) 8/31/2011 Duke University 11

Generating Profiles by Measurement split 0 map reduce out 0 enable profiling raw data enable profiling raw data split 1 map map profile reduce profile enable profiling raw data job profile Use of Sampling Profiling Task execution 8/31/2011 Duke University 12

What-if Engine Job Profile <p, d 1, r 1, c 1 > Input Data Properties <d 2 > Cluster Resources <r 2 > Configuration Settings <c 2 > What-if Engine Job Oracle Virtual Job Profile for <p, d 2, r 2, c 2 > Task Scheduler Simulator Properties of Hypothetical job 8/31/2011 Duke University 13

Virtual Profile Estimation Given profile for job j = <p, d 1, r 1, c 1 > estimate profile for job j' = <p, d 2, r 2, c 2 > Profile for j Dataflow Statistics Cost Statistics Dataflow Costs Input Data d 2 Resources r 2 Relative Black-box Models (Virtual) Profile for j' Cardinality Models Cost Statistics White-box Models Costs Dataflow Statistics Dataflow Configuration c 2 White-box Models 8/31/2011 Duke University 14

White-box Models Detailed set of equations for Hadoop Example: Input data properties Dataflow statistics Configuration parameters Calculate dataflow in each task phase in a map task map Serialize, Partition Memory Buffer split Sort, [Combine], [Compress] Merge DFS Read Map Collect Spill Merge 8/31/2011 Duke University 15

Just-in-Time Optimizer Job Profile <p, d 1, r 1, c 1 > Input Data Properties <d 2 > Cluster Resources <r 2 > Just-in-Time Optimizer (Sub) Space Enumeration Recursive Random Search What-if Calls Best Configuration Settings <c opt > for <p, d 2, r 2 > 8/31/2011 Duke University 16

Recursive Random Search Space Point (configuration settings) Use What-if Engine to cost Parameter Space 8/31/2011 Duke University 17

Experimental Methodology 15-30 Amazon EC2 nodes, various instance types Cluster-level configurations based on rules of thumb Data sizes: 10-180 GB Rule-based Optimizer Vs. Cost-based Optimizer Abbr. MapReduce Program Domain Dataset CO Word Co-occurrence NLP Wikipedia WC WordCount Text Analytics Wikipedia TS TeraSort Business Analytics TeraGen LG LinkGraph Graph Processing Wikipedia (compressed) JO Join Business Analytics TPC-H TF TF-IDF Information Retrieval Wikipedia 8/31/2011 Duke University 18

Speedup Job Optimizer Evaluation Hadoop cluster: 30 nodes, m1.xlarge Data sizes: 60-180 GB 60 50 40 30 20 10 0 TS WC LG JO TF CO MapReduce Programs Default Settings Rule-based Optimizer 8/31/2011 Duke University 19

Speedup Job Optimizer Evaluation Hadoop cluster: 30 nodes, m1.xlarge Data sizes: 60-180 GB 60 50 40 30 20 10 0 TS WC LG JO TF CO MapReduce Programs Default Settings Rule-based Optimizer Cost-based Optimizer 8/31/2011 Duke University 20

Estimates from the What-if Engine Hadoop cluster: 16 nodes, c1.medium MapReduce Program: Word Co-occurrence Data set: 10 GB Wikipedia True surface Estimated surface 8/31/2011 Duke University 21

Running Time (min) Estimates from the What-if Engine Profiling on Test cluster, prediction on Production cluster Test cluster: 10 nodes, m1.large, 60 GB Production cluster: 30 nodes, m1.xlarge, 180 GB 40 35 30 25 20 15 10 5 0 TS WC LG JO TF CO MapReduce Programs Actual Predicted 8/31/2011 Duke University 22

Percent Overhead over Job Running Time with Profiling Turned Off Speedup over Job run with RBO Settings Profiling Overhead Vs. Benefit Hadoop cluster: 16 nodes, c1.medium MapReduce Program: Word Co-occurrence Data set: 10 GB Wikipedia 35 30 25 20 15 10 5 0 1 5 10 20 40 60 80 100 Percent of Tasks Profiled 2.5 2.0 1.5 1.0 0.5 0.0 1 5 10 20 40 60 80 100 Percent of Tasks Profiled 8/31/2011 Duke University 23

Conclusion What have we achieved? Perform in-depth job analysis with profiles Predict the behavior of hypothetical job executions Optimize arbitrary MapReduce programs What s next? Optimize job workflows/workloads Address the cluster sizing (provisioning) problem Perform data layout tuning 8/31/2011 Duke University 24

Starfish: Self-tuning Analytics System Software Release: Starfish v0.2.0 Demo Session C: Thursday, 10:30-12:00 Grand Crescent www.cs.duke.edu/starfish 8/31/2011 Duke University 25

Hadoop Configuration Parameters Parameter Default Value io.sort.mb 100 io.sort.record.percent 0.05 io.sort.spill.percent 0.8 io.sort.factor 10 mapreduce.combine.class null min.num.spills.for.combine 3 mapred.compress.map.output false mapred.reduce.tasks 1 mapred.job.shuffle.input.buffer.percent 0.7 mapred.job.shuffle.merge.percent 0.66 mapred.inmem.merge.threshold 1000 mapred.job.reduce.input.buffer.percent 0 mapred.output.compress false 8/31/2011 Duke University 26

Amazon EC2 Node Types Node Type CPU (EC2 Units) Mem (GB) Storage (GB) Cost ($/hour) Map Slots per Node Reduce Slots per Node Max Mem per Slot m1.small 1 1.7 160 0.085 2 1 300 m1.large 4 7.5 850 0.34 3 2 1024 m1.xlarge 8 15 1690 0.68 4 4 1536 c1.medium 5 1.7 350 0.17 2 2 300 c1.xlarge 20 7 1690 0.68 8 6 400 8/31/2011 Duke University 27