Report 02 Data analytics workbench for educational data. Palak Agrawal

Size: px

Start display at page:

Download "Report 02 Data analytics workbench for educational data. Palak Agrawal"

Anastasia Hubbard
8 years ago
Views:

1 Report 02 Data analytics workbench for educational data Palak Agrawal Last Updated: May 22, 2014

2 Starfish: A Selftuning System for Big Data Analytics [1] text

3 CONTENTS Contents 1 Introduction Starfish: MADDERand SelfTuningHadoop Overview of Starfish Joblevel Tuning Starfish s just-in-time optimization Workflow level tuning Workload level tuning Lastword: Starfishs Language forworkloads and Data JUSTINTIMEJOB OPTIMIZATION 3 4 Workflow-aware scheduling 4 5 Optimization and provisioning for Hadoop workloads 4 6 Positive Points 5 7 Negative Points 5 Palak Agrawal May 22, 2014 i

4 Lastword: Starfishs Language forworkloads and Data.

4 1 INTRODUCTION 1 Introduction Starfish builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the many tuning knobs in Hadoop. Features that users expect from a system for big data analytics are MAD : for- Magnetism, Agility, and Depth Hadoop is a MAD system Hadoop itself has two primary components: a MapReduce execution engine and a distributed filesystem. Analytics with Hadoop involves loading data as files into the distributed filesystem, and then running parallel MapReduce computations on the data. the same properties that make Hadoop MAD pose new challenges in the path to self-tuning: Data opacity until processing File based processing Heavy use of programming languages 1.1 Starfish: MADDERand SelfTuningHadoop Three more features becoming important in analytics : Datalifecycle awareness elasticity robustness An important design decision made was to build Starfish on the Hadoop stack Starfishs goal is to enable Hadoop users and applications to get good performance automatically throughout the data lifecycle in analytics. Palak Agrawal May 22,

Features that users expect from a system for big data analytics are MAD : for- Magnetism, Agility, and Depth Hadoop is a MAD system Hadoop itself has two primary components: a MapReduce execution

5 2 OVERVIEW OF STARFISH 2 Overview of Starfish Hadoop workload divided in different levels, on the lowest level mapreduce jobs are there. Workflow : is the execution plan generated for a query. workflow may be ad-hoc driven, time driven, data driven. The tuning of the components in the Starfish architecture can be categorized into:job-level tuning, workflow-level tuning, and workloadlevel tuning. 2.1 Joblevel Tuning behavior of map reduce job controlled by settings of more than 190 jobs. good settings for these parameters depends on job, data and cluster characteristics. Schema and properties of data are unknown so far and now the system has to choose the way (eg: join query) to execute the job and also the settings of job configuration parameters Starfish s just-in-time optimization Automatically selects optimal execution technique when job is submitted Is assisted by information from Profiler and Sampler Profiler 1. Performs dynamic instrumentation using Btrace(read only java tool that generates scripts to monitor running applications) 2. Generate job profiles(information captured at task and sub task level) Sampler 1. Collects statistics about input, intermediate, and output data 2. Samples execution of MapReduce jobs 3. Enables Profiler to generate approximate job profiles, without complete execution 2.2 Workflow level tuning unbalanced data layouts can result into dramatically degraded performance. default HDFS replication scheme in combination with data-locality-aware scheduling can result into overloaded servers Work-flow-aware scheduler coordinates with what-if-engine to determine optimal data layout. just in time optimizer to determine job execution scheduler workflow. it performs cost based search over the following 1.block replacement policy 2.replication factor 3.optimal size of file blocks 4.whether to compress output or not Palak Agrawal May 22,

The tuning of the components in the Starfish architecture can be categorized into:job-level tuning, workflow-level tuning, and workloadlevel tuning. 2.

6 3 JUSTINTIMEJOB OPTIMIZATION 2.3 Workload level tuning starfish implements a workload optimizer translates workloads submitted to the system to equivalent but optimized collection of workflows. otimizes three areas Data flow sharing Materializatoin Reorganization Hadoop provisioning deals with choices like the number of nodes, node configuration, and network configuration to meet given workload requirements. Starfish s elastisizer given multiple constraints provides opitmal configuration 2.4 Lastword: Starfishs Language forworkloads and Data starfish interposes itself between hadoop and its clients.clients submit workloads. starfish uses language translators to automatically convert workloads specified in high level language to lastword. An important feature of Lastword is its support for expressing metadata along with the tasks for execution. 3 JUSTINTIMEJOB OPTIMIZATION Rule of thumb for parameter tuning : It suggests to set the mapred.reduce.tasks that is the number of reudce tasks in the job to roughly 0.9 times the total number of reduce slots in the cluster set the io.sort.record.percent to (16/(16 + avg r ecords)) based on the average size of map output records. results show that this rule of thumb is inefficient so starfish is used. Profiling Using Dynamic Instrumentation When Hadoop runs a MapReduce job, the Starfish Profiler instruments selected Java classes in Hadoop to construct a job profile. A profile is a concise representation of the job execution that captures information both at the task and subtask levels. The execution of a MapReduce job is broken down into the Map Phase and the Reduce Phase. Subsequently, the Map Phase is divided into the Reading, Map Processing, Spilling, and Merging subphases. The Reduce Phase is divided into the Shuffling, Sorting, Reduce Processing, and Writing subphases. Each subphase represents an important part of the jobs overall execution in Hadoop Predicting job performance in Hadoop given a new setting S of the configuration parameters, the What-if Engine can use the job profile and a set of models that we developed to estimate the new profile if the job were to be run using S. Palak Agrawal May 22,

workload requirements. Starfish s elastisizer given multiple constraints provides opitmal configuration 2.

7 5 OPTIMIZATION AND PROVISIONING FOR HADOOP WORKLOADS The What-if Engine is given four inputs when asked to predict the performance of a MapReduce job J: 1.The job profile generated for J by the Profiler 2. The new setting S of the job configuration parameters using which Job J will be run. 3. The size, layout, and compression information of the input dataset on which Job J will be run. 4. The cluster setup and resource allocation that will be used to run Job J. 4 Workflow-aware scheduling Unbalanced data layouts cause a problem for data-locality-aware schedulers (i.e., schedulers that aim to move computation to the data). Exploiting data locality can have two undesirable consequences in this context. first,reduced performance due to decrease in parallelism second, the data layout is further unbalanced because new outputs will go to the over-utilized nodes and on the other hand non-data-local scheduling incurs the overhead of data movement. For a single-rack cluster,the HDFS here places the second replica of each block of the partitions on a randomly-chosen node. Now the time to process the partitions improved significantly because the second copy of the data is spread out over the cluster. Also the overhead of creating a second replica is very small on cluster. A Workflow-aware Scheduler can ensure that job-level optimization and scheduling policies are coordinated with the policies for data placement employed by the underlying distributed filesystem. StarfishsWorkflow-aware Scheduler makes decisions by considering producerconsumer relationships among jobs in the workflow. 5 Optimization and provisioning for Hadoop workloads Workload Optimizer: Starfishs Workload Optimizer represents the workload as a directed graph and applies the optimizations as graph-to- graph transformations. Elastisizer: Users can now leverage pay-as-you-go resources on the cloud to meet their analytics needs. The cluster can be released when the workflow completes, and the user pays for the resources used. One of the goals of Starfishs Elastisizer is to automatically determine the best cluster and Hadoop configurations to process a given workload subject to user-specified goals. Palak Agrawal May 22,

The size, layout, and compression information of the input dataset on which Job J will be run. 4. The cluster setup and resource allocation that will be used to run Job J.

8 7 NEGATIVE POINTS 6 Positive Points The methodology of the paper made it easy for the user as now the user need not to decide when to add or when to remove nodes from the HDFS. 7 Negative Points How effective might the Sampler be? Missing analysis of overhead introduced by Starfish Paper leaves out a lot of details Palak Agrawal May 22,

HDFS. 7 Negative Points How effective might the Sampler be?

9 REFERENCES References [1] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu, Starfish: A self-tuning system for big data analytics., in CIDR, vol. 11, pp , Palak Agrawal May 22,

Starfish: A Self-tuning System for Big Data Analytics

Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, Shivnath Babu Department of Computer Science Duke University