A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML



Similar documents
Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

How To Study Hadoop Efficiency

Machine Learning with MATLAB David Willingham Application Engineer

An Introduction to Data Mining

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Automating Big Data Benchmarking for Different Architectures with ALOJA

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Machine Learning Capacity and Performance Analysis and R

HiBench Introduction. Carson Wang Software & Services Group

Azure Machine Learning, SQL Data Mining and R

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

HiBench Installation. Sunil Raiyani, Jayam Modi

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

An Overview of Knowledge Discovery Database and Data mining Techniques

Introduction to Data Mining

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Welcome to the 6 th Workshop on Big Data Benchmarking

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

not possible or was possible at a high cost for collecting the data.

DATA MINING TECHNIQUES AND APPLICATIONS

Introduction to Data Mining

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Meta-learning. Synonyms. Definition. Characteristics

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Introduction. A. Bellaachia Page: 1

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

INDIAN STATISTICAL INSTITUTE announces Training Program on Statistical Techniques for Data Mining & Business Analytics

CS Data Science and Visualization Spring 2016

Chapter ML:XI. XI. Cluster Analysis

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Data Mining + Business Intelligence. Integration, Design and Implementation

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

How To Predict Web Site Visits

Data Mining for Fun and Profit

An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus

Maximizing Return and Minimizing Cost with the Decision Management Systems

Maschinelles Lernen mit MATLAB

Data Mining: Overview. What is Data Mining?

Data Mining Solutions for the Business Environment

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Predictive Modeling Techniques in Insurance

The Scientific Data Mining Process

BIG DATA What it is and how to use?

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Advanced Ensemble Strategies for Polynomial Models

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations

1. Classification problems

A Survey on Web Research for Data Mining

Advanced analytics at your hands

HPCHadoop: MapReduce on Cray X-series

Application of Predictive Analytics for Better Alignment of Business and IT

Sunnie Chung. Cleveland State University

Pentaho Data Mining Last Modified on January 22, 2007

CUSTOMER Presentation of SAP Predictive Analytics

Fast Analytics on Big Data with H20

Knowledge Discovery from patents using KMX Text Analytics

Data Mining Techniques

MA2823: Foundations of Machine Learning

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Ensembles and PMML in KNIME

An In-Depth Look at In-Memory Predictive Analytics for Developers

locuz.com Big Data Services

COURSE RECOMMENDER SYSTEM IN E-LEARNING

A Review of Data Mining Techniques

IBM's Fraud and Abuse, Analytics and Management Solution

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

Data Domain Profiling and Data Masking for Hadoop

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

KnowledgeSEEKER Marketing Edition

TIETS34 Seminar: Data Mining on Biometric identification

EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS

Why is Internal Audit so Hard?

Distributed Framework for Data Mining As a Service on Private Cloud

Analytics on Big Data

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

MapReduce and Hadoop Distributed File System

Chapter 12 Discovering New Knowledge Data Mining

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

DBTech Pro Workshop. Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining. Georgios Evangelidis

Principles of Data Mining by Hand&Mannila&Smyth

Data UNC. Vinayak Deshpande

Business Intelligence. Data Mining and Optimization for Decision Making

Using Data Mining for Mobile Communication Clustering and Characterization

SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics

Customer and Business Analytic

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends

Contents. Dedication List of Figures List of Tables. Acknowledgments

Transcription:

www.bsc.es A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML Josep Ll. Berral, Nicolas Poggi, David Carrera Workshop on Big Data Benchmarks Toronto, Canada 2015 1

Context ALOJA: framework to interpret Hadoop benchmarks performance data and tuning to provide recommendations on cost-effectiveness of systems Challenges: 1. Need to go from expert-guided to automated analysis 2. Need to deal with huge amounts of data and find the most relevant Approach Predictive Analytics can be applied to deploy model-based methods helping such analysis 2

ALOJA and Predictive Analytics Predictive Analytics Encompasses statistical and Machine Learning (ML) techniques To make predictions of unknown events from historical events Forecast and foresight Formerly known also as applied modeling and prediction Predict behavior elements and apply them to extract knowledge Machine Learning Science and methods part of Data Mining in charge of learning (modeling) a system from some of its observations ALOJA-ML: the ALOJA predictive analytics component for modeling benchmarks Benchmark behavior Benchmark executions Predictive Analytics Knowledge about benchmark Oracles for benchmark 3

Prediction tools online 4

PA workflow in ALOJA User work-flow 1. ALOJA Web front-end 2. Data filtering 3. ML Processing 4. Processed data into ALOJA tools 5. Display / Visualization 5

Modeling Hadoop components and topology Experiments and algorithms are codified in R (except some learning algorithms called from the WEKA package, in JAVA) This allows us to run the methods locally or process them in an external Web Service 6

Modeling Hadoop jobs Methodology 3-step learning: Different split sizes tested Grant enough samples for testing ( 25%) Attempt to reduce the training set without degrading the learning (25% < training < 50%) Different learning algorithms Regression trees Nearest-neighbors learning Neural networks (FFANN) Linear and polynomial regressions 7

Learning Benchmarks Capabilities of learning Benchmarks Modeling of benchmark behaviors Examine in which degree configurations affect the execution Prediction of benchmark execution times, resource consumption, Benchmark representation Without parametrizing benchmarks Learning algorithms must treat benchmark names as discrimination categories A model must have seen a benchmark before to predict it With benchmark parametrization (numeric) [w.i.p.] Discover which elements of a system tailor a benchmark A new benchmark should have trial runs (we already do without parametrizing) 8

Dataset The ALOJA data-set Over 40.000 Hadoop benchmark executions, from the HiBench suite Selected benchs: kmeans, pagerank, sort, terasort, wordcount, dfsioe_r/w Inputs: benchmark, hardware features, cloud provider + type of deployment, software configurations Output: execution time, used resources, Features Hardware characteristics: network storage type, cluster description Software configurations: # of maps, sort factor, file buffer size, block size 9

Benchmark Modeling results Modeling Use of 50% of executions to train a model, 25% to validate, 25% to test Use of regression trees and nearest neighbor algorithms, among others General model for different classifiers 10

Generalization vs. Specialization General model vs. Specialized models One model (one training/update) vs. individual models (more specialized) General model is expected to have higher error, but not too much Comparative without parametrization General model behaves worse that specific ones (but not much) G.M.: RAE 0.184 Average S.M.: RAE 0.132 G.M. does not over-fit as specifics Benchmark RAE on Sepecific model DFSIOE read 0.29965 DFSIOE write 0.10763 k-means 0.12842 pagerank 0.11948 sort 0.12823 terasort 0.12599 wordcount 0.09702 General Model 0.184 Encouraging next work Study the parametrization of benchmarks for automatic learning so we expect to get a better general model 11

Benchmark Modeling Issues and opportunities Detection of outliers Benchmarks showing high learning errors: signal that executions are unstable or have high presence of outliers Automating learning can put apart those benchmarks or executions E.g., pagerank (having outliers) returns a RAE of 1.18 execs put to revision After cleaning the pagerank anomalous executions RAE of 0.12 Prediction of output variables Ranking of relevant configuration features 12

Use case 1: Anomaly Detection Anomaly Detection Model-based detection procedure The selected model becomes the system. Any execution not fitting into the model is supposed to be out of the system. 13

Use case 1: Anomaly Detection Anomaly and Outlier Detection Use of statistic and model-based outlier detections Highlight executions with high probability of anomaly Mark down executions with high probability of being errors 14

Use case 2: Features and Discriminators Discrimination of Features Use the models (general or specifics) Create a ranking of features, according to the estimated results Possible discrimination By information gain By ordered splits (best variable to split configurations by their outputs) Tree Descriptor: Disk=HDD Net=ETH IO.FBuf=128KB 2935s IO.FBuf=64KB 2942s Net=IB IO.FBuf=128KB 3118s IO.FBuf=64KB 3125s Disk=SSD Net=ETH IO.FBuf=128KB 1248s IO.FBuf=64KB 1256s Net=IB IO.FBuf=128KB 1233s IO.FBuf=64KB 124s1 15

Case of use 3: Knowledge Discovery Make analyzing results easier Multi-variable visualization Trees separating relevant attributes Other interesting tools HDD SSD pred_time 16

Summary I Conclusions Modeling: Specific models are a little bit more accurate than generals (but not so much) High error at automatic modeling can indicate outliers Unfolding the search space of Hadoop configurations Observe predictions Rank features by possible relevance Next steps: Characterization and parametrization of benchmarks Guided executions 17

Additional reference and publications Online repository and tools available at: http://hadoop.bsc.es Publications ALOJA project: automatic characterization of cost-effectiveness on Hadoop deployments Nicolas Poggi, David Carrera, Aaron Call, Rob Reinauer, Nikola Vujic, Daron Green and Jose Blakeley, et al. "ALOJA: a Systematic Study of Hadoop Deployment Variables to Enable Automated Characterization of Cost- Effectiveness. IEEE BigData 2014 ALOJA-ML: Predictive analytics tools for benchmarking on Hadoop deployments Josep Ll. Berral, Nicolás Poggi, David Carrera, Aaron Call, Rob Reinauer, Daron Green. ALOJA-ML: A Framework for Automating Characterization and Knowledge Discovery in Hadoop Deployments. ACM SIGKDD - KDD 2015 18

www.bsc.es Thanks! Q&A Contact: hadoop@bsc.es