Predictive Analytics. Omer Mimran, Spring 2015. Challenges in Modern Data Centers Management, Spring 2015 1



Similar documents
Data Mining on Streams

Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Data Mining & Data Stream Mining Open Source Tools

Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality

A Data Generator for Multi-Stream Data

Performance and efficacy simulations of the mlpack Hoeffding tree

How To Classify Data Stream Mining

Social Media Mining. Data Mining Essentials

Data Mining Classification: Decision Trees

Massive Online Analysis Manual

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Data Mining Practical Machine Learning Tools and Techniques

Lecture 10: Regression Trees

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

SAP HANA In-Memory Database Sizing Guideline

Classification and Prediction

How To Classify Data Stream Data From Concept Drift To Novel Class

Data mining techniques: decision trees

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Decision Trees from large Databases: SLIQ

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational

Challenges in Modern Data- Centers Management

Prerequisites. Course Outline

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

BIG DATA What it is and how to use?

Monday Morning Data Mining

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

Standardization and Its Effects on K-Means Clustering Algorithm

Information Management course

Decision-Tree Learning

Knowledge Discovery and Data Mining

Chapter 12 Discovering New Knowledge Data Mining

A comparative study of data mining (DM) and massive data mining (MDM)

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

The basic data mining algorithms introduced may be enhanced in a number of ways.

Machine Learning Capacity and Performance Analysis and R

Big Data Mining Services and Knowledge Discovery Applications on Clouds

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

ETL PROCESS IN DATA WAREHOUSE

INCREMENTAL AGGREGATION MODEL FOR DATA STREAM CLASSIFICATION

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Data Mining for Knowledge Management. Classification

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Preprocessing. Week 2

Load Balancing to Save Energy in Cloud Computing

Professor Anita Wasilewska. Classification Lecture Notes

CSE 326: Data Structures B-Trees and B+ Trees

Using multiple models: Bagging, Boosting, Ensembles, Forests

Data Mining with Weka

Scalable Machine Learning - or what to do with all that Big Data infrastructure

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

D A T A M I N I N G C L A S S I F I C A T I O N

Binary Coded Web Access Pattern Tree in Education Domain

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Informationsaustausch für Nutzer des Aachener HPC Clusters

Random forest algorithm in big data environment

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

Web Document Clustering

HiBench Introduction. Carson Wang Software & Services Group

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

A Review of Online Decision Tree Learning Algorithms

Azure Machine Learning, SQL Data Mining and R

Load Balancing. Load Balancing 1 / 24

Fingerprinting the datacenter: automated classification of performance crises

Active Learning SVM for Blogs recommendation

Oracle Database 11g Comparison Chart

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Distributed forests for MapReduce-based machine learning

Why Ensembles Win Data Mining Competitions

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Predicting earning potential on Adult Dataset

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting borrowers chance of defaulting on credit loans

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Transcription:

Predictive Analytics Omer Mimran, Spring 2015 Challenges in Modern Data Centers Management, Spring 2015 1

Information provided in these slides is for educational purposes only Challenges in Modern Data Centers Management, Spring 2015 2

Agenda Motivation Predicting the jobs resource requirements Background and challenges Predictive analytics, data-stream mining (DSM) System overview DSM algorithms Regression tree, Hoeffding tree, Multiple sliding windows (MSW) Summary & conclusions Challenges in Modern Data Centers Management, Spring 2015 3

Motivation Challenges in Modern Data Centers Management, Spring 2015 4

Reminder: RM lectures I III Each job comes with resource requirements e.g., 2-cores X 8GB Specified by the user submitting the job, based on his experience, etc. Scheduler picks the job (RM-I) and matches it with a server (RM-II) Best fit, worst fit, etc. Challenges in Modern Data Centers Management, Spring 2015 5

Why we need predictive analytics? What if the jobs (users) request too many resources? 8GB while in practice the job only uses 4GB of memory? Very common problem resulting in huge waste of resources ($$ loss) Even if resource matching was done optimally (RM-II lecture) Our goal (predictive analytics) Provide prediction for the actual resource usage of the jobs (focusing on memory) Forward this information to the scheduler to do the matching More jobs fit in higher throughput $$ saving Challenges in Modern Data Centers Management, Spring 2015 6

Background and challenges Challenges in Modern Data Centers Management, Spring 2015 7

Predictive analytics Predictive analytics: encompasses a variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events. (Nyce, Charles, 2007) Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed. (Arthur Samuel, 1959) An Introduction to Data Mining/Machine Learning General methodology (CRISP-DM): 1. Divide the data into 3 sets (training, testing, validation) 2. Use training set to create models and testing set to measure performance 3. Use validation set to select best model & test model generalization 4. Use the model for prediction Challenges in Modern Data Centers Management, Spring 2015 8

Data-stream mining (DSM) Data-stream: continuous (endless) and rapid incoming data Idea: apply machine-learning techniques on-line, on the data stream Key challenges: 1. Performance: infeasible to store/train all data, each sample is processed once 2. Quality: expected to perform at least as well as no-stream models 3. Adaptability: non-stationary stream, the underlying model must be altered accordingly 4. Availability: must be available for prediction at all times (Bifet, et al., 2010; Domingos & Hulten, 2001; Aggarwal, 2007; Gama & Rodrigues, 2007; Gaber, et al., 2005; Babcock, et al., 2002) Challenges in Modern Data Centers Management, Spring 2015 9

Adaptivity challenge: concept drift Concept drift: scenarios in which the distribution of a certain population changes over time; hence, statistical inference is affected (Kelly et al., 1999) Concept-drift types: 1. Sudden: easier to detect, with fewer examples 2. Gradual: harder to detect, often mistaken for random noise 3. Incremental: occur over long period of time 4. Recurring contexts: appear in a cyclic manner (Tsymbal, 2004; Gama & Castillo, 2006; Zliobaite, 2009) Possible treatments: 1. Resetting the training data (Klinkenberg, 2004; Cohen, et al., 2008) (Zliobaite, 2009) 2. Training a shadow model (Domingos & Hulten, 2000; Ikonomovska & Gama; 2008; Bifet & Gavaldà, 2009) 3. Using ensemble (Tsymbal et al., 2008; Ouyang et al., 2009) Challenges in Modern Data Centers Management, Spring 2015 10

Concept drift in reality Bursts in jobs core and memory requirements Ohad Shai, Edi Shmueli, and Dror G. Feitelson, Heuristics for resource matching in Intel's compute farm. In Job Scheduling Strategies for Parallel Processing, Walfredo Cirne and Narayan Desai, (ed.), Springer-Verlag, 2013 Challenges in Modern Data Centers Management, Spring 2015 11

Performance challenge: sliding windows Using time windows A common technique in stream mining Better performance Also addresses concept drift Time-window types 1. Landmark window: maintaining data, starting from identified relevant point 2. Tilted window: maintain all data within a window in different aggregate scales 3. Sliding window: only recent examples are stored in the window (Gama & Rodrigues, 2007) Challenges in Modern Data Centers Management, Spring 2015 12

Performance challenge: sliding windows The problem: how to set window size? Too short lower statistical validity and stability Too long slow adaptation, with negative impact on quality Example: The accuracy of protein-structure prediction, using KNN with sliding windows of varying length (Chen, Kurgan, & Ruan, 2006) Challenges in Modern Data Centers Management, Spring 2015 13

System overview 1 Challenges in Modern Data Centers Management, Spring 2015 14

System overview input from the users 1 Job characteristics User, project, priority, command-line, resource requirements, etc. Data only known at submission time Categorial variables with many possible values Challenges in Modern Data Centers Management, Spring 2015 15

System overview 1 2 Challenges in Modern Data Centers Management, Spring 2015 16

System overview output of the model 2 Prediction example If command = A and project = Tablet then memory=4gb If command = B and project = Mobile then memory=6gb If priority = 1 and user team = uncore then memory=2gb If project = ServerX then memory=16gb Challenges in Modern Data Centers Management, Spring 2015 17

System overview 3 1 2 Challenges in Modern Data Centers Management, Spring 2015 18

System overview output of the scheduler 3 Scheduler matches the jobs with machines/servers (RM-II lecture) Using the predicted values (not the original values specified by the user) More jobs fit in higher throughput $$ saving Challenges in Modern Data Centers Management, Spring 2015 19

System overview 3 1 2 4 Challenges in Modern Data Centers Management, Spring 2015 20

System overview input to the model 4 Job characteristics User, project, priority, command-line, etc. Actual resources consumed by the jobs e.g., memory Challenges in Modern Data Centers Management, Spring 2015 21

Performance measurements & objective Measurements calculated per job once completed & available in DB Calculating actual runtime/memory consumption vs. prediction Objective: maximum savings + minimum of 95% accuracy i.e., minimize resource waste, while ensuring that 95% of the jobs will not be under-estimated (otherwise they might be killed by the scheduler) Measurements (calculated for all jobs which got memory prediction): Accuracy = number of jobs with memory consumed<memory prediction number of jobs Saving = job runtime memory prediction memory requested by user Challenges in Modern Data Centers Management, Spring 2015 22

DSM algorithms Challenges in Modern Data Centers Management, Spring 2015 23

Challenge Data available for learning 1. Jobs characteristics: user, project, command-line, etc. 2. Actual resources consumed by the jobs e.g., memory Output Predict resource consumption for future incoming jobs Challenges in Modern Data Centers Management, Spring 2015 24

DSM algorithms: regression tree idea Priority=1 NumOfLoops=2 Fast Incremental Regression Tree with Drift Detection (FIRT-DD) (Elena et al., 2009) Priority <= 5 Priority > 5 3 NumOfLoops = 0 NumOfLoops > 0 2 4 New Job Memory Prediction = 4GB Challenges in Modern Data Centers Management, Spring 2015 25

DSM algorithms: regression tree steps 1. Construct a tree using Chernoff bound comparing standard deviation reduction (SDR) of all possible values as split criteria Split node using Priority value 5 All candidate variables values are tested, Priority value 5 found best reducing STDEV Priority<= 5 Priority > 5 Challenges in Modern Data Centers Management, Spring 2015 26

DSM algorithms: regression tree steps 2. Sliding window size is a pre- defined parameter Sliding window side = 5 Jobs 1,4,1,5,2 Priority <= 5 Priority > 5 new job value 3 added Job value 1 discarded Jobs 4,1,5,2,3 NumOfLoops = 0 NumOfLoops <> 0 Re-calculate prediction Median=3 2 Challenges in Modern Data Centers Management, Spring 2015 27

DSM algorithms: regression tree steps 3. Adaptivity Track error rate using statistical PH test Grow a shadow sub-tree and replace once accuracy is better Priority <= 5 Compare Error Rate High Error Priority<= 5 Priority > 5 CommandNum <= 10 CommandNum >10 NumOfLoops = 0 NumOfLoops > 0 Challenges in Modern Data Centers Management, Spring 2015 28

DSM algorithms: Hoeffding tree idea Project=A CommandType=X Hoeffding Adaptive Tree (HAT) (Bifet et al., 2009) Project = A Project = B fail Command Type = Y Command Type = X pass fail New job predicted to fail Challenges in Modern Data Centers Management, Spring 2015 29

Entropy & information gain Entropy(S) = - n i=1 p i log 2 p i Weather Go to the beach P(Beach = Yes) = 5/12 P(Beach = No) = 7/12 Sunny Sunny Sunny Yes Yes Yes Entropy (Beach) = Sunny No - 5/12log 2 ( 5 12 )- 7/12log 2 7 12 = 0.98 P(Weather=Sunny and Beach=Yes) = 3/4 P(Weather=Sunny and Beach=No) = 1/4 Overcast Overcast Overcast Overcast Yes Yes No No Entropy(S sunny ) = - 3/4log 2 ( 3 4 )- 1/4log 2 Entropy(S overcast ) = 1 Entropy(S rain ) = 0 1 4 =0.81 Rain Rain Rain Rain No No No No Challenges in Modern Data Centers Management, Spring 2015 30

Entropy & information gain Entropy (Beach) = 0.98 Entropy(S sunny ) = 0.81 Entropy(S overcast ) = 1 Entropy(S rain ) = 0 P(sunny) = P(overcast) = P(rain) = 4/12 Entropy (Beach Weather) = P(sunny)*Entropy(sunny) + P(overcast)* Entropy(overcast) + P(rain)*Entropy(rain) = 4/12(0.81) + 4/12(1) + 4/12(0) = 0.6 Weather Sunny Sunny Sunny Sunny Overcast Overcast Overcast Overcast By knowing the weather, how much information have I gained Rain? Rain Gain = Entropy(X) - Entropy(X Y) Rain Entropy(Beach) Entropy(Beach Weather) = 0.98 0.6 Rain = 0.38 Go to the beach Yes Yes Yes No Yes Yes No No No No No No Challenges in Modern Data Centers Management, Spring 2015 31

DSM algorithms: Hoeffding tree steps 1. Construct a tree using information gain as split criteria and Hoeffding bound statistical test as a stopping condition Split node using Project variable Information Gain calculated for all candidate variables if G(Best Attr.) G(2nd best)> ε* Split leaf on best attribute Project = A Project = B * ε = Hoeffding bound statistic Challenges in Modern Data Centers Management, Spring 2015 32

DSM algorithms: Hoeffding tree steps 2. Sliding window size is dynamic (discussed later...) Sliding window side = 5 Jobs +,+,+,-,- Project = A Project = B new job - added Job + discarded Jobs +,+,-,-,- Command Type = Y Command Type = X Re-calculate prediction fail pass Challenges in Modern Data Centers Management, Spring 2015 33

DSM algorithms: Hoeffding tree steps 3. Adaptivity A. Window size change similar to MSW (discussed later...) B. Alternate tree: After a concept drift in the data stream, followed by a stable period, a new alternate tree is generated Track error rate on new concept If new tree error is smaller for time T replace trees Challenges in Modern Data Centers Management, Spring 2015 34

DSM algorithms: MSW idea Project=A CommandType=X Multiple Sliding Windows (MSW) (Mimran & Even, 2014) Project = A Project = B Command Type = Y Command Type = X Command Type = Y Command Type = X 2GB 4GB 4GB 10GB New job memory prediction is 4GB * MSW variable set is [Project],[Command Type] Challenges in Modern Data Centers Management, Spring 2015 35

DSM algorithms: MSW 1. Before training the model, find set of variables that impact the memory consumption The method used forward selection minimizing variance and number of profiles: Candidate Variables A, B, C, D, E, F, G A, B, C, E, F, G B, C, E, F, G B, C, E, F C, E, F Variable Rank 1,2,3,4,3,2,1 6,5,4,3,2,1 4,6,8,10,20 3,1,1,2-1, -2, 0 Selected Variable Set D D, A D, A, G D, A, G, B D, A, G, B Challenges in Modern Data Centers Management, Spring 2015 36

DSM algorithms: MSW Variable set selection illustration Challenges in Modern Data Centers Management, Spring 2015 37

DSM algorithms: MSW 2. Set a sliding window per profile Predict Label Challenges in Modern Data Centers Management, Spring 2015 38

DSM algorithms: MSW 3. Use any given prediction function within the window Objective: Chosen strategy: maximum saving + minimum 95% accuracy linear prediction function (φ = 0.95, C=0.1) Challenges in Modern Data Centers Management, Spring 2015 39

DSM algorithms: MSW 4. Set the window size dynamically, using change detector Example for concept drift management of window with 850 jobs. Sub-window size parameter is 200 and confidence levels are: 97.5%, 95%, 90%, 90% Division to sub-windows 1st change detection comparison, 97.5% confidence level 2nd change detection comparison, 95% confidence level 3rd change detection comparison, 90% confidence level 200 200 200 250 200 650 400 450 600 250 Flow in case of 2 nd comparison being statistically significant 1st comparison is not statistically significant go to next sub-windows 2nd comparison is statistically significant prune window New sliding window 200 650 Older observations 400 450 400 Older observations Challenges in Modern Data Centers Management, Spring 2015 40

DSM algorithms: MSW 4. Set the window size dynamically, using change detector Change detector function: Hoeffding bound The Hoeffding bound (Hoeffding, 1963), also known as additive Chernoff bound R - The range of the variable (1-δ) - The statistical confidence n - The number of examples ε = R2 ln 1 δ 2n Alternate function: Kolmogorov Smirnov test Courtesy of Wikipedia http://en.wikipedia.org/wiki/file:ks_example.png Challenges in Modern Data Centers Management, Spring 2015 41

Summary & conclusions Model Model Type Sliding Windows Window Size Adaptivity Change Detector Incremental Online Info-fuzzy Network (IOLIN) (Cohen et al., 2008) classification 1 window heuristic update network Accuracy Degregation Fast Incremental Regression Tree with Drift Detection (FIRT-DD) (Ikonomovska et al., 2009) regression multiple windows userdefined shadow model Error PH test Concept adapting Very Fast Decision Trees (CVFDT) (Hulten et al., 2001) classification 1 window userdefined shadow model Hoeffding Bound Hoeffding Adaptive Tree (HAT) (Bifet et al., 2009) classification multiple windows dynamic modify window Hoeffding Bound Multiple Sliding Windows (MSW) (Mimran & Even, 2014) Classification Regression multiple windows dynamic modify window Hoeffding Bound Challenges in Modern Data Centers Management, Spring 2015 42

MSW in production: predicting jobs memory usage Deploying the model improved throughput by 10% By allowing the scheduler to fit more jobs on available resources Challenges in Modern Data Centers Management, Spring 2015 43

Can we do the same for the jobs runtime? Jobs runtime behavior is more chaotic compared to memory Some jobs get killed upon startup, e.g., due to configuration issues Jobs sharing the same CPU create contention impacting runtime Environmental issues impact runtime, e.g., file system slowness Non-uniform server configurations having different CPU speeds Hyper-threading, etc. Conclusion: existing variables do not sufficiently reduce variance Challenges in Modern Data Centers Management, Spring 2015 44

Can we do the same for the jobs runtime? Proposed approach predict the extremes MAX (improved throughput by 5%) 0.5% of jobs consume ~10% of resources with high failure rate Predict the outliers using large windows and kill them MIN (not implemented yet) ~50% of jobs run less than 5 minutes Predict if job s run time is short for better scheduling use cases Challenges in Modern Data Centers Management, Spring 2015 45

References Cohen, L., Avrahami, G., Last, M., & Kandel, A. (2008, September). Info-fuzzy algorithms for mining dynamic data streams. Applied Soft Computing, 8(4), 1283-1294. Ikonomovska, E., Gama, J., Sebastião, R., & Gjorgjevik, D. (2009). Regression Trees from Data Streams with Drift Detection. Discovery Science - Lecture Notes in Computer Science (pp. 121-135). Springer Berlin / Heidelberg. Hulten, G., Spencer, L., & Domingos, P. (2001). Mining time-changing data streams. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '01). (pp. 97-106). New York: ACM. Bifet, A., & Gavaldà, R. (2009). Adaptive Learning from Evolving Data Streams. In N. Adams, C. Robardet, A. Siebes, & J.-F. Boulicaut (Eds.), Advances in Intelligent Data Analysis VIII / Lecture Notes in Computer Science (Vol. 5772, pp. 249-260). Berlin / Heidelberg: Springer. Mimran, O. & Even, A. (2014). Data Stream Mining With Multiple Sliding Windows For Continuous Prediction. Proceedings of the European Conference on Information Systems (ECIS). AISeL. Challenges in Modern Data Centers Management, Spring 2015 46

Thank You Challenges in Modern Data Centers Management, Spring 2015 47

Backup Challenges in Modern Data Centers Management, Spring 2015 48

MSW feature selection Data Dictionary Selection (DDS) criterion: DDS = j=1 Normalized DDS criterion: J σ j 2 N j /N V 0 = σ 2 ; P 0 = 1; V i = P J j=1 σ j 2 N j /N ; DDS i = V i V i 1 N - The total number of observations J - The number of profiles considered σ j - The standard deviation of profile j N j - The number of observations in profile j P - The number of profiles generated (i.e. distinct value combinations) P i P i 1 i - The step number N - The total number of observations J - The number of profiles considered σ j - The standard deviation of profile j N j - The number of observations in profile j P i - The number of profiles in step i Normalized DDS criterion with minimum support α: V 0 = σ 2 ; P 0 = 1; V i = n j=1 N j α σ 2 j N j /N otherwise 0 DDS i = V i V i 1 P i P i 1 n j=1 N j α N j otherwise 0 /N Challenges in Modern Data Centers Management, Spring 2015 49

DSM algorithms: MSW Multiple Sliding Windows (MSW) (Mimran & Even, 2014) MSW strategy: Find a variable set, which divides the data into minimal set of profiles (clusters) with minimal variance (done once) Set a sliding window per profile Use any given prediction function within the windows Set the window size dynamically, using change detector Challenges in Modern Data Centers Management, Spring 2015 50