Data Analytics and Business Intelligence (8696/8697)

Size: px
Start display at page:

Download "Data Analytics and Business Intelligence (8696/8697)"

Transcription

1 http: // togaware. com Copyright 2014, 1/36 Data Analytics and Business Intelligence (8696/8697) Ensemble Decision Trees Data Scientist Australian Taxation Office Adjunct Professor, University of Canberra

2 Overview http: // togaware. com Copyright 2014, 2/36 Overview 1 Overview 2 Multiple Models 3 Boosting Algorithm Example 4 Random Forests Forests of Trees Introduction 5 Other Ensembles Ensembles of Different Models

3 Multiple Models http: // togaware. com Copyright 2014, 3/36 Overview 1 Overview 2 Multiple Models 3 Boosting Algorithm Example 4 Random Forests Forests of Trees Introduction 5 Other Ensembles Ensembles of Different Models

4 Multiple Models http: // togaware. com Copyright 2014, 4/36 Building Multiple Models General idea developed in Multiple Inductive Learning algorithm (Williams 1987). Ideas were developed (ACJ 1987, PhD 1990) in the context of: observe that variable selection methods don t discriminate; so build multiple decision trees; then combine into a single model. Basic idea is that multiple models, like multiple experts, may produce better results when working together, rather than in isolation Two approaches covered: Boosting and Random Forests. Meta learners.

5 Boosting http: // togaware. com Copyright 2014, 5/36 Overview 1 Overview 2 Multiple Models 3 Boosting Algorithm Example 4 Random Forests Forests of Trees Introduction 5 Other Ensembles Ensembles of Different Models

6 Boosting Algorithm http: // togaware. com Copyright 2014, 6/36 Boosting Algorithms Basic idea: boost observations that are hard to model. Algorithm: iteratively build weak models using a poor learner: Build an initial model; Identify mis-classified cases in the training dataset; Boost (over-represent) training observations modelled incorrectly; Build a new model on the boosted training dataset; Repeat. The result is an ensemble of weighted models. Best off the shelf model builder. (Leo Brieman)

7 Boosting Algorithm http: // togaware. com Copyright 2014, 7/36 Algorithm in Pseudo Code adaboost <- function(form, data, learner) { w <- rep(1/nrows(data), nrows(data)) e <- NULL a <- NULL m <- list() i <- 0 repeat { i <- i + 1 m <- c(m, learner(form, data, w)) ms <- which(predict(m[i], data)!= data[target(form)]) e <- c(e, sum(w[ms])/sum(w)) a <- c(a, log((1-e[i])/e[i])) w[ms] <- w[ms] * exp(a[i]) if (e[i] >= 0.5) break } return(sum(a * sapply(m, predict, data))) }

8 Boosting Algorithm http: // togaware. com Copyright 2014, 8/36 Distributions Learning Rate 1e+03 1e+01 alpha e^alpha 1e Error Rate epsilon

9 Boosting Example http: // togaware. com Copyright 2014, 9/36 Example: First Iteration n <- 10 w <- rep(1/n, n) # ms <- c(7, 8, 9, 10) e <- sum(w[ms])/sum(w) # 0.4 a <- log((1-e)/e) # w[ms] <- w[ms] * exp(a) #

10 Boosting Example http: // togaware. com Copyright 2014, 10/36 Example: Second Iteration ms <- c(1, 8) # w[ms] ## [1] e <- sum(w[ms])/sum(w) # a <- log((1-e)/e) # (w[ms] <- w[ms] * exp(a)) ## [1]

11 Boosting Example http: // togaware. com Copyright 2014, 11/36 Example: Ada on Weather Data head(weather[c(1:5, 23, 24)], 3) ## Date Location MinTemp MaxTemp Rainfall RISK_MM... ## Canberra ## Canberra ## Canberra set.seed(42) train <- sample(1:nrow(weather), 0.7 * nrow(weather)) (m <- ada(raintomorrow ~., weather[train, -c(1:2, 23)])) ## Call: ## ada(raintomorrow ~., data=weather[train, -c(1:2, 23)]) ## ## Loss: exponential Method: discrete Iteration: 50...

12 Boosting Example http: // togaware. com Copyright 2014, 12/36 Example: Error Rate Notice error rate decreases quickly then flattens. plot(m) Training Error Error Train Iteration 1 to 50

13 Boosting Example http: // togaware. com Copyright 2014, 13/36 Example: Variable Importance Helps understand the knowledge captured. varplot(m) Variable Importance Plot Temp3pm Pressure9am MinTemp Humidity3pm Temp9am MaxTemp Evaporation Cloud9am Humidity9am WindSpeed3pm WindSpeed9am Cloud3pm WindGustSpeed Sunshine Pressure3pm WindGustDir WindDir3pm Rainfall WindDir9am RainToday Score

14 Boosting Example http: // togaware. com Copyright 2014, 14/36 Example: Sample Trees There are 50 trees in all. Here s the first 3. fancyrpartplot(m$model$trees[[1]]) fancyrpartplot(m$model$trees[[2]]) fancyrpartplot(m$model$trees[[3]]) % yes Cloud3pm < 7.5 no % Pressure3pm >= % Humidity3pm < % yes Pressure3pm >= 1012 no % Sunshine >= % yes Pressure3pm >= 1012 no % MaxTemp >= % % 14% % % 12% % % 9% 6% Rattle 2014 Jul 01 20:15:55 gjw Rattle 2014 Jul 01 20:15:55 gjw Rattle 2014 Jul 01 20:15:55 gjw

15 Boosting 0.1 Example http: // togaware. com Copyright 2014, 15/36 Example: Performance predicted <- predict(m, weather[-train,], type="prob")[,2] actual <- weather[-train,]$raintomorrow risks <- weather[-train,]$risk_mm riskchart(predicted, actual, risks) Risk Scores 100 Lift Performance (%) % 20 1 Recall (88%) Risk (94%) 0 Precision Caseload (%)

16 Boosting Example http: // togaware. com Copyright 2014, 16/36 Example Applications ATO Application: What life events affect compliance? First application of the technology 1995 Decision Stumps: Age > NN; Change in Marital Status Boosted Neural Networks OCR using neural networks as base learners Drucker, Schapire, Simard, 1993

17 Boosting Example http: // togaware. com Copyright 2014, 17/36 Summary 1 Boosting is implemented in R in the ada library 2 AdaBoost uses e m ; LogitBoost uses log(1 + e m ); Doom II uses 1 tanh(m) 3 AdaBoost tends to be sensitive to noise (addressed by BrownBoost) 4 AdaBoost tends not to overfit, and as new models are added, generalisation error tends to improve. 5 Can be proved to converge to a perfect model if the learners are always better than chance.

18 Random Forests Forests of Trees http: // togaware. com Copyright 2014, 18/36 Overview 1 Overview 2 Multiple Models 3 Boosting Algorithm Example 4 Random Forests Forests of Trees Introduction 5 Other Ensembles Ensembles of Different Models

19 Random Forests http: // togaware. com Copyright 2014, 19/36 Random Forests Original idea from Leo Brieman and Adele Cutler. The name is Licensed to Salford Systems! Hence, R package is randomforest. Typically presented in context of decision trees. Random Multinomial Logit uses multiple multinomial logit models.

20 Random Forests http: // togaware. com Copyright 2014, 20/36 Random Forests Build many decision trees (e.g., 500). For each tree: Select a random subset of the training set (N); Choose different subsets of variables for each node of the decision tree (m << M); Build the tree without pruning (i.e., overfit) Classify a new entity using every decision tree: Each tree votes for the entity. The decision with the largest number of votes wins! The proportion of votes is the resulting score.

21 Random Forests http: // togaware. com Copyright 2014, 21/36 Example: RF on Weather Data set.seed(42) (m <- randomforest(raintomorrow ~., weather[train, -c(1:2, 23)], na.action=na.roughfix, importance=true)) ## ## Call: ## randomforest(formula=raintomorrow ~., data=weath... ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 4 ## ## OOB estimate of error rate: 13.67% ## Confusion matrix: ## No Yes class.error ## No ## Yes

22 Random Forests http: // togaware. com Copyright 2014, 22/36 Example: Error Rate Error rate decreases quickly then flattens over the 500 trees. plot(m) m Error trees

23 Random Forests http: // togaware. com Copyright 2014, 23/36 Example: Variable Importance Helps understand the knowledge captured. varimpplot(m, main="variable Importance") Variable Importance Sunshine Cloud3pm Pressure3pm Temp3pm WindGustSpeed MaxTemp Pressure9am Temp9am Humidity3pm MinTemp Cloud9am WindSpeed3pm WindSpeed9am Humidity9am WindGustDir Evaporation WindDir9am WindDir3pm Rainfall RainToday Pressure3pm Sunshine Cloud3pm Pressure9am WindGustSpeed Humidity3pm MinTemp Temp3pm Temp9am MaxTemp Humidity9am WindSpeed3pm WindSpeed9am Cloud9am Evaporation WindDir9am WindGustDir WindDir3pm Rainfall RainToday MeanDecreaseAccuracy MeanDecreaseGini

24 Random Forests http: // togaware. com Copyright 2014, 24/36 Example: Sample Trees There are 500 trees in all. Here s some rules from the first tree. ## Random Forest Model 1 ## ## ## Tree 1 Rule 1 Node 30 Decision No ## ## 1: Evaporation <= 9 ## 2: Humidity3pm <= 71 ## 3: Cloud3pm <= 2.5 ## 4: WindDir9am IN ("NNE") ## 5: Sunshine <= ## 6: Temp3pm <= ## ## Tree 1 Rule 2 Node 31 Decision Yes ## ## 1: Evaporation <= 9 ## 2: Humidity3pm <= 71...

25 Random Forests http: // togaware. com Copyright 2014, 25/ Example: Performance predicted <- predict(m, weather[-train,], type="prob")[,2] actual <- weather[-train,]$raintomorrow risks <- weather[-train,]$risk_mm riskchart(predicted, actual, risks) Risk Scores 100 Lift Performance (%) % 20 1 Recall (92%) Risk (97%) 0 Precision Caseload (%)

26 Random Forests http: // togaware. com Copyright 2014, 26/36 Features of Random Forests: By Brieman Most accurate of current algorithms. Runs efficiently on large data sets. Can handle thousands of input variables. Gives estimates of variable importance.

27 Other Ensembles http: // togaware. com Copyright 2014, 27/36 Overview 1 Overview 2 Multiple Models 3 Boosting Algorithm Example 4 Random Forests Forests of Trees Introduction 5 Other Ensembles Ensembles of Different Models

28 Other Ensembles Ensembles of Different Models http: // togaware. com Copyright 2014, 28/36 Other Ensembles Netflix Movie rental business - 100M customer movie ratings $1M for 10% improved root mean square error First annual award (Dec 07) to KorBell (AT&T) 8.43% $50K Aggregate of the best other models! Linear combination of 107 other models A lot of the different model builders deliver similar performance. So why not build one of each model and combine! In Rattle: Generate a Score file from all the models, and reload that into Rattle to explore.

29 Other Ensembles Ensembles of Different Models http: // togaware. com Copyright 2014, 29/36 Build a Model of Each Type ds <- weather[train, -c(1:2, 23)] form <- RainTomorrow ~. m.rp <- rpart(form, data=ds) m.ada <- ada(form, data=ds) m.rf <- randomforest(form, data=ds, na.action=na.roughfix, importance=true) m.svm <- ksvm(form, data=ds, kernel="rbfdot", prob.model=true) m.glm <- glm(form, data=ds, family=binomial(link="logit")) m.nn <- nnet(form, data=ds, size=10, skip=true, MaxNWts=10000, trace=false, maxit=100)

30 Other Ensembles Ensembles of Different Models http: // togaware. com Copyright 2014, 30/36 Calculate Probabilities ds <- weather[-train, -c(1:2, 23)] ds <- na.omit(ds, "na.action") pr <- data.frame( obs=row.names(ds), rp=predict(m.rp, ds)[,2], ada=predict(m.ada, ds, type="prob")[,2], rf=predict(m.rf, ds, type="prob")[,2], svm=predict(m.svm, ds, type="probabilities")[,2], glm=predict(m.glm, type="response", ds), nn=predict(m.nn, ds)) prw <- pr

31 Other Ensembles Ensembles of Different Models http: // togaware. com Copyright 2014, 31/36 Plots Weather Dataset count 50 count 50 count rp ada rf count 50 count 50 count svm glm nn

32 Other Ensembles Ensembles of Different Models http: // togaware. com Copyright 2014, 32/36 Plots Audit Dataset count count count rp ada rf 250 count count count svm glm nn

33 Other Ensembles Ensembles of Different Models http: // togaware. com Copyright 2014, 33/36 Correlation of Scores The correlations between scores obtained by the different models suggest quite an overlap in their abilities to extract the same knowledge. Weather Audit rp ada rf svm glm rp ada rf svm glm rp rp ada ada rf rf svm svm glm glm

34 Summary http: // togaware. com Copyright 2014, 34/36 Overview 1 Overview 2 Multiple Models 3 Boosting Algorithm Example 4 Random Forests Forests of Trees Introduction 5 Other Ensembles Ensembles of Different Models

35 Summary Resources http: // togaware. com Copyright 2014, 35/36 Reference Book Data Mining with Rattle and R Graham Williams 2011, Springer, Use R! ISBN: Chapters 12 and 13.

36 Summary Summary http: // togaware. com Copyright 2014, 36/36 Summary Ensemble: Multiple models working together Often better than a single model Variance and bias of the model are reduced The best available models today - accurate and robust In daily use in very many areas of application

37 Summary Summary http: // togaware. com Copyright 2014, 36/36 Summary Ensemble: Multiple models working together Often better than a single model Variance and bias of the model are reduced The best available models today - accurate and robust In daily use in very many areas of application

Data Science with R Ensemble of Decision Trees

Data Science with R Ensemble of Decision Trees Data Science with R Ensemble of Decision Trees Graham.Williams@togaware.com 3rd August 2014 Visit http://handsondatascience.com/ for more Chapters. The concept of building multiple decision trees to produce

More information

Data Science with R. Introducing Data Mining with Rattle and R. Graham.Williams@togaware.com

Data Science with R. Introducing Data Mining with Rattle and R. Graham.Williams@togaware.com http: // togaware. com Copyright 2013, Graham.Williams@togaware.com 1/35 Data Science with R Introducing Data Mining with Rattle and R Graham.Williams@togaware.com Senior Director and Chief Data Miner,

More information

How To Understand Data Mining In R And Rattle

How To Understand Data Mining In R And Rattle http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/40 Data Analytics and Business Intelligence (8696/8697) Introducing Data Science with R and Rattle Graham.Williams@togaware.com Chief

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

1 R: A Language for Data Mining. 2 Data Mining, Rattle, and R. 3 Loading, Cleaning, Exploring Data in Rattle. 4 Descriptive Data Mining

1 R: A Language for Data Mining. 2 Data Mining, Rattle, and R. 3 Loading, Cleaning, Exploring Data in Rattle. 4 Descriptive Data Mining A Data Mining Workshop Excavating Knowledge from Data Introducing Data Mining using R Graham.Williams@togaware.com Data Scientist Australian Taxation Office Adjunct Professor, Australian National University

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Data Analytics and Business Intelligence (8696/8697)

Data Analytics and Business Intelligence (8696/8697) http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/34 Data Analytics and Business Intelligence (8696/8697) Introducing and Interacting with R Graham.Williams@togaware.com Chief Data

More information

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 - Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT Hands-On Data Science with R Dealing with Big Data Graham.Williams@togaware.com 27th November 2014 Visit http://handsondatascience.com/ for more Chapters. In this module we explore how to load larger datasets

More information

Classification and Regression by randomforest

Classification and Regression by randomforest Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

CS570 Data Mining Classification: Ensemble Methods

CS570 Data Mining Classification: Ensemble Methods CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

More information

MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Why Ensembles Win Data Mining Competitions

Why Ensembles Win Data Mining Competitions Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:

More information

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

More information

Ensemble Data Mining Methods

Ensemble Data Mining Methods Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

More information

Didacticiel Études de cas

Didacticiel Études de cas 1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO

More information

Decompose Error Rate into components, some of which can be measured on unlabeled data

Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

More information

How Boosting the Margin Can Also Boost Classifier Complexity

How Boosting the Margin Can Also Boost Classifier Complexity Lev Reyzin lev.reyzin@yale.edu Yale University, Department of Computer Science, 51 Prospect Street, New Haven, CT 652, USA Robert E. Schapire schapire@cs.princeton.edu Princeton University, Department

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Ensembles and PMML in KNIME

Ensembles and PMML in KNIME Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany First.Last@Uni-Konstanz.De

More information

Delivery of an Analytics Capability

Delivery of an Analytics Capability Delivery of an Analytics Capability Ensembles and Model Delivery for Tax Compliance Graham Williams Senior Director and Chief Data Miner, Analytics Office of the Chief Knowledge Officer Australian Taxation

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

How To Do Data Mining In R

How To Do Data Mining In R Data Mining with R John Maindonald (Centre for Mathematics and Its Applications, Australian National University) and Yihui Xie (School of Statistics, Renmin University of China) December 13, 2008 Data

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Studying Auto Insurance Data

Studying Auto Insurance Data Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how

More information

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008 Ensemble Learning Better Predictions Through Diversity Todd Holloway ETech 2008 Outline Building a classifier (a tutorial example) Neighbor method Major ideas and challenges in classification Ensembles

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Package acrm. R topics documented: February 19, 2015

Package acrm. R topics documented: February 19, 2015 Package acrm February 19, 2015 Type Package Title Convenience functions for analytical Customer Relationship Management Version 0.1.1 Date 2014-03-28 Imports dummies, randomforest, kernelfactory, ada Author

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Beating the MLB Moneyline

Beating the MLB Moneyline Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

Package trimtrees. February 20, 2015

Package trimtrees. February 20, 2015 Package trimtrees February 20, 2015 Type Package Title Trimmed opinion pools of trees in a random forest Version 1.2 Date 2014-08-1 Depends R (>= 2.5.0),stats,randomForest,mlbench Author Yael Grushka-Cockayne,

More information

Package missforest. February 20, 2015

Package missforest. February 20, 2015 Type Package Package missforest February 20, 2015 Title Nonparametric Missing Value Imputation using Random Forest Version 1.4 Date 2013-12-31 Author Daniel J. Stekhoven Maintainer

More information

Predicting borrowers chance of defaulting on credit loans

Predicting borrowers chance of defaulting on credit loans Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm

More information

How To Choose A Churn Prediction

How To Choose A Churn Prediction Assessing classification methods for churn prediction by composite indicators M. Clemente*, V. Giner-Bosch, S. San Matías Department of Applied Statistics, Operations Research and Quality, Universitat

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Data Science Using Open Souce Tools Decision Trees and Random Forest Using R

Data Science Using Open Souce Tools Decision Trees and Random Forest Using R Data Science Using Open Souce Tools Decision Trees and Random Forest Using R Jennifer Evans Clickfox jennifer.evans@clickfox.com January 14, 2014 Jennifer Evans (Clickfox) Twitter: JenniferE CF January

More information

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

More information

Boosting. riedmiller@informatik.uni-freiburg.de

Boosting. riedmiller@informatik.uni-freiburg.de . Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Active Learning with Boosting for Spam Detection

Active Learning with Boosting for Spam Detection Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38 Outline 1 Spam Filters

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Ensemble Approach for the Classification of Imbalanced Data

Ensemble Approach for the Classification of Imbalanced Data Ensemble Approach for the Classification of Imbalanced Data Vladimir Nikulin 1, Geoffrey J. McLachlan 1, and Shu Kay Ng 2 1 Department of Mathematics, University of Queensland v.nikulin@uq.edu.au, gjm@maths.uq.edu.au

More information

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition Submission to NIPS*97, Category: Algorithms & Architectures, Preferred: Oral Training Methods for Adaptive Boosting of Neural Networks for Character Recognition Holger Schwenk Dept. IRO Université de Montréal

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Predictive Modeling and Big Data

Predictive Modeling and Big Data Predictive Modeling and Presented by Eileen Burns, FSA, MAAA Milliman Agenda Current uses of predictive modeling in the life insurance industry Potential applications of 2 1 June 16, 2014 [Enter presentation

More information

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d. EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

More information

Why do statisticians "hate" us?

Why do statisticians hate us? Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Data Analytics and Business Intelligence (8696/8697)

Data Analytics and Business Intelligence (8696/8697) http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/39 Data Analytics and Business Intelligence (8696/8697) Introducing Data Mining Graham.Williams@togaware.com Chief Data Scientist Australian

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,

More information

DATA MINING SPECIES DISTRIBUTION AND LANDCOVER. Dawn Magness Kenai National Wildife Refuge

DATA MINING SPECIES DISTRIBUTION AND LANDCOVER. Dawn Magness Kenai National Wildife Refuge DATA MINING SPECIES DISTRIBUTION AND LANDCOVER Dawn Magness Kenai National Wildife Refuge Why Data Mining Random Forest Algorithm Examples from the Kenai Species Distribution Model Pattern Landcover Model

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

II. RELATED WORK. Sentiment Mining

II. RELATED WORK. Sentiment Mining Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

Stock Market Forecasting Using Machine Learning Algorithms

Stock Market Forecasting Using Machine Learning Algorithms Stock Market Forecasting Using Machine Learning Algorithms Shunrong Shen, Haomiao Jiang Department of Electrical Engineering Stanford University {conank,hjiang36}@stanford.edu Tongda Zhang Department of

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Open-Source Machine Learning: R Meets Weka

Open-Source Machine Learning: R Meets Weka Open-Source Machine Learning: R Meets Weka Kurt Hornik Christian Buchta Achim Zeileis Weka? Weka is not only a flightless endemic bird of New Zealand (Gallirallus australis, picture from Wekapedia) but

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

FilterBoost: Regression and Classification on Large Datasets

FilterBoost: Regression and Classification on Large Datasets FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley Machine Learning Department Carnegie Mellon University Pittsburgh, PA 523 jkbradle@cs.cmu.edu Robert E. Schapire Department

More information

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati

More information

On the application of multi-class classification in physical therapy recommendation

On the application of multi-class classification in physical therapy recommendation RESEARCH Open Access On the application of multi-class classification in physical therapy recommendation Jing Zhang 1,PengCao 1,DouglasPGross 2 and Osmar R Zaiane 1* Abstract Recommending optimal rehabilitation

More information