Data Mining Practical Machine Learning Tools and Techniques



Similar documents
Data Mining. Nonlinear Classification

CS570 Data Mining Classification: Ensemble Methods

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Knowledge Discovery and Data Mining

Model Combination. 24 Novembre 2009

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Chapter 6. The stacking ensemble approach

Decision Trees from large Databases: SLIQ

REVIEW OF ENSEMBLE CLASSIFICATION

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Supervised Learning (Big Data Analytics)

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Knowledge Discovery and Data Mining

Data Mining - Evaluation of Classifiers

Using multiple models: Bagging, Boosting, Ensembles, Forests

Comparison of Data Mining Techniques used for Financial Data Analysis

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Why Ensembles Win Data Mining Competitions

Data Mining Methods: Applications for Institutional Research

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Ensemble Data Mining Methods

Social Media Mining. Data Mining Essentials

Gerry Hobbs, Department of Statistics, West Virginia University

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

L25: Ensemble learning

Leveraging Ensemble Models in SAS Enterprise Miner

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Chapter 12 Bagging and Random Forests

Data Mining Techniques for Prognosis in Pancreatic Cancer

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Decision-Tree Learning

Knowledge Discovery and Data Mining

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

On the application of multi-class classification in physical therapy recommendation

Better credit models benefit us all

Why do statisticians "hate" us?

Chapter 12 Discovering New Knowledge Data Mining

How To Make A Credit Risk Model For A Bank Account

Logistic Model Trees

Local classification and local likelihoods

Studying Auto Insurance Data

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Decompose Error Rate into components, some of which can be measured on unlabeled data

Bike sharing model reuse framework for tree-based ensembles

Random Forest Based Imbalanced Data Cleaning and Classification

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Data Mining for Knowledge Management. Classification

Data Mining Techniques Chapter 6: Decision Trees

Azure Machine Learning, SQL Data Mining and R

Lecture 10: Regression Trees

Solving Regression Problems Using Competitive Ensemble Models

6 Classification and Regression Trees, 7 Bagging, and Boosting

Beating the NCAA Football Point Spread

Applied Multivariate Analysis - Big data analytics

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Predicting borrowers chance of defaulting on credit loans

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

A Learning Algorithm For Neural Network Ensembles

Journal of Asian Scientific Research COMPARISON OF THREE CLASSIFICATION ALGORITHMS FOR PREDICTING PM2.5 IN HONG KONG RURAL AREA.

How To Perform An Ensemble Analysis

6.2.8 Neural networks for data mining

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Using Random Forest to Learn Imbalanced Data

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

An Overview of Knowledge Discovery Database and Data mining Techniques

Advanced Ensemble Strategies for Polynomial Models

Data Mining Classification: Decision Trees

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

How Boosting the Margin Can Also Boost Classifier Complexity

The Artificial Prediction Market

Ensemble of Classifiers Based on Association Rule Mining

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition

Boosting.

Monday Morning Data Mining

WEKA. Machine Learning Algorithms in Java

Cross-Validation. Synonyms Rotation estimation

Lecture 13: Validation

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Making Sense of the Mayhem: Machine Learning and March Madness

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Presentation by: Ahmad Alsahaf. Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Classification and Regression by randomforest

Lecture 6. Artificial Neural Networks

More Data Mining with Weka

An Introduction to WEKA. As presented by PACE

CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH

Transcription:

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea Bias variance decomposition, bagging with costs Randomization Boosting Random forests, Rotation forests AdaBoost, the power of boosting Additive regression Numeric prediction, additive logistic regression Interpretable ensembles Stacking Option trees, alternating decision trees, logistic model trees Combining multiple models Bagging Basic idea: build different experts, let them vote Advantage: often improves predictive performance Disadvantage: usually produces output that is very hard to analyze but: there are approaches that aim to produce a single comprehensible structure Combining predictions by voting/averaging Simplest way Each model receives equal weight Idealized version: Sample several training sets of size n (instead of just having one training set of size n) Build a classifier for each training set Combine the classifiers predictions Learning scheme is unstable almost always improves performance Small change in training data can make big change in model (e. g. decision trees)

Bias variance decomposition More on bagging Used to analyze how much selection of any specific training set affects performance Assume infinitely many classifiers, built from different training sets of size n For any learning scheme, Bias = expected error of the combined classifier on new data Variance = expected error due to the particular training set used Total expected error bias + variance Bagging works because it reduces variance by voting/averaging Note: in some pathological hypothetical situations the overall error might increase Usually, the more classifiers the better Problem: we only have one dataset! Solution: generate new ones of size n by sampling from it with replacement Can help a lot if data is noisy Can also be applied to numeric prediction Aside: bias variance decomposition originally only known for numeric prediction Bagging classifiers Bagging with costs Model generation Let n be the number of instances in the training data For each of t iterations: Sample n instances from training set (with replacement) Apply learning algorithm to the sample Store resulting model Classification For each of the t models: Predict class of instance using model Return class that is predicted most often Bagging unpruned decision trees known to produce good probability estimates Where, instead of voting, the individual classifiers' probability estimates are averaged Note: this can also improve the success rate Can use this with minimum expected cost approach for learning problems with costs Problem: not interpretable MetaCost re labels training data using bagging with costs and then builds single tree

Randomization Random forests, rotation forests Can randomize learning algorithm instead of input Some algorithms already have a random component: e.g. initial weights in neural net Most algorithms can be randomized, e. g. greedy algorithms: Pick from the N best options at random instead of always picking the best options E. g.: attribute selection in decision trees More generally applicable than bagging: e. g. random subsets in nearest neighbor scheme Can be combined with bagging Bagging creates ensembles of accurate classifiers with relatively low diversity Bootstrap sampling creates training sets with a distribution that resembles the original data Randomness in the learning algorithm increases diversity but sacrifices accuracy of individual ensemble members Random and rotation forests have the goal of creating accurate and diverse ensemble members Random forests Rotation forests Combine random attribute sets and bagging to generate an ensemble of decision trees An iteration involves Generate Bootstrap sample Combine random attribute sets, bagging and principal components to generate an ensemble of decision trees An iteration involves When constructing the decision tree, randomly choose one of the k best attributes Randomly dividing the input attributes into k disjoint subsets Applying PCA to each of the k subsets in turn Learning a decision tree from the k sets of PCA directions Further increases in diversity can be achieved by creating a bootstrap sample in each iteration before applying PCA

Boosting AdaBoost.M1 Also uses voting/averaging Weights models according to performance Iterative: new models are influenced by performance of previously built ones Encourage new model to become an expert for instances misclassified by earlier models Intuitive justification: models should be experts that complement each other Several variants Model generation Assign equal weight to each training instance For t iterations: Apply learning algorithm to weighted dataset, store resulting model Compute model s error e on weighted dataset If e = 0 or e 0.5: Terminate model generation For each instance in dataset: If classified correctly by model: Multiply instance s weight by e/(1-e) Normalize weight of all instances Classification Assign weight = 0 to all classes For each of the t (or less) models: For the class this model predicts add log e/(1-e) to this class s weight Return class with highest weight More on boosting I Boosting needs weights but Can adapt learning algorithm... or Can apply boosting without weights resample with probability determined by weights disadvantage: not all instances are used advantage: if error > 0.5, can resample again Stems from computational learning theory Theoretical result: training error decreases exponentially Also: works if base classifiers are not too complex, and their error doesn t become too large too quickly More on boosting II Continue boosting after training error = 0? Puzzling fact: generalization error continues to decrease! Seems to contradict Occam s Razor Explanation: consider margin (confidence), not error Difference between estimated probability for true class and nearest other class (between 1 and 1) Boosting works with weak learners only condition: error doesn t exceed 0.5 In practice, boosting sometimes overfits (in contrast to bagging)

Additive regression I Additive regression II Turns out that boosting is a greedy algorithm for fitting additive models More specifically, implements forward stagewise additive modeling Same kind of algorithm for numeric prediction: 1.Build standard regression model (e. g. tree) 2.Gather residuals, learn model predicting residuals (e. g. tree), and repeat To predict, simply sum up individual predictions from all models Minimizes squared error of ensemble if base learner minimizes squared error Doesn't make sense to use it with standard multiple linear regression, why? Can use it with simple linear regression to build multiple linear regression model Use cross validation to decide when to stop Another trick: shrink predictions of the base models by multiplying with positive constant < 1 Caveat: need to start with model 0 that predicts the mean Additive logistic regression Can use the logit transformation to get algorithm for classification More precisely, class probability estimation Probability estimation problem is transformed into regression problem Regression scheme is used as base learner (eg. regression tree learner) Can use forward stagewise algorithm: at each stage, add model that maximizes probability of data If f j is the jth regression model, the ensemble predicts probability p1 a= 1 1exp f j a for the first class LogitBoost Model generation For j = 1 to t iterations: For each instance a[i]: Set the target value for the regression to z[i] = (y[i] p(1 a[i])) / [p(1 a[i]) (1-p(1 a[i])] Set the weight of instance a[i] to p(1 a[i]) (1-p(1 a[i]) Fit a regression model f[j] to the data with class values z[i] and weights w[i] Classification Predict 1 st class if p(1 a) > 0.5, otherwise predict 2 nd class Maximizes probability if base learner minimizes squared error Difference to AdaBoost: optimizes probability/likelihood instead of exponential loss Can be adapted to multi class problems Shrinking and cross validation based selection apply

Option trees Ensembles are not interpretable Can we generate a single model? One possibility: cloning the ensemble by using lots of artificial data that is labeled by ensemble Another possibility: generating a single structure that represents ensemble in compact fashion Option tree: decision tree with option nodes Idea: follow all possible branches at option node Predictions from different branches are merged using voting or by averaging probability estimates Example Can be learned by modifying tree learner: Create option node if there are several equally promising splits (within user specified interval) When pruning, error at option node is average error of options Alternating decision trees Can also grow option tree by incrementally adding nodes to it Structure called alternating decision tree, with splitter nodes and prediction nodes Prediction nodes are leaves if no splitter nodes have been added to them yet Standard alternating tree applies to 2 class problems To obtain prediction, filter instance down all applicable branches and sum predictions Predict one class or the other depending on whether the sum is positive or negative Example Ex.: overcast, high, false 0.255 +0.213 +0.486 0.331 =0.113

Growing alternating trees Tree is grown using a boosting algorithm E. g. LogitBoost described earlier Assume that base learner produces single conjunctive rule in each boosting iteration (note: rule for regression) Each rule could simply be added into the tree, including the numeric prediction obtained from the rule Problem: tree would grow very large very quickly Solution: base learner should only consider candidate rules that extend existing branches Extension adds splitter node and two prediction nodes (assuming binary splits) Standard algorithm chooses best extension among all possible extensions applicable to tree More efficient heuristics can be employed instead Logistic model trees Option trees may still be difficult to interpret Can also use boosting to build decision trees with linear models at the leaves (i. e. trees without options) Algorithm for building logistic model trees: Run LogitBoost with simple linear regression as base learner (choosing the best attribute in each iteration) Interrupt boosting when cross validated performance of additive model no longer increases Split data (e. g. as in C4.5) and resume boosting in subsets of data Prune tree using cross validation based pruning strategy (from CART tree learner) Stacking More on stacking To combine predictions of base learners, don t vote, use meta learner Base learners: level 0 models Meta learner: level 1 model Predictions of base learners are input to meta learner Base learners are usually different schemes Can t use predictions on training data to generate data for level 1 model! Instead use cross validation like scheme Hard to analyze theoretically: black magic If base learners can output probabilities, use those as input to meta learner instead Which algorithm to use for meta learner? In principle, any learning scheme Prefer relatively global, smooth model Base learners do most of the work Reduces risk of overfitting Stacking can be applied to numeric prediction too