Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/



Similar documents
Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Data Mining Practical Machine Learning Tools and Techniques

Why Ensembles Win Data Mining Competitions

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Data Mining. Nonlinear Classification

CS570 Data Mining Classification: Ensemble Methods

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Using multiple models: Bagging, Boosting, Ensembles, Forests

Data Mining Methods: Applications for Institutional Research

Knowledge Discovery and Data Mining

Model Combination. 24 Novembre 2009

Leveraging Ensemble Models in SAS Enterprise Miner

ECBDL 14: Evolu/onary Computa/on for Big Data and Big Learning Workshop July 13 th, 2014 Big Data Compe//on

Comparison of Data Mining Techniques used for Financial Data Analysis

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

Decision Trees from large Databases: SLIQ

The Predictive Data Mining Revolution in Scorecards:

Gerry Hobbs, Department of Statistics, West Virginia University

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Knowledge Discovery and Data Mining

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Data Mining Techniques Chapter 6: Decision Trees

The Operational Value of Social Media Information. Social Media and Customer Interaction

Beating the MLB Moneyline

Chapter 12 Bagging and Random Forests

Better credit models benefit us all

Chapter 6. The stacking ensemble approach

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Why do statisticians "hate" us?

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Knowledge Discovery and Data Mining

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Applied Multivariate Analysis - Big data analytics

Decompose Error Rate into components, some of which can be measured on unlabeled data

REVIEW OF ENSEMBLE CLASSIFICATION

Advanced Ensemble Strategies for Polynomial Models

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

How To Make A Credit Risk Model For A Bank Account

Boosting.

MACHINE LEARNING BRETT WUJEK, SAS INSTITUTE INC.

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

Studying Auto Insurance Data

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

L25: Ensemble learning

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Ensemble Data Mining Methods

Lecture 10: Regression Trees

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Introduction to Learning & Decision Trees

Local classification and local likelihoods

Using Random Forest to Learn Imbalanced Data

MS1b Statistical Data Mining

Supervised Learning (Big Data Analytics)

Missing Data. Katyn & Elena

Cross-validation for detecting and preventing overfitting

Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Classification and Regression by randomforest

HT2015: SC4 Statistical Data Mining and Machine Learning

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Introduction To Ensemble Learning

Cross Validation. Dr. Thomas Jensen Expedia.com

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

6 Classification and Regression Trees, 7 Bagging, and Boosting

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Social Media Mining. Data Mining Essentials

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Data Mining Techniques for Prognosis in Pancreatic Cancer

Classification of Bad Accounts in Credit Card Industry

Course Syllabus. Purposes of Course:

The Artificial Prediction Market

Sales Forecasting for Retail Chains

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Predicting borrowers chance of defaulting on credit loans

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

How To Use A Webmail On A Pc Or Macodeo.Com

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)

II. RELATED WORK. Sentiment Mining

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Alessandro Laio, Maria d Errico and Alex Rodriguez SISSA (Trieste)

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

How To Analyze Medical Image Data With A Feature Based Approach To Big Data Medical Image Analysis

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Data Mining Classification: Decision Trees

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

: Introduction to Machine Learning Dr. Rita Osadchy

Transcription:

Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Outline The NeHlix Prize Success of ensemble methods in the NeHlix Prize Why Ensemble Methods Work Algorithms Bagging Random forests AdaBoost

Defini<on Ensemble Classifica.on Aggrega<on of predic<ons of mul<ple classifiers with the goal of improving accuracy.

Teaser: How good are ensemble methods? Let s look at the Ne-lix Prize Compe66on

Began October 2006 Supervised learning task Training data is a set of users and ra<ngs (1,2,3,4,5 stars) those users have given to movies. Construct a classifier that given a user and an unrated movie, correctly classifies that movie as either 1, 2, 3, 4, or 5 stars $1 million prize for a 10% improvement over NeHlix s current movie recommender/classifier (MSE = 0.9514)

Just three weeks aaer it began, at least 40 teams had bested the NeHlix classifier. Top teams showed about 5% improvement.

However, improvement slowed from h8p://www.research.a8.com/~volinsky/nehlix/

Today, the top team has posted a 8.5% improvement. Ensemble methods are the best performers

Rookies Thanks to Paul Harrison's collabora<on, a simple mix of our solu<ons improved our result from 6.31 to 6.75

Arek Paterek My approach is to combine the results of many methods (also two- way interac<ons between them) using linear regression on the test set. The best method in my ensemble is regularized SVD with biases, post processed with kernel ridge regression h8p://rainbow.mimuw.edu.pl/~ap/ap_kdd.pdf

U of Toronto When the predic<ons of mul.ple RBM models and mul.ple SVD models are linearly combined, we achieve an error rate that is well over 6% be8er than the score of NeHlix s own system. h8p://www.cs.toronto.edu/~rsalakhu/papers/rbmcf.pdf

Gravity home.mit.bme.hu/~gtakacs/download/gravity.pdf

When Gravity and Dinosaurs Unite Our common team blends the result of team Gravity and team Dinosaur Planet. Might have guessed from the name

BellKor / KorBell And, yes, the top team which is from AT&T Our final solu.on (RMSE=0.8712) consists of blending 107 individual results.

The winner was an ensemble of ensembles (including BellKor).

Some Intui<ons on Why Ensemble Methods Work

Intui<ons U<lity of combining diverse, independent opinions in human decision- making Protec<ve Mechanism (e.g. stock porholio diversity) Viola.on of Ockham s Razor Iden<fying the best model requires iden<fying the proper "model complexity" See Domingos, P. Occam s two razors: the sharp and the blunt. KDD. 1998.

Intui<ons Majority vote Suppose we have 5 completely independent classifiers If accuracy is 70% for each 10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5) 83.7% majority vote accuracy 101 such classifiers 99.9% majority vote accuracy

Strategies Bagging Use different samples of observa<ons and/or predictors (features) of the examples to generate diverse classifiers Aggregate classifiers: average in regression, majority vote in classifica<on Boos<ng Make examples currently misclassified more important (or less, in some cases)

Bagging (Construc<ng for Diversity) 1. Use random samples of the examples to construct the classifiers 2. Use random feature sets to construct the classifiers Random Decision Forests Bagging: Bootstrap Aggrega<on Leo Breiman

Bootstrap: consider the following situa<on: A random sample x =(x 1,...,x N ) from unknown probability distribu<on F We wish to es<mate parameter We build es<mate What is the s.d. of ˆθ? ˆθ = s(x) θ = t(f ) Examples: 1) estimate mean and sd of expected prediction error 2) estimate point-wise confidence bands in smoothing

Bootstrap: It is completely automa<c Requires no theore<cal calcula<ons Not based on asympto<c results Available regardless of how complicated the es<mator ˆθ

Bootstrap algorithm: 1. Select B independent bootstrap samples x 1,...,x B each consis<ng of N data values drawing with replacement from x 2. Evaluate the bootstrap replica<on corresponding to each bootstrap sample ˆθ (b) =s(x b), b=1,...,b 3. Es<mate the standard error of ˆθ using the sample standard error of the B es<mates

Bagging: use bootstrap to improve predic6ons 1. Create bootstrap samples, es<mate model from each bootstrap sample 2. Aggregate predic<ons (average if regression, majority vote if classifica<on) This works best when perturbing the training set can cause significant changes in the es<mated model For instance, for least- squares, can show variance is decreased while bias is unchanged

Random forests At every level, choose a random subset of the variables (predictors, not examples) and choose the best split among those a8ributes

Random forests Let the number of training points be M, and the number of variables in the classifier be N. For each tree, 1. Choose a training set by choosing N <mes with replacement from all N available training cases. 2. For each node, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these.

Random forests Grow each tree as deep as possible no pruning! Out- of- bag data can be used to es<mate cross- valida<on error For each training point, get predic<on from averaging trees where point is not included in bootstrap sample Variable importance measures are easy to calculate

Boos<ng 1. Create a sequence of classifiers, giving higher influence to more accurate classifiers 2. At each itera<on, make examples currently misclassified more important (get larger weight in the construc<on of the next classifier). 3. Then combine classifiers by weighted vote (weight given by classifier accuracy)

AdaBoost Algorithm 1. Ini<alize Weights: each case gets the same weight: 2. Construct a classifier using current weights. Compute its error: i w i 3. Get classifier influence, and update example weights α m = log 4. Goto step 2 ε m = w i =1/N, i =1,...,N 1 εm ε m i w i I{y i = g m (x i )} w i w i exp {α m I{y i = g m (x i )}} Final predic<on is weighted vote, with weight α m

Classifica.ons (colors) and Weights (size) a[er 1 itera+on Of AdaBoost 3 itera+ons 20 itera+ons from Elder, John. From Trees to Forests and Rule Sets - A Unified Overview of Ensemble Methods. 2007.

AdaBoost Advantages Very li8le code Reduces variance Disadvantages Sensi<ve to noise and outliers. Why?

How was the NeHlix prize won? Gradient boosted decision trees Details: http://www.netflixprize.com/assets/grandprize2009_bpc_bellkor.pdf

Sources David Mease. Sta<s<cal Aspects of Data Mining. Lecture. h8p://video.google.com/videoplay?docid=- 4669216290304603251&q=stats +202+engEDU&total=13&start=0&num=10&so=0&type=search&plindex=8 Die8erich, T. G. Ensemble Learning. In The Handbook of Brain Theory and Neural Networks, Second edi<on, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. h8p://www.cs.orst.edu/~tgd/publica<ons/hbtnn- ensemble- learning.ps.gz Elder, John and Seni Giovanni. From Trees to Forests and Rule Sets - A Unified Overview of Ensemble Methods. KDD 2007 h8p://tutorial. videolectures.net/kdd07_elder_afr/ NeHlix Prize. h8p://www.nehlixprize.com/ Christopher M. Bishop. Neural Networks for Pa8ern Recogni<on. Oxford University Press. 1995.