Why Ensembles Win Data Mining Competitions

Similar documents

Data Mining Practical Machine Learning Tools and Techniques

Data Mining. Nonlinear Classification

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Knowledge Discovery and Data Mining

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

The Predictive Data Mining Revolution in Scorecards:

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Classification of Bad Accounts in Credit Card Industry

Model Combination. 24 Novembre 2009

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

CS570 Data Mining Classification: Ensemble Methods

How To Make A Credit Risk Model For A Bank Account

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Decision Trees from large Databases: SLIQ

Gerry Hobbs, Department of Statistics, West Virginia University

Chapter 6. The stacking ensemble approach

Using multiple models: Bagging, Boosting, Ensembles, Forests

Leveraging Ensemble Models in SAS Enterprise Miner

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Data Mining Methods: Applications for Institutional Research

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Knowledge Discovery and Data Mining

Supervised Learning (Big Data Analytics)

Benchmarking of different classes of models used for credit scoring

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Random forest algorithm in big data environment

Why do statisticians "hate" us?

Strategies for Building Predictive Models

L25: Ensemble learning

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

On the effect of data set size on bias and variance in classification learning

The Artificial Prediction Market

MACHINE LEARNING BRETT WUJEK, SAS INSTITUTE INC.

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Advanced Ensemble Strategies for Polynomial Models

Data Mining & Data Stream Mining Open Source Tools

Risk pricing for Australian Motor Insurance

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Comparison of Data Mining Techniques used for Financial Data Analysis

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Distributed forests for MapReduce-based machine learning

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Better credit models benefit us all

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Monday Morning Data Mining

Knowledge Discovery and Data Mining

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Solving Regression Problems Using Competitive Ensemble Models

Introducing diversity among the models of multi-label classification ensemble

Data Mining Algorithms Part 1. Dejan Sarka

Using Adaptive Random Trees (ART) for optimal scorecard segmentation

Getting Even More Out of Ensemble Selection

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Data Mining Techniques Chapter 6: Decision Trees

Machine Learning using MapReduce

Ensembles and PMML in KNIME

Ensemble Data Mining Methods

Perspectives on Data Mining

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Making Sense of the Mayhem: Machine Learning and March Madness

Social Media Mining. Data Mining Essentials

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

Data Mining Techniques for Prognosis in Pancreatic Cancer

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Neural Networks and Support Vector Machines

testo dello schema Secondo livello Terzo livello Quarto livello Quinto livello

Identifying SPAM with Predictive Models

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Data Mining - Evaluation of Classifiers

MHI3000 Big Data Analytics for Health Care Final Project Report

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Machine Learning Algorithms and Predictive Models for Undergraduate Student Retention

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Predicting Student Performance by Using Data Mining Methods for Classification

Lecture 13: Validation

Using Random Forest to Learn Imbalanced Data

Data Mining Classification: Decision Trees

Why is Internal Audit so Hard?

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

An Overview of Knowledge Discovery Database and Data mining Techniques

New Ensemble Combination Scheme

MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION

Cross-Validation. Synonyms Rotation estimation

Transcription:

Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL: http://www.abbottanalytics.com Twitter: @deanabb Email: dean@abbottanalytics.com 1

Outline Motivation for Ensembles How Ensembles are Built Do Ensembles Violate Occams Razor? Why Do Ensembles Win? 2

PAKDD Cup 2007 Results: Score Metric Changes Winner Ensembles Modeling Technique Modeling Implementa4on - > Par4cipant Affilia4on Loca4on - > Par4cipant Affilia4on Type - > AUCROC (Trapezoid al Rule)- > AUCROC (Trapezoidal Rule) Rank - > Top Decile Response Rate - > Top Decile Response Rate Rank - > TreeNet + Logis-c Regression Salford Systems Mainland China Prac--oner 70.01% 1 13.00% 7 Probit Regression SAS USA Prac--oner 69.99% 2 13.13% 6 MLP + n- Tuple Classifier Brazil Prac--oner 69.62% 3 13.88% 1 TreeNet Salford Systems USA Prac--oner 69.61% 4 13.25% 4 TreeNet Salford Systems Mainland China Prac--oner 69.42% 5 13.50% 2 Ridge Regression Rank Belgium Prac--oner 69.28% 6 12.88% 9 2- Layer Linear Regression USA Prac--oner 69.14% 7 12.88% 9 Logis-c Regression + Decision Stump + AdaBoost + VFI Mainland China Academia 69.10% 8 13.25% 4 Logis-c Average of Single Decision Func-ons Australia Prac--oner 68.85% 9 12.13% 17 Logis-c Regression Weka Singapore Academia 68.69% 10 12.38% 16 Logis-c Regression Mainland China Prac--oner 68.58% 11 12.88% 9 Decision Tree + Neural Network + Logis-c Regression Singapore 68.54% 12 13.00% 7 Scorecard Linear Addi-ve Model Xeno USA Prac--oner 68.28% 13 11.75% 20 Random Forest Weka USA 68.04% 14 12.50% 14 Expanding Regression Tree + RankBoost + Bagging Weka Mainland China Academia 68.02% 15 12.50% 14 SAS + Salford Logis-c Regression Systems India Prac--oner 67.58% 16 12.00% 19 J48 + BayesNet Weka Mainland China Academia 67.56% 17 11.63% 21 Neural Network + General Addi-ve Model Tiberius USA Prac--oner 67.54% 18 11.63% 21 Decision Tree + Neural Network Mainland China Academia 67.50% 19 12.88% 9 Decision Tree + Neural Network + Logis-c Regression SAS USA Academia 66.71% 20 13.50% 2 Neural Network SAS USA Academia 66.36% 21 12.13% 17 Decision Tree + Neural Network + Logis-c Regression SAS USA Academia 65.95% 22 11.63% 21 Neural Network SAS USA Academia 65.69% 23 9.25% 32 Mul-- dimension Balanced Random Forest Mainland China Academia 65.42% 24 12.63% 13 Neural Network SAS USA Academia 65.28% 25 11.00% 26 CHAID Decision Tree SPSS Argen-na Academia 64.53% 26 11.25% 24 Under- Sampling Based on Clustering + CART Decision Tree Taiwan Academia 64.45% 27 11.13% 25 Decision Tree + Neural Network + Polynomial Regression SAS USA Academia 64.26% 28 9.38% 30 3

Netflix Prize 2006 Netflix State-of-the-art (Cinematch) RMSE = 0.9525 Prize: reduce this RMSE by 10% => 0.8572 2007: Korbell team Progress Prize winner 107 algorithm ensemble Top algorithm: SVD with RMSE = 0.8914 2 nd algorithm: Restricted Boltzmann Machine with RMSE = 0.8990 Mini-ensemble (SVD+RBM) has RMSE = 0.88 http://techblog.netflix.com/2012/04/netflixrecommendations-beyond-5-stars.html 4

Common Kinds of Ensembles vs. Single Models Ensembles { Single Classifiers From Zhuowen Tu, Ensemble Classification Methods: Bagging, Boosting, and Random Forests 5

What are Model Ensembles? Combining outputs from multiple models into single decision Models can be created using the same algorithm, or several different algorithms Decision Logic Ensemble Prediction 6

Creating Model Ensembles Step 1: Generate Component Models Can Vary Data or Model Parameters: Case (Record) Weights bootstrapping, sampling Data Values add noise, recode data Learning Parameters vary learning rates, pruning severity, random seeds Variable Subsets vary candidate inputs, features Single data set Multiple models and predictions 7

Creating Model Ensembles Step 2: Combining Models Combining Methods Estimation: Average Outputs Classification: Average probabilities or vote (best M of N) Variance Reduction Build complex, overfit models All models built in same manner Bias Reduction Build simple models Subsequent models weight records with errors more (or model actual errors) Multiple models and predictions Combine Decision or Prediction Value 8

How Model Complexity Effects Errors Giovanni Seni, John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842) 9

Commonly Used Information- Theoretic Complexity Penalties BIC: Baysian Information Criterion AIC: Akaike Information Criterion MDL: Minimum Description Length For a nice summary: http://en.wikipedia.org/wiki/regularization_(mathematics) 10

Four Keys to Effective Ensembling Diversity of opinion Independence Decentralization Aggregation From The Wisdom of Crowds, James Surowiecki 11 11

Bagging Bagging Method Create many data sets by bootstrapping (can also do this with cross validation) Create one decision tree for each data set Combine decision trees by averaging (or voting) final decisions Primarily reduces model variance rather than bias Results On average, better than any individual tree Final Answer (average) 12

Boosting (Adaboost) Boosting Method Creating tree using training data set Score each data point, indicating when each incorrect decision is made (errors) Retrain, giving rows with incorrect decisions more weight. Repeat Final prediction is a weighted average of all models-> model regularization. Best to create weak models simple models (just a few splits for a decision tree) and let the boosting iterations find the complexity. Often used with trees or Naïve Bayes Results Usually better than individual tree or Bagging Reweight examples where classification incorrect Combine models via weighted sum 13

Random Forest Ensembles Random Forest (RF) Method Exact same methodology as Bagging, but with a twist At each split, rather than using the entire set of candidate inputs, use a random subset of candidate inputs Generates diversity of samples and inputs (splits) Results On average, better than any individual tree, Bagging, or even Boosting Final Answer (average) 14

Stochastic Gradient Boosting Implemented in MART (Jerry Friedman), and TreeNet (Salford Systems) Algorithm Begin with a simple model a constant value for a model Build a simple tree (perhaps 6 terminal nodes) now there are 6 possible levels, whereas before there was one level Score the model and compute errors. The score is the sum of all previous trees, weighted by a learning rate Build a new tree with the errors as the target variable. Results TreeNet has won 2 KDD-Cup competitions and numerous others It is less prone to outliers and overfit than Adaboost Predict errors in ensemble tree so far Combine models via weighted sum Build Final Answer (additive model) 15

Ensembles of Trees: Smoothers Ensembles smooth jagged decision boundaries Pictures from T.G. Dietterich. Ensemble methods in machine learning. In Multiple Classier Systems, Cagliari, Italy, 2000. 16

Heterogeneous Model Ensembles on Glass Data Percent Classification Error 40 % 35 % 30 % 25 % 20 % 15 % 10 % 5% 0% Max Error Min Error Avera ge Error 1 2 3 4 5 6 Number Models Combin ed Model prediction diversity obtained by using different algorithms: tree, NN, RBF, Gaussian, Regression, k-nn Combining 3-5 models on average better than best single model Combining all 6 models not best (best is 3&4 model combination), but is close The is an example of reducing model variance through ensembles, but not model bias 17

Direct Marketing Example: Considerations or I-Miner From Abbott, D.W., "How to Improve Customer Acquisition Models with Ensembles", presented at Predictive Analytics World Conference, Washington, D.C., October 20, 2009. Steps: 1. Join by record all models applied to same data in same row order 2. Change probability names 3. Average probabilities 1. Decision is avg_prob > threshold 4. Decile Probability Ranks 18

Direct Marketing Example: Variable Inclusion in Model Ensembles Twenty-Five different variables represented in the ten models Only five were represented in seven or more models Twelve were represented in one or two models # Models with Common Variables # Models # Variables From Abbott, D.W., "How to Improve Customer Acquisition Models with Ensembles", presented at Predictive Analytics World Conference, Washington, D.C., October 20, 2009. 19

Fraud Detection Example: Deployment Stream Model scoring picks up scores from each model, combines in an ensemble, and pushes scores back to database 20

Fraud Detection Example: Overall Model Score on Validation Data Normalized Score 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 1 6.1 2 7.2 3 5.3 4 Total Score (from validation population) 7.2 5 5.7 6 7.5 7.0 6.8 6.9 7 8 9 Model 8.8 1.0 10 11 Ensemble 9.5 6.3 6.8 5.3 Average Average 5 Best Average 5 Worst 7.2 6.3 Best Testing Worst Testing From Abbott, D, and Tom Konchan, Advanced Fraud Detection Techniques for Vendor Payments, Predictive Analytics Summit, San Diego, CA, February 24, 2011. Score weights false alarms and sensitivi ty Overall, ensemble is clearly best, and much better than best on testing data 21

Are Ensembles Better? Accuracy? Yes Interpretability? No Do Ensembles contradict Occam s Razor? Principle: simpler models generalize better; avoid overfit! They are more complex than single models (RF may have hundreds of trees in the ensemble) Yet these more complex models perform better on held-out data But are they really more complex? 22

Generalized Degrees of Freedom Linear Regression: a degree of freedom in the model is simple a parameter Does not extrapolate to non-linear methods Number of parameters in non-linear methods can produce more complexity or less Enter Generalized Degrees of Freedom (GDF) GDF (Ye 1998) randomly perturbs (adds noise to) the output variable, re-runs the modeling procedure, and measures the changes to the estimates (for same number of parameters) 23

The Math of GDF From Giovanni Seni, John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842) 24

The Effect of GDF From Elder, J.F.E IV, The Generalization Paradox of Ensembles, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages 853 864 25

Why Ensembles Win Performance, performance, performance Most competitions care only about performance, not about interpretation or ability to deploy Single model sometimes provide insufficient accuracy Neural networks become stuck in local minima Decision trees Run out of data Are greedy can get fooled early Single algorithms keep pushing performance using the same ideas (basis function / algorithm), and are incapable of thinking outside of their box Different algorithms or algorithms built using resample data achieve the same level of accuracy but on different cases they identify different ways to get the same level of accuracy 26

Conclusion Ensembles can achieve significant model performance improvements The key to good ensembles is diversity in sampling and variable selection Can be applied to single algorithm, or across multiple algorithms Just do it! 27

References Giovanni Seni, John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842) Elder, J.F.E IV, The Generalization Paradox of Ensembles, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages 853 864 DOI: 10.1198/1061860032733 Abbott, D.W., The Benefits of Creating Ensembles of Classifiers, Abbott Analytics, Inc., http://www.abbottanalytics.com/white-paperclassifiers.php Abbott, D.W., A Comparison of Algorithms at PAKDD2007, Blog post at http://abbottanalytics.blogspot.com/2007/05/comparison-ofalgorithms-at-pakdd2007.html 28

References Tu, Zhuowen, Ensemble Classification Methods: Bagging, Boosting, and Random Forests, http://www.loni.ucla.edu/~ztu/courses/ 2010_CS_spring/cs269_2010_ensemble.pdf Ye, J. (1998), On Measuring and Correcting the Effects of Data Mining and Model Selection, Journal of the American Statistical Association, 93, 120 131. 29