Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL: http://www.abbottanalytics.com Twitter: @deanabb Email: dean@abbottanalytics.com 1
Outline Motivation for Ensembles How Ensembles are Built Do Ensembles Violate Occams Razor? Why Do Ensembles Win? 2
PAKDD Cup 2007 Results: Score Metric Changes Winner Ensembles Modeling Technique Modeling Implementa4on - > Par4cipant Affilia4on Loca4on - > Par4cipant Affilia4on Type - > AUCROC (Trapezoid al Rule)- > AUCROC (Trapezoidal Rule) Rank - > Top Decile Response Rate - > Top Decile Response Rate Rank - > TreeNet + Logis-c Regression Salford Systems Mainland China Prac--oner 70.01% 1 13.00% 7 Probit Regression SAS USA Prac--oner 69.99% 2 13.13% 6 MLP + n- Tuple Classifier Brazil Prac--oner 69.62% 3 13.88% 1 TreeNet Salford Systems USA Prac--oner 69.61% 4 13.25% 4 TreeNet Salford Systems Mainland China Prac--oner 69.42% 5 13.50% 2 Ridge Regression Rank Belgium Prac--oner 69.28% 6 12.88% 9 2- Layer Linear Regression USA Prac--oner 69.14% 7 12.88% 9 Logis-c Regression + Decision Stump + AdaBoost + VFI Mainland China Academia 69.10% 8 13.25% 4 Logis-c Average of Single Decision Func-ons Australia Prac--oner 68.85% 9 12.13% 17 Logis-c Regression Weka Singapore Academia 68.69% 10 12.38% 16 Logis-c Regression Mainland China Prac--oner 68.58% 11 12.88% 9 Decision Tree + Neural Network + Logis-c Regression Singapore 68.54% 12 13.00% 7 Scorecard Linear Addi-ve Model Xeno USA Prac--oner 68.28% 13 11.75% 20 Random Forest Weka USA 68.04% 14 12.50% 14 Expanding Regression Tree + RankBoost + Bagging Weka Mainland China Academia 68.02% 15 12.50% 14 SAS + Salford Logis-c Regression Systems India Prac--oner 67.58% 16 12.00% 19 J48 + BayesNet Weka Mainland China Academia 67.56% 17 11.63% 21 Neural Network + General Addi-ve Model Tiberius USA Prac--oner 67.54% 18 11.63% 21 Decision Tree + Neural Network Mainland China Academia 67.50% 19 12.88% 9 Decision Tree + Neural Network + Logis-c Regression SAS USA Academia 66.71% 20 13.50% 2 Neural Network SAS USA Academia 66.36% 21 12.13% 17 Decision Tree + Neural Network + Logis-c Regression SAS USA Academia 65.95% 22 11.63% 21 Neural Network SAS USA Academia 65.69% 23 9.25% 32 Mul-- dimension Balanced Random Forest Mainland China Academia 65.42% 24 12.63% 13 Neural Network SAS USA Academia 65.28% 25 11.00% 26 CHAID Decision Tree SPSS Argen-na Academia 64.53% 26 11.25% 24 Under- Sampling Based on Clustering + CART Decision Tree Taiwan Academia 64.45% 27 11.13% 25 Decision Tree + Neural Network + Polynomial Regression SAS USA Academia 64.26% 28 9.38% 30 3
Netflix Prize 2006 Netflix State-of-the-art (Cinematch) RMSE = 0.9525 Prize: reduce this RMSE by 10% => 0.8572 2007: Korbell team Progress Prize winner 107 algorithm ensemble Top algorithm: SVD with RMSE = 0.8914 2 nd algorithm: Restricted Boltzmann Machine with RMSE = 0.8990 Mini-ensemble (SVD+RBM) has RMSE = 0.88 http://techblog.netflix.com/2012/04/netflixrecommendations-beyond-5-stars.html 4
Common Kinds of Ensembles vs. Single Models Ensembles { Single Classifiers From Zhuowen Tu, Ensemble Classification Methods: Bagging, Boosting, and Random Forests 5
What are Model Ensembles? Combining outputs from multiple models into single decision Models can be created using the same algorithm, or several different algorithms Decision Logic Ensemble Prediction 6
Creating Model Ensembles Step 1: Generate Component Models Can Vary Data or Model Parameters: Case (Record) Weights bootstrapping, sampling Data Values add noise, recode data Learning Parameters vary learning rates, pruning severity, random seeds Variable Subsets vary candidate inputs, features Single data set Multiple models and predictions 7
Creating Model Ensembles Step 2: Combining Models Combining Methods Estimation: Average Outputs Classification: Average probabilities or vote (best M of N) Variance Reduction Build complex, overfit models All models built in same manner Bias Reduction Build simple models Subsequent models weight records with errors more (or model actual errors) Multiple models and predictions Combine Decision or Prediction Value 8
How Model Complexity Effects Errors Giovanni Seni, John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842) 9
Commonly Used Information- Theoretic Complexity Penalties BIC: Baysian Information Criterion AIC: Akaike Information Criterion MDL: Minimum Description Length For a nice summary: http://en.wikipedia.org/wiki/regularization_(mathematics) 10
Four Keys to Effective Ensembling Diversity of opinion Independence Decentralization Aggregation From The Wisdom of Crowds, James Surowiecki 11 11
Bagging Bagging Method Create many data sets by bootstrapping (can also do this with cross validation) Create one decision tree for each data set Combine decision trees by averaging (or voting) final decisions Primarily reduces model variance rather than bias Results On average, better than any individual tree Final Answer (average) 12
Boosting (Adaboost) Boosting Method Creating tree using training data set Score each data point, indicating when each incorrect decision is made (errors) Retrain, giving rows with incorrect decisions more weight. Repeat Final prediction is a weighted average of all models-> model regularization. Best to create weak models simple models (just a few splits for a decision tree) and let the boosting iterations find the complexity. Often used with trees or Naïve Bayes Results Usually better than individual tree or Bagging Reweight examples where classification incorrect Combine models via weighted sum 13
Random Forest Ensembles Random Forest (RF) Method Exact same methodology as Bagging, but with a twist At each split, rather than using the entire set of candidate inputs, use a random subset of candidate inputs Generates diversity of samples and inputs (splits) Results On average, better than any individual tree, Bagging, or even Boosting Final Answer (average) 14
Stochastic Gradient Boosting Implemented in MART (Jerry Friedman), and TreeNet (Salford Systems) Algorithm Begin with a simple model a constant value for a model Build a simple tree (perhaps 6 terminal nodes) now there are 6 possible levels, whereas before there was one level Score the model and compute errors. The score is the sum of all previous trees, weighted by a learning rate Build a new tree with the errors as the target variable. Results TreeNet has won 2 KDD-Cup competitions and numerous others It is less prone to outliers and overfit than Adaboost Predict errors in ensemble tree so far Combine models via weighted sum Build Final Answer (additive model) 15
Ensembles of Trees: Smoothers Ensembles smooth jagged decision boundaries Pictures from T.G. Dietterich. Ensemble methods in machine learning. In Multiple Classier Systems, Cagliari, Italy, 2000. 16
Heterogeneous Model Ensembles on Glass Data Percent Classification Error 40 % 35 % 30 % 25 % 20 % 15 % 10 % 5% 0% Max Error Min Error Avera ge Error 1 2 3 4 5 6 Number Models Combin ed Model prediction diversity obtained by using different algorithms: tree, NN, RBF, Gaussian, Regression, k-nn Combining 3-5 models on average better than best single model Combining all 6 models not best (best is 3&4 model combination), but is close The is an example of reducing model variance through ensembles, but not model bias 17
Direct Marketing Example: Considerations or I-Miner From Abbott, D.W., "How to Improve Customer Acquisition Models with Ensembles", presented at Predictive Analytics World Conference, Washington, D.C., October 20, 2009. Steps: 1. Join by record all models applied to same data in same row order 2. Change probability names 3. Average probabilities 1. Decision is avg_prob > threshold 4. Decile Probability Ranks 18
Direct Marketing Example: Variable Inclusion in Model Ensembles Twenty-Five different variables represented in the ten models Only five were represented in seven or more models Twelve were represented in one or two models # Models with Common Variables # Models # Variables From Abbott, D.W., "How to Improve Customer Acquisition Models with Ensembles", presented at Predictive Analytics World Conference, Washington, D.C., October 20, 2009. 19
Fraud Detection Example: Deployment Stream Model scoring picks up scores from each model, combines in an ensemble, and pushes scores back to database 20
Fraud Detection Example: Overall Model Score on Validation Data Normalized Score 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 1 6.1 2 7.2 3 5.3 4 Total Score (from validation population) 7.2 5 5.7 6 7.5 7.0 6.8 6.9 7 8 9 Model 8.8 1.0 10 11 Ensemble 9.5 6.3 6.8 5.3 Average Average 5 Best Average 5 Worst 7.2 6.3 Best Testing Worst Testing From Abbott, D, and Tom Konchan, Advanced Fraud Detection Techniques for Vendor Payments, Predictive Analytics Summit, San Diego, CA, February 24, 2011. Score weights false alarms and sensitivi ty Overall, ensemble is clearly best, and much better than best on testing data 21
Are Ensembles Better? Accuracy? Yes Interpretability? No Do Ensembles contradict Occam s Razor? Principle: simpler models generalize better; avoid overfit! They are more complex than single models (RF may have hundreds of trees in the ensemble) Yet these more complex models perform better on held-out data But are they really more complex? 22
Generalized Degrees of Freedom Linear Regression: a degree of freedom in the model is simple a parameter Does not extrapolate to non-linear methods Number of parameters in non-linear methods can produce more complexity or less Enter Generalized Degrees of Freedom (GDF) GDF (Ye 1998) randomly perturbs (adds noise to) the output variable, re-runs the modeling procedure, and measures the changes to the estimates (for same number of parameters) 23
The Math of GDF From Giovanni Seni, John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842) 24
The Effect of GDF From Elder, J.F.E IV, The Generalization Paradox of Ensembles, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages 853 864 25
Why Ensembles Win Performance, performance, performance Most competitions care only about performance, not about interpretation or ability to deploy Single model sometimes provide insufficient accuracy Neural networks become stuck in local minima Decision trees Run out of data Are greedy can get fooled early Single algorithms keep pushing performance using the same ideas (basis function / algorithm), and are incapable of thinking outside of their box Different algorithms or algorithms built using resample data achieve the same level of accuracy but on different cases they identify different ways to get the same level of accuracy 26
Conclusion Ensembles can achieve significant model performance improvements The key to good ensembles is diversity in sampling and variable selection Can be applied to single algorithm, or across multiple algorithms Just do it! 27
References Giovanni Seni, John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842) Elder, J.F.E IV, The Generalization Paradox of Ensembles, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages 853 864 DOI: 10.1198/1061860032733 Abbott, D.W., The Benefits of Creating Ensembles of Classifiers, Abbott Analytics, Inc., http://www.abbottanalytics.com/white-paperclassifiers.php Abbott, D.W., A Comparison of Algorithms at PAKDD2007, Blog post at http://abbottanalytics.blogspot.com/2007/05/comparison-ofalgorithms-at-pakdd2007.html 28
References Tu, Zhuowen, Ensemble Classification Methods: Bagging, Boosting, and Random Forests, http://www.loni.ucla.edu/~ztu/courses/ 2010_CS_spring/cs269_2010_ensemble.pdf Ye, J. (1998), On Measuring and Correcting the Effects of Data Mining and Model Selection, Journal of the American Statistical Association, 93, 120 131. 29