Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions

Transcription

1 Ensemble Methods n Data Mnng: Improvng Accuracy Through Combnng Predctons

2

3 Synthess Lectures on Data Mnng and Knowledge Dscovery Edtor Robert Grossman, Unversty of Illnos, Chcago Ensemble Methods n Data Mnng: Improvng Accuracy Through Combnng Predctons Govann Sen and John F. Elder 2010 Modelng and Data Mnng n Blogosphere Ntn Agarwal and Huan Lu 2009

4 Copyrght 2010 by Morgan & Claypool All rghts reserved. No part of ths publcaton may be reproduced, stored n a retreval system, or transmtted n any form or by any means electronc, mechancal, photocopy, recordng, or any other except for bref quotatons n prnted revews, wthout the pror permsson of the publsher. Ensemble Methods n Data Mnng: Improvng Accuracy Through Combnng Predctons Govann Sen and John F. Elder ISBN: ISBN: paperback ebook DOI /S00240ED1V01Y200912DMK002 A Publcaton n the Morgan & Claypool Publshers seres SYNTHESIS LECTURES ON DATA MINING AND KNOWLEDGE DISCOVERY Lecture #2 Seres Edtor: Robert Grossman, Unversty of Illnos, Chcago Seres ISSN Synthess Lectures on Data Mnng and Knowledge Dscovery Prnt Electronc

5 Ensemble Methods n Data Mnng: Improvng Accuracy Through Combnng Predctons Govann Sen Elder Research, Inc. and Santa Clara Unversty John F. Elder Elder Research, Inc. and Unversty of Vrgna SYNTHESIS LECTURES ON DATA MINING AND KNOWLEDGE DISCOVERY #2 & M C Morgan & claypool publshers

6 ABSTRACT Ensemble methods have been called the most nfluental development n Data Mnng and Machne Learnng n the past decade. They combne multple models nto one usually more accurate than the best of ts components. Ensembles can provde a crtcal boost to ndustral challenges from nvestment tmng to drug dscovery, and fraud detecton to recommendaton systems where predctve accuracy s more vtal than model nterpretablty. Ensembles are useful wth all modelng algorthms, but ths book focuses on decson trees to explan them most clearly. After descrbng trees and ther strengths and weaknesses, the authors provde an overvew of regularzaton today understood to be a key reason for the superor performance of modern ensemblng algorthms. The book contnues wth a clear descrpton of two recent developments: Importance Samplng (IS) and Rule Ensembles (RE).IS reveals classc ensemble methods baggng, random forests, and boostng to be specal cases of a sngle algorthm, thereby showng how to mprove ther accuracy and speed. REs are lnear rule models derved from decson tree ensembles. They are the most nterpretable verson of ensembles, whch s essental to applcatons such as credt scorng and fault dagnoss. Lastly, the authors explan the paradox of how ensembles acheve greater accuracy on new data despte ther (apparently much greater) complexty. Ths book s amed at novce and advanced analytc researchers and practtoners especally n Engneerng, Statstcs, and Computer Scence. Those wth lttle exposure to ensembles wll learn why and how to employ ths breakthrough method, and advanced practtoners wll gan nsght nto buldng even more powerful models. Throughout, snppets of code n R are provded to llustrate the algorthms descrbed and to encourage the reader to try the technques 1. The authors are ndustry experts n data mnng and machne learnng who are also adjunct professors and popular speakers. Although early poneers n dscoverng and usng ensembles, they here dstll and clarfy the recent groundbreakng work of leadng academcs (such as Jerome Fredman) to brng the benefts of ensembles to practtoners. The authors would apprecate hearng of errors n or suggested mprovements to ths book, and may be emaled at sen@datamnnglab.com and elder@datamnnglab.com. Errata and updates wll be avalable from KEYWORDS ensemble methods, rule ensembles, mportance samplng, boostng, random forest, baggng, regularzaton, decson trees, data mnng, machne learnng, pattern recognton, model nterpretaton, model complexty, generalzed degrees of freedom 1 R s an Open Source Language and envronment for data analyss and statstcal modelng avalable through the Comprehensve R Archve Network (CRAN). The R system s lbrary packages offer extensve functonalty, and be downloaded form cran.r-project.org/ for many computng platforms. The CRAN web ste also has ponters to tutoral and comprehensve documentaton. A varety of excellent ntroductory books are also avalable; we partcularly lke Introductory Statstcs wth R by Peter Dalgaard and Modern Appled Statstcs wth S by W.N. Venables and B.D. Rpley.

7 To the lovng memory of our fathers, Tto and Fletcher

8

9 x Contents Acknowledgments...x Foreword by Jaffray Woodrff...xv Foreword by Tn Kam Ho...xv 1 Ensembles Dscovered Buldng Ensembles Regularzaton Real-World Examples: Credt Scorng + the Netflx Challenge Organzaton of Ths Book Predctve Learnng and Decson Trees Decson Tree Inducton Overvew Decson Tree Propertes Decson Tree Lmtatons Model Complexty, Model Selecton and Regularzaton What s the Rght Sze of a Tree? Bas-Varance Decomposton Regularzaton Regularzaton and Cost-Complexty Tree Prunng Cross-Valdaton Regularzaton va Shrnkage Regularzaton va Incremental Model Buldng Example Regularzaton Summary 37

10 x CONTENTS 4 Importance Samplng and the Classc Ensemble Methods Importance Samplng Parameter Importance Measure Perturbaton Samplng Generc Ensemble Generaton Baggng Example Why t Helps? Random Forest AdaBoost Example Why the Exponental Loss? AdaBoost s Populaton Mnmzer Gradent Boostng MART Parallel vs. Sequental Ensembles Rule Ensembles and Interpretaton Statstcs Rule Ensembles Interpretaton Smulated Data Example Varable Importance Partal Dependences Interacton Statstc Manufacturng Data Example Summary Ensemble Complexty Complexty Generalzed Degrees of Freedom Examples: Decson Tree Surface wth Nose...83

11 CONTENTS x 6.4 R Code for GDF and Example Summary and Dscusson...89 A AdaBoost Equvalence to FSF Procedure...93 B Gradent Boostng and Robust Loss Functons...97 Bblography Authors Bographes...107

12

13 Acknowledgments We would lke to thank the many people who contrbuted to the concepton and completon of ths project. Govann had the prvlege of meetng wth Jerry Fredman regularly to dscuss many of the statstcal concepts behnd ensembles. Prof. Fredman s nfluence s deep. Bart Goethels and the organzers of ACM-KDD07 frst welcomed our tutoral proposal on the topc.tn Kam Ho favorably revewed the book dea, Keth Bettnger offered many helpful suggestons on the manuscrpt, and Matt Strampe asssted wth R code. The staff at Morgan & Claypool especally executve edtor Dane Cerra were dlgent and patent n turnng the manuscrpt nto a book. Fnally, we would lke to thank our famles for ther love and support. Govann Sen and John F. Elder January 2010

14

15 Foreword by Jaffray Woodrff John Elder s a well-known expert n the feld of statstcal predcton. He s also a good frend who has mentored me about many technques for mnng complex data for useful nformaton. I have been qute fortunate to collaborate wth John on a varety of projects, and there must be a good reason that ensembles played the prmary role each tme. I need to explan how we met, as ensembles are responsble! I spent my four years at the Unversty of Vrgna nvestgatng the markets. My plan was to become an nvestment manager after I graduated. All I needed was a proftable techncal style that ft my sklls and personalty (that s all!). After I graduated n 1991, I followed where the data led me durng one partcular caffenefueled, double all-nghter. In a ft of crazed tral and error branstormng I stumbled upon the wnnng concept of creatng one super-model from a large and dverse group of base predctve models. After ten years of combnng models for nvestment management, I decded to nvestgate where my deas ft n the general academc body of work. I had moved back to Charlottesvlle after a stnt as a propretary trader on Wall Street, and I sought out a local expert n the feld. I found John s frm, Elder Research, on the web and hoped that they d have the tme to talk to a data mnng novce. I quckly realzed that John was not only a leadng expert on statstcal learnng, but a very accomplshed speaker popularzng these methods. Fortunately for me, he was curous to talk about predcton and my deas. Early on, he ponted out that my multple model method for nvestng descrbed by the statstcal predcton term, ensemble. John and I have worked together on nterestng projects over the past decade. I teamed wth Elder Research to compete n the KDD Cup n We wrote an extensve proposal for a government grant to fund the creaton of ensemble-based research and software. In 2007 we joned up to compete aganst thousands of other teams on the Netflx Prze - achevng a thrd-place rankng at one pont (thanks partly to smple ensembles). We even pulled a branstormng all-nghter codng up our user ratng model, whch brought back fond memores of that ntal breakthrough so many years before. The practcal mplementatons of ensemble methods are enormous. Most current mplementatons of them are qute prmtve and ths book wll defntely rase the state of the art. Govann Sen s thorough mastery of the cuttng-edge research and John Elder s practcal experence have combned to make an extremely readable and useful book. Lookng forward, I can magne software that allows users to seamlessly buld ensembles n the manner, say, that sklled archtects use CAD software to create desgn mages. I expect that

16 xv FOREWORD BY JAFFRAY WOODRIFF Govann and John wll be at the forefront of developments n ths area, and, f I am lucky, I wll be nvolved as well. Jaffray Woodrff CEO, Quanttatve Investment Management Charlottesvlle, Vrgna January 2010 [Edtor s note: Mr. Woodrff s nvestment frm has experenced consstently postve results, and has grown to be the largest hedge fund manager n the South-East U.S.]

17 Foreword by Tn Kam Ho Frutful solutons to a challengng task have often been found to come from combnng an ensemble of experts. Yet for algorthmc solutons to a complex classfcaton task, the utltes of ensembles were frst wtnessed only n the late 1980 s, when the computng power began to support the exploraton and deployment of a rch set of classfcaton methods smultaneously. The next two decades saw more and more such approaches come nto the research arena, and the development of several consstently successful strateges for ensemble generaton and combnaton. Today, whle a complete explanaton of all the elements remans elusve, the ensemble methodology has become an ndspensable tool for statstcal learnng. Every researcher and practtoner nvolved n predctve classfcaton problems can beneft from a good understandng of what s avalable n ths methodology. Ths book by Sen and Elder provdes a tmely, concse ntroducton to ths topc. After an ntutve, hghly accessble sketch of the key concerns n predctve learnng, the book takes the readers through a shortcut nto the heart of the popular tree-based ensemble creaton strateges, and follows that wth a compact yet clear presentaton of the developments n the fronters of statstcs, where actve attempts are beng made to explan and explot the mysteres of ensembles through conventonal statstcal theory and methods. Throughout the book, the methodology s llustrated wth vared real-lfe examples, and augmented wth mplementatons n R-code for the readers to obtan frst-hand experence. For practtoners, ths handy reference opens the door to a good understandng of ths rch set of tools that holds hgh promses for the challengng tasks they face. For researchers and students, t provdes a succnct outlne of the crtcally relevant peces of the vast lterature, and serves as an excellent summary for ths mportant topc. The development of ensemble methods s by no means complete. Among the most nterestng open challenges are a more thorough understandng of the mathematcal structures, mappng of the detaled condtons of applcablty, fndng scalable and nterpretable mplementatons, dealng wth ncomplete or mbalanced tranng samples, and evolvng models to adapt to envronmental changes. It wll be exctng to see ths monograph encourage talented ndvduals to tackle these problems n the comng decades. Tn Kam Ho Bell Labs, Alcatel-Lucent January 2010

18

19 CHAPTER 1 1 Ensembles Dscovered and n a multtude of counselors there s safety. Proverbs 24:6b A wde varety of competng methods are avalable for nducng models from data, and ther relatve strengths are of keen nterest. The comparatve accuracy of popular algorthms depends strongly on the detals of the problems addressed, as shown n Fgure 1.1 (from Elder and Lee (1997)), whch plots the relatve out-of-sample error of fve algorthms for sx publc-doman problems. Overall, neural network models dd the best on ths set of problems, but note that every algorthm scored best or next-to-best on at least two of the sx data sets. ) Error Relatve to Peer Techn nques (lowe er s better) Relatve Performance Examples: 5 Algorthms on 6 Datasets (John Elder, Elder Research & Stephen Lee, U. Idaho, 1997) Neural Network Logstc Regresson Lnear Vector Quantzaton Projecton Pursut Regresson Decson Tree.00 Dabetes Gaussan Hypothyrod German Credt Waveform Investment Fgure 1.1: Relatve out-of-sample error of fve algorthms on sx publc-doman problems (based on Elder and Lee (1997)).

20 2 1. ENSEMBLES DISCOVERED How can we tell, ahead of tme, whch algorthm wll excel for a gven problem? Mche et al. (1994) addressed ths queston by executng a smlar but larger study (23 algorthms on 22 data sets) and buldng a decson tree to predct the best algorthm to use gven the propertes of a data set 1. Though the study was skewed toward trees they were 9 of the 23 algorthms, and several of the (academc) data sets had unrealstc thresholds amenable to trees the study dd reveal useful lessons for algorthm selecton (as hghlghted n Elder, J. (1996a)). Stll, there s a way to mprove model accuracy that s easer and more powerful than judcous algorthm selecton: one can gather models nto ensembles. Fgure 1.2 reveals the out-of-sample accuracy of the models of Fgure 1.1 when they are combned four dfferent ways, ncludng averagng, votng, and advsor perceptrons (Elder and Lee, 1997). Whle the ensemble technque of advsor perceptrons beats smple averagng on every problem, the dfference s small compared to the dfference between ensembles and the sngle models. Every ensemble method competes well here aganst the best of the ndvdual algorthms. Ths phenomenon was dscovered by a handful of researchers, separately and smultaneously, to mprove classfcaton whether usng decson trees (Ho, Hull, and Srhar, 1990), neural networks (Hansen and Salamon, 1990), or math theory (Klenberg, E., 1990). The most nfluental early developments were by Breman, L. (1996) wth Baggng, and Freund and Shapre (1996) wth AdaBoost (both descrbed n Chapter 4). One of us stumbled across the marvel of ensemblng (whch we called model fuson or bundlng ) whle strvng to predct the speces of bats from features of ther echo-locaton sgnals (Elder, J., 1996b) 2. We bult the best model we could wth each of several very dfferent algorthms, such as decson trees, neural networks, polynomal networks, and nearest neghbors (see Nsbet et al. (2009) for algorthm descrptons). These methods employ dfferent bass functons and tranng procedures, whch causes ther dverse surface forms as shown n Fgure 1.3 and often leads to surprsngly dfferent predcton vectors, even when the aggregate performance s very smlar. The project goal was to classfy a bat s speces nonnvasvely, by usng only ts chrps. Unversty of Illnos Urbana-Champagn bologsts captured 19 bats, labeled each as one of 6 speces, then recorded 98 sgnals, from whch UIUC engneers calculated 35 tme-frequency features 3. Fgure 1.4 llustrates a two-dmensonal projecton of the data where each class s represented by a dfferent color and symbol. The data dsplays useful clusterng but also much class overlap to contend wth. Each bat contrbuted 3 to 8 sgnals, and we realzed that the set of sgnals from a gven bat had to be kept together (n ether tranng or evaluaton data) to farly test the model s ablty to predct a speces of an unknown bat. That s, any bat wth a sgnal n the evaluaton data must have no other 1 The researchers (Mche et al., 1994, Secton 10.6) examned the results of one algorthm at a tme and bult a C4.5 decson tree (Qunlan, J., 1992) to separate those datasets where the algorthm was applcable (where t was wthn a tolerance of the best algorthm) to those where t was not. They also extracted rules from the tree models and used an expert system to adjudcate between conflctng rules to maxmze net nformaton score. The book s onlne at ac.uk/ charles/statlog/whole.pdf 2 Thanks to collaboraton wth Doug Jones and hs EE students at the Unversty of Illnos, Urbana-Champagn. 3 Features such as low frequency at the 3-decbel level, tme poston of the sgnal peak, and ampltude rato of 1st and 2nd harmoncs.

21 3 Ensemble methods all mprove performance ) Error Relatve to Peer Techn nques (lowe er s better) Advsor Perceptron AP weghted average Vote Average.00 Dabetes Gaussan Hypothyrod German Credt Waveform Investment Fgure 1.2: Relatve out-of-sample error of four ensemble methods on the problems of Fgure 1.1(based on Elder and Lee (1997)). sgnals from t n tranng. So, evaluatng the performance of a model type conssted of buldng and cross-valdatng 19 models and accumulatng the out-of-sample results ( a leave-one-bat-out method). On evaluaton, the baselne accuracy (always choosng the pluralty class) was 27%. Decson trees got 46%, and a tree algorthm that was mproved to look two-steps ahead to choose splts (Elder, J., 1996b) got 58%. Polynomal networks got 64%. The frst neural networks tred acheved only 52%. However, unlke the other methods, neural networks don t select varables; when the nputs were then pruned n half to reduce redundancy and collnearty, neural networks mproved to 63% accuracy. When the nputs were pruned further to be only the 8 varables the trees employed, neural networks mproved to 69% accuracy out-of-sample. (Ths result s a clear demonstraton of the need for regularzaton, as descrbed n Chapter 3, to avod overft.) Lastly, nearest neghbors, usng those same 8 varables for dmensons, matched the neural network score of 69%. Despte ther overall scores beng dentcal, the two best models neural network and nearest neghbor dsagreed a thrd of the tme; that s, they made errors on very dfferent regons of the data. We observed that the more confdent of the two methods was rght more often than not.

22 4 1. ENSEMBLES DISCOVERED (Ther estmates were between 0 and 1 for a gven class; the estmate more close to an extreme was usually more correct.) Thus, we tred averagng together the estmates of four of the methods twostep decson tree, polynomal network, neural network, and nearest neghbor and acheved 74% accuracy the best of all. Further study of the lessons of each algorthm (such as when to gnore an estmate due to ts nputs clearly beng outsde the algorthm s tranng doman) led to mprovement reachng 80%. In short, t was dscovered to be possble to break through the asymptotc performance celng of an ndvdual algorthm by employng the estmates of multple algorthms. Our fascnaton wth what came to be known as ensemblng began. 1.1 BUILDING ENSEMBLES Buldng an ensemble conssts of two steps:(1) constructng vared models and (2) combnng ther estmates (see Secton 4.2). One may generate component models by, for nstance, varyng case weghts, data values, gudance parameters, varable subsets, or parttons of the nput space. Combnaton can be accomplshed by votng, but s prmarly done through model estmate weghts, wth gatng and advsor perceptrons as specal cases. For example, Bayesan model averagng sums estmates of possble Fgure 1.3: Example estmaton surfaces for fve modelng algorthms. Clockwse from top left: decson tree, Delaunay planes (based on Elder, J. (1993)), nearest neghbor, polynomal network (or neural network), kernel.

23 1.1. BUILDING ENSEMBLES 5 Var4 t10 \ \ bw \ \ \ Fgure 1.4: Sample projecton of sgnals for 6 dfferent bat speces. models, weghted by ther posteror evdence. Baggng (bootsrap aggregatng; Breman, L. (1996)) bootstraps the tranng data set (usually to buld vared decson trees) and takes the majorty vote or the average of ther estmates (see Secton 4.3). Random Forest (Ho, T., 1995; Breman, L., 2001) adds a stochastc component to create more dversty among the trees beng combned (see Secton 4.4) AdaBoost (Freund and Shapre, 1996) and ARCng (Breman, L., 1996) teratvely buld models by varyng case weghts (up-weghtng cases wth large current errors and down-weghtng those accurately estmated) and employs the weghted sum of the estmates of the sequence of models (see Secton 4.5). Gradent Boostng (Fredman, J., 1999, 2001) extended the AdaBoost algorthm to a varety of error functons for regresson and classfcaton (see Secton 4.6). The Group Method of Data Handlng (GMDH) (Ivakhenko, A., 1968) and ts descendent, Polynomal Networks (Barron et al., 1984; Elder and Brown, 2000), can be thought of as early ensemble technques.they buld multple layers of moderate-order polynomals, ft by lnear regresson,

24 6 1. ENSEMBLES DISCOVERED where varety arses from dfferent varable sets beng employed by each node. Ther combnaton s nonlnear snce the outputs of nteror nodes are nputs to polynomal nodes n subsequent layers. Network constructon s stopped by a smple cross-valdaton test (GMDH) or a complexty penalty. An early popular method, Stackng (Wolpert, D., 1992) employs neural networks as components (whose varety can stem from smply usng dfferent gudance parameters, such as ntalzaton weghts), combned n a lnear regresson traned on leave-1-out estmates from the networks. Models have to be ndvdually good to contrbute to ensemblng, and that requres knowng when to stop; that s, how to avod overft the chef danger n model nducton, as dscussed next. 1.2 REGULARIZATION A wdely held prncple n Statstcal and Machne Learnng model nference s that accuracy and smplcty are both desrable. But there s a tradeoff between the two: a flexble (more complex) model s often needed to acheve hgher accuracy, but t s more susceptble to overfttng and less lkely to generalze well. Regularzaton technques damp down the flexblty of a model fttng procedure by augmentng the error functon wth a term that penalzes model complexty. Mnmzng the augmented error crteron requres a certan ncrease n accuracy to pay for the ncrease n model complexty (e.g., addng another term to the model). Regularzaton s today understood to be one of the key reasons for the superor performance of modern ensemblng algorthms. An nfluental paper was Tbshran s ntroducton of the Lasso regularzaton technque for lnear models (Tbshran, R., 1996).The Lasso uses the sum of the absolute value of the coeffcents n the model as the penalty functon and had roots n work done by Breman on a coeffcent post-processng technque whch he had termed Garotte (Breman et al., 1993). Another mportant development came wth the LARS algorthm by Efron et al.(2004),whch allows for an effcent teratve calculaton of the Lasso soluton. More recently, Fredman publshed a technque called Path Seeker (PS) that allows combnng the Lasso penalty wth a varety of loss (error) functons (Fredman and Popescu, 2004), extendng the orgnal Lasso paper whch was lmted to the Least-Squares loss. Careful comparson of the Lasso penalty wth alternatve penalty functons (e.g., usng the sum of the squares of the coeffcents) led to an understandng that the penalty functon has two roles: controllng the sparseness of the soluton (the number of coeffcents that are non-zero) and controllng the magntude of the non-zero coeffcents ( shrnkage ). Ths led to development of the Elastc Net (Zou and Haste, 2005) famly of penalty functons whch allow searchng for the best shrnkage/sparseness tradeoff accordng to characterstcs of the problem at hand (e.g., data sze, number of nput varables, correlaton among these varables, etc.). The Coordnate Descent algorthm of Fredman et al. (2008) provdes fast solutons for the Elastc Net. Fnally, an extenson of the Elastc Net famly to non-convex members producng sparser solutons (desrable when the number of varables s much larger than the number of observatons) s now possble wth the Generalzed Path Seeker algorthm (Fredman, J., 2008).

25 1.3. REAL-WORLD EXAMPLES: CREDIT SCORING + THE NETFLIX CHALLENGE REAL-WORLD EXAMPLES: CREDIT SCORING + THE NETFLIX CHALLENGE Many of the examples we show are academc; they are ether curostes (bats) or kept very smple to best llustrate prncples. We close Chapter 1 by llustratng that even smple ensembles can work n very challengng ndustral applcatons. Fgure 1.5 reveals the out-of-sample results of ensemblng up to fve dfferent types of models on a credt scorng applcaton. (The output of each model s ranked, those ranks are averaged and re-ranked, and the credt defaulters n a top percentage s counted. Thus, lower s better.) The combnatons are ordered on the horzontal axs by the number of models used, and Fgure 1.6 hghlghts the fndng that the mean error reduces wth ncreasng degree of combnaton. Note that the fnal model wth all fve component models does better than the best of the sngle models. 80 #Defaulters Mssed (fewer s better) Bundled Trees Stepwse Regresson Polynomal Network Neural Network MARS NT NS MT PT MS MP ST PS NP MN SNT SPT PNT MPT SMN MPN SMT SPN MNT SMP SPNT SMPT SMNT SMPN MPNT SMPNT #Models combned (averagng output range) Fgure 1.5: Out-of-sample errors on a credt scorng applcaton when combnng one to fve dfferent types of models nto ensembles. T represents bagged trees; S, stepwse regresson; P, polynomal networks; N, neural networks; M, MARS. The best model, MPN, thus averages the models bult by MARS, a polynomal network, and a neural network algorthm. Each model n the collecton represents a great deal of work, and t was constructed by advocates of that modelng algorthm competng to beat the other methods. Here, MARS was the best and bagged trees was the worst of the fve methods (though a consderable mprovement over sngle trees, as also shown n many examples n Chapter 4).

26 8 1. ENSEMBLES DISCOVERED 75 Number of Defaulters Mssed Number of models n combnaton Fgure 1.6: Box plot for Fgure 1.5; medan (and mean) error decreased as more models are combned. Most of the ensemblng beng done n research and applcatons use varatons of one knd of modelng method partcularly decson trees (as descrbed n Chapter 2 and throughout ths book). But one great example of heterogenous ensemblng captured the magnaton of the geek communty recently. In the Netflx Prze, a contest ran for two years n whch the frst team to submt a model mprovng on Netflx s nternal recommendaton system by 10% would wn $1,000,000. Contestants were suppled wth entres from a huge move/user matrx (only 2% non-mssng) and asked to predct the rankng (from 1 to 5) of a set of the blank cells. A team one of us was on, Ensemble Experts, peaked at 3 rd place at a tme when over 20,000 teams had submtted. Movng that hgh n the rankngs usng ensembles may have nspred other leadng compettors, snce near the end of the contest, when the two top teams were extremely close to each other and to wnnng the prze, the fnal edge was obtaned by weghng contrbutons from the models of up to 30 compettors. Note that the ensemblng technques explaned n ths book are even more advanced than those employed n the fnal stages of the Netflx prze. 1.4 ORGANIZATION OF THIS BOOK Chapter 2 presents the formal problem of predctve learnng and detals the most popular nonlnear method decson trees, whch are used throughout the book to llustrate concepts. Chapter 3 dscusses model complexty and how regularzng complexty helps model selecton. Regularzaton technques play an essental role n modern ensemblng. Chapters 4 and 5 are the heart of the book; there, the useful new concepts of Importance Samplng Learnng Ensembles (ISLE) and Rule Ensembles developed by J. Fredman and colleagues are explaned clearly. The ISLE framework

27 1.4. ORGANIZATION OF THIS BOOK 9 allows us to vew the classc ensemble methods of Baggng, Random Forest, AdaBoost, and Gradent Boostng as specal cases of a sngle algorthm. Ths unfed vew clarfes the propertes of these methods and suggests ways to mprove ther accuracy and speed. Rule Ensembles s a new ISLEbased model bult by combnng smple, readable rules. Whle mantanng (and often mprovng) the accuracy of the classc tree ensemble, the rule-based model s much more nterpretable. Chapter 5 also llustrates recently proposed nterpretaton statstcs, whch are applcable to Rule Ensembles as well as to most other ensemble types. Chapter 6 concludes by explanng why ensembles generalze much better than ther apparent complexty would seem to allow. Throughout, snppets of code n R are provded to llustrate the algorthms descrbed.

28

29 CHAPTER 2 Predctve Learnng and Decson Trees 11 In ths chapter, we provde an overvew of predctve learnng and decson trees. Before ntroducng formal notaton, consder a very smple data set represented by the followng data matrx: Table 2.1: A smple data set. Each row represents a data pont and each column corresponds to an attrbute. Sometmes, attrbute values could be unknown or mssng (denoted by a? below). TI PE Response 1.0 M2 good 2.0 M1 bad 4.5 M5? Each row n the matrx represents an observaton or data pont. Each column corresponds to an attrbute of the observatons: TI, PE, and Response, n ths example. TI s a numerc attrbute, PE s an ordnal attrbute, and Response s a categorcal attrbute. A categorcal attrbute s one that has two or more values, but there s no ntrnsc orderng to the values e.g., ether good or bad n Table 2.1. An ordnal attrbute s smlar to a categorcal one but wth a clear orderng of the attrbute values. Thus, n ths example M1 comes before M2, M2 comes before M3, etc. Graphcally, ths data set can be represented by a smple two-dmensonal plot wth numerc attrbute TI rendered on the horzontal axs and ordnal attrbute PE, rendered on the vertcal axs (Fgure 2.1). When presented wth a data set such as the one above, there are two possble modelng tasks: 1. Descrbe: Summarze exstng data n an understandable and actonable way 2. Predct: What s the Response (e.g., class) of new pont? See (Haste et al., 2009). More formally, we say we are gven tranng data D ={y,x 1,x 2,,x n } N 1 ={y, x } N 1 where - y,x j are measured values of attrbutes (propertes, characterstcs) of an object - y s the response (or output) varable

30 12 2. PREDICTIVE LEARNING AND DECISION TREES PE M9. M4 M3 M2 M1 2 5 TI Fgure 2.1: A graphcal renderng of the data set from Table 2.1. Numerc and ordnal attrbutes make approprate axes because they are ordered, whle categorcal attrbutes requre color codng the ponts. The dagonal lne represents the best lnear boundary separatng the blue cases from the green cases. - x j are the predctor (or nput) varables - x s the nput vector made of all the attrbute values for the -th observaton - n s the number of attrbutes; thus, we also say that the sze of x s n - N s the number of observatons - D s a random sample from some unknown (jont) dstrbuton p(x,y).e., t s assumed there s a true underlyng dstrbuton out there, and that through a data collecton effort, we ve drawn a random sample from t. Predctve Learnng s the problem of usng D to buld a functonal model y = ˆF(x 1,x 2,,x n ) = ˆF(x) whch s the best predctor of y gven nput x. It s also often desrable for the model to offer an nterpretable descrpton of how the nputs affect the outputs.when y s categorcal, the problem s termed a classfcaton problem; when y s numerc, the problem s termed a regresson problem. The smplest model, or estmator, s a lnear model, wth functonal form ˆF(x) = a 0 + n a j x j.e., a weghted lnear combnaton of the predctors. The coeffcents {a j } n 0 are to be determned va a model fttng process such as ordnary lnear regresson (after assgnng numerc labels to the ponts.e., +1 to the blue cases and 1 to the green cases). We use the notaton ˆF(x) to refer j=1

31 to the output of the fttng process an approxmaton to the true but unknown functon F (x) lnkng the nputs to the output. The decson boundary for ths model, the ponts where ˆF(x) = 0, s a lne (see Fgure 2.1), or a plane, f n>2. The classfcaton rule smply checks whch sde of the boundary a gven pont s at.e., 13 ˆF(x) { 0 (blue) else (green) In Fgure 2.1, the lnear model sn t very good, wth several blue ponts on the (mostly) green sde of the boundary. Decson trees (Breman et al., 1993; Qunlan, J., 1992) nstead create a decson boundary by askng a sequence of nested yes/no questons. Fgure 2.2 shows a decson tree for classfyng the data of Table 2.1. The frst, or root, node splts on varable TI: cases for whch TI 5, follow the left branch and are all classfed as blue; cases for whch TI <5, go to the rght daughter of the root node, where they are subject to addtonal splt tests. PE M9... true TI 5 false M4 M3 M2 M1 2 5 TI PE Є {M1, M2, M3 } TI 2 Fgure 2.2: Decson tree example for the data of Table 2.1. There are two types of nodes: splt and termnal. Termnal nodes are gven a class label. When readng the tree, we follow the left branch when a splt test condton s met and the rght branch otherwse. At every new node the splttng algorthm takes a fresh look at the data that has arrved at t, and at all the varables and all the splts that are possble. When the data arrvng at a gven node s mostly of a sngle class, then the node s no longer splt and s assgned a class label correspondng to the majorty class wthn t; these nodes become termnal nodes. To classfy a new observaton, such as the whte dot n Fgure 2.1, one smply navgates the tree startng at the top (root), followng the left branch when a splt test condton s met and the rght branch otherwse, untl arrvng at a termnal node. The class label of the termnal node s returned as the tree predcton.

32 14 2. PREDICTIVE LEARNING AND DECISION TREES The tree of Fgure 2.2 can also be expressed by the followng expert system rule (assumng green = bad and blue = good ): TI [2, 5] AND PE {M1,M2,M3} bad ELSE good whch offers an understandable summary of the data (a descrptve model). Imagne ths data came from a manufacturng process, where M1,M2,M3, etc., were the equpment names of machnes used at some processng step, and that the TI values represented trackng tmes for the machnes. Then, the model also offers an actonable summary: certan machnes used at certan tmes lead to bad outcomes (e.g., defects). The ablty of decson trees to generate nterpretable models lke ths s an mportant reason for ther popularty. In summary, the predctve learnng problem has the followng components: - Data: D ={y, x } N 1 - Model: the underlyng functonal form sought from the data e.g., a lnear model, a decson tree model, etc. We say the model represents a famly F of functons, each ndexed by a parameter vector p: ˆF(x) = ˆF(x; p) F In the case where F are decson trees, for example, the parameter vector p represents the splts defnng each possble tree. - Score crteron: judges the qualty of a ftted model. Ths has two parts: Loss functon: Penalzes ndvdual errors n predcton. Examples for regresson tasks nclude the squared-error loss, L(y, ŷ) = (y ŷ) 2, and the absolute-error loss, L(y, ŷ) = y ŷ. Examples for 2-class classfcaton nclude the exponental loss, L(y, ŷ) = exp( y ŷ), and the (negatve) bnomal log-lkelhood, L(y, ŷ) = log(1 + e y ŷ ). Rsk: the expected loss over all predctons, R(p) = E y,x L(y, F (x; p)), whch we often approxmate by the average loss over the tranng data: ˆR(p) = 1 N N L(y, ˆF(x ; p)) (2.1) =1 In the case of ordnary lnear regresson (OLR), for nstance, whch uses squared-error loss, we have 2 ˆR(p) = ˆR(a) = 1 N n y a 0 a j x j N =1 j=1

33 2.1. DECISION TREE INDUCTION OVERVIEW 15 - Search Strategy: the procedure used to mnmze the rsk crteron.e., the means by whch we solve ˆp = arg mn p ˆR(p) In the case of OLR, the search strategy corresponds to drect matrx algebra. In the case of trees, or neural networks, the search strategy s a heurstc teratve algorthm. It should be ponted out that no model famly s unversally better; each has a class of target functons, sample sze, sgnal-to-nose rato, etc., for whch t s best. For nstance, trees work well when 100 s of varables are avalable, but the output vector only depends on a few of them (say < 10); the opposte s true for Neural Networks (Bshop, C., 1995) and Support Vector Machnes (Scholkopf et al., 1999). How to choose the rght model famly then? We can do the followng: - Match the assumptons for partcular model to what s known about the problem, or - Try several models and choose the one that performs the best, or - Use several models and allow each subresult to contrbute to the fnal result (the ensemble method). 2.1 DECISION TREE INDUCTION OVERVIEW In ths secton, we look more closely at the algorthm for buldng decson trees. Fgure 2.3 shows an example surface bult by a regresson tree. It s a pece-wse constant surface: there s a regon R m n nput space for each termnal node n the tree.e., the (hyper) rectangles nduced by tree cuts. There s a constant assocated wth each regon, whch represents the estmated predcton ŷ =ĉ m that the tree s makng at each termnal node. Formally, an M-termnal node tree model s expressed by: ŷ = T(x) = M ĉ m I ˆR m (x) m=1 where I A (x) s1fx A and 0 otherwse. Because the regons are dsjont, every possble nput x belongs n a sngle one, and the tree model can be thought of as the sum of all these regons. Trees allow for dfferent loss functons farly easly.the two most used for regresson problems are squared-error where the optmal constant ĉ m s the mean and the absolute-error where the optmal constant s the medan of the data ponts wthn regon R m (Breman et al., 1993).

34 16 2. PREDICTIVE LEARNING AND DECISION TREES Fgure 2.3: Sample regresson tree and correspondng surface n nput (x) space (adapted from (Haste et al., 2001)). If we choose to use squared-error loss, then the search problem, fndng the tree T(x) wth lowest predcton rsk, s stated: { } M ĉ m, ˆR m 1 = arg mn {c m,r m } M 1 = arg mn {c m,r m } M 1 N [y T(x )] 2 =1 [ N y =1 M c m I Rm (x ) m=1 ] 2 To solve, one searches over the space of all possble constants and regons to mnmze average loss. Unrestrcted optmzaton wth respect to {R m } M 1 s very dffcult, so one unversal technque s to restrct the shape of the regons (see Fgure 2.4). Jont optmzaton wth respect to {R m } M 1 and {c m} M 1, smultaneously, s also extremely dffcult, so a greedy teratve procedure s adopted (see Fgure 2.5). The procedure starts wth all the data ponts beng n a sngle regon R and computng a score for t; n the case of squared-error loss ths s smply: ê(r) = 1 ) 2 (y mean ({y } N 1 N ) x R Then each nput varable x j, and each possble test s j on that partcular varable for splttng R nto R l (left regon) and R r (rght regon), s consdered, and scores ê(r l ) and ê(r r ) computed. The

35 x 2 x DECISION TREE INDUCTION OVERVIEW 17 X x 1 x 1 Fgure 2.4: Examples of nvald and vald regons nduced by decson trees. To make the problem of buldng a tree computatonally fast, the regon boundares are restrcted to be rectangles parallel to the axes. Resultng regons are smple, dsjont, and cover the nput space (adapted from (Haste et al., 2001)). - Startng wth a sngle regon --.e., all gven data - At the m-th teraton: Fgure 2.5: Forward stagewse addtve procedure for buldng decson trees. qualty, or mprovement, score of the splt s j s deemed to be Î(x j,s j ) = ê(r) ê(r l ) ê(r r ).e., the reducton n overall error as a result of the splt. The algorthm chooses the varable and the splt that mproves the ft the most, wth no regard to what s gong to happen subsequently. And then the orgnal regon s replaced wth the two new regons and the splttng process contnues teratvely (recursvely). Note the data s consumed exponentally each splt leads to solvng two smaller subsequent problems. So, when should the algorthm stop? Clearly, f all the elements of the set {x : x R} have the same value of y, then no splt s gong to mprove the score.e., reduce the rsk; n ths case,

36 18 2. PREDICTIVE LEARNING AND DECISION TREES we say the regon R s pure. One could also specfy a maxmum number of desred termnal nodes, maxmum tree depth, or mnmum node sze. In the next chapter, we wll dscuss a more prncpled way of decdng the optmal tree sze. Ths smple algorthm can be coded n a few lnes. But, of course, to handle real and categorcal varables, mssng values and varous loss functons takes thousands of lnes of code. In R, decson trees for regresson and classfcaton are avalable n the rpart package (rpart). 2.2 DECISION TREE PROPERTIES As recently as 2007, a KDNuggets poll (Data Mnng Methods, 2007) concluded that trees were the method most frequently used by practtoners. Ths s so because they have many desrable data mnng propertes. These are as follows: 1. Ablty to deal wth rrelevant nputs. Snce at every node,we scan all the varables and pck the best, trees naturally do varable selecton. And, thus, anythng you can measure, you can allow as a canddate wthout worryng that they wll unduly skew your results. Trees also provde a varable mportance score based on the contrbuton to error (rsk) reducton across all the splts n the tree (see Chapter 5). 2. No data preprocessng needed. Trees naturally handle numerc, bnary, and categorcal varables. Numerc attrbutes have splts of the form x j <cut_value; categorcal attrbutes have splts of the form x j {value1, value2,...}. Monotonc transformatons won t affect the splts, so you don t have problems wth nput outlers. If cut_value = 3 and a value x j s 3.14 or 3,100, t s greater than 3, so t goes to the same sde. Output outlers can stll be nfluental, especally wth squared-error as the loss. 3. Scalable computaton.trees are very fast to buld and run compared to other teratve technques. Buldng a tree has approxmate tme complexty of O (nn log N). 4. Mssng value tolerant. Trees do not suffer much loss of accuracy due to mssng values. Some tree algorthms treat mssng values as a separate categorcal value. CART handles them va a clever mechansm termed surrogate splts (Breman et al., 1993); these are substtute splts n case the frst varable s unknown, whch are selected based on ther ablty to approxmate the splttng of the orgnally ntended varable. One may alternatvely create a new bnary varable x j _s_na(not avalable) when one beleves that there may be nformaton n x j s beng mssng.e., that t may not be mssng at random. 5. Off-the-shelf procedure: there are only few tunable parameters. One can typcally use them wthn mnutes of learnng about them.

37 2.3. DECISION TREE LIMITATIONS Interpretable model representaton. The bnary tree graphc s very nterpretable,at least to a few levels. 2.3 DECISION TREE LIMITATIONS Despte ther many desrable propertes, trees also suffer from some severe lmtatons: 1. Dscontnuous pecewse constant model. If one s tryng to ft a trend, pecewse constants are a very poor way to do that (see Fgure 2.6). In order to approxmate a trend well, many splts would be needed, and n order to have many splts, a large data set s requred. x <= cutvalue C 2 y F * (x) C 1 C 2 C 1 cutvalue x Fgure 2.6: A 2-termnal node tree approxmaton to a lnear functon. 2. Data fragmentaton. Each splt reduces tranng data for subsequent splts. Ths s especally problematc n hgh dmensons where the data s already very sparse and can lead to overft (as dscussed n Chapter 6). 3. Not good for low nteracton target functons F (x).ths s related to pont 1 above.consder that we can equvalently express a lnear target as a sum of sngle-varable functons: F (x)=a o + = n j=1 n a j x j j=1 f j ( xj ).e., no nteractons, addtve model and n order for x j to enter the model, the tree must splt on t, but once the root splt varable s selected, addtonal varables enter as products of ndcator functons. For nstance, ˆR 1 n Fgure 2.3 s defned by the product of I(x 1 > 22) and I(x 2 > 27).

38 20 2. PREDICTIVE LEARNING AND DECISION TREES 4. Not good for target functons F (x) that have dependence on many varables. Ths s related to pont 2 above. Many varables mply that many splts are needed, but then we wll run nto the data fragmentaton problem. 5. Hgh varance caused by greedy search strategy (local optma).e., small changes n the data (say due to samplng fluctuatons) can cause bg changes n the resultng tree. Furthermore, errors n upper splts are propagated down to affect all splts below t. As a result, very deep trees mght be questonable. Sometmes, the second tree followng a data change may have a very smlar performance to the frst; ths happens because typcally n real data some varables are very correlated. So the end-estmated values mght not be as dfferent as the apparent dfference by lookng at the varables n the two trees. Ensemble methods, dscussed n Chapter 4, mantan tree advantages-except for perhaps nterpretablty-whle dramatcally ncreasng ther accuracy. Technques to mprove the nterpretablty of ensemble methods are dscussed n Chapter 5.

39 CHAPTER 3 Model Complexty, Model Selecton and Regularzaton 21 Ths chapter provdes an overvew of model complexty, model selecton, and regularzaton. It s ntended to help the reader develop an ntuton for what bas and varance are; ths s mportant because ensemble methods succeed by reducng bas, reducng varance, or fndng a good tradeoff between the two. We wll present a defnton for regularzaton and see three dfferent mplementatons of t. Regularzaton s a varance control technque whch plays an essental role n modern ensemblng. We wll also revew cross-valdaton whch s used to estmate meta parameters ntroduced by the regularzaton process. We wll see that fndng the optmal value of these meta-parameters s equvalent to selectng the optmal model. 3.1 WHAT IS THE RIGHT SIZE OF A TREE? We start by revstng the queston of how bg to grow a tree, what s ts rght sze? As llustrated n Fgure 3.1, the dlemma s ths: f the number of regons (termnal nodes) s too small, then the pecewse constant approxmaton s too crude. That ntutvely leads to what s called bas, and t creates error. Fgure 3.1: Representaton of a tree model ft for smple 1-dmensonal data. From left to rght, a lnear target functon, a 2-termnal node tree approxmaton to ths target functon, and a 3-termnal node tree approxmaton. As the number of nodes n the tree grows, the approxmaton s less crude but overfttng can occur. If, on the other hand, the tree s too large, wth many termnal nodes, overfttng occurs. A tree can be grown all the way to havng one termnal node for every sngle data pont n the tranng

40 22 3. MODEL COMPLEXITY, MODEL SELECTION AND REGULARIZATION data. 1 Such a tree wll have zero error on the tranng data; however, f we were to obtan a second batch of data-test data-t s very unlkely that the orgnal tree wll perform as well on the new data. The tree wll have ftted the nose as well as the sgnal n the tranng data-analogous to a chld memorzng some partcular examples wthout graspng the underlyng concept. Wth very flexble fttng procedures such as trees, we also have the stuaton where the varaton among trees, ftted to dfferent data samples from a sngle phenomenon, can be large. Consder a semconductor manufacturng plant where for several consecutve days, t s possble to collect a data sample characterzng the devces beng made. Imagne that a decson tree s ft to each sample to classfy the defect-free vs. faled devces. It s the same process day to day, so one would expect the data dstrbuton to be very smlar. If, however, the trees are not very smlar to each other, that s known as varance. 3.2 BIAS-VARIANCE DECOMPOSITION More formally, suppose that the data we have comes from the addtve error model: y = F (x) + ε (3.1) where F (x) s the target functon that we are tryng to learn. We don t really know F, and because ether we are not measurng everythng that s relevant, or we have problems wth our measurement equpment, or what we measure has nose n t, the response varable we have contans the truth plus some error.we assume that these errors are ndependent and dentcally dstrbuted. Specfcally, we assume ε s normally dstrbuted.e., ε N(0,σ 2 ) (although ths s not strctly necessary). Now consder the dealzed aggregate estmator F(x) = E ˆF D (x) (3.2) whch s the average ft over all possble data sets. One can thnk of the expectaton operator as an averagng operator. Gong back to the manufacturng example, each ˆF represents the model ft to the data set from a gven day. And assumng many such data sets can be collected, F can be created as the average of all those ˆF s. Now, let s look at what the error of one of these ˆF s s on one partcular data pont, say x 0, under one partcular loss functon, the squared-error loss, whch allows easy analytcal manpulaton. The error, known as the Mean Square Error (MSE) n ths case, at that partcular pont s the expectaton of the squared dfference between the target y and ˆF : ] 2 Err(x 0 ) = E [y ˆF(x) x = x 0 [ ] 2 = E F (x 0 ) ˆF(x 0 ) + σ 2 [ ] 2 = E F (x 0 ) F(x 0 ) + F(x 0 ) ˆF(x 0 ) + σ 2 1 Unless two cases have dentcal nput values and dfferent output values.

41 3.2. BIAS-VARIANCE DECOMPOSITION 23 The dervaton above follows from equatons Equatons (3.1) and (3.2), and propertes of the expectaton operator. Contnung, we arrve at: = E [ F(x 0 ) F (x 0 ) ] [ ] E ˆF(x 0 ) F(x 0 ) + σ 2 = [ F(x 0 ) F (x 0 ) ] [ ] E ˆF(x 0 ) F(x 0 ) + σ 2 = Bas 2 ( ˆF(x 0 )) + Var( ˆF(x 0 )) + σ 2 The fnal expresson says that the error s made of three components: - [ F(x 0 ) F (x 0 )] 2 : known as squared-bas, s the amount by whch the average estmator F dffers from the truth F. In practce, squared-bas can t be computed, but t s a useful theoretcal concept. - E[ ˆF(x 0 ) F(x 0 )] 2 : known as varance, s the spread of the ˆF s around ther mean F. - σ 2 : s the rreducble error, the error that was present n the orgnal data, and cannot be reduced unless the data s expanded wth new, more relevant, attrbutes, or the measurement equpment s mproved, etc. Fgure 3.2 depcts the notons of squared-bas and varance graphcally. The blue shaded area s ndcatve of the σ of the error. Each data set collected represents dfferent realzatons of the truth F, each resultng n a dfferent y; the spread of these y s around F s represented by the blue crcle. The model famly F, or model space, s represented by the regon to the rght of the red curve. For a gven target realzaton y, one ˆF s ft, whch s the member from the model space F that s closest to y. After repeatng the fttng process many tmes, the average F can be computed. Thus, the orange crcle represents varance, the spread of the ˆF s around ther mean F. Smlarly, the dstance between the average estmator F and the truth F represents model bas, the amount by whch the average estmator dffers from the truth. Because bas and varance add up to MSE, they act as two opposng forces. If bas s reduced, varance wll often ncrease, and vce versa. Fgure 3.3 llustrates another aspect of ths tradeoff between bas and varance. The horzontal axs corresponds to model complexty. In the case of trees, for example, model complexty can be measured by the sze of the tree. At the orgn, mnmum complexty, there would be a tree of sze one, namely a stump. At the other extreme of the complexty axs, there would be a tree that has been grown all the way to havng one termnal node per observaton n the data (maxmum complexty). For the complex tree, the tranng error can be zero (t s only non-zero f cases have dfferent response y wth all nputs x j the same). Thus, tranng error s not a useful measurement of model qualty and a dfferent dataset, the test data set, s needed to assess performance. Assumng a test set s avalable, f for each tree sze performance s measured on t, then the error curve s typcally U-shaped as shown. That s, somewhere on the x-axs there s a M, where the test error s at ts mnmum, whch corresponds to the optmal tree sze for the gven problem.