# The Elements of Statistical Learning

Size: px
Start display at page:

## Transcription

1 Springer Series in Statistics Trevr Hastie Rbert Tibshirani Jerme Friedman The Elements f Statistical Learning Data Mining, Inference, and Predictin Secnd Editin

2 This is page v Printer: paque this T ur parents: Valerie and Patrick Hastie Vera and Sami Tibshirani Flrence and Harry Friedman and t ur families: Samantha, Timthy, and Lynda Charlie, Ryan, Julie, and Cheryl Melanie, Dra, Mnika, and Ildik

3 vi

4 Preface t the Secnd Editin This is page vii Printer: paque this In Gd we trust, all thers bring data. William Edwards Deming (900-99) We have been gratified by the ppularity f the first editin f The Elements f Statistical Learning. This, alng with the fast pace f research in the statistical learning field, mtivated us t update ur bk with a secnd editin. We have added fur new chapters and updated sme f the existing chapters. Because many readers are familiar with the layut f the first editin, we have tried t change it as little as pssible. Here is a summary f the main changes: n the Web, this qute has been widely attributed t bth Deming and Rbert W. Hayden; hwever Prfessr Hayden tld us that he can claim n credit fr this qute, and irnically we culd find n data cnfirming that Deming actually said this.

5 viii Preface t the Secnd Editin Chapter What s new. Intrductin. verview f Supervised Learning. Linear Methds fr Regressin LAR algrithm and generalizatins f the lass 4. Linear Methds fr Classificatin Lass path fr lgistic regressin 5. Basis Expansins and Regularizatin Additinal illustratins f RKHS 6. Kernel Smthing Methds 7. Mdel Assessment and Selectin Strengths and pitfalls f crssvalidatin 8. Mdel Inference and Averaging 9. Additive Mdels, Trees, and Related Methds 0. Bsting and Additive Trees New example frm eclgy; sme material split ff t Chapter 6.. Neural Netwrks Bayesian neural nets and the NIPS 00 challenge. Supprt Vectr Machines and Path algrithm fr SVM classifier Flexible Discriminants. Prttype Methds and Nearest-Neighbrs 4. Unsupervised Learning Spectral clustering, kernel PCA, sparse PCA, nn-negative matrix factrizatin archetypal analysis, nnlinear dimensin reductin, Ggle page rank algrithm, a direct apprach t ICA 5. Randm Frests New 6. Ensemble Learning New 7. Undirected Graphical Mdels New 8. High-Dimensinal Prblems New Sme further ntes: ur first editin was unfriendly t clrblind readers; in particular, we tended t favr red/green cntrasts which are particularly trublesme. We have changed the clr palette in this editin t a large extent, replacing the abve with an range/blue cntrast. We have changed the name f Chapter 6 frm Kernel Methds t Kernel Smthing Methds, t avid cnfusin with the machinelearning kernel methd that is discussed in the cntext f supprt vectr machines (Chapter ) and mre generally in Chapters 5 and 4. In the first editin, the discussin f errr-rate estimatin in Chapter 7 was slppy, as we did nt clearly differentiate the ntins f cnditinal errr rates (cnditinal n the training set) and uncnditinal rates. We have fixed this in the new editin.

6 Preface t the Secnd Editin ix Chapters 5 and 6 fllw naturally frm Chapter 0, and the chapters are prbably best read in that rder. In Chapter 7, we have nt attempted a cmprehensive treatment f graphical mdels, and discuss nly undirected mdels and sme new methds fr their estimatin. Due t a lack f space, we have specifically mitted cverage f directed graphical mdels. Chapter 8 explres the p N prblem, which is learning in highdimensinal feature spaces. These prblems arise in many areas, including genmic and prtemic studies, and dcument classificatin. We thank the many readers wh have fund the (t numerus) errrs in the first editin. We aplgize fr thse and have dne ur best t avid errrs in this new editin. We thank Mark Segal, Bala Rajaratnam, and Larry Wasserman fr cmments n sme f the new chapters, and many Stanfrd graduate and pst-dctral students wh ffered cmments, in particular Mhammed AlQuraishi, Jhn Bik, Hlger Hefling, Arian Maleki, Dnal McMahn, Saharn Rsset, Babak Shababa, Daniela Witten, Ji Zhu and Hui Zu. We thank Jhn Kimmel fr his patience in guiding us thrugh this new editin. RT dedicates this editin t the memry f Anna McPhee. Trevr Hastie Rbert Tibshirani Jerme Friedman Stanfrd, Califrnia August 008

7 x Preface t the Secnd Editin

8 Preface t the First Editin This is page xi Printer: paque this We are drwning in infrmatin and starving fr knwledge. Rutherfrd D. Rger The field f Statistics is cnstantly challenged by the prblems that science and industry brings t its dr. In the early days, these prblems ften came frm agricultural and industrial experiments and were relatively small in scpe. With the advent f cmputers and the infrmatin age, statistical prblems have explded bth in size and cmplexity. Challenges in the areas f data strage, rganizatin and searching have led t the new field f data mining ; statistical and cmputatinal prblems in bilgy and medicine have created biinfrmatics. Vast amunts f data are being generated in many fields, and the statistician s jb is t make sense f it all: t extract imprtant patterns and trends, and understand what the data says. We call this learning frm data. The challenges in learning frm data have led t a revlutin in the statistical sciences. Since cmputatin plays such a key rle, it is nt surprising that much f this new develpment has been dne by researchers in ther fields such as cmputer science and engineering. The learning prblems that we cnsider can be rughly categrized as either supervised r unsupervised. In supervised learning, the gal is t predict the value f an utcme measure based n a number f input measures; in unsupervised learning, there is n utcme measure, and the gal is t describe the assciatins and patterns amng a set f input measures.

9 xii Preface t the First Editin This bk is ur attempt t bring tgether many f the imprtant new ideas in learning, and explain them in a statistical framewrk. While sme mathematical details are needed, we emphasize the methds and their cnceptual underpinnings rather than their theretical prperties. As a result, we hpe that this bk will appeal nt just t statisticians but als t researchers and practitiners in a wide variety f fields. Just as we have learned a great deal frm researchers utside f the field f statistics, ur statistical viewpint may help thers t better understand different aspects f learning: There is n true interpretatin f anything; interpretatin is a vehicle in the service f human cmprehensin. The value f interpretatin is in enabling thers t fruitfully think abut an idea. Andreas Buja We wuld like t acknwledge the cntributin f many peple t the cnceptin and cmpletin f this bk. David Andrews, Le Breiman, Andreas Buja, Jhn Chambers, Bradley Efrn, Geffrey Hintn, Werner Stuetzle, and Jhn Tukey have greatly influenced ur careers. Balasubramanian Narasimhan gave us advice and help n many cmputatinal prblems, and maintained an excellent cmputing envirnment. Shin-H Bang helped in the prductin f a number f the figures. Lee Wilkinsn gave valuable tips n clr prductin. Ilana Belitskaya, Eva Cantni, Maya Gupta, Michael Jrdan, Shanti Gpatam, Radfrd Neal, Jrge Picaz, Bgdan Ppescu, livier Renaud, Saharn Rsset, Jhn Strey, Ji Zhu, Mu Zhu, tw reviewers and many students read parts f the manuscript and ffered helpful suggestins. Jhn Kimmel was supprtive, patient and helpful at every phase; MaryAnn Brickner and Frank Ganz headed a superb prductin team at Springer. Trevr Hastie wuld like t thank the statistics department at the University f Cape Twn fr their hspitality during the final stages f this bk. We gratefully acknwledge NSF and NIH fr their supprt f this wrk. Finally, we wuld like t thank ur families and ur parents fr their lve and supprt. Trevr Hastie Rbert Tibshirani Jerme Friedman Stanfrd, Califrnia May 00 The quiet statisticians have changed ur wrld; nt by discvering new facts r technical develpments, but by changing the ways that we reasn, experiment and frm ur pinins... Ian Hacking

10 Cntents This is page xiii Printer: paque this Preface t the Secnd Editin Preface t the First Editin vii xi Intrductin verview f Supervised Learning 9. Intrductin Variable Types and Terminlgy Tw Simple Appraches t Predictin: Least Squares and Nearest Neighbrs Linear Mdels and Least Squares Nearest-Neighbr Methds Frm Least Squares t Nearest Neighbrs Statistical Decisin Thery Lcal Methds in High Dimensins Statistical Mdels, Supervised Learning and Functin Apprximatin A Statistical Mdel fr the Jint Distributin Pr(X,Y ) Supervised Learning Functin Apprximatin Structured Regressin Mdels Difficulty f the Prblem

11 xiv Cntents.8 Classes f Restricted Estimatrs Rughness Penalty and Bayesian Methds Kernel Methds and Lcal Regressin Basis Functins and Dictinary Methds Mdel Selectin and the Bias Variance Tradeff Bibligraphic Ntes Exercises Linear Methds fr Regressin 4. Intrductin Linear Regressin Mdels and Least Squares Example: Prstate Cancer The Gauss Markv Therem Multiple Regressin frm Simple Univariate Regressin Multiple utputs Subset Selectin Best-Subset Selectin Frward- and Backward-Stepwise Selectin Frward-Stagewise Regressin Prstate Cancer Data Example (Cntinued) Shrinkage Methds Ridge Regressin The Lass Discussin: Subset Selectin, Ridge Regressin and the Lass Least Angle Regressin Methds Using Derived Input Directins Principal Cmpnents Regressin Partial Least Squares Discussin: A Cmparisn f the Selectin and Shrinkage Methds Multiple utcme Shrinkage and Selectin Mre n the Lass and Related Path Algrithms Incremental Frward Stagewise Regressin Piecewise-Linear Path Algrithms The Dantzig Selectr The Gruped Lass Further Prperties f the Lass Pathwise Crdinate ptimizatin Cmputatinal Cnsideratins Bibligraphic Ntes Exercises

12 Cntents xv 4 Linear Methds fr Classificatin 0 4. Intrductin Linear Regressin f an Indicatr Matrix Linear Discriminant Analysis Regularized Discriminant Analysis Cmputatins fr LDA Reduced-Rank Linear Discriminant Analysis Lgistic Regressin Fitting Lgistic Regressin Mdels Example: Suth African Heart Disease Quadratic Apprximatins and Inference L Regularized Lgistic Regressin Lgistic Regressin r LDA? Separating Hyperplanes Rsenblatt s Perceptrn Learning Algrithm ptimal Separating Hyperplanes Bibligraphic Ntes Exercises Basis Expansins and Regularizatin 9 5. Intrductin Piecewise Plynmials and Splines Natural Cubic Splines Example: Suth African Heart Disease (Cntinued) Example: Phneme Recgnitin Filtering and Feature Extractin Smthing Splines Degrees f Freedm and Smther Matrices Autmatic Selectin f the Smthing Parameters Fixing the Degrees f Freedm The Bias Variance Tradeff Nnparametric Lgistic Regressin Multidimensinal Splines Regularizatin and Reprducing Kernel Hilbert Spaces Spaces f Functins Generated by Kernels Examples f RKHS Wavelet Smthing Wavelet Bases and the Wavelet Transfrm Adaptive Wavelet Filtering Bibligraphic Ntes Exercises Appendix: Cmputatinal Cnsideratins fr Splines Appendix: B-splines Appendix: Cmputatins fr Smthing Splines

13 xvi Cntents 6 Kernel Smthing Methds 9 6. ne-dimensinal Kernel Smthers Lcal Linear Regressin Lcal Plynmial Regressin Selecting the Width f the Kernel Lcal Regressin in IR p Structured Lcal Regressin Mdels in IR p Structured Kernels Structured Regressin Functins Lcal Likelihd and ther Mdels Kernel Density Estimatin and Classificatin Kernel Density Estimatin Kernel Density Classificatin The Naive Bayes Classifier Radial Basis Functins and Kernels Mixture Mdels fr Density Estimatin and Classificatin Cmputatinal Cnsideratins Bibligraphic Ntes Exercises Mdel Assessment and Selectin 9 7. Intrductin Bias, Variance and Mdel Cmplexity The Bias Variance Decmpsitin Example: Bias Variance Tradeff ptimism f the Training Errr Rate Estimates f In-Sample Predictin Errr The Effective Number f Parameters The Bayesian Apprach and BIC Minimum Descriptin Length Vapnik Chervnenkis Dimensin Example (Cntinued) Crss-Validatin K-Fld Crss-Validatin The Wrng and Right Way t D Crss-validatin Des Crss-Validatin Really Wrk? Btstrap Methds Example (Cntinued) Cnditinal r Expected Test Errr? Bibligraphic Ntes Exercises Mdel Inference and Averaging 6 8. Intrductin

14 Cntents xvii 8. The Btstrap and Maximum Likelihd Methds A Smthing Example Maximum Likelihd Inference Btstrap versus Maximum Likelihd Bayesian Methds Relatinship Between the Btstrap and Bayesian Inference The EM Algrithm Tw-Cmpnent Mixture Mdel The EM Algrithm in General EM as a Maximizatin Maximizatin Prcedure MCMC fr Sampling frm the Psterir Bagging Example: Trees with Simulated Data Mdel Averaging and Stacking Stchastic Search: Bumping Bibligraphic Ntes Exercises Additive Mdels, Trees, and Related Methds Generalized Additive Mdels Fitting Additive Mdels Example: Additive Lgistic Regressin Summary Tree-Based Methds Backgrund Regressin Trees Classificatin Trees ther Issues Spam Example (Cntinued) PRIM: Bump Hunting Spam Example (Cntinued) MARS: Multivariate Adaptive Regressin Splines Spam Example (Cntinued) Example (Simulated Data) ther Issues Hierarchical Mixtures f Experts Missing Data Cmputatinal Cnsideratins Bibligraphic Ntes Exercises Bsting and Additive Trees 7 0. Bsting Methds utline f This Chapter

15 xviii Cntents 0. Bsting Fits an Additive Mdel Frward Stagewise Additive Mdeling Expnential Lss and AdaBst Why Expnential Lss? Lss Functins and Rbustness ff-the-shelf Prcedures fr Data Mining Example: Spam Data Bsting Trees Numerical ptimizatin via Gradient Bsting Steepest Descent Gradient Bsting Implementatins f Gradient Bsting Right-Sized Trees fr Bsting Regularizatin Shrinkage Subsampling Interpretatin Relative Imprtance f Predictr Variables Partial Dependence Plts Illustratins Califrnia Husing New Zealand Fish Demgraphics Data Bibligraphic Ntes Exercises Neural Netwrks 89. Intrductin Prjectin Pursuit Regressin Neural Netwrks Fitting Neural Netwrks Sme Issues in Training Neural Netwrks Starting Values verfitting Scaling f the Inputs Number f Hidden Units and Layers Multiple Minima Example: Simulated Data Example: ZIP Cde Data Discussin Bayesian Neural Nets and the NIPS 00 Challenge Bayes, Bsting and Bagging Perfrmance Cmparisns Cmputatinal Cnsideratins Bibligraphic Ntes

16 Cntents xix Exercises Supprt Vectr Machines and Flexible Discriminants 47. Intrductin The Supprt Vectr Classifier Cmputing the Supprt Vectr Classifier Mixture Example (Cntinued) Supprt Vectr Machines and Kernels Cmputing the SVM fr Classificatin The SVM as a Penalizatin Methd Functin Estimatin and Reprducing Kernels SVMs and the Curse f Dimensinality A Path Algrithm fr the SVM Classifier Supprt Vectr Machines fr Regressin Regressin and Kernels Discussin Generalizing Linear Discriminant Analysis Flexible Discriminant Analysis Cmputing the FDA Estimates Penalized Discriminant Analysis Mixture Discriminant Analysis Example: Wavefrm Data Bibligraphic Ntes Exercises Prttype Methds and Nearest-Neighbrs 459. Intrductin Prttype Methds K-means Clustering Learning Vectr Quantizatin Gaussian Mixtures k-nearest-neighbr Classifiers Example: A Cmparative Study Example: k-nearest-neighbrs and Image Scene Classificatin Invariant Metrics and Tangent Distance Adaptive Nearest-Neighbr Methds Example Glbal Dimensin Reductin fr Nearest-Neighbrs Cmputatinal Cnsideratins Bibligraphic Ntes Exercises

17 xx Cntents 4 Unsupervised Learning Intrductin Assciatin Rules Market Basket Analysis The Apriri Algrithm Example: Market Basket Analysis Unsupervised as Supervised Learning Generalized Assciatin Rules Chice f Supervised Learning Methd Example: Market Basket Analysis (Cntinued) Cluster Analysis Prximity Matrices Dissimilarities Based n Attributes bject Dissimilarity Clustering Algrithms Cmbinatrial Algrithms K-means Gaussian Mixtures as Sft K-means Clustering Example: Human Tumr Micrarray Data Vectr Quantizatin K-medids Practical Issues Hierarchical Clustering Self-rganizing Maps Principal Cmpnents, Curves and Surfaces Principal Cmpnents Principal Curves and Surfaces Spectral Clustering Kernel Principal Cmpnents Sparse Principal Cmpnents Nn-negative Matrix Factrizatin Archetypal Analysis Independent Cmpnent Analysis and Explratry Prjectin Pursuit Latent Variables and Factr Analysis Independent Cmpnent Analysis Explratry Prjectin Pursuit A Direct Apprach t ICA Multidimensinal Scaling Nnlinear Dimensin Reductin and Lcal Multidimensinal Scaling The Ggle PageRank Algrithm Bibligraphic Ntes Exercises

18 Cntents xxi 5 Randm Frests Intrductin Definitin f Randm Frests Details f Randm Frests ut f Bag Samples Variable Imprtance Prximity Plts Randm Frests and verfitting Analysis f Randm Frests Variance and the De-Crrelatin Effect Bias Adaptive Nearest Neighbrs Bibligraphic Ntes Exercises Ensemble Learning Intrductin Bsting and Regularizatin Paths Penalized Regressin The Bet n Sparsity Principle Regularizatin Paths, ver-fitting and Margins Learning Ensembles Learning a Gd Ensemble Rule Ensembles Bibligraphic Ntes Exercises Undirected Graphical Mdels Intrductin Markv Graphs and Their Prperties Undirected Graphical Mdels fr Cntinuus Variables Estimatin f the Parameters when the Graph Structure is Knwn Estimatin f the Graph Structure Undirected Graphical Mdels fr Discrete Variables Estimatin f the Parameters when the Graph Structure is Knwn Hidden Ndes Estimatin f the Graph Structure Restricted Bltzmann Machines Exercises High-Dimensinal Prblems: p N When p is Much Bigger than N

19 xxii Cntents 8. Diagnal Linear Discriminant Analysis and Nearest Shrunken Centrids Linear Classifiers with Quadratic Regularizatin Regularized Discriminant Analysis Lgistic Regressin with Quadratic Regularizatin The Supprt Vectr Classifier Feature Selectin Cmputatinal Shrtcuts When p N Linear Classifiers with L Regularizatin Applicatin f Lass t Prtein Mass Spectrscpy The Fused Lass fr Functinal Data Classificatin When Features are Unavailable Example: String Kernels and Prtein Classificatin Classificatin and ther Mdels Using Inner-Prduct Kernels and Pairwise Distances Example: Abstracts Classificatin High-Dimensinal Regressin: Supervised Principal Cmpnents Cnnectin t Latent-Variable Mdeling Relatinship with Partial Least Squares Pre-Cnditining fr Feature Selectin Feature Assessment and the Multiple-Testing Prblem The False Discvery Rate Asymmetric Cutpints and the SAM Prcedure A Bayesian Interpretatin f the FDR Bibligraphic Ntes Exercises References 699 Authr Index 79 Index 7

20 Intrductin This is page Printer: paque this Statistical learning plays a key rle in many areas f science, finance and industry. Here are sme examples f learning prblems: Predict whether a patient, hspitalized due t a heart attack, will have a secnd heart attack. The predictin is t be based n demgraphic, diet and clinical measurements fr that patient. Predict the price f a stck in 6 mnths frm nw, n the basis f cmpany perfrmance measures and ecnmic data. Identify the numbers in a handwritten ZIP cde, frm a digitized image. Estimate the amunt f glucse in the bld f a diabetic persn, frm the infrared absrptin spectrum f that persn s bld. Identify the risk factrs fr prstate cancer, based n clinical and demgraphic variables. The science f learning plays a key rle in the fields f statistics, data mining and artificial intelligence, intersecting with areas f engineering and ther disciplines. This bk is abut learning frm data. In a typical scenari, we have an utcme measurement, usually quantitative (such as a stck price) r categrical (such as heart attack/n heart attack), that we wish t predict based n a set f features (such as diet and clinical measurements). We have a training set f data, in which we bserve the utcme and feature

21 . Intrductin TABLE.. Average percentage f wrds r characters in an message equal t the indicated wrd r character. We have chsen the wrds and characters shwing the largest difference between spam and . gerge yu yur hp free hpl! ur re edu remve spam measurements fr a set f bjects (such as peple). Using this data we build a predictin mdel, r learner, which will enable us t predict the utcme fr new unseen bjects. A gd learner is ne that accurately predicts such an utcme. The examples abve describe what is called the supervised learning prblem. It is called supervised because f the presence f the utcme variable t guide the learning prcess. In the unsupervised learning prblem, we bserve nly the features and have n measurements f the utcme. ur task is rather t describe hw the data are rganized r clustered. We devte mst f this bk t supervised learning; the unsupervised prblem is less develped in the literature, and is the fcus f Chapter 4. Here are sme examples f real learning prblems that are discussed in this bk. Example : Spam The data fr this example cnsists f infrmatin frm 460 messages, in a study t try t predict whether the was junk , r spam. The bjective was t design an autmatic spam detectr that culd filter ut spam befre clgging the users mailbxes. Fr all 460 messages, the true utcme ( type) r spam is available, alng with the relative frequencies f 57 f the mst cmmnly ccurring wrds and punctuatin marks in the message. This is a supervised learning prblem, with the utcme the class variable /spam. It is als called a classificatin prblem. Table. lists the wrds and characters shwing the largest average difference between spam and . ur learning methd has t decide which features t use and hw: fr example, we might use a rule such as if (%gerge < 0.6) & (%yu >.5) then spam else . Anther frm f a rule might be: if (0. %yu 0. %gerge) > 0 then spam else .

22 . Intrductin lpsa lcavl lweight age lbph svi lcp gleasn pgg45 FIGURE.. Scatterplt matrix f the prstate cancer data. The first rw shws the respnse against each f the predictrs in turn. Tw f the predictrs, svi and gleasn, are categrical. Fr this prblem nt all errrs are equal; we want t avid filtering ut gd , while letting spam get thrugh is nt desirable but less serius in its cnsequences. We discuss a number f different methds fr tackling this learning prblem in the bk. Example : Prstate Cancer The data fr this example, displayed in Figure., cme frm a study by Stamey et al. (989) that examined the crrelatin between the level f There was an errr in these data in the first editin f this bk. Subject had a value f 6. fr lweight, which translates t a 449 gm prstate! The crrect value is 44.9 gm. We are grateful t Prf. Stephen W. Link fr alerting us t this errr.

23 4. Intrductin FIGURE.. Examples f handwritten digits frm U.S. pstal envelpes. prstate specific antigen (PSA) and a number f clinical measures, in 97 men wh were abut t receive a radical prstatectmy. The gal is t predict the lg f PSA (lpsa) frm a number f measurements including lg cancer vlume (lcavl), lg prstate weight lweight, age, lg f benign prstatic hyperplasia amunt lbph, seminal vesicle invasin svi, lg f capsular penetratin lcp, Gleasn scre gleasn, and percent f Gleasn scres 4 r 5 pgg45. Figure. is a scatterplt matrix f the variables. Sme crrelatins with lpsa are evident, but a gd predictive mdel is difficult t cnstruct by eye. This is a supervised learning prblem, knwn as a regressin prblem, because the utcme measurement is quantitative. Example : Handwritten Digit Recgnitin The data frm this example cme frm the handwritten ZIP cdes n envelpes frm U.S. pstal mail. Each image is a segment frm a five digit ZIP cde, islating a single digit. The images are 6 6 eight-bit grayscale maps, with each pixel ranging in intensity frm 0 t 55. Sme sample images are shwn in Figure.. The images have been nrmalized t have apprximately the same size and rientatin. The task is t predict, frm the 6 6 matrix f pixel intensities, the identity f each image (0,,...,9) quickly and accurately. If it is accurate enugh, the resulting algrithm wuld be used as part f an autmatic srting prcedure fr envelpes. This is a classificatin prblem fr which the errr rate needs t be kept very lw t avid misdirectin f

24 . Intrductin 5 mail. In rder t achieve this lw errr rate, sme bjects can be assigned t a dn t knw categry, and srted instead by hand. Example 4: DNA Expressin Micrarrays DNA stands fr dexyribnucleic acid, and is the basic material that makes up human chrmsmes. DNA micrarrays measure the expressin f a gene in a cell by measuring the amunt f mrna (messenger ribnucleic acid) present fr that gene. Micrarrays are cnsidered a breakthrugh technlgy in bilgy, facilitating the quantitative study f thusands f genes simultaneusly frm a single sample f cells. Here is hw a DNA micrarray wrks. The nucletide sequences fr a few thusand genes are printed n a glass slide. A target sample and a reference sample are labeled with red and green dyes, and each are hybridized with the DNA n the slide. Thrugh flurscpy, the lg (red/green) intensities f RNA hybridizing at each site is measured. The result is a few thusand numbers, typically ranging frm say 6 t 6, measuring the expressin level f each gene in the target relative t the reference sample. Psitive values indicate higher expressin in the target versus the reference, and vice versa fr negative values. A gene expressin dataset cllects tgether the expressin values frm a series f DNA micrarray experiments, with each clumn representing an experiment. There are therefre several thusand rws representing individual genes, and tens f clumns representing samples: in the particular example f Figure. there are 680 genes (rws) and 64 samples (clumns), althugh fr clarity nly a randm sample f 00 rws are shwn. The figure displays the data set as a heat map, ranging frm green (negative) t red (psitive). The samples are 64 cancer tumrs frm different patients. The challenge here is t understand hw the genes and samples are rganized. Typical questins include the fllwing: (a) which samples are mst similar t each ther, in terms f their expressin prfiles acrss genes? (b) which genes are mst similar t each ther, in terms f their expressin prfiles acrss samples? (c) d certain genes shw very high (r lw) expressin fr certain cancer samples? We culd view this task as a regressin prblem, with tw categrical predictr variables genes and samples with the respnse variable being the level f expressin. Hwever, it is prbably mre useful t view it as unsupervised learning prblem. Fr example, fr questin (a) abve, we think f the samples as pints in 680 dimensinal space, which we want t cluster tgether in sme way.

25 6. Intrductin SIDW9904 SIDW800 SID76 GNAL H.sapiensmRNA SID594 RASGTPASE SID077 ESTs SIDW7740 HumanmRNA SIDW ESTs SID4795 MYBPRT ESTsChr. SID7745 DNAPLYMER SID758 SIDW489 SID677 SIDW SIDW4876 Hmsapiens SIDW76586 Chr MITCHNDRIAL60 SID476 ESTsChr.6 SIDW960 SID48807 SID0567 ESTsChr. SID7504 SID8944 PTPRC SIDW980 SIDW04 SIDW7698 ESTsCh SID44 SID7749 SID977 SIDW060 SIDW79664 SIDW5054 HLACLASSI SIDW0464 SID90 SIDW0576 SIDW76776 HYPTHETICAL WASWisktt SIDW854 ESTsChr.5 SIDW7694 SID80066 ESTsChr.5 SIDW488 SID4656 SIDW5795 ESTsChr. SIDW806 SID0094 ESTsChr.5 SID8485 SID48548 SID97905 ESTs SIDW SMALLNUC ESTs SIDW66 SIDW5797 SID5979 ESTs SID4609 SIDW466 ERLUMEN TUPLETUP SIDW4864 SID8079 SIDW9805 SIDW4770 SIDW647 ESTsChr.5 SIDW95 SID8065 SIDW088 SID8508 SID77 SIDW65099 ESTsChr.0 SIDW50 SID60097 SID75990 SIDW868 SID090 SID984 SID454 BREAST RENAL MELANMA MELANMA MCF7D-repr CLN CLN K56B-repr CLN NSCLC LEUKEMIA RENAL MELANMA BREAST CNS CNS RENAL MCF7A-repr NSCLC K56A-repr CLN CNS NSCLC NSCLC LEUKEMIA CNS VARIAN BREAST LEUKEMIA MELANMA MELANMA VARIAN VARIAN NSCLC RENAL BREAST MELANMA VARIAN VARIAN NSCLC RENAL BREAST MELANMA LEUKEMIA CLN BREAST LEUKEMIA CLN CNS MELANMA NSCLC PRSTATE NSCLC RENAL RENAL NSCLC RENAL LEUKEMIA VARIAN PRSTATE CLN BREAST RENAL UNKNWN FIGURE.. DNA micrarray data: expressin matrix f 680 genes (rws) and 64 samples (clumns), fr the human tumr data. nly a randm sample f 00 rws are shwn. The display is a heat map, ranging frm bright green (negative, under expressed) t bright red (psitive, ver expressed). Missing values are gray. The rws and clumns are displayed in a randmly chsen rder.

26 Wh Shuld Read this Bk. Intrductin 7 This bk is designed fr researchers and students in a brad variety f fields: statistics, artificial intelligence, engineering, finance and thers. We expect that the reader will have had at least ne elementary curse in statistics, cvering basic tpics including linear regressin. We have nt attempted t write a cmprehensive catalg f learning methds, but rather t describe sme f the mst imprtant techniques. Equally ntable, we describe the underlying cncepts and cnsideratins by which a researcher can judge a learning methd. We have tried t write this bk in an intuitive fashin, emphasizing cncepts rather than mathematical details. As statisticians, ur expsitin will naturally reflect ur backgrunds and areas f expertise. Hwever in the past eight years we have been attending cnferences in neural netwrks, data mining and machine learning, and ur thinking has been heavily influenced by these exciting fields. This influence is evident in ur current research, and in this bk. Hw This Bk is rganized ur view is that ne must understand simple methds befre trying t grasp mre cmplex nes. Hence, after giving an verview f the supervising learning prblem in Chapter, we discuss linear methds fr regressin and classificatin in Chapters and 4. In Chapter 5 we describe splines, wavelets and regularizatin/penalizatin methds fr a single predictr, while Chapter 6 cvers kernel methds and lcal regressin. Bth f these sets f methds are imprtant building blcks fr high-dimensinal learning techniques. Mdel assessment and selectin is the tpic f Chapter 7, cvering the cncepts f bias and variance, verfitting and methds such as crss-validatin fr chsing mdels. Chapter 8 discusses mdel inference and averaging, including an verview f maximum likelihd, Bayesian inference and the btstrap, the EM algrithm, Gibbs sampling and bagging, A related prcedure called bsting is the fcus f Chapter 0. In Chapters 9 we describe a series f structured methds fr supervised learning, with Chapters 9 and cvering regressin and Chapters and fcusing n classificatin. Chapter 4 describes methds fr unsupervised learning. Tw recently prpsed techniques, randm frests and ensemble learning, are discussed in Chapters 5 and 6. We describe undirected graphical mdels in Chapter 7 and finally we study highdimensinal prblems in Chapter 8. At the end f each chapter we discuss cmputatinal cnsideratins imprtant fr data mining applicatins, including hw the cmputatins scale with the number f bservatins and predictrs. Each chapter ends with Bibligraphic Ntes giving backgrund references fr the material.

27 8. Intrductin We recmmend that Chapters 4 be first read in sequence. Chapter 7 shuld als be cnsidered mandatry, as it cvers central cncepts that pertain t all learning methds. With this in mind, the rest f the bk can be read sequentially, r sampled, depending n the reader s interest. The symbl indicates a technically difficult sectin, ne that can be skipped withut interrupting the flw f the discussin. Bk Website The website fr this bk is lcated at It cntains a number f resurces, including many f the datasets used in this bk. Nte fr Instructrs We have successively used the first editin f this bk as the basis fr a tw-quarter curse, and with the additinal materials in this secnd editin, it culd even be used fr a three-quarter sequence. Exercises are prvided at the end f each chapter. It is imprtant fr students t have access t gd sftware tls fr these tpics. We used the R and S-PLUS prgramming languages in ur curses.

28 verview f Supervised Learning This is page 9 Printer: paque this. Intrductin The first three examples described in Chapter have several cmpnents in cmmn. Fr each there is a set f variables that might be dented as inputs, which are measured r preset. These have sme influence n ne r mre utputs. Fr each example the gal is t use the inputs t predict the values f the utputs. This exercise is called supervised learning. We have used the mre mdern language f machine learning. In the statistical literature the inputs are ften called the predictrs, a term we will use interchangeably with inputs, and mre classically the independent variables. In the pattern recgnitin literature the term features is preferred, which we use as well. The utputs are called the respnses, r classically the dependent variables.. Variable Types and Terminlgy The utputs vary in nature amng the examples. In the glucse predictin example, the utput is a quantitative measurement, where sme measurements are bigger than thers, and measurements clse in value are clse in nature. In the famus Iris discriminatin example due t R. A. Fisher, the utput is qualitative (species f Iris) and assumes values in a finite set G = {Virginica, Setsa and Versiclr}. In the handwritten digit example the utput is ne f 0 different digit classes: G = {0,,...,9}. In bth f

29 0. verview f Supervised Learning these there is n explicit rdering in the classes, and in fact ften descriptive labels rather than numbers are used t dente the classes. Qualitative variables are als referred t as categrical r discrete variables as well as factrs. Fr bth types f utputs it makes sense t think f using the inputs t predict the utput. Given sme specific atmspheric measurements tday and yesterday, we want t predict the zne level tmrrw. Given the grayscale values fr the pixels f the digitized image f the handwritten digit, we want t predict its class label. This distinctin in utput type has led t a naming cnventin fr the predictin tasks: regressin when we predict quantitative utputs, and classificatin when we predict qualitative utputs. We will see that these tw tasks have a lt in cmmn, and in particular bth can be viewed as a task in functin apprximatin. Inputs als vary in measurement type; we can have sme f each f qualitative and quantitative input variables. These have als led t distinctins in the types f methds that are used fr predictin: sme methds are defined mst naturally fr quantitative inputs, sme mst naturally fr qualitative and sme fr bth. A third variable type is rdered categrical, such as small, medium and large, where there is an rdering between the values, but n metric ntin is apprpriate (the difference between medium and small need nt be the same as that between large and medium). These are discussed further in Chapter 4. Qualitative variables are typically represented numerically by cdes. The easiest case is when there are nly tw classes r categries, such as success r failure, survived r died. These are ften represented by a single binary digit r bit as 0 r, r else by and. Fr reasns that will becme apparent, such numeric cdes are smetimes referred t as targets. When there are mre than tw categries, several alternatives are available. The mst useful and cmmnly used cding is via dummy variables. Here a K-level qualitative variable is represented by a vectr f K binary variables r bits, nly ne f which is n at a time. Althugh mre cmpact cding schemes are pssible, dummy variables are symmetric in the levels f the factr. We will typically dente an input variable by the symbl X. If X is a vectr, its cmpnents can be accessed by subscripts X j. Quantitative utputs will be dented by Y, and qualitative utputs by G (fr grup). We use uppercase letters such as X, Y r G when referring t the generic aspects f a variable. bserved values are written in lwercase; hence the ith bserved value f X is written as x i (where x i is again a scalar r vectr). Matrices are represented by bld uppercase letters; fr example, a set f N input p-vectrs x i, i =,...,N wuld be represented by the N p matrix X. In general, vectrs will nt be bld, except when they have N cmpnents; this cnventin distinguishes a p-vectr f inputs x i fr the

30 . Least Squares and Nearest Neighbrs ith bservatin frm the N-vectr x j cnsisting f all the bservatins n variable X j. Since all vectrs are assumed t be clumn vectrs, the ith rw f X is x T i, the vectr transpse f x i. Fr the mment we can lsely state the learning task as fllws: given the value f an input vectr X, make a gd predictin f the utput Y, dented by Ŷ (prnunced y-hat ). If Y takes values in IR then s shuld Ŷ ; likewise fr categrical utputs, Ĝ shuld take values in the same set G assciated with G. Fr a tw-class G, ne apprach is t dente the binary cded target as Y, and then treat it as a quantitative utput. The predictins Ŷ will typically lie in [0,], and we can assign t Ĝ the class label accrding t whether ŷ > 0.5. This apprach generalizes t K-level qualitative utputs as well. We need data t cnstruct predictin rules, ften a lt f it. We thus suppse we have available a set f measurements (x i,y i ) r (x i,g i ), i =,...,N, knwn as the training data, with which t cnstruct ur predictin rule.. Tw Simple Appraches t Predictin: Least Squares and Nearest Neighbrs In this sectin we develp tw simple but pwerful predictin methds: the linear mdel fit by least squares and the k-nearest-neighbr predictin rule. The linear mdel makes huge assumptins abut structure and yields stable but pssibly inaccurate predictins. The methd f k-nearest neighbrs makes very mild structural assumptins: its predictins are ften accurate but can be unstable... Linear Mdels and Least Squares The linear mdel has been a mainstay f statistics fr the past 0 years and remains ne f ur mst imprtant tls. Given a vectr f inputs X T = (X,X,...,X p ), we predict the utput Y via the mdel Ŷ = ˆβ 0 + p X j ˆβj. (.) The term ˆβ 0 is the intercept, als knwn as the bias in machine learning. ften it is cnvenient t include the cnstant variable in X, include ˆβ 0 in the vectr f cefficients ˆβ, and then write the linear mdel in vectr frm as an inner prduct Ŷ = X T ˆβ, (.) j=

31 . verview f Supervised Learning where X T dentes vectr r matrix transpse (X being a clumn vectr). Here we are mdeling a single utput, s Ŷ is a scalar; in general Ŷ can be a K vectr, in which case β wuld be a p K matrix f cefficients. In the (p + )-dimensinal input utput space, (X,Ŷ ) represents a hyperplane. If the cnstant is included in X, then the hyperplane includes the rigin and is a subspace; if nt, it is an affine set cutting the Y -axis at the pint (0, ˆβ 0 ). Frm nw n we assume that the intercept is included in ˆβ. Viewed as a functin ver the p-dimensinal input space, f(x) = X T β is linear, and the gradient f (X) = β is a vectr in input space that pints in the steepest uphill directin. Hw d we fit the linear mdel t a set f training data? There are many different methds, but by far the mst ppular is the methd f least squares. In this apprach, we pick the cefficients β t minimize the residual sum f squares RSS(β) = N (y i x T i β). (.) i= RSS(β) is a quadratic functin f the parameters, and hence its minimum always exists, but may nt be unique. The slutin is easiest t characterize in matrix ntatin. We can write RSS(β) = (y Xβ) T (y Xβ), (.4) where X is an N p matrix with each rw an input vectr, and y is an N-vectr f the utputs in the training set. Differentiating w.r.t. β we get the nrmal equatins X T (y Xβ) = 0. (.5) If X T X is nnsingular, then the unique slutin is given by ˆβ = (X T X) X T y, (.6) and the fitted value at the ith input x i is ŷ i = ŷ(x i ) = x T ˆβ. i At an arbitrary input x 0 the predictin is ŷ(x 0 ) = x T ˆβ. 0 The entire fitted surface is characterized by the p parameters ˆβ. Intuitively, it seems that we d nt need a very large data set t fit such a mdel. Let s lk at an example f the linear mdel in a classificatin cntext. Figure. shws a scatterplt f training data n a pair f inputs X and X. The data are simulated, and fr the present the simulatin mdel is nt imprtant. The utput class variable G has the values BLUE r RANGE, and is represented as such in the scatterplt. There are 00 pints in each f the tw classes. The linear regressin mdel was fit t these data, with the respnse Y cded as 0 fr BLUE and fr RANGE. The fitted values Ŷ are cnverted t a fitted class variable Ĝ accrding t the rule Ĝ = { RANGE if Ŷ > 0.5, BLUE if Ŷ 0.5. (.7)

32 . Least Squares and Nearest Neighbrs Linear Regressin f 0/ Respnse FIGURE.. A classificatin example in tw dimensins. The classes are cded as a binary variable (BLUE = 0, RANGE = ), and then fit by linear regressin. The line is the decisin bundary defined by x T ˆβ = 0.5. The range shaded regin dentes that part f input space classified as RANGE, while the blue regin is classified as BLUE. The set f pints in IR classified as RANGE crrespnds t {x : x T ˆβ > 0.5}, indicated in Figure., and the tw predicted classes are separated by the decisin bundary {x : x T ˆβ = 0.5}, which is linear in this case. We see that fr these data there are several misclassificatins n bth sides f the decisin bundary. Perhaps ur linear mdel is t rigid r are such errrs unavidable? Remember that these are errrs n the training data itself, and we have nt said where the cnstructed data came frm. Cnsider the tw pssible scenaris: Scenari : The training data in each class were generated frm bivariate Gaussian distributins with uncrrelated cmpnents and different means. Scenari : The training data in each class came frm a mixture f 0 lwvariance Gaussian distributins, with individual means themselves distributed as Gaussian. A mixture f Gaussians is best described in terms f the generative mdel. ne first generates a discrete variable that determines which f

33 4. verview f Supervised Learning the cmpnent Gaussians t use, and then generates an bservatin frm the chsen density. In the case f ne Gaussian per class, we will see in Chapter 4 that a linear decisin bundary is the best ne can d, and that ur estimate is almst ptimal. The regin f verlap is inevitable, and future data t be predicted will be plagued by this verlap as well. In the case f mixtures f tightly clustered Gaussians the stry is different. A linear decisin bundary is unlikely t be ptimal, and in fact is nt. The ptimal decisin bundary is nnlinear and disjint, and as such will be much mre difficult t btain. We nw lk at anther classificatin and regressin prcedure that is in sme sense at the ppsite end f the spectrum t the linear mdel, and far better suited t the secnd scenari... Nearest-Neighbr Methds Nearest-neighbr methds use thse bservatins in the training set T clsest in input space t x t frm Ŷ. Specifically, the k-nearest neighbr fit fr Ŷ is defined as fllws: Ŷ (x) = k x i N k (x) y i, (.8) where N k (x) is the neighbrhd f x defined by the k clsest pints x i in the training sample. Clseness implies a metric, which fr the mment we assume is Euclidean distance. S, in wrds, we find the k bservatins with x i clsest t x in input space, and average their respnses. In Figure. we use the same training data as in Figure., and use 5-nearest-neighbr averaging f the binary cded respnse as the methd f fitting. Thus Ŷ is the prprtin f RANGE s in the neighbrhd, and s assigning class RANGE t Ĝ if Ŷ > 0.5 amunts t a majrity vte in the neighbrhd. The clred regins indicate all thse pints in input space classified as BLUE r RANGE by such a rule, in this case fund by evaluating the prcedure n a fine grid in input space. We see that the decisin bundaries that separate the BLUE frm the RANGE regins are far mre irregular, and respnd t lcal clusters where ne class dminates. Figure. shws the results fr -nearest-neighbr classificatin: Ŷ is assigned the value y l f the clsest pint x l t x in the training data. In this case the regins f classificatin can be cmputed relatively easily, and crrespnd t a Vrni tessellatin f the training data. Each pint x i has an assciated tile bunding the regin fr which it is the clsest input pint. Fr all pints x in the tile, Ĝ(x) = g i. The decisin bundary is even mre irregular than befre. The methd f k-nearest-neighbr averaging is defined in exactly the same way fr regressin f a quantitative utput Y, althugh k = wuld be an unlikely chice.

34 . Least Squares and Nearest Neighbrs 5 5-Nearest Neighbr Classifier FIGURE.. The same classificatin example in tw dimensins as in Figure.. The classes are cded as a binary variable (BLUE = 0,RANGE = ) and then fit by 5-nearest-neighbr averaging as in (.8). The predicted class is hence chsen by majrity vte amngst the 5-nearest neighbrs. In Figure. we see that far fewer training bservatins are misclassified than in Figure.. This shuld nt give us t much cmfrt, thugh, since in Figure. nne f the training data are misclassified. A little thught suggests that fr k-nearest-neighbr fits, the errr n the training data shuld be apprximately an increasing functin f k, and will always be 0 fr k =. An independent test set wuld give us a mre satisfactry means fr cmparing the different methds. It appears that k-nearest-neighbr fits have a single parameter, the number f neighbrs k, cmpared t the p parameters in least-squares fits. Althugh this is the case, we will see that the effective number f parameters f k-nearest neighbrs is N/k and is generally bigger than p, and decreases with increasing k. T get an idea f why, nte that if the neighbrhds were nnverlapping, there wuld be N/k neighbrhds and we wuld fit ne parameter (a mean) in each neighbrhd. It is als clear that we cannt use sum-f-squared errrs n the training set as a criterin fr picking k, since we wuld always pick k =! It wuld seem that k-nearest-neighbr methds wuld be mre apprpriate fr the mixture Scenari described abve, while fr Gaussian data the decisin bundaries f k-nearest neighbrs wuld be unnecessarily nisy.

35 6. verview f Supervised Learning -Nearest Neighbr Classifier FIGURE.. The same classificatin example in tw dimensins as in Figure.. The classes are cded as a binary variable (BLUE = 0,RANGE = ), and then predicted by -nearest-neighbr classificatin... Frm Least Squares t Nearest Neighbrs The linear decisin bundary frm least squares is very smth, and apparently stable t fit. It des appear t rely heavily n the assumptin that a linear decisin bundary is apprpriate. In language we will develp shrtly, it has lw variance and ptentially high bias. n the ther hand, the k-nearest-neighbr prcedures d nt appear t rely n any stringent assumptins abut the underlying data, and can adapt t any situatin. Hwever, any particular subregin f the decisin bundary depends n a handful f input pints and their particular psitins, and is thus wiggly and unstable high variance and lw bias. Each methd has its wn situatins fr which it wrks best; in particular linear regressin is mre apprpriate fr Scenari abve, while nearest neighbrs are mre suitable fr Scenari. The time has cme t expse the racle! The data in fact were simulated frm a mdel smewhere between the tw, but clser t Scenari. First we generated 0 means m k frm a bivariate Gaussian distributin N((,0) T,I) and labeled this class BLUE. Similarly, 0 mre were drawn frm N((0,) T,I) and labeled class RANGE. Then fr each class we generated 00 bservatins as fllws: fr each bservatin, we picked an m k at randm with prbability /0, and

36 . Least Squares and Nearest Neighbrs 7 k Number f Nearest Neighbrs Test Errr Train Test Bayes Linear Degrees f Freedm N/k FIGURE.4. Misclassificatin curves fr the simulatin example used in Figures.,. and.. A single training sample f size 00 was used, and a test sample f size 0, 000. The range curves are test and the blue are training errr fr k-nearest-neighbr classificatin. The results fr linear regressin are the bigger range and blue squares at three degrees f freedm. The purple line is the ptimal Bayes errr rate. then generated a N(m k,i/5), thus leading t a mixture f Gaussian clusters fr each class. Figure.4 shws the results f classifying 0,000 new bservatins generated frm the mdel. We cmpare the results fr least squares and thse fr k-nearest neighbrs fr a range f values f k. A large subset f the mst ppular techniques in use tday are variants f these tw simple prcedures. In fact -nearest-neighbr, the simplest f all, captures a large percentage f the market fr lw-dimensinal prblems. The fllwing list describes sme ways in which these simple prcedures have been enhanced: Kernel methds use weights that decrease smthly t zer with distance frm the target pint, rather than the effective 0/ weights used by k-nearest neighbrs. In high-dimensinal spaces the distance kernels are mdified t emphasize sme variable mre than thers.

37 8. verview f Supervised Learning Lcal regressin fits linear mdels by lcally weighted least squares, rather than fitting cnstants lcally. Linear mdels fit t a basis expansin f the riginal inputs allw arbitrarily cmplex mdels. Prjectin pursuit and neural netwrk mdels cnsist f sums f nnlinearly transfrmed linear mdels..4 Statistical Decisin Thery In this sectin we develp a small amunt f thery that prvides a framewrk fr develping mdels such as thse discussed infrmally s far. We first cnsider the case f a quantitative utput, and place urselves in the wrld f randm variables and prbability spaces. Let X IR p dente a real valued randm input vectr, and Y IR a real valued randm utput variable, with jint distributin Pr(X,Y ). We seek a functin f(x) fr predicting Y given values f the input X. This thery requires a lss functin L(Y,f(X)) fr penalizing errrs in predictin, and by far the mst cmmn and cnvenient is squared errr lss: L(Y,f(X)) = (Y f(x)). This leads us t a criterin fr chsing f, EPE(f) = E(Y f(x)) (.9) = [y f(x)] Pr(dx,dy), (.0) the expected (squared) predictin errr. By cnditining n X, we can write EPE as EPE(f) = E X E Y X ( [Y f(x)] X ) (.) and we see that it suffices t minimize EPE pintwise: The slutin is f(x) = argmin c E Y X ( [Y c] X = x ). (.) f(x) = E(Y X = x), (.) the cnditinal expectatin, als knwn as the regressin functin. Thus the best predictin f Y at any pint X = x is the cnditinal mean, when best is measured by average squared errr. The nearest-neighbr methds attempt t directly implement this recipe using the training data. At each pint x, we might ask fr the average f all Cnditining here amunts t factring the jint density Pr(X, Y ) = Pr(Y X)Pr(X) where Pr(Y X) = Pr(Y, X)/Pr(X), and splitting up the bivariate integral accrdingly.

38 .4 Statistical Decisin Thery 9 thse y i s with input x i = x. Since there is typically at mst ne bservatin at any pint x, we settle fr ˆf(x) = Ave(y i x i N k (x)), (.4) where Ave dentes average, and N k (x) is the neighbrhd cntaining the k pints in T clsest t x. Tw apprximatins are happening here: expectatin is apprximated by averaging ver sample data; cnditining at a pint is relaxed t cnditining n sme regin clse t the target pint. Fr large training sample size N, the pints in the neighbrhd are likely t be clse t x, and as k gets large the average will get mre stable. In fact, under mild regularity cnditins n the jint prbability distributin Pr(X,Y ), ne can shw that as N,k such that k/n 0, ˆf(x) E(Y X = x). In light f this, why lk further, since it seems we have a universal apprximatr? We ften d nt have very large samples. If the linear r sme mre structured mdel is apprpriate, then we can usually get a mre stable estimate than k-nearest neighbrs, althugh such knwledge has t be learned frm the data as well. There are ther prblems thugh, smetimes disastrus. In Sectin.5 we see that as the dimensin p gets large, s des the metric size f the k-nearest neighbrhd. S settling fr nearest neighbrhd as a surrgate fr cnditining will fail us miserably. The cnvergence abve still hlds, but the rate f cnvergence decreases as the dimensin increases. Hw des linear regressin fit int this framewrk? The simplest explanatin is that ne assumes that the regressin functin f(x) is apprximately linear in its arguments: f(x) x T β. (.5) This is a mdel-based apprach we specify a mdel fr the regressin functin. Plugging this linear mdel fr f(x) int EPE (.9) and differentiating we can slve fr β theretically: β = [E(XX T )] E(XY ). (.6) Nte we have nt cnditined n X; rather we have used ur knwledge f the functinal relatinship t pl ver values f X. The least squares slutin (.6) amunts t replacing the expectatin in (.6) by averages ver the training data. S bth k-nearest neighbrs and least squares end up apprximating cnditinal expectatins by averages. But they differ dramatically in terms f mdel assumptins: Least squares assumes f(x) is well apprximated by a glbally linear functin.

39 0. verview f Supervised Learning k-nearest neighbrs assumes f(x) is well apprximated by a lcally cnstant functin. Althugh the latter seems mre palatable, we have already seen that we may pay a price fr this flexibility. Many f the mre mdern techniques described in this bk are mdel based, althugh far mre flexible than the rigid linear mdel. Fr example, additive mdels assume that f(x) = p f j (X j ). (.7) j= This retains the additivity f the linear mdel, but each crdinate functin f j is arbitrary. It turns ut that the ptimal estimate fr the additive mdel uses techniques such as k-nearest neighbrs t apprximate univariate cnditinal expectatins simultaneusly fr each f the crdinate functins. Thus the prblems f estimating a cnditinal expectatin in high dimensins are swept away in this case by impsing sme (ften unrealistic) mdel assumptins, in this case additivity. Are we happy with the criterin (.)? What happens if we replace the L lss functin with the L : E Y f(x)? The slutin in this case is the cnditinal median, ˆf(x) = median(y X = x), (.8) which is a different measure f lcatin, and its estimates are mre rbust than thse fr the cnditinal mean. L criteria have discntinuities in their derivatives, which have hindered their widespread use. ther mre resistant lss functins will be mentined in later chapters, but squared errr is analytically cnvenient and the mst ppular. What d we d when the utput is a categrical variable G? The same paradigm wrks here, except we need a different lss functin fr penalizing predictin errrs. An estimate Ĝ will assume values in G, the set f pssible classes. ur lss functin can be represented by a K K matrix L, where K = card(g). L will be zer n the diagnal and nnnegative elsewhere, where L(k, l) is the price paid fr classifying an bservatin belnging t class G k as G l. Mst ften we use the zer ne lss functin, where all misclassificatins are charged a single unit. The expected predictin errr is EPE = E[L(G,Ĝ(X))], (.9) where again the expectatin is taken with respect t the jint distributin Pr(G,X). Again we cnditin, and can write EPE as K EPE = E X L[G k,ĝ(x)]pr(g k X) (.0) k=

40 .4 Statistical Decisin Thery Bayes ptimal Classifier FIGURE.5. The ptimal Bayes decisin bundary fr the simulatin example f Figures.,. and.. Since the generating density is knwn fr each class, this bundary can be calculated exactly (Exercise.). and again it suffices t minimize EPE pintwise: Ĝ(x) = argmin g G K k= L(G k,g)pr(g k X = x). (.) With the 0 lss functin this simplifies t Ĝ(x) = argmin g G [ Pr(g X = x)] (.) r simply Ĝ(X) = G k if Pr(G k X = x) = max g G Pr(g X = x). (.) This reasnable slutin is knwn as the Bayes classifier, and says that we classify t the mst prbable class, using the cnditinal (discrete) distributin Pr(G X). Figure.5 shws the Bayes-ptimal decisin bundary fr ur simulatin example. The errr rate f the Bayes classifier is called the Bayes rate.

41 . verview f Supervised Learning Again we see that the k-nearest neighbr classifier directly apprximates this slutin a majrity vte in a nearest neighbrhd amunts t exactly this, except that cnditinal prbability at a pint is relaxed t cnditinal prbability within a neighbrhd f a pint, and prbabilities are estimated by training-sample prprtins. Suppse fr a tw-class prblem we had taken the dummy-variable apprach and cded G via a binary Y, fllwed by squared errr lss estimatin. Then ˆf(X) = E(Y X) = Pr(G = G X) if G crrespnded t Y =. Likewise fr a K-class prblem, E(Y k X) = Pr(G = G k X). This shws that ur dummy-variable regressin prcedure, fllwed by classificatin t the largest fitted value, is anther way f representing the Bayes classifier. Althugh this thery is exact, in practice prblems can ccur, depending n the regressin mdel used. Fr example, when linear regressin is used, ˆf(X) need nt be psitive, and we might be suspicius abut using it as an estimate f a prbability. We will discuss a variety f appraches t mdeling Pr(G X) in Chapter 4..5 Lcal Methds in High Dimensins We have examined tw learning techniques fr predictin s far: the stable but biased linear mdel and the less stable but apparently less biased class f k-nearest-neighbr estimates. It wuld seem that with a reasnably large set f training data, we culd always apprximate the theretically ptimal cnditinal expectatin by k-nearest-neighbr averaging, since we shuld be able t find a fairly large neighbrhd f bservatins clse t any x and average them. This apprach and ur intuitin breaks dwn in high dimensins, and the phenmenn is cmmnly referred t as the curse f dimensinality (Bellman, 96). There are many manifestatins f this prblem, and we will examine a few here. Cnsider the nearest-neighbr prcedure fr inputs unifrmly distributed in a p-dimensinal unit hypercube, as in Figure.6. Suppse we send ut a hypercubical neighbrhd abut a target pint t capture a fractin r f the bservatins. Since this crrespnds t a fractin r f the unit vlume, the expected edge length will be e p (r) = r /p. In ten dimensins e 0 (0.0) = 0.6 and e 0 (0.) = 0.80, while the entire range fr each input is nly.0. S t capture % r 0% f the data t frm a lcal average, we must cver 6% r 80% f the range f each input variable. Such neighbrhds are n lnger lcal. Reducing r dramatically des nt help much either, since the fewer bservatins we average, the higher is the variance f ur fit. Anther cnsequence f the sparse sampling in high dimensins is that all sample pints are clse t an edge f the sample. Cnsider N data pints unifrmly distributed in a p-dimensinal unit ball centered at the rigin. Suppse we cnsider a nearest-neighbr estimate at the rigin. The median

42 .5 Lcal Methds in High Dimensins 0 Unit Cube Distance p=0 p= p= p= Neighbrhd Fractin f Vlume FIGURE.6. The curse f dimensinality is well illustrated by a subcubical neighbrhd fr unifrm data in a unit cube. The figure n the right shws the side-length f the subcube needed t capture a fractin r f the vlume f the data, fr different dimensins p. In ten dimensins we need t cver 80% f the range f each crdinate t capture 0% f the data. distance frm the rigin t the clsest data pint is given by the expressin ( d(p,n) = /N) /p (.4) (Exercise.). A mre cmplicated expressin exists fr the mean distance t the clsest pint. Fr N = 500, p = 0, d(p,n) 0.5, mre than halfway t the bundary. Hence mst data pints are clser t the bundary f the sample space than t any ther data pint. The reasn that this presents a prblem is that predictin is much mre difficult near the edges f the training sample. ne must extraplate frm neighbring sample pints rather than interplate between them. Anther manifestatin f the curse is that the sampling density is prprtinal t N /p, where p is the dimensin f the input space and N is the sample size. Thus, if N = 00 represents a dense sample fr a single input prblem, then N 0 = 00 0 is the sample size required fr the same sampling density with 0 inputs. Thus in high dimensins all feasible training samples sparsely ppulate the input space. Let us cnstruct anther unifrm example. Suppse we have 000 training examples x i generated unifrmly n [,] p. Assume that the true relatinship between X and Y is Y = f(x) = e 8 X, withut any measurement errr. We use the -nearest-neighbr rule t predict y 0 at the test-pint x 0 = 0. Dente the training set by T. We can

43 4. verview f Supervised Learning cmpute the expected predictin errr at x 0 fr ur prcedure, averaging ver all such samples f size 000. Since the prblem is deterministic, this is the mean squared errr (MSE) fr estimating f(0): MSE(x 0 ) = E T [f(x 0 ) ŷ 0 ] = E T [ŷ 0 E T (ŷ 0 )] + [E T (ŷ 0 ) f(x 0 )] = Var T (ŷ 0 ) + Bias (ŷ 0 ). (.5) Figure.7 illustrates the setup. We have brken dwn the MSE int tw cmpnents that will becme familiar as we prceed: variance and squared bias. Such a decmpsitin is always pssible and ften useful, and is knwn as the bias variance decmpsitin. Unless the nearest neighbr is at 0, ŷ 0 will be smaller than f(0) in this example, and s the average estimate will be biased dwnward. The variance is due t the sampling variance f the -nearest neighbr. In lw dimensins and with N = 000, the nearest neighbr is very clse t 0, and s bth the bias and variance are small. As the dimensin increases, the nearest neighbr tends t stray further frm the target pint, and bth bias and variance are incurred. By p = 0, fr mre than 99% f the samples the nearest neighbr is a distance greater than 0.5 frm the rigin. Thus as p increases, the estimate tends t be 0 mre ften than nt, and hence the MSE levels ff at.0, as des the bias, and the variance starts drpping (an artifact f this example). Althugh this is a highly cntrived example, similar phenmena ccur mre generally. The cmplexity f functins f many variables can grw expnentially with the dimensin, and if we wish t be able t estimate such functins with the same accuracy as functin in lw dimensins, then we need the size f ur training set t grw expnentially as well. In this example, the functin is a cmplex interactin f all p variables invlved. The dependence f the bias term n distance depends n the truth, and it need nt always dminate with -nearest neighbr. Fr example, if the functin always invlves nly a few dimensins as in Figure.8, then the variance can dminate instead. Suppse, n the ther hand, that we knw that the relatinship between Y and X is linear, Y = X T β + ε, (.6) where ε N(0,σ ) and we fit the mdel by least squares t the training data. Fr an arbitrary test pint x 0, we have ŷ 0 = x T ˆβ, 0 which can be written as ŷ 0 = x T 0 β + N i= l i(x 0 )ε i, where l i (x 0 ) is the ith element f X(X T X) x 0. Since under this mdel the least squares estimates are

44 .5 Lcal Methds in High Dimensins 5 -NN in ne Dimensin -NN in ne vs. Tw Dimensins f(x) X X X Distance t -NN vs. Dimensin MSE vs. Dimensin Average Distance t Nearest Neighbr Mse MSE Variance Sq. Bias Dimensin Dimensin FIGURE.7. A simulatin example, demnstrating the curse f dimensinality and its effect n MSE, bias and variance. The input features are unifrmly distributed in [, ] p fr p =,..., 0 The tp left panel shws the target functin (n nise) in IR: f(x) = e 8 X, and demnstrates the errr that -nearest neighbr makes in estimating f(0). The training pint is indicated by the blue tick mark. The tp right panel illustrates why the radius f the -nearest neighbrhd increases with dimensin p. The lwer left panel shws the average radius f the -nearest neighbrhds. The lwer-right panel shws the MSE, squared bias and variance curves as a functin f dimensin p.

45 6. verview f Supervised Learning -NN in ne Dimensin MSE vs. Dimensin f(x) 0 4 MSE MSE Variance Sq. Bias X Dimensin FIGURE.8. A simulatin example with the same setup as in Figure.7. Here the functin is cnstant in all but ne dimensin: F(X) = (X + ). The variance dminates. unbiased, we find that EPE(x 0 ) = E y0 x 0 E T (y 0 ŷ 0 ) = Var(y 0 x 0 ) + E T [ŷ 0 E T ŷ 0 ] + [E T ŷ 0 x T 0 β] = Var(y 0 x 0 ) + Var T (ŷ 0 ) + Bias (ŷ 0 ) = σ + E T x T 0 (X T X) x 0 σ + 0. (.7) Here we have incurred an additinal variance σ in the predictin errr, since ur target is nt deterministic. There is n bias, and the variance depends n x 0. If N is large and T were selected at randm, and assuming E(X) = 0, then X T X NCv(X) and E x0 EPE(x 0 ) E x0 x T 0 Cv(X) x 0 σ /N + σ = trace[cv(x) Cv(x 0 )]σ /N + σ = σ (p/n) + σ. (.8) Here we see that the expected EPE increases linearly as a functin f p, with slpe σ /N. If N is large and/r σ is small, this grwth in variance is negligible (0 in the deterministic case). By impsing sme heavy restrictins n the class f mdels being fitted, we have avided the curse f dimensinality. Sme f the technical details in (.7) and (.8) are derived in Exercise.5. Figure.9 cmpares -nearest neighbr vs. least squares in tw situatins, bth f which have the frm Y = f(x) + ε, X unifrm as befre, and ε N(0,). The sample size is N = 500. Fr the range curve, f(x)

46 .5 Lcal Methds in High Dimensins 7 EPE Rati Expected Predictin Errr f NN vs. LS Linear Cubic Dimensin FIGURE.9. The curves shw the expected predictin errr (at x 0 = 0) fr -nearest neighbr relative t least squares fr the mdel Y = f(x) + ε. Fr the range curve, f(x) = x, while fr the blue curve f(x) = (x + ). is linear in the first crdinate, fr the blue curve, cubic as in Figure.8. Shwn is the relative EPE f -nearest neighbr t least squares, which appears t start at arund fr the linear case. Least squares is unbiased in this case, and as discussed abve the EPE is slightly abve σ =. The EPE fr -nearest neighbr is always abve, since the variance f ˆf(x 0 ) in this case is at least σ, and the rati increases with dimensin as the nearest neighbr strays frm the target pint. Fr the cubic case, least squares is biased, which mderates the rati. Clearly we culd manufacture examples where the bias f least squares wuld dminate the variance, and the -nearest neighbr wuld cme ut the winner. By relying n rigid assumptins, the linear mdel has n bias at all and negligible variance, while the errr in -nearest neighbr is substantially larger. Hwever, if the assumptins are wrng, all bets are ff and the -nearest neighbr may dminate. We will see that there is a whle spectrum f mdels between the rigid linear mdels and the extremely flexible -nearest-neighbr mdels, each with their wn assumptins and biases, which have been prpsed specifically t avid the expnential grwth in cmplexity f functins in high dimensins by drawing heavily n these assumptins. Befre we delve mre deeply, let us elabrate a bit n the cncept f statistical mdels and see hw they fit int the predictin framewrk.

47 8. verview f Supervised Learning.6 Statistical Mdels, Supervised Learning and Functin Apprximatin ur gal is t find a useful apprximatin ˆf(x) t the functin f(x) that underlies the predictive relatinship between the inputs and utputs. In the theretical setting f Sectin.4, we saw that squared errr lss lead us t the regressin functin f(x) = E(Y X = x) fr a quantitative respnse. The class f nearest-neighbr methds can be viewed as direct estimates f this cnditinal expectatin, but we have seen that they can fail in at least tw ways: if the dimensin f the input space is high, the nearest neighbrs need nt be clse t the target pint, and can result in large errrs; if special structure is knwn t exist, this can be used t reduce bth the bias and the variance f the estimates. We anticipate using ther classes f mdels fr f(x), in many cases specifically designed t vercme the dimensinality prblems, and here we discuss a framewrk fr incrprating them int the predictin prblem..6. A Statistical Mdel fr the Jint Distributin Pr(X,Y ) Suppse in fact that ur data arse frm a statistical mdel Y = f(x) + ε, (.9) where the randm errr ε has E(ε) = 0 and is independent f X. Nte that fr this mdel, f(x) = E(Y X = x), and in fact the cnditinal distributin Pr(Y X) depends n X nly thrugh the cnditinal mean f(x). The additive errr mdel is a useful apprximatin t the truth. Fr mst systems the input utput pairs (X, Y ) will nt have a deterministic relatinship Y = f(x). Generally there will be ther unmeasured variables that als cntribute t Y, including measurement errr. The additive mdel assumes that we can capture all these departures frm a deterministic relatinship via the errr ε. Fr sme prblems a deterministic relatinship des hld. Many f the classificatin prblems studied in machine learning are f this frm, where the respnse surface can be thught f as a clred map defined in IR p. The training data cnsist f clred examples frm the map {x i,g i }, and the gal is t be able t clr any pint. Here the functin is deterministic, and the randmness enters thrugh the x lcatin f the training pints. Fr the mment we will nt pursue such prblems, but will see that they can be handled by techniques apprpriate fr the errr-based mdels. The assumptin in (.9) that the errrs are independent and identically distributed is nt strictly necessary, but seems t be at the back f ur mind

48 .6 Statistical Mdels, Supervised Learning and Functin Apprximatin 9 when we average squared errrs unifrmly in ur EPE criterin. With such a mdel it becmes natural t use least squares as a data criterin fr mdel estimatin as in (.). Simple mdificatins can be made t avid the independence assumptin; fr example, we can have Var(Y X = x) = σ(x), and nw bth the mean and variance depend n X. In general the cnditinal distributin Pr(Y X) can depend n X in cmplicated ways, but the additive errr mdel precludes these. S far we have cncentrated n the quantitative respnse. Additive errr mdels are typically nt used fr qualitative utputs G; in this case the target functin p(x) is the cnditinal density Pr(G X), and this is mdeled directly. Fr example, fr tw-class data, it is ften reasnable t assume that the data arise frm independent binary trials, with the prbability f ne particular utcme being p(x), and the ther p(x). Thus if Y is the 0 cded versin f G, then E(Y X = x) = p(x), but the variance depends n x as well: Var(Y X = x) = p(x)[ p(x)]..6. Supervised Learning Befre we launch int mre statistically riented jargn, we present the functin-fitting paradigm frm a machine learning pint f view. Suppse fr simplicity that the errrs are additive and that the mdel Y = f(x)+ε is a reasnable assumptin. Supervised learning attempts t learn f by example thrugh a teacher. ne bserves the system under study, bth the inputs and utputs, and assembles a training set f bservatins T = (x i,y i ), i =,...,N. The bserved input values t the system x i are als fed int an artificial system, knwn as a learning algrithm (usually a cmputer prgram), which als prduces utputs ˆf(x i ) in respnse t the inputs. The learning algrithm has the prperty that it can mdify its input/utput relatinship ˆf in respnse t differences y i ˆf(x i ) between the riginal and generated utputs. This prcess is knwn as learning by example. Upn cmpletin f the learning prcess the hpe is that the artificial and real utputs will be clse enugh t be useful fr all sets f inputs likely t be encuntered in practice..6. Functin Apprximatin The learning paradigm f the previus sectin has been the mtivatin fr research int the supervised learning prblem in the fields f machine learning (with analgies t human reasning) and neural netwrks (with bilgical analgies t the brain). The apprach taken in applied mathematics and statistics has been frm the perspective f functin apprximatin and estimatin. Here the data pairs {x i,y i } are viewed as pints in a (p + )-dimensinal Euclidean space. The functin f(x) has dmain equal t the p-dimensinal input subspace, and is related t the data via a mdel

49 0. verview f Supervised Learning such as y i = f(x i ) + ε i. Fr cnvenience in this chapter we will assume the dmain is IR p, a p-dimensinal Euclidean space, althugh in general the inputs can be f mixed type. The gal is t btain a useful apprximatin t f(x) fr all x in sme regin f IR p, given the representatins in T. Althugh smewhat less glamrus than the learning paradigm, treating supervised learning as a prblem in functin apprximatin encurages the gemetrical cncepts f Euclidean spaces and mathematical cncepts f prbabilistic inference t be applied t the prblem. This is the apprach taken in this bk. Many f the apprximatins we will encunter have assciated a set f parameters θ that can be mdified t suit the data at hand. Fr example, the linear mdel f(x) = x T β has θ = β. Anther class f useful apprximatrs can be expressed as linear basis expansins f θ (x) = K h k (x)θ k, (.0) k= where the h k are a suitable set f functins r transfrmatins f the input vectr x. Traditinal examples are plynmial and trignmetric expansins, where fr example h k might be x, x x, cs(x ) and s n. We als encunter nnlinear expansins, such as the sigmid transfrmatin cmmn t neural netwrk mdels, h k (x) = + exp( x T β k ). (.) We can use least squares t estimate the parameters θ in f θ as we did fr the linear mdel, by minimizing the residual sum-f-squares RSS(θ) = N (y i f θ (x i )) (.) i= as a functin f θ. This seems a reasnable criterin fr an additive errr mdel. In terms f functin apprximatin, we imagine ur parameterized functin as a surface in p + space, and what we bserve are nisy realizatins frm it. This is easy t visualize when p = and the vertical crdinate is the utput y, as in Figure.0. The nise is in the utput crdinate, s we find the set f parameters such that the fitted surface gets as clse t the bserved pints as pssible, where clse is measured by the sum f squared vertical errrs in RSS(θ). Fr the linear mdel we get a simple clsed frm slutin t the minimizatin prblem. This is als true fr the basis functin methds, if the basis functins themselves d nt have any hidden parameters. therwise the slutin requires either iterative methds r numerical ptimizatin. While least squares is generally very cnvenient, it is nt the nly criterin used and in sme cases wuld nt make much sense. A mre general

50 .6 Statistical Mdels, Supervised Learning and Functin Apprximatin FIGURE.0. Least squares fitting f a functin f tw inputs. The parameters f f θ (x) are chsen s as t minimize the sum-f-squared vertical errrs. principle fr estimatin is maximum likelihd estimatin. Suppse we have a randm sample y i, i =,...,N frm a density Pr θ (y) indexed by sme parameters θ. The lg-prbability f the bserved sample is L(θ) = N lg Pr θ (y i ). (.) i= The principle f maximum likelihd assumes that the mst reasnable values fr θ are thse fr which the prbability f the bserved sample is largest. Least squares fr the additive errr mdel Y = f θ (X) + ε, with ε N(0,σ ), is equivalent t maximum likelihd using the cnditinal likelihd Pr(Y X,θ) = N(f θ (X),σ ). (.4) S althugh the additinal assumptin f nrmality seems mre restrictive, the results are the same. The lg-likelihd f the data is L(θ) = N lg(π) N lg σ σ N (y i f θ (x i )), (.5) i= and the nly term invlving θ is the last, which is RSS(θ) up t a scalar negative multiplier. A mre interesting example is the multinmial likelihd fr the regressin functin Pr(G X) fr a qualitative utput G. Suppse we have a mdel Pr(G = G k X = x) = p k,θ (x), k =,...,K fr the cnditinal prbability f each class given X, indexed by the parameter vectr θ. Then the

51 . verview f Supervised Learning lg-likelihd (als referred t as the crss-entrpy) is L(θ) = N lg p gi,θ(x i ), (.6) i= and when maximized it delivers values f θ that best cnfrm with the data in this likelihd sense..7 Structured Regressin Mdels We have seen that althugh nearest-neighbr and ther lcal methds fcus directly n estimating the functin at a pint, they face prblems in high dimensins. They may als be inapprpriate even in lw dimensins in cases where mre structured appraches can make mre efficient use f the data. This sectin intrduces classes f such structured appraches. Befre we prceed, thugh, we discuss further the need fr such classes..7. Difficulty f the Prblem Cnsider the RSS criterin fr an arbitrary functin f, RSS(f) = N (y i f(x i )). (.7) i= Minimizing (.7) leads t infinitely many slutins: any functin ˆf passing thrugh the training pints (x i,y i ) is a slutin. Any particular slutin chsen might be a pr predictr at test pints different frm the training pints. If there are multiple bservatin pairs x i,y il, l =,...,N i at each value f x i, the risk is limited. In this case, the slutins pass thrugh the average values f the y il at each x i ; see Exercise.6. The situatin is similar t the ne we have already visited in Sectin.4; indeed, (.7) is the finite sample versin f (.) n page 8. If the sample size N were sufficiently large such that repeats were guaranteed and densely arranged, it wuld seem that these slutins might all tend t the limiting cnditinal expectatin. In rder t btain useful results fr finite N, we must restrict the eligible slutins t (.7) t a smaller set f functins. Hw t decide n the nature f the restrictins is based n cnsideratins utside f the data. These restrictins are smetimes encded via the parametric representatin f f θ, r may be built int the learning methd itself, either implicitly r explicitly. These restricted classes f slutins are the majr tpic f this bk. ne thing shuld be clear, thugh. Any restrictins impsed n f that lead t a unique slutin t (.7) d nt really remve the ambiguity

52 .8 Classes f Restricted Estimatrs caused by the multiplicity f slutins. There are infinitely many pssible restrictins, each leading t a unique slutin, s the ambiguity has simply been transferred t the chice f cnstraint. In general the cnstraints impsed by mst learning methds can be described as cmplexity restrictins f ne kind r anther. This usually means sme kind f regular behavir in small neighbrhds f the input space. That is, fr all input pints x sufficiently clse t each ther in sme metric, ˆf exhibits sme special structure such as nearly cnstant, linear r lw-rder plynmial behavir. The estimatr is then btained by averaging r plynmial fitting in that neighbrhd. The strength f the cnstraint is dictated by the neighbrhd size. The larger the size f the neighbrhd, the strnger the cnstraint, and the mre sensitive the slutin is t the particular chice f cnstraint. Fr example, lcal cnstant fits in infinitesimally small neighbrhds is n cnstraint at all; lcal linear fits in very large neighbrhds is almst a glbally linear mdel, and is very restrictive. The nature f the cnstraint depends n the metric used. Sme methds, such as kernel and lcal regressin and tree-based methds, directly specify the metric and size f the neighbrhd. The nearest-neighbr methds discussed s far are based n the assumptin that lcally the functin is cnstant; clse t a target input x 0, the functin des nt change much, and s clse utputs can be averaged t prduce ˆf(x 0 ). ther methds such as splines, neural netwrks and basis-functin methds implicitly define neighbrhds f lcal behavir. In Sectin 5.4. we discuss the cncept f an equivalent kernel (see Figure 5.8 n page 57), which describes this lcal dependence fr any methd linear in the utputs. These equivalent kernels in many cases lk just like the explicitly defined weighting kernels discussed abve peaked at the target pint and falling away smthly away frm it. ne fact shuld be clear by nw. Any methd that attempts t prduce lcally varying functins in small istrpic neighbrhds will run int prblems in high dimensins again the curse f dimensinality. And cnversely, all methds that vercme the dimensinality prblems have an assciated and ften implicit r adaptive metric fr measuring neighbrhds, which basically des nt allw the neighbrhd t be simultaneusly small in all directins..8 Classes f Restricted Estimatrs The variety f nnparametric regressin techniques r learning methds fall int a number f different classes depending n the nature f the restrictins impsed. These classes are nt distinct, and indeed sme methds fall in several classes. Here we give a brief summary, since detailed descriptins

54 .8 Classes f Restricted Estimatrs 5 weights t pints x in a regin arund x 0 (see Figure 6. n page 9). Fr example, the Gaussian kernel has a weight functin based n the Gaussian density functin K λ (x 0,x) = [ λ exp x x 0 ] (.40) λ and assigns weights t pints that die expnentially with their squared Euclidean distance frm x 0. The parameter λ crrespnds t the variance f the Gaussian density, and cntrls the width f the neighbrhd. The simplest frm f kernel estimate is the Nadaraya Watsn weighted average ˆf(x 0 ) = N i= K λ(x 0,x i )y i N i= K λ(x 0,x i ). (.4) In general we can define a lcal regressin estimate f f(x 0 ) as fˆθ(x 0 ), where ˆθ minimizes RSS(f θ,x 0 ) = N K λ (x 0,x i )(y i f θ (x i )), (.4) i= and f θ is sme parameterized functin, such as a lw-rder plynmial. Sme examples are: f θ (x) = θ 0, the cnstant functin; this results in the Nadaraya Watsn estimate in (.4) abve. f θ (x) = θ 0 + θ x gives the ppular lcal linear regressin mdel. Nearest-neighbr methds can be thught f as kernel methds having a mre data-dependent metric. Indeed, the metric fr k-nearest neighbrs is K k (x,x 0 ) = I( x x 0 x (k) x 0 ), where x (k) is the training bservatin ranked kth in distance frm x 0, and I(S) is the indicatr f the set S. These methds f curse need t be mdified in high dimensins, t avid the curse f dimensinality. Varius adaptatins are discussed in Chapter Basis Functins and Dictinary Methds This class f methds includes the familiar linear and plynmial expansins, but mre imprtantly a wide variety f mre flexible mdels. The mdel fr f is a linear expansin f basis functins f θ (x) = M θ m h m (x), (.4) m=

55 6. verview f Supervised Learning where each f the h m is a functin f the input x, and the term linear here refers t the actin f the parameters θ. This class cvers a wide variety f methds. In sme cases the sequence f basis functins is prescribed, such as a basis fr plynmials in x f ttal degree M. Fr ne-dimensinal x, plynmial splines f degree K can be represented by an apprpriate sequence f M spline basis functins, determined in turn by M K knts. These prduce functins that are piecewise plynmials f degree K between the knts, and jined up with cntinuity f degree K at the knts. As an example cnsider linear splines, r piecewise linear functins. ne intuitively satisfying basis cnsists f the functins b (x) =, b (x) = x, and b m+ (x) = (x t m ) +, m =,...,M, where t m is the mth knt, and z + dentes psitive part. Tensr prducts f spline bases can be used fr inputs with dimensins larger than ne (see Sectin 5., and the CART and MARS mdels in Chapter 9.) The parameter θ can be the ttal degree f the plynmial r the number f knts in the case f splines. Radial basis functins are symmetric p-dimensinal kernels lcated at particular centrids, f θ (x) = M K λm (µ m,x)θ m ; (.44) m= fr example, the Gaussian kernel K λ (µ,x) = e x µ /λ is ppular. Radial basis functins have centrids µ m and scales λ m that have t be determined. The spline basis functins have knts. In general we wuld like the data t dictate them as well. Including these as parameters changes the regressin prblem frm a straightfrward linear prblem t a cmbinatrially hard nnlinear prblem. In practice, shrtcuts such as greedy algrithms r tw stage prcesses are used. Sectin 6.7 describes sme such appraches. A single-layer feed-frward neural netwrk mdel with linear utput weights can be thught f as an adaptive basis functin methd. The mdel has the frm M f θ (x) = β m σ(αmx T + b m ), (.45) m= where σ(x) = /( + e x ) is knwn as the activatin functin. Here, as in the prjectin pursuit mdel, the directins α m and the bias terms b m have t be determined, and their estimatin is the meat f the cmputatin. Details are give in Chapter. These adaptively chsen basis functin methds are als knwn as dictinary methds, where ne has available a pssibly infinite set r dictinary D f candidate basis functins frm which t chse, and mdels are built up by emplying sme kind f search mechanism.

56 .9 Mdel Selectin and the Bias Variance Tradeff 7.9 Mdel Selectin and the Bias Variance Tradeff All the mdels described abve and many thers discussed in later chapters have a smthing r cmplexity parameter that has t be determined: the multiplier f the penalty term; the width f the kernel; r the number f basis functins. In the case f the smthing spline, the parameter λ indexes mdels ranging frm a straight line fit t the interplating mdel. Similarly a lcal degreem plynmial mdel ranges between a degree-m glbal plynmial when the windw size is infinitely large, t an interplating fit when the windw size shrinks t zer. This means that we cannt use residual sum-f-squares n the training data t determine these parameters as well, since we wuld always pick thse that gave interplating fits and hence zer residuals. Such a mdel is unlikely t predict future data well at all. The k-nearest-neighbr regressin fit ˆf k (x 0 ) usefully illustrates the cmpeting frces that effect the predictive ability f such apprximatins. Suppse the data arise frm a mdel Y = f(x) + ε, with E(ε) = 0 and Var(ε) = σ. Fr simplicity here we assume that the values f x i in the sample are fixed in advance (nnrandm). The expected predictin errr at x 0, als knwn as test r generalizatin errr, can be decmpsed: EPE k (x 0 ) = E[(Y ˆf k (x 0 )) X = x 0 ] = σ + [Bias ( ˆf k (x 0 )) + Var T ( ˆf k (x 0 ))] (.46) = [ σ + f(x 0 ) k ] σ f(x (l) ) + k k. (.47) The subscripts in parentheses (l) indicate the sequence f nearest neighbrs t x 0. There are three terms in this expressin. The first term σ is the irreducible errr the variance f the new test target and is beynd ur cntrl, even if we knw the true f(x 0 ). The secnd and third terms are under ur cntrl, and make up the mean squared errr f ˆfk (x 0 ) in estimating f(x 0 ), which is brken dwn int a bias cmpnent and a variance cmpnent. The bias term is the squared difference between the true mean f(x 0 ) and the expected value f the estimate [E T ( ˆf k (x 0 )) f(x 0 )] where the expectatin averages the randmness in the training data. This term will mst likely increase with k, if the true functin is reasnably smth. Fr small k the few clsest neighbrs will have values f(x (l) ) clse t f(x 0 ), s their average shuld l=

57 8. verview f Supervised Learning High Bias Lw Variance Lw Bias High Variance Predictin Errr Test Sample Training Sample Lw Mdel Cmplexity High FIGURE.. Test and training errr as a functin f mdel cmplexity. be clse t f(x 0 ). As k grws, the neighbrs are further away, and then anything can happen. The variance term is simply the variance f an average here, and decreases as the inverse f k. S as k varies, there is a bias variance tradeff. Mre generally, as the mdel cmplexity f ur prcedure is increased, the variance tends t increase and the squared bias tends t decreases. The ppsite behavir ccurs as the mdel cmplexity is decreased. Fr k-nearest neighbrs, the mdel cmplexity is cntrlled by k. Typically we wuld like t chse ur mdel cmplexity t trade bias ff with variance in such a way as t minimize the test errr. An bvius i (y i ŷ i ). Unfrtunately estimate f test errr is the training errr N training errr is nt a gd estimate f test errr, as it des nt prperly accunt fr mdel cmplexity. Figure. shws the typical behavir f the test and training errr, as mdel cmplexity is varied. The training errr tends t decrease whenever we increase the mdel cmplexity, that is, whenever we fit the data harder. Hwever with t much fitting, the mdel adapts itself t clsely t the training data, and will nt generalize well (i.e., have large test errr). In that case the predictins ˆf(x 0 ) will have large variance, as reflected in the last term f expressin (.46). In cntrast, if the mdel is nt cmplex enugh, it will underfit and may have large bias, again resulting in pr generalizatin. In Chapter 7 we discuss methds fr estimating the test errr f a predictin methd, and hence estimating the ptimal amunt f mdel cmplexity fr a given predictin methd and training set.

58 Exercises 9 Bibligraphic Ntes Sme gd general bks n the learning prblem are Duda et al. (000), Bishp (995),(Bishp, 006), Ripley (996), Cherkassky and Mulier (007) and Vapnik (996). Parts f this chapter are based n Friedman (994b). Exercises Ex.. Suppse each f K-classes has an assciated target t k, which is a vectr f all zers, except a ne in the kth psitin. Shw that classifying t the largest element f ŷ amunts t chsing the clsest target, min k t k ŷ, if the elements f ŷ sum t ne. Ex.. Shw hw t cmpute the Bayes decisin bundary fr the simulatin example in Figure.5. Ex.. Derive equatin (.4). Ex..4 The edge effect prblem discussed n page is nt peculiar t unifrm sampling frm bunded dmains. Cnsider inputs drawn frm a spherical multinrmal distributin X N(0,I p ). The squared distance frm any sample pint t the rigin has a χ p distributin with mean p. Cnsider a predictin pint x 0 drawn frm this distributin, and let a = x 0 / x 0 be an assciated unit vectr. Let z i = a T x i be the prjectin f each f the training pints n this directin. Shw that the z i are distributed N(0,) with expected squared distance frm the rigin, while the target pint has expected squared distance p frm the rigin. Hence fr p = 0, a randmly drawn test pint is abut. standard deviatins frm the rigin, while all the training pints are n average ne standard deviatin alng directin a. S mst predictin pints see themselves as lying n the edge f the training set. Ex..5 (a) Derive equatin (.7). The last line makes use f (.8) thrugh a cnditining argument. (b) Derive equatin (.8), making use f the cyclic prperty f the trace peratr [trace(ab) = trace(ba)], and its linearity (which allws us t interchange the rder f trace and expectatin). Ex..6 Cnsider a regressin prblem with inputs x i and utputs y i, and a parameterized mdel f θ (x) t be fit by least squares. Shw that if there are bservatins with tied r identical values f x, then the fit can be btained frm a reduced weighted least squares prblem.

59 40. verview f Supervised Learning Ex..7 Suppse we have a sample f N pairs x i,y i drawn i.i.d. frm the distributin characterized as fllws: x i h(x), the design density y i = f(x i ) + ε i, f is the regressin functin ε i (0,σ ) (mean zer, variance σ ) We cnstruct an estimatr fr f linear in the y i, ˆf(x 0 ) = N l i (x 0 ; X)y i, i= where the weights l i (x 0 ; X) d nt depend n the y i, but d depend n the entire training sequence f x i, dented here by X. (a) Shw that linear regressin and k-nearest-neighbr regressin are members f this class f estimatrs. Describe explicitly the weights l i (x 0 ; X) in each f these cases. (b) Decmpse the cnditinal mean-squared errr E Y X (f(x 0 ) ˆf(x 0 )) int a cnditinal squared bias and a cnditinal variance cmpnent. Like X, Y represents the entire training sequence f y i. (c) Decmpse the (uncnditinal) mean-squared errr E Y,X (f(x 0 ) ˆf(x 0 )) int a squared bias and a variance cmpnent. (d) Establish a relatinship between the squared biases and variances in the abve tw cases. Ex..8 Cmpare the classificatin perfrmance f linear regressin and k nearest neighbr classificatin n the zipcde data. In particular, cnsider nly the s and s, and k =,,5,7 and 5. Shw bth the training and test errr fr each chice. The zipcde data are available frm the bk website www-stat.stanfrd.edu/elemstatlearn. Ex..9 Cnsider a linear regressin mdel with p parameters, fit by least squares t a set f training data (x,y ),...,(x N,y N ) drawn at randm frm a ppulatin. Let ˆβ be the least squares estimate. Suppse we have sme test data ( x,ỹ ),...,( x M,ỹ M ) drawn at randm frm the same ppulatin as the training data. If R tr (β) = N N (y i β T x i ) and R te (β) = M M (ỹ i β T x i ), prve that E[R tr (ˆβ)] E[R te (ˆβ)],

60 Exercises 4 where the expectatins are ver all that is randm in each expressin. [This exercise was brught t ur attentin by Ryan Tibshirani, frm a hmewrk assignment given by Andrew Ng.]

61 4. verview f Supervised Learning

62 Linear Methds fr Regressin This is page 4 Printer: paque this. Intrductin A linear regressin mdel assumes that the regressin functin E(Y X) is linear in the inputs X,...,X p. Linear mdels were largely develped in the precmputer age f statistics, but even in tday s cmputer era there are still gd reasns t study and use them. They are simple and ften prvide an adequate and interpretable descriptin f hw the inputs affect the utput. Fr predictin purpses they can smetimes utperfrm fancier nnlinear mdels, especially in situatins with small numbers f training cases, lw signal-t-nise rati r sparse data. Finally, linear methds can be applied t transfrmatins f the inputs and this cnsiderably expands their scpe. These generalizatins are smetimes called basis-functin methds, and are discussed in Chapter 5. In this chapter we describe linear methds fr regressin, while in the next chapter we discuss linear methds fr classificatin. n sme tpics we g int cnsiderable detail, as it is ur firm belief that an understanding f linear methds is essential fr understanding nnlinear nes. In fact, many nnlinear techniques are direct generalizatins f the linear methds discussed here.

63 44. Linear Methds fr Regressin. Linear Regressin Mdels and Least Squares As intrduced in Chapter, we have an input vectr X T = (X,X,...,X p ), and want t predict a real-valued utput Y. The linear regressin mdel has the frm p f(x) = β 0 + X j β j. (.) The linear mdel either assumes that the regressin functin E(Y X) is linear, r that the linear mdel is a reasnable apprximatin. Here the β j s are unknwn parameters r cefficients, and the variables X j can cme frm different surces: quantitative inputs; transfrmatins f quantitative inputs, such as lg, square-rt r square; basis expansins, such as X = X, X = X, leading t a plynmial representatin; numeric r dummy cding f the levels f qualitative inputs. Fr example, if G is a five-level factr input, we might create X j, j =,...,5, such that X j = I(G = j). Tgether this grup f X j represents the effect f G by a set f level-dependent cnstants, since in 5 j= X jβ j, ne f the X j s is ne, and the thers are zer. j= interactins between variables, fr example, X = X X. N matter the surce f the X j, the mdel is linear in the parameters. Typically we have a set f training data (x,y )...(x N,y N ) frm which t estimate the parameters β. Each x i = (x i,x i,...,x ip ) T is a vectr f feature measurements fr the ith case. The mst ppular estimatin methd is least squares, in which we pick the cefficients β = (β 0,β,...,β p ) T t minimize the residual sum f squares RSS(β) = = N (y i f(x i )) i= N ( y i β 0 i= j= p ). x ij β j (.) Frm a statistical pint f view, this criterin is reasnable if the training bservatins (x i,y i ) represent independent randm draws frm their ppulatin. Even if the x i s were nt drawn randmly, the criterin is still valid if the y i s are cnditinally independent given the inputs x i. Figure. illustrates the gemetry f least-squares fitting in the IR p+ -dimensinal

64 . Linear Regressin Mdels and Least Squares 45 Y X FIGURE.. Linear least squares fitting with X IR. We seek the linear functin f X that minimizes the sum f squared residuals frm Y. space ccupied by the pairs (X,Y ). Nte that (.) makes n assumptins abut the validity f mdel (.); it simply finds the best linear fit t the data. Least squares fitting is intuitively satisfying n matter hw the data arise; the criterin measures the average lack f fit. Hw d we minimize (.)? Dente by X the N (p + ) matrix with each rw an input vectr (with a in the first psitin), and similarly let y be the N-vectr f utputs in the training set. Then we can write the residual sum-f-squares as RSS(β) = (y Xβ) T (y Xβ). (.) This is a quadratic functin in the p + parameters. Differentiating with respect t β we btain RSS β = XT (y Xβ) RSS β β T = XT X. X (.4) Assuming (fr the mment) that X has full clumn rank, and hence X T X is psitive definite, we set the first derivative t zer t btain the unique slutin X T (y Xβ) = 0 (.5) ˆβ = (X T X) X T y. (.6)

65 46. Linear Methds fr Regressin y x ŷ FIGURE.. The N-dimensinal gemetry f least squares regressin with tw predictrs. The utcme vectr y is rthgnally prjected nt the hyperplane spanned by the input vectrs x and x. The prjectin ŷ represents the vectr f the least squares predictins The predicted values at an input vectr x 0 are given by ˆf(x 0 ) = ( : x 0 ) T ˆβ; the fitted values at the training inputs are ŷ = Xˆβ = X(X T X) X T y, (.7) where ŷ i = ˆf(x i ). The matrix H = X(X T X) X T appearing in equatin (.7) is smetimes called the hat matrix because it puts the hat n y. Figure. shws a different gemetrical representatin f the least squares estimate, this time in IR N. We dente the clumn vectrs f X by x 0,x,...,x p, with x 0. Fr much f what fllws, this first clumn is treated like any ther. These vectrs span a subspace f IR N, als referred t as the clumn space f X. We minimize RSS(β) = y Xβ by chsing ˆβ s that the residual vectr y ŷ is rthgnal t this subspace. This rthgnality is expressed in (.5), and the resulting estimate ŷ is hence the rthgnal prjectin f y nt this subspace. The hat matrix H cmputes the rthgnal prjectin, and hence it is als knwn as a prjectin matrix. It might happen that the clumns f X are nt linearly independent, s that X is nt f full rank. This wuld ccur, fr example, if tw f the inputs were perfectly crrelated, (e.g., x = x ). Then X T X is singular and the least squares cefficients ˆβ are nt uniquely defined. Hwever, the fitted values ŷ = Xˆβ are still the prjectin f y nt the clumn space f X; there is just mre than ne way t express that prjectin in terms f the clumn vectrs f X. The nn-full-rank case ccurs mst ften when ne r mre qualitative inputs are cded in a redundant fashin. There is usually a natural way t reslve the nn-unique representatin, by recding and/r drpping redundant clumns in X. Mst regressin sftware packages detect these redundancies and autmatically implement x

66 . Linear Regressin Mdels and Least Squares 47 sme strategy fr remving them. Rank deficiencies can als ccur in signal and image analysis, where the number f inputs p can exceed the number f training cases N. In this case, the features are typically reduced by filtering r else the fitting is cntrlled by regularizatin (Sectin 5.. and Chapter 8). Up t nw we have made minimal assumptins abut the true distributin f the data. In rder t pin dwn the sampling prperties f ˆβ, we nw assume that the bservatins y i are uncrrelated and have cnstant variance σ, and that the x i are fixed (nn randm). The variance cvariance matrix f the least squares parameter estimates is easily derived frm (.6) and is given by Typically ne estimates the variance σ by Var(ˆβ) = (X T X) σ. (.8) ˆσ = N p N (y i ŷ i ). The N p rather than N in the denminatr makes ˆσ an unbiased estimate f σ : E(ˆσ ) = σ. T draw inferences abut the parameters and the mdel, additinal assumptins are needed. We nw assume that (.) is the crrect mdel fr the mean; that is, the cnditinal expectatin f Y is linear in X,...,X p. We als assume that the deviatins f Y arund its expectatin are additive and Gaussian. Hence i= Y = E(Y X,...,X p ) + ε p = β 0 + X j β j + ε, (.9) j= where the errr ε is a Gaussian randm variable with expectatin zer and variance σ, written ε N(0,σ ). Under (.9), it is easy t shw that ˆβ N(β,(X T X) σ ). (.0) This is a multivariate nrmal distributin with mean vectr and variance cvariance matrix as shwn. Als (N p )ˆσ σ χ N p, (.) a chi-squared distributin with N p degrees f freedm. In additin ˆβ and ˆσ are statistically independent. We use these distributinal prperties t frm tests f hypthesis and cnfidence intervals fr the parameters β j.

67 48. Linear Methds fr Regressin Tail Prbabilities t 0 t 00 nrmal Z FIGURE.. The tail prbabilities Pr( Z > z) fr three distributins, t 0, t 00 and standard nrmal. Shwn are the apprpriate quantiles fr testing significance at the p = 0.05 and 0.0 levels. The difference between t and the standard nrmal becmes negligible fr N bigger than abut 00. T test the hypthesis that a particular cefficient β j = 0, we frm the standardized cefficient r Z-scre z j = ˆβ j ˆσ v j, (.) where v j is the jth diagnal element f (X T X). Under the null hypthesis that β j = 0, z j is distributed as t N p (a t distributin with N p degrees f freedm), and hence a large (abslute) value f z j will lead t rejectin f this null hypthesis. If ˆσ is replaced by a knwn value σ, then z j wuld have a standard nrmal distributin. The difference between the tail quantiles f a t-distributin and a standard nrmal becme negligible as the sample size increases, and s we typically use the nrmal quantiles (see Figure.). ften we need t test fr the significance f grups f cefficients simultaneusly. Fr example, t test if a categrical variable with k levels can be excluded frm a mdel, we need t test whether the cefficients f the dummy variables used t represent the levels can all be set t zer. Here we use the F statistic, F = (RSS 0 RSS )/(p p 0 ), (.) RSS /(N p ) where RSS is the residual sum-f-squares fr the least squares fit f the bigger mdel with p + parameters, and RSS 0 the same fr the nested smaller mdel with p 0 + parameters, having p p 0 parameters cnstrained t be

68 . Linear Regressin Mdels and Least Squares 49 zer. The F statistic measures the change in residual sum-f-squares per additinal parameter in the bigger mdel, and it is nrmalized by an estimate f σ. Under the Gaussian assumptins, and the null hypthesis that the smaller mdel is crrect, the F statistic will have a F p p 0,N p distributin. It can be shwn (Exercise.) that the z j in (.) are equivalent t the F statistic fr drpping the single cefficient β j frm the mdel. Fr large N, the quantiles f the F p p 0,N p apprach thse f the χ p p 0. Similarly, we can islate β j in (.0) t btain a α cnfidence interval fr β j : (ˆβ j z ( α) v j ˆσ, ˆβj + z ( α) v j ˆσ). (.4) Here z ( α) is the α percentile f the nrmal distributin: z ( 0.05) =.96, z (.05) =.645, etc. Hence the standard practice f reprting ˆβ ± se(ˆβ) amunts t an apprximate 95% cnfidence interval. Even if the Gaussian errr assumptin des nt hld, this interval will be apprximately crrect, with its cverage appraching α as the sample size N. In a similar fashin we can btain an apprximate cnfidence set fr the entire parameter vectr β, namely C β = {β (ˆβ β) T X T X(ˆβ β) ˆσ χ p+( α) }, (.5) where χ l( α) is the α percentile f the chi-squared distributin n l degrees f freedm: fr example, χ 5( 0.05) =., χ 5 ( 0.) = 9.. This cnfidence set fr β generates a crrespnding cnfidence set fr the true functin f(x) = x T β, namely {x T β β C β } (Exercise.; see als Figure 5.4 in Sectin 5.. fr examples f cnfidence bands fr functins)... Example: Prstate Cancer The data fr this example cme frm a study by Stamey et al. (989). They examined the crrelatin between the level f prstate-specific antigen and a number f clinical measures in men wh were abut t receive a radical prstatectmy. The variables are lg cancer vlume (lcavl), lg prstate weight (lweight), age, lg f the amunt f benign prstatic hyperplasia (lbph), seminal vesicle invasin (svi), lg f capsular penetratin (lcp), Gleasn scre (gleasn), and percent f Gleasn scres 4 r 5 (pgg45). The crrelatin matrix f the predictrs given in Table. shws many strng crrelatins. Figure. (page ) f Chapter is a scatterplt matrix shwing every pairwise plt between the variables. We see that svi is a binary variable, and gleasn is an rdered categrical variable. We see, fr

69 50. Linear Methds fr Regressin TABLE.. Crrelatins f predictrs in the prstate cancer data. lcavl lweight age lbph svi lcp gleasn lweight 0.00 age lbph svi lcp gleasn pgg TABLE.. Linear mdel fit t the prstate cancer data. The Z scre is the cefficient divided by its standard errr (.). Rughly a Z scre larger than tw in abslute value is significantly nnzer at the p = 0.05 level. Term Cefficient Std. Errr Z Scre Intercept lcavl lweight age lbph svi lcp gleasn pgg example, that bth lcavl and lcp shw a strng relatinship with the respnse lpsa, and with each ther. We need t fit the effects jintly t untangle the relatinships between the predictrs and the respnse. We fit a linear mdel t the lg f prstate-specific antigen, lpsa, after first standardizing the predictrs t have unit variance. We randmly split the dataset int a training set f size 67 and a test set f size 0. We applied least squares estimatin t the training set, prducing the estimates, standard errrs and Z-scres shwn in Table.. The Z-scres are defined in (.), and measure the effect f drpping that variable frm the mdel. A Z-scre greater than in abslute value is apprximately significant at the 5% level. (Fr ur example, we have nine parameters, and the 0.05 tail quantiles f the t 67 9 distributin are ±.00!) The predictr lcavl shws the strngest effect, with lweight and svi als strng. Ntice that lcp is nt significant, nce lcavl is in the mdel (when used in a mdel withut lcavl, lcp is strngly significant). We can als test fr the exclusin f a number f terms at nce, using the F-statistic (.). Fr example, we cnsider drpping all the nn-significant terms in Table., namely age,

70 . Linear Regressin Mdels and Least Squares 5 lcp, gleasn, and pgg45. We get F = (.8 9.4)/(9 5) 9.4/(67 9) =.67, (.6) which has a p-value f 0.7 (Pr(F 4,58 >.67) = 0.7), and hence is nt significant. The mean predictin errr n the test data is 0.5. In cntrast, predictin using the mean training value f lpsa has a test errr f.057, which is called the base errr rate. Hence the linear mdel reduces the base errr rate by abut 50%. We will return t this example later t cmpare varius selectin and shrinkage methds... The Gauss Markv Therem ne f the mst famus results in statistics asserts that the least squares estimates f the parameters β have the smallest variance amng all linear unbiased estimates. We will make this precise here, and als make clear that the restrictin t unbiased estimates is nt necessarily a wise ne. This bservatin will lead us t cnsider biased estimates such as ridge regressin later in the chapter. We fcus n estimatin f any linear cmbinatin f the parameters θ = a T β; fr example, predictins f(x 0 ) = x T 0 β are f this frm. The least squares estimate f a T β is ˆθ = a T ˆβ = a T (X T X) X T y. (.7) Cnsidering X t be fixed, this is a linear functin c T 0 y f the respnse vectr y. If we assume that the linear mdel is crrect, a T ˆβ is unbiased since E(a T ˆβ) = E(a T (X T X) X T y) = a T (X T X) X T Xβ = a T β. (.8) The Gauss Markv therem states that if we have any ther linear estimatr θ = c T y that is unbiased fr a T β, that is, E(c T y) = a T β, then Var(a T ˆβ) Var(c T y). (.9) The prf (Exercise.) uses the triangle inequality. Fr simplicity we have stated the result in terms f estimatin f a single parameter a T β, but with a few mre definitins ne can state it in terms f the entire parameter vectr β (Exercise.). Cnsider the mean squared errr f an estimatr θ in estimating θ: MSE( θ) = E( θ θ) = Var( θ) + [E( θ) θ]. (.0)

71 5. Linear Methds fr Regressin The first term is the variance, while the secnd term is the squared bias. The Gauss-Markv therem implies that the least squares estimatr has the smallest mean squared errr f all linear estimatrs with n bias. Hwever, there may well exist a biased estimatr with smaller mean squared errr. Such an estimatr wuld trade a little bias fr a larger reductin in variance. Biased estimates are cmmnly used. Any methd that shrinks r sets t zer sme f the least squares cefficients may result in a biased estimate. We discuss many examples, including variable subset selectin and ridge regressin, later in this chapter. Frm a mre pragmatic pint f view, mst mdels are distrtins f the truth, and hence are biased; picking the right mdel amunts t creating the right balance between bias and variance. We g int these issues in mre detail in Chapter 7. Mean squared errr is intimately related t predictin accuracy, as discussed in Chapter. Cnsider the predictin f the new respnse at input x 0, Y 0 = f(x 0 ) + ε 0. (.) Then the expected predictin errr f an estimate f(x 0 ) = x T 0 β is E(Y 0 f(x 0 )) = σ + E(x T 0 β f(x 0 )) = σ + MSE( f(x 0 )). (.) Therefre, expected predictin errr and mean squared errr differ nly by the cnstant σ, representing the variance f the new bservatin y 0... Multiple Regressin frm Simple Univariate Regressin The linear mdel (.) with p > inputs is called the multiple linear regressin mdel. The least squares estimates (.6) fr this mdel are best understd in terms f the estimates fr the univariate (p = ) linear mdel, as we indicate in this sectin. Suppse first that we have a univariate mdel with n intercept, that is, The least squares estimate and residuals are Y = Xβ + ε. (.) N ˆβ = x iy i N, x i (.4) r i = y i x i ˆβ. In cnvenient vectr ntatin, we let y = (y,...,y N ) T, x = (x,...,x N ) T and define N x, y = x i y i, i= = x T y, (.5)

72 . Linear Regressin Mdels and Least Squares 5 the inner prduct between x and y. Then we can write ˆβ = x,y x,x, r = y xˆβ. (.6) As we will see, this simple univariate regressin prvides the building blck fr multiple linear regressin. Suppse next that the inputs x,x,...,x p (the clumns f the data matrix X) are rthgnal; that is x j,x k = 0 fr all j k. Then it is easy t check that the multiple least squares estimates ˆβ j are equal t x j,y / x j,x j the univariate estimates. In ther wrds, when the inputs are rthgnal, they have n effect n each ther s parameter estimates in the mdel. rthgnal inputs ccur mst ften with balanced, designed experiments (where rthgnality is enfrced), but almst never with bservatinal data. Hence we will have t rthgnalize them in rder t carry this idea further. Suppse next that we have an intercept and a single input x. Then the least squares cefficient f x has the frm ˆβ = x x,y x x,x x, (.7) where x = i x i/n, and = x 0, the vectr f N nes. We can view the estimate (.7) as the result f tw applicatins f the simple regressin (.6). The steps are:. regress x n t prduce the residual z = x x;. regress y n the residual z t give the cefficient ˆβ. In this prcedure, regress b n a means a simple univariate regressin f b n a with n intercept, prducing cefficient ˆγ = a, b / a, a and residual vectr b ˆγa. We say that b is adjusted fr a, r is rthgnalized with respect t a. Step rthgnalizes x with respect t x 0 =. Step is just a simple univariate regressin, using the rthgnal predictrs and z. Figure.4 shws this prcess fr tw general inputs x and x. The rthgnalizatin des nt change the subspace spanned by x and x, it simply prduces an rthgnal basis fr representing it. This recipe generalizes t the case f p inputs, as shwn in Algrithm.. Nte that the inputs z 0,...,z j in step are rthgnal, hence the simple regressin cefficients cmputed there are in fact als the multiple regressin cefficients. The inner-prduct ntatin is suggestive f generalizatins f linear regressin t different metric spaces, as well as t prbability spaces.

73 54. Linear Methds fr Regressin y x z ŷ FIGURE.4. Least squares regressin by rthgnalizatin f the inputs. The vectr x is regressed n the vectr x, leaving the residual vectr z. The regressin f y n z gives the multiple regressin cefficient f x. Adding tgether the prjectins f y n each f x and z gives the least squares fit ŷ. Algrithm. Regressin by Successive rthgnalizatin.. Initialize z 0 = x 0 =.. Fr j =,,...,p Regress x j n z 0,z,...,,z j t prduce cefficients ˆγ lj = z l,x j / z l,z l, l = 0,...,j and residual vectr z j = x j j k=0 ˆγ kjz k.. Regress y n the residual z p t give the estimate ˆβ p. x The result f this algrithm is ˆβ p = z p,y z p,z p. (.8) Re-arranging the residual in step, we can see that each f the x j is a linear cmbinatin f the z k, k j. Since the z j are all rthgnal, they frm a basis fr the clumn space f X, and hence the least squares prjectin nt this subspace is ŷ. Since z p alne invlves x p (with cefficient ), we see that the cefficient (.8) is indeed the multiple regressin cefficient f y n x p. This key result expses the effect f crrelated inputs in multiple regressin. Nte als that by rearranging the x j, any ne f them culd be in the last psitin, and a similar results hlds. Hence stated mre generally, we have shwn that the jth multiple regressin cefficient is the univariate regressin cefficient f y n x j 0...(j )(j+)...,p, the residual after regressing x j n x 0,x,...,x j,x j+,...,x p :

74 . Linear Regressin Mdels and Least Squares 55 The multiple regressin cefficient ˆβ j represents the additinal cntributin f x j n y, after x j has been adjusted fr x 0,x,...,x j, x j+,...,x p. If x p is highly crrelated with sme f the ther x k s, the residual vectr z p will be clse t zer, and frm (.8) the cefficient ˆβ p will be very unstable. This will be true fr all the variables in the crrelated set. In such situatins, we might have all the Z-scres (as in Table.) be small any ne f the set can be deleted yet we cannt delete them all. Frm (.8) we als btain an alternate frmula fr the variance estimates (.8), Var(ˆβ p ) = σ z p,z p = σ z p. (.9) In ther wrds, the precisin with which we can estimate ˆβ p depends n the length f the residual vectr z p ; this represents hw much f x p is unexplained by the ther x k s. Algrithm. is knwn as the Gram Schmidt prcedure fr multiple regressin, and is als a useful numerical strategy fr cmputing the estimates. We can btain frm it nt just ˆβ p, but als the entire multiple least squares fit, as shwn in Exercise.4. We can represent step f Algrithm. in matrix frm: X = ZΓ, (.0) where Z has as clumns the z j (in rder), and Γ is the upper triangular matrix with entries ˆγ kj. Intrducing the diagnal matrix D with jth diagnal entry D jj = z j, we get X = ZD DΓ = QR, (.) the s-called QR decmpsitin f X. Here Q is an N (p+) rthgnal matrix, Q T Q = I, and R is a (p + ) (p + ) upper triangular matrix. The QR decmpsitin represents a cnvenient rthgnal basis fr the clumn space f X. It is easy t see, fr example, that the least squares slutin is given by ˆβ = R Q T y, (.) ŷ = QQ T y. (.) Equatin (.) is easy t slve because R is upper triangular (Exercise.4).

75 56. Linear Methds fr Regressin..4 Multiple utputs Suppse we have multiple utputs Y,Y,...,Y K that we wish t predict frm ur inputs X 0,X,X,...,X p. We assume a linear mdel fr each utput Y k = β 0k + p X j β jk + ε k (.4) j= = f k (X) + ε k. (.5) With N training cases we can write the mdel in matrix ntatin Y = XB + E. (.6) Here Y is the N K respnse matrix, with ik entry y ik, X is the N (p+) input matrix, B is the (p + ) K matrix f parameters and E is the N K matrix f errrs. A straightfrward generalizatin f the univariate lss functin (.) is RSS(B) = K k= i= N (y ik f k (x i )) (.7) = tr[(y XB) T (Y XB)]. (.8) The least squares estimates have exactly the same frm as befre ˆB = (X T X) X T Y. (.9) Hence the cefficients fr the kth utcme are just the least squares estimates in the regressin f y k n x 0,x,...,x p. Multiple utputs d nt affect ne anther s least squares estimates. If the errrs ε = (ε,...,ε K ) in (.4) are crrelated, then it might seem apprpriate t mdify (.7) in favr f a multivariate versin. Specifically, suppse Cv(ε) = Σ, then the multivariate weighted criterin RSS(B;Σ) = N (y i f(x i )) T Σ (y i f(x i )) (.40) i= arises naturally frm multivariate Gaussian thery. Here f(x) is the vectr functin (f (x),...,f K (x)), and y i the vectr f K respnses fr bservatin i. Hwever, it can be shwn that again the slutin is given by (.9); K separate regressins that ignre the crrelatins (Exercise.). If the Σ i vary amng bservatins, then this is n lnger the case, and the slutin fr B n lnger decuples. In Sectin.7 we pursue the multiple utcme prblem, and cnsider situatins where it des pay t cmbine the regressins.

76 . Subset Selectin 57. Subset Selectin There are tw reasns why we are ften nt satisfied with the least squares estimates (.6). The first is predictin accuracy: the least squares estimates ften have lw bias but large variance. Predictin accuracy can smetimes be imprved by shrinking r setting sme cefficients t zer. By ding s we sacrifice a little bit f bias t reduce the variance f the predicted values, and hence may imprve the verall predictin accuracy. The secnd reasn is interpretatin. With a large number f predictrs, we ften wuld like t determine a smaller subset that exhibit the strngest effects. In rder t get the big picture, we are willing t sacrifice sme f the small details. In this sectin we describe a number f appraches t variable subset selectin with linear regressin. In later sectins we discuss shrinkage and hybrid appraches fr cntrlling variance, as well as ther dimensin-reductin strategies. These all fall under the general heading mdel selectin. Mdel selectin is nt restricted t linear mdels; Chapter 7 cvers this tpic in sme detail. With subset selectin we retain nly a subset f the variables, and eliminate the rest frm the mdel. Least squares regressin is used t estimate the cefficients f the inputs that are retained. There are a number f different strategies fr chsing the subset... Best-Subset Selectin Best subset regressin finds fr each k {0,,,...,p} the subset f size k that gives smallest residual sum f squares (.). An efficient algrithm the leaps and bunds prcedure (Furnival and Wilsn, 974) makes this feasible fr p as large as 0 r 40. Figure.5 shws all the subset mdels fr the prstate cancer example. The lwer bundary represents the mdels that are eligible fr selectin by the best-subsets apprach. Nte that the best subset f size, fr example, need nt include the variable that was in the best subset f size (fr this example all the subsets are nested). The best-subset curve (red lwer bundary in Figure.5) is necessarily decreasing, s cannt be used t select the subset size k. The questin f hw t chse k invlves the tradeff between bias and variance, alng with the mre subjective desire fr parsimny. There are a number f criteria that ne may use; typically we chse the smallest mdel that minimizes an estimate f the expected predictin errr. Many f the ther appraches that we discuss in this chapter are similar, in that they use the training data t prduce a sequence f mdels varying in cmplexity and indexed by a single parameter. In the next sectin we use

77 58. Linear Methds fr Regressin Residual Sum f Squares Subset Size k FIGURE.5. All pssible subset mdels fr the prstate cancer example. At each subset size is shwn the residual sum-f-squares fr each mdel f that size. crss-validatin t estimate predictin errr and select k; the AIC criterin is a ppular alternative. We defer mre detailed discussin f these and ther appraches t Chapter 7... Frward- and Backward-Stepwise Selectin Rather than search thrugh all pssible subsets (which becmes infeasible fr p much larger than 40), we can seek a gd path thrugh them. Frwardstepwise selectin starts with the intercept, and then sequentially adds int the mdel the predictr that mst imprves the fit. With many candidate predictrs, this might seem like a lt f cmputatin; hwever, clever updating algrithms can explit the QR decmpsitin fr the current fit t rapidly establish the next candidate (Exercise.9). Like best-subset regressin, frward stepwise prduces a sequence f mdels indexed by k, the subset size, which must be determined. Frward-stepwise selectin is a greedy algrithm, prducing a nested sequence f mdels. In this sense it might seem sub-ptimal cmpared t best-subset selectin. Hwever, there are several reasns why it might be preferred:

78 . Subset Selectin 59 Cmputatinal; fr large p we cannt cmpute the best subset sequence, but we can always cmpute the frward stepwise sequence (even when p N). Statistical; a price is paid in variance fr selecting the best subset f each size; frward stepwise is a mre cnstrained search, and will have lwer variance, but perhaps mre bias. E ˆβ(k) β Best Subset Frward Stepwise Backward Stepwise Frward Stagewise Subset Size k FIGURE.6. Cmparisn f fur subset-selectin techniques n a simulated linear regressin prblem Y = X T β + ε. There are N = 00 bservatins n p = standard Gaussian variables, with pairwise crrelatins all equal t Fr 0 f the variables, the cefficients are drawn at randm frm a N(0,0.4) distributin; the rest are zer. The nise ε N(0, 6.5), resulting in a signal-t-nise rati f Results are averaged ver 50 simulatins. Shwn is the mean-squared errr f the estimated cefficient ˆβ(k) at each step frm the true β. Backward-stepwise selectin starts with the full mdel, and sequentially deletes the predictr that has the least impact n the fit. The candidate fr drpping is the variable with the smallest Z-scre (Exercise.0). Backward selectin can nly be used when N > p, while frward stepwise can always be used. Figure.6 shws the results f a small simulatin study t cmpare best-subset regressin with the simpler alternatives frward and backward selectin. Their perfrmance is very similar, as is ften the case. Included in the figure is frward stagewise regressin (next sectin), which takes lnger t reach minimum errr.

79 60. Linear Methds fr Regressin n the prstate cancer example, best-subset, frward and backward selectin all gave exactly the same sequence f terms. Sme sftware packages implement hybrid stepwise-selectin strategies that cnsider bth frward and backward mves at each step, and select the best f the tw. Fr example in the R package the step functin uses the AIC criterin fr weighing the chices, which takes prper accunt f the number f parameters fit; at each step an add r drp will be perfrmed that minimizes the AIC scre. ther mre traditinal packages base the selectin n F-statistics, adding significant terms, and drpping nn-significant terms. These are ut f fashin, since they d nt take prper accunt f the multiple testing issues. It is als tempting after a mdel search t print ut a summary f the chsen mdel, such as in Table.; hwever, the standard errrs are nt valid, since they d nt accunt fr the search prcess. The btstrap (Sectin 8.) can be useful in such settings. Finally, we nte that ften variables cme in grups (such as the dummy variables that cde a multi-level categrical predictr). Smart stepwise prcedures (such as step in R) will add r drp whle grups at a time, taking prper accunt f their degrees-f-freedm... Frward-Stagewise Regressin Frward-stagewise regressin (FS) is even mre cnstrained than frwardstepwise regressin. It starts like frward-stepwise regressin, with an intercept equal t ȳ, and centered predictrs with cefficients initially all 0. At each step the algrithm identifies the variable mst crrelated with the current residual. It then cmputes the simple linear regressin cefficient f the residual n this chsen variable, and then adds it t the current cefficient fr that variable. This is cntinued till nne f the variables have crrelatin with the residuals i.e. the least-squares fit when N > p. Unlike frward-stepwise regressin, nne f the ther variables are adjusted when a term is added t the mdel. As a cnsequence, frward stagewise can take many mre than p steps t reach the least squares fit, and histrically has been dismissed as being inefficient. It turns ut that this slw fitting can pay dividends in high-dimensinal prblems. We see in Sectin.8. that bth frward stagewise and a variant which is slwed dwn even further are quite cmpetitive, especially in very highdimensinal prblems. Frward-stagewise regressin is included in Figure.6. In this example it takes ver 000 steps t get all the crrelatins belw 0 4. Fr subset size k, we pltted the errr fr the last step fr which there where k nnzer cefficients. Althugh it catches up with the best fit, it takes lnger t d s.

80 ..4 Prstate Cancer Data Example (Cntinued).4 Shrinkage Methds 6 Table. shws the cefficients frm a number f different selectin and shrinkage methds. They are best-subset selectin using an all-subsets search, ridge regressin, the lass, principal cmpnents regressin and partial least squares. Each methd has a cmplexity parameter, and this was chsen t minimize an estimate f predictin errr based n tenfld crss-validatin; full details are given in Sectin 7.0. Briefly, crss-validatin wrks by dividing the training data randmly int ten equal parts. The learning methd is fit fr a range f values f the cmplexity parameter t nine-tenths f the data, and the predictin errr is cmputed n the remaining ne-tenth. This is dne in turn fr each ne-tenth f the data, and the ten predictin errr estimates are averaged. Frm this we btain an estimated predictin errr curve as a functin f the cmplexity parameter. Nte that we have already divided these data int a training set f size 67 and a test set f size 0. Crss-validatin is applied t the training set, since selecting the shrinkage parameter is part f the training prcess. The test set is there t judge the perfrmance f the selected mdel. The estimated predictin errr curves are shwn in Figure.7. Many f the curves are very flat ver large ranges near their minimum. Included are estimated standard errr bands fr each estimated errr rate, based n the ten errr estimates cmputed by crss-validatin. We have used the ne-standard-errr rule we pick the mst parsimnius mdel within ne standard errr f the minimum (Sectin 7.0, page 44). Such a rule acknwledges the fact that the tradeff curve is estimated with errr, and hence takes a cnservative apprach. Best-subset selectin chse t use the tw predictrs lcvl and lweight. The last tw lines f the table give the average predictin errr (and its estimated standard errr) ver the test set..4 Shrinkage Methds By retaining a subset f the predictrs and discarding the rest, subset selectin prduces a mdel that is interpretable and has pssibly lwer predictin errr than the full mdel. Hwever, because it is a discrete prcess variables are either retained r discarded it ften exhibits high variance, and s desn t reduce the predictin errr f the full mdel. Shrinkage methds are mre cntinuus, and dn t suffer as much frm high variability..4. Ridge Regressin Ridge regressin shrinks the regressin cefficients by impsing a penalty n their size. The ridge cefficients minimize a penalized residual sum f

81 6. Linear Methds fr Regressin All Subsets Ridge Regressin CV Errr CV Errr Subset Size Lass Degrees f Freedm Principal Cmpnents Regressin CV Errr CV Errr Shrinkage Factr s Number f Directins Partial Least Squares CV Errr Number f Directins FIGURE.7. Estimated predictin errr curves and their standard errrs fr the varius selectin and shrinkage methds. Each curve is pltted as a functin f the crrespnding cmplexity parameter fr that methd. The hrizntal axis has been chsen s that the mdel cmplexity increases as we mve frm left t right. The estimates f predictin errr and their standard errrs were btained by tenfld crss-validatin; full details are given in Sectin 7.0. The least cmplex mdel within ne standard errr f the best is chsen, indicated by the purple vertical brken lines.

82 .4 Shrinkage Methds 6 TABLE.. Estimated cefficients and test errr results, fr different subset and shrinkage methds applied t the prstate data. The blank entries crrespnd t variables mitted. Term LS Best Subset Ridge Lass PCR PLS Intercept lcavl lweight age lbph svi lcp gleasn pgg Test Errr Std Errr squares, ˆβ ridge = argmin β { N ( yi β 0 i= p ) p x ij β j + λ j= j= β j }. (.4) Here λ 0 is a cmplexity parameter that cntrls the amunt f shrinkage: the larger the value f λ, the greater the amunt f shrinkage. The cefficients are shrunk tward zer (and each ther). The idea f penalizing by the sum-f-squares f the parameters is als used in neural netwrks, where it is knwn as weight decay (Chapter ). An equivalent way t write the ridge prblem is ˆβ ridge = argmin β subject t N ( y i β 0 i= p βj t, j= p ), x ij β j j= (.4) which makes explicit the size cnstraint n the parameters. There is a net-ne crrespndence between the parameters λ in (.4) and t in (.4). When there are many crrelated variables in a linear regressin mdel, their cefficients can becme prly determined and exhibit high variance. A wildly large psitive cefficient n ne variable can be canceled by a similarly large negative cefficient n its crrelated cusin. By impsing a size cnstraint n the cefficients, as in (.4), this prblem is alleviated. The ridge slutins are nt equivariant under scaling f the inputs, and s ne nrmally standardizes the inputs befre slving (.4). In additin,

83 64. Linear Methds fr Regressin ntice that the intercept β 0 has been left ut f the penalty term. Penalizatin f the intercept wuld make the prcedure depend n the rigin chsen fr Y ; that is, adding a cnstant c t each f the targets y i wuld nt simply result in a shift f the predictins by the same amunt c. It can be shwn (Exercise.5) that the slutin t (.4) can be separated int tw parts, after reparametrizatin using centered inputs: each x ij gets replaced by x ij x j. We estimate β 0 by ȳ = N N y i. The remaining cefficients get estimated by a ridge regressin withut intercept, using the centered x ij. Hencefrth we assume that this centering has been dne, s that the input matrix X has p (rather than p + ) clumns. Writing the criterin in (.4) in matrix frm, RSS(λ) = (y Xβ) T (y Xβ) + λβ T β, (.4) the ridge regressin slutins are easily seen t be ˆβ ridge = (X T X + λi) X T y, (.44) where I is the p p identity matrix. Ntice that with the chice f quadratic penalty β T β, the ridge regressin slutin is again a linear functin f y. The slutin adds a psitive cnstant t the diagnal f X T X befre inversin. This makes the prblem nnsingular, even if X T X is nt f full rank, and was the main mtivatin fr ridge regressin when it was first intrduced in statistics (Herl and Kennard, 970). Traditinal descriptins f ridge regressin start with definitin (.44). We chse t mtivate it via (.4) and (.4), as these prvide insight int hw it wrks. Figure.8 shws the ridge cefficient estimates fr the prstate cancer example, pltted as functins f df(λ), the effective degrees f freedm implied by the penalty λ (defined in (.50) n page 68). In the case f rthnrmal inputs, the ridge estimates are just a scaled versin f the least squares estimates, that is, ˆβ ridge = ˆβ/( + λ). Ridge regressin can als be derived as the mean r mde f a psterir distributin, with a suitably chsen prir distributin. In detail, suppse y i N(β 0 + x T i β,σ ), and the parameters β j are each distributed as N(0,τ ), independently f ne anther. Then the (negative) lg-psterir density f β, with τ and σ assumed knwn, is equal t the expressin in curly braces in (.4), with λ = σ /τ (Exercise.6). Thus the ridge estimate is the mde f the psterir distributin; since the distributin is Gaussian, it is als the psterir mean. The singular value decmpsitin (SVD) f the centered input matrix X gives us sme additinal insight int the nature f ridge regressin. This decmpsitin is extremely useful in the analysis f many statistical methds. The SVD f the N p matrix X has the frm X = UDV T. (.45)

84 .4 Shrinkage Methds 65 Cefficients lcavl lweight age lbph svi lcp gleasn pgg45 df(λ) FIGURE.8. Prfiles f ridge cefficients fr the prstate cancer example, as the tuning parameter λ is varied. Cefficients are pltted versus df(λ), the effective degrees f freedm. A vertical line is drawn at df = 5.0, the value chsen by crss-validatin.

85 66. Linear Methds fr Regressin Here U and V are N p and p p rthgnal matrices, with the clumns f U spanning the clumn space f X, and the clumns f V spanning the rw space. D is a p p diagnal matrix, with diagnal entries d d d p 0 called the singular values f X. If ne r mre values d j = 0, X is singular. Using the singular value decmpsitin we can write the least squares fitted vectr as Xˆβ ls = X(X T X) X T y = UU T y, (.46) after sme simplificatin. Nte that U T y are the crdinates f y with respect t the rthnrmal basis U. Nte als the similarity with (.); Q and U are generally different rthgnal bases fr the clumn space f X (Exercise.8). Nw the ridge slutins are Xˆβ ridge = X(X T X + λi) X T y = U D(D + λi) D U T y = p d j u j d j + λut j y, (.47) j= where the u j are the clumns f U. Nte that since λ 0, we have d j /(d j + λ). Like linear regressin, ridge regressin cmputes the crdinates f y with respect t the rthnrmal basis U. It then shrinks these crdinates by the factrs d j /(d j + λ). This means that a greater amunt f shrinkage is applied t the crdinates f basis vectrs with smaller d j. What des a small value f d j mean? The SVD f the centered matrix X is anther way f expressing the principal cmpnents f the variables in X. The sample cvariance matrix is given by S = X T X/N, and frm (.45) we have X T X = VD V T, (.48) which is the eigen decmpsitin f X T X (and f S, up t a factr N). The eigenvectrs v j (clumns f V) are als called the principal cmpnents (r Karhunen Leve) directins f X. The first principal cmpnent directin v has the prperty that z = Xv has the largest sample variance amngst all nrmalized linear cmbinatins f the clumns f X. This sample variance is easily seen t be Var(z ) = Var(Xv ) = d N, (.49) and in fact z = Xv = u d. The derived variable z is called the first principal cmpnent f X, and hence u is the nrmalized first principal

86 .4 Shrinkage Methds Largest Principal Cmpnent Smallest Principal Cmpnent X X FIGURE.9. Principal cmpnents f sme input data pints. The largest principal cmpnent is the directin that maximizes the variance f the prjected data, and the smallest principal cmpnent minimizes that variance. Ridge regressin prjects y nt these cmpnents, and then shrinks the cefficients f the lw variance cmpnents mre than the high-variance cmpnents. cmpnent. Subsequent principal cmpnents z j have maximum variance d j /N, subject t being rthgnal t the earlier nes. Cnversely the last principal cmpnent has minimum variance. Hence the small singular values d j crrespnd t directins in the clumn space f X having small variance, and ridge regressin shrinks these directins the mst. Figure.9 illustrates the principal cmpnents f sme data pints in tw dimensins. If we cnsider fitting a linear surface ver this dmain (the Y -axis is sticking ut f the page), the cnfiguratin f the data allw us t determine its gradient mre accurately in the lng directin than the shrt. Ridge regressin prtects against the ptentially high variance f gradients estimated in the shrt directins. The implicit assumptin is that the respnse will tend t vary mst in the directins f high variance f the inputs. This is ften a reasnable assumptin, since predictrs are ften chsen fr study because they vary with the respnse variable, but need nt hld in general.

87 68. Linear Methds fr Regressin In Figure.7 we have pltted the estimated predictin errr versus the quantity df(λ) = tr[x(x T X + λi) X T ], = tr(h λ ) = p d j + λ. (.50) d j= j This mntne decreasing functin f λ is the effective degrees f freedm f the ridge regressin fit. Usually in a linear-regressin fit with p variables, the degrees-f-freedm f the fit is p, the number f free parameters. The idea is that althugh all p cefficients in a ridge fit will be nn-zer, they are fit in a restricted fashin cntrlled by λ. Nte that df(λ) = p when λ = 0 (n regularizatin) and df(λ) 0 as λ. f curse there is always an additinal ne degree f freedm fr the intercept, which was remved apriri. This definitin is mtivated in mre detail in Sectin.4.4 and Sectins In Figure.7 the minimum ccurs at df(λ) = 5.0. Table. shws that ridge regressin reduces the test errr f the full least squares estimates by a small amunt..4. The Lass The lass is a shrinkage methd like ridge, with subtle but imprtant differences. The lass estimate is defined by ˆβ lass = argmin β N ( y i β 0 i= subject t p ) x ij β j j= p β j t. (.5) Just as in ridge regressin, we can re-parametrize the cnstant β 0 by standardizing the predictrs; the slutin fr ˆβ 0 is ȳ, and thereafter we fit a mdel withut an intercept (Exercise.5). In the signal prcessing literature, the lass is als knwn as basis pursuit (Chen et al., 998). We can als write the lass prblem in the equivalent Lagrangian frm { ˆβ lass = argmin β N ( yi β 0 i= j= p ) p } x ij β j + λ β j. (.5) Ntice the similarity t the ridge regressin prblem (.4) r (.4): the L ridge penalty p β j is replaced by the L lass penalty p β j. This latter cnstraint makes the slutins nnlinear in the y i, and there is n clsed frm expressin as in ridge regressin. Cmputing the lass slutin j= j=

88 .4 Shrinkage Methds 69 is a quadratic prgramming prblem, althugh we see in Sectin.4.4 that efficient algrithms are available fr cmputing the entire path f slutins as λ is varied, with the same cmputatinal cst as fr ridge regressin. Because f the nature f the cnstraint, making t sufficiently small will cause sme f the cefficients t be exactly zer. Thus the lass des a kind f cntinuus subset selectin. If t is chsen larger than t 0 = p ˆβ j (where ls ˆβ j = ˆβ j, the least squares estimates), then the lass estimates are the ˆβ j s. n the ther hand, fr t = t 0 / say, then the least squares cefficients are shrunk by abut 50% n average. Hwever, the nature f the shrinkage is nt bvius, and we investigate it further in Sectin.4.4 belw. Like the subset size in variable subset selectin, r the penalty parameter in ridge regressin, t shuld be adaptively chsen t minimize an estimate f expected predictin errr. In Figure.7, fr ease f interpretatin, we have pltted the lass predictin errr estimates versus the standardized parameter s = t/ p ˆβ j. A value ŝ 0.6 was chsen by 0-fld crss-validatin; this caused fur cefficients t be set t zer (fifth clumn f Table.). The resulting mdel has the secnd lwest test errr, slightly lwer than the full least squares mdel, but the standard errrs f the test errr estimates (last line f Table.) are fairly large. Figure.0 shws the lass cefficients as the standardized tuning parameter s = t/ p ˆβ j is varied. At s =.0 these are the least squares estimates; they decrease t 0 as s 0. This decrease is nt always strictly mntnic, althugh it is in this example. A vertical line is drawn at s = 0.6, the value chsen by crss-validatin..4. Discussin: Subset Selectin, Ridge Regressin and the Lass In this sectin we discuss and cmpare the three appraches discussed s far fr restricting the linear regressin mdel: subset selectin, ridge regressin and the lass. In the case f an rthnrmal input matrix X the three prcedures have explicit slutins. Each methd applies a simple transfrmatin t the least squares estimate ˆβ j, as detailed in Table.4. Ridge regressin des a prprtinal shrinkage. Lass translates each cefficient by a cnstant factr λ, truncating at zer. This is called sft threshlding, and is used in the cntext f wavelet-based smthing in Sectin 5.9. Best-subset selectin drps all variables with cefficients smaller than the Mth largest; this is a frm f hard-threshlding. Back t the nnrthgnal case; sme pictures help understand their relatinship. Figure. depicts the lass (left) and ridge regressin (right) when there are nly tw parameters. The residual sum f squares has elliptical cnturs, centered at the full least squares estimate. The cnstraint

89 70. Linear Methds fr Regressin lcavl Cefficients svi lweight pgg45 lbph gleasn age lcp Shrinkage Factr s FIGURE.0. Prfiles f lass cefficients, as the tuning parameter t is varied. Cefficients are pltted versus s = t/ P p ˆβ j. A vertical line is drawn at s = 0.6, the value chsen by crss-validatin. Cmpare Figure.8 n page 65; the lass prfiles hit zer, while thse fr ridge d nt. The prfiles are piece-wise linear, and s are cmputed nly at the pints displayed; see Sectin.4.4 fr details.

90 .4 Shrinkage Methds 7 TABLE.4. Estimatrs f β j in the case f rthnrmal clumns f X. M and λ are cnstants chsen by the crrespnding techniques; sign dentes the sign f its argument (±), and x + dentes psitive part f x. Belw the table, estimatrs are shwn by brken red lines. The 45 line in gray shws the unrestricted estimate fr reference. Estimatr Frmula Best subset (size M) ˆβj I( ˆβ j ˆβ (M) ) Ridge ˆβj /( + λ) Lass sign(ˆβ j )( ˆβ j λ) + Best Subset Ridge Lass λ ˆβ (M) (0,0) (0,0) (0,0) β. ^ β. β ^ β β β FIGURE.. Estimatin picture fr the lass (left) and ridge regressin (right). Shwn are cnturs f the errr and cnstraint functins. The slid blue areas are the cnstraint regins β + β t and β + β t, respectively, while the red ellipses are the cnturs f the least squares errr functin.

91 7. Linear Methds fr Regressin regin fr ridge regressin is the disk β + β t, while that fr lass is the diamnd β + β t. Bth methds find the first pint where the elliptical cnturs hit the cnstraint regin. Unlike the disk, the diamnd has crners; if the slutin ccurs at a crner, then it has ne parameter β j equal t zer. When p >, the diamnd becmes a rhmbid, and has many crners, flat edges and faces; there are many mre pprtunities fr the estimated parameters t be zer. We can generalize ridge regressin and the lass, and view them as Bayes estimates. Cnsider the criterin { N } ( p ) p β = argmin yi β 0 x ij β j + λ β j q (.5) β i= j= fr q 0. The cnturs f cnstant value f j β j q are shwn in Figure., fr the case f tw inputs. Thinking f β j q as the lg-prir density fr β j, these are als the equicnturs f the prir distributin f the parameters. The value q = 0 crrespnds t variable subset selectin, as the penalty simply cunts the number f nnzer parameters; q = crrespnds t the lass, while q = t ridge regressin. Ntice that fr q, the prir is nt unifrm in directin, but cncentrates mre mass in the crdinate directins. The prir crrespnding t the q = case is an independent duble expnential (r Laplace) distributin fr each input, with density (/τ) exp( β /τ) and τ = /λ. The case q = (lass) is the smallest q such that the cnstraint regin is cnvex; nn-cnvex cnstraint regins make the ptimizatin prblem mre difficult. In this view, the lass, ridge regressin and best subset selectin are Bayes estimates with different prirs. Nte, hwever, that they are derived as psterir mdes, that is, maximizers f the psterir. It is mre cmmn t use the mean f the psterir as the Bayes estimate. Ridge regressin is als the psterir mean, but the lass and best subset selectin are nt. Lking again at the criterin (.5), we might try using ther values f q besides 0,, r. Althugh ne might cnsider estimating q frm the data, ur experience is that it is nt wrth the effrt fr the extra variance incurred. Values f q (,) suggest a cmprmise between the lass and ridge regressin. Althugh this is the case, with q >, β j q is differentiable at 0, and s des nt share the ability f lass (q = ) fr j= q = 4 q = q = q = 0.5 q = 0. FIGURE.. Cnturs f cnstant value f P j βj q fr given values f q.

92 .4 Shrinkage Methds 7 q =. α = 0. L q Elastic Net FIGURE.. Cnturs f cnstant value f P j βj q fr q =. (left plt), and the elastic-net penalty P j (αβ j +( α) β j ) fr α = 0. (right plt). Althugh visually very similar, the elastic-net has sharp (nn-differentiable) crners, while the q =. penalty des nt. setting cefficients exactly t zer. Partly fr this reasn as well as fr cmputatinal tractability, Zu and Hastie (005) intrduced the elasticnet penalty p ( λ αβ j + ( α) β j ), (.54) j= a different cmprmise between ridge and lass. Figure. cmpares the L q penalty with q =. and the elastic-net penalty with α = 0.; it is hard t detect the difference by eye. The elastic-net selects variables like the lass, and shrinks tgether the cefficients f crrelated predictrs like ridge. It als has cnsiderable cmputatinal advantages ver the L q penalties. We discuss the elastic-net further in Sectin Least Angle Regressin Least angle regressin (LAR) is a relative newcmer (Efrn et al., 004), and can be viewed as a kind f demcratic versin f frward stepwise regressin (Sectin..). As we will see, LAR is intimately cnnected with the lass, and in fact prvides an extremely efficient algrithm fr cmputing the entire lass path as in Figure.0. Frward stepwise regressin builds a mdel sequentially, adding ne variable at a time. At each step, it identifies the best variable t include in the active set, and then updates the least squares fit t include all the active variables. Least angle regressin uses a similar strategy, but nly enters as much f a predictr as it deserves. At the first step it identifies the variable mst crrelated with the respnse. Rather than fit this variable cmpletely, LAR mves the cefficient f this variable cntinuusly tward its leastsquares value (causing its crrelatin with the evlving residual t decrease in abslute value). As sn as anther variable catches up in terms f crrelatin with the residual, the prcess is paused. The secnd variable then jins the active set, and their cefficients are mved tgether in a way that keeps their crrelatins tied and decreasing. This prcess is cntinued

93 74. Linear Methds fr Regressin until all the variables are in the mdel, and ends at the full least-squares fit. Algrithm. prvides the details. The terminatin cnditin in step 5 requires sme explanatin. If p > N, the LAR algrithm reaches a zer residual slutin after N steps (the is because we have centered the data). Algrithm. Least Angle Regressin.. Standardize the predictrs t have mean zer and unit nrm. Start with the residual r = y ȳ, β,β,...,β p = 0.. Find the predictr x j mst crrelated with r.. Mve β j frm 0 twards its least-squares cefficient x j,r, until sme ther cmpetitr x k has as much crrelatin with the current residual as des x j. 4. Mve β j and β k in the directin defined by their jint least squares cefficient f the current residual n (x j,x k ), until sme ther cmpetitr x l has as much crrelatin with the current residual. 5. Cntinue in this way until all p predictrs have been entered. After min(n,p) steps, we arrive at the full least-squares slutin. Suppse A k is the active set f variables at the beginning f the kth step, and let β Ak be the cefficient vectr fr these variables at this step; there will be k nnzer values, and the ne just entered will be zer. If r k = y X Ak β Ak is the current residual, then the directin fr this step is δ k = (X T A k X Ak ) X T A k r k. (.55) The cefficient prfile then evlves as β Ak (α) = β Ak + α δ k. Exercise. verifies that the directins chsen in this fashin d what is claimed: keep the crrelatins tied and decreasing. If the fit vectr at the beginning f this step is ˆf k, then it evlves as ˆf k (α) = ˆf k + α u k, where u k = X Ak δ k is the new fit directin. The name least angle arises frm a gemetrical interpretatin f this prcess; u k makes the smallest (and equal) angle with each f the predictrs in A k (Exercise.4). Figure.4 shws the abslute crrelatins decreasing and jining ranks with each step f the LAR algrithm, using simulated data. By cnstructin the cefficients in LAR change in a piecewise linear fashin. Figure.5 [left panel] shws the LAR cefficient prfile evlving as a functin f their L arc length. Nte that we d nt need t take small The L arc-length f a differentiable curve β(s) fr s [0, S] is given by TV(β, S) = R S 0 β(s) ds, where β(s) = β(s)/ s. Fr the piecewise-linear LAR cefficient prfile, this amunts t summing the L nrms f the changes in cefficients frm step t step.

94 .4 Shrinkage Methds 75 v v6 v4 v5 v v Abslute Crrelatins L Arc Length FIGURE.4. Prgressin f the abslute crrelatins during each step f the LAR prcedure, using a simulated data set with six predictrs. The labels at the tp f the plt indicate which variables enter the active set at each step. The step length are measured in units f L arc length. Least Angle Regressin Lass Cefficients Cefficients L Arc Length L Arc Length FIGURE.5. Left panel shws the LAR cefficient prfiles n the simulated data, as a functin f the L arc length. The right panel shws the Lass prfile. They are identical until the dark-blue cefficient crsses zer at an arc length f abut 8.

95 76. Linear Methds fr Regressin steps and recheck the crrelatins in step ; using knwledge f the cvariance f the predictrs and the piecewise linearity f the algrithm, we can wrk ut the exact step length at the beginning f each step (Exercise.5). The right panel f Figure.5 shws the lass cefficient prfiles n the same data. They are almst identical t thse in the left panel, and differ fr the first time when the blue cefficient passes back thrugh zer. Fr the prstate data, the LAR cefficient prfile turns ut t be identical t the lass prfile in Figure.0, which never crsses zer. These bservatins lead t a simple mdificatin f the LAR algrithm that gives the entire lass path, which is als piecewise-linear. Algrithm.a Least Angle Regressin: Lass Mdificatin. 4a. If a nn-zer cefficient hits zer, drp its variable frm the active set f variables and recmpute the current jint least squares directin. The LAR(lass) algrithm is extremely efficient, requiring the same rder f cmputatin as that f a single least squares fit using the p predictrs. Least angle regressin always takes p steps t get t the full least squares estimates. The lass path can have mre than p steps, althugh the tw are ften quite similar. Algrithm. with the lass mdificatin.a is an efficient way f cmputing the slutin t any lass prblem, especially when p N. sbrne et al. (000a) als discvered a piecewise-linear path fr cmputing the lass, which they called a hmtpy algrithm. We nw give a heuristic argument fr why these prcedures are s similar. Althugh the LAR algrithm is stated in terms f crrelatins, if the input features are standardized, it is equivalent and easier t wrk with innerprducts. Suppse A is the active set f variables at sme stage in the algrithm, tied in their abslute inner-prduct with the current residuals y Xβ. We can express this as x T j (y Xβ) = γ s j, j A (.56) where s j {,} indicates the sign f the inner-prduct, and γ is the cmmn value. Als x T k (y Xβ) γ k A. Nw cnsider the lass criterin (.5), which we write in vectr frm R(β) = y Xβ + λ β. (.57) Let B be the active set f variables in the slutin fr a given value f λ. Fr these variables R(β) is differentiable, and the statinarity cnditins give x T j (y Xβ) = λ sign(β j ), j B (.58) Cmparing (.58) with (.56), we see that they are identical nly if the sign f β j matches the sign f the inner prduct. That is why the LAR

96 .4 Shrinkage Methds 77 algrithm and lass start t differ when an active cefficient passes thrugh zer; cnditin (.58) is vilated fr that variable, and it is kicked ut f the active set B. Exercise. shws that these equatins imply a piecewiselinear cefficient prfile as λ decreases. The statinarity cnditins fr the nn-active variables require that x T k (y Xβ) λ, k B, (.59) which again agrees with the LAR algrithm. Figure.6 cmpares LAR and lass t frward stepwise and stagewise regressin. The setup is the same as in Figure.6 n page 59, except here N = 00 here rather than 00, s the prblem is mre difficult. We see that the mre aggressive frward stepwise starts t verfit quite early (well befre the 0 true variables can enter the mdel), and ultimately perfrms wrse than the slwer frward stagewise regressin. The behavir f LAR and lass is similar t that f frward stagewise regressin. Incremental frward stagewise is similar t LAR and lass, and is described in Sectin.8.. Degrees-f-Freedm Frmula fr LAR and Lass Suppse that we fit a linear mdel via the least angle regressin prcedure, stpping at sme number f steps k < p, r equivalently using a lass bund t that prduces a cnstrained versin f the full least squares fit. Hw many parameters, r degrees f freedm have we used? Cnsider first a linear regressin using a subset f k features. If this subset is prespecified in advance withut reference t the training data, then the degrees f freedm used in the fitted mdel is defined t be k. Indeed, in classical statistics, the number f linearly independent parameters is what is meant by degrees f freedm. Alternatively, suppse that we carry ut a best subset selectin t determine the ptimal set f k predictrs. Then the resulting mdel has k parameters, but in sme sense we have used up mre than k degrees f freedm. We need a mre general definitin fr the effective degrees f freedm f an adaptively fitted mdel. We define the degrees f freedm f the fitted vectr ŷ = (ŷ,ŷ,...,ŷ N ) as df(ŷ) = N σ Cv(ŷ i,y i ). (.60) i= Here Cv(ŷ i,y i ) refers t the sampling cvariance between the predicted value ŷ i and its crrespnding utcme value y i. This makes intuitive sense: the harder that we fit t the data, the larger this cvariance and hence df(ŷ). Expressin (.60) is a useful ntin f degrees f freedm, ne that can be applied t any mdel predictin ŷ. This includes mdels that are

97 78. Linear Methds fr Regressin E ˆβ(k) β Frward Stepwise LAR Lass Frward Stagewise Incremental Frward Stagewise Fractin f L arc-length FIGURE.6. Cmparisn f LAR and lass with frward stepwise, frward stagewise (FS) and incremental frward stagewise (FS 0) regressin. The setup is the same as in Figure.6, except N = 00 here rather than 00. Here the slwer FS regressin ultimately utperfrms frward stepwise. LAR and lass shw similar behavir t FS and FS 0. Since the prcedures take different numbers f steps (acrss simulatin replicates and methds), we plt the MSE as a functin f the fractin f ttal L arc-length tward the least-squares fit. adaptively fitted t the training data. This definitin is mtivated and discussed further in Sectins Nw fr a linear regressin with k fixed predictrs, it is easy t shw that df(ŷ) = k. Likewise fr ridge regressin, this definitin leads t the clsed-frm expressin (.50) n page 68: df(ŷ) = tr(s λ ). In bth these cases, (.60) is simple t evaluate because the fit ŷ = H λ y is linear in y. If we think abut definitin (.60) in the cntext f a best subset selectin f size k, it seems clear that df(ŷ) will be larger than k, and this can be verified by estimating Cv(ŷ i,y i )/σ directly by simulatin. Hwever there is n clsed frm methd fr estimating df(ŷ) fr best subset selectin. Fr LAR and lass, smething magical happens. These techniques are adaptive in a smther way than best subset selectin, and hence estimatin f degrees f freedm is mre tractable. Specifically it can be shwn that after the kth step f the LAR prcedure, the effective degrees f freedm f the fit vectr is exactly k. Nw fr the lass, the (mdified) LAR prcedure

98 .5 Methds Using Derived Input Directins 79 ften takes mre than p steps, since predictrs can drp ut. Hence the definitin is a little different; fr the lass, at any stage df(ŷ) apprximately equals the number f predictrs in the mdel. While this apprximatin wrks reasnably well anywhere in the lass path, fr each k it wrks best at the last mdel in the sequence that cntains k predictrs. A detailed study f the degrees f freedm fr the lass may be fund in Zu et al. (007)..5 Methds Using Derived Input Directins In many situatins we have a large number f inputs, ften very crrelated. The methds in this sectin prduce a small number f linear cmbinatins Z m, m =,...,M f the riginal inputs X j, and the Z m are then used in place f the X j as inputs in the regressin. The methds differ in hw the linear cmbinatins are cnstructed..5. Principal Cmpnents Regressin In this apprach the linear cmbinatins Z m used are the principal cmpnents as defined in Sectin.4. abve. Principal cmpnent regressin frms the derived input clumns z m = Xv m, and then regresses y n z,z,...,z M fr sme M p. Since the z m are rthgnal, this regressin is just a sum f univariate regressins: ŷ pcr (M) = ȳ + M m= ˆθ m z m, (.6) where ˆθ m = z m,y / z m,z m. Since the z m are each linear cmbinatins f the riginal x j, we can express the slutin (.6) in terms f cefficients f the x j (Exercise.): ˆβ pcr (M) = M ˆθ m v m. (.6) m= As with ridge regressin, principal cmpnents depend n the scaling f the inputs, s typically we first standardize them. Nte that if M = p, we wuld just get back the usual least squares estimates, since the clumns f Z = UD span the clumn space f X. Fr M < p we get a reduced regressin. We see that principal cmpnents regressin is very similar t ridge regressin: bth perate via the principal cmpnents f the input matrix. Ridge regressin shrinks the cefficients f the principal cmpnents (Figure.7), shrinking mre depending n the size f the crrespnding eigenvalue; principal cmpnents regressin discards the p M smallest eigenvalue cmpnents. Figure.7 illustrates this.

99 80. Linear Methds fr Regressin Shrinkage Factr ridge pcr Index FIGURE.7. Ridge regressin shrinks the regressin cefficients f the principal cmpnents, using shrinkage factrs d j/(d j + λ) as in (.47). Principal cmpnent regressin truncates them. Shwn are the shrinkage and truncatin patterns crrespnding t Figure.7, as a functin f the principal cmpnent index. In Figure.7 we see that crss-validatin suggests seven terms; the resulting mdel has the lwest test errr in Table...5. Partial Least Squares This technique als cnstructs a set f linear cmbinatins f the inputs fr regressin, but unlike principal cmpnents regressin it uses y (in additin t X) fr this cnstructin. Like principal cmpnent regressin, partial least squares (PLS) is nt scale invariant, s we assume that each x j is standardized t have mean 0 and variance. PLS begins by cmputing ˆϕ j = x j,y fr each j. Frm this we cnstruct the derived input z = j ˆϕ jx j, which is the first partial least squares directin. Hence in the cnstructin f each z m, the inputs are weighted by the strength f their univariate effect n y. The utcme y is regressed n z giving cefficient ˆθ, and then we rthgnalize x,...,x p with respect t z. We cntinue this prcess, until M p directins have been btained. In this manner, partial least squares prduces a sequence f derived, rthgnal inputs r directins z,z,...,z M. As with principal-cmpnent regressin, if we were t cnstruct all M = p directins, we wuld get back a slutin equivalent t the usual least squares estimates; using M < p directins prduces a reduced regressin. The prcedure is described fully in Algrithm.. Since the x j are standardized, the first directins ˆϕ j are the univariate regressin cefficients (up t an irrelevant cnstant); this is nt the case fr subsequent directins.

100 Algrithm. Partial Least Squares..5 Methds Using Derived Input Directins 8. Standardize each x j t have mean zer and variance ne. Set ŷ (0) = ȳ, and x (0) j = x j, j =,...,p.. Fr m =,,...,p (a) z m = p j= ˆϕ mjx (m ) j (b) ˆθ m = z m,y / z m,z m. (c) ŷ (m) = ŷ (m ) + ˆθ m z m. (d) rthgnalize each x (m ) j [ z m,x (m ) j, where ˆϕ mj = x (m ),y. with respect t z m : x (m) j / z m,z m ]z m, j =,,...,p. j = x (m ) j. utput the sequence f fitted vectrs {ŷ (m) } p. Since the {z l} m are linear in the riginal x j, s is ŷ (m) = Xˆβ pls (m). These linear cefficients can be recvered frm the sequence f PLS transfrmatins. In the prstate cancer example, crss-validatin chse M = PLS directins in Figure.7. This prduced the mdel given in the rightmst clumn f Table.. What ptimizatin prblem is partial least squares slving? Since it uses the respnse y t cnstruct its directins, its slutin path is a nnlinear functin f y. It can be shwn (Exercise.5) that partial least squares seeks directins that have high variance and have high crrelatin with the respnse, in cntrast t principal cmpnents regressin which keys nly n high variance (Stne and Brks, 990; Frank and Friedman, 99). In particular, the mth principal cmpnent directin v m slves: max α Var(Xα) (.6) subject t α =, α T Sv l = 0, l =,...,m, where S is the sample cvariance matrix f the x j. The cnditins α T Sv l = 0 ensures that z m = Xα is uncrrelated with all the previus linear cmbinatins z l = Xv l. The mth PLS directin ˆϕ m slves: max α Crr (y,xα)var(xα) (.64) subject t α =, α T Sˆϕ l = 0, l =,...,m. Further analysis reveals that the variance aspect tends t dminate, and s partial least squares behaves much like ridge regressin and principal cmpnents regressin. We discuss this further in the next sectin. If the input matrix X is rthgnal, then partial least squares finds the least squares estimates after m = steps. Subsequent steps have n effect

101 8. Linear Methds fr Regressin since the ˆϕ mj are zer fr m > (Exercise.4). It can als be shwn that the sequence f PLS cefficients fr m =,,...,p represents the cnjugate gradient sequence fr cmputing the least squares slutins (Exercise.8)..6 Discussin: A Cmparisn f the Selectin and Shrinkage Methds There are sme simple settings where we can understand better the relatinship between the different methds described abve. Cnsider an example with tw crrelated inputs X and X, with crrelatin ρ. We assume that the true regressin cefficients are β = 4 and β =. Figure.8 shws the cefficient prfiles fr the different methds, as their tuning parameters are varied. The tp panel has ρ = 0.5, the bttm panel ρ = 0.5. The tuning parameters fr ridge and lass vary ver a cntinuus range, while best subset, PLS and PCR take just tw discrete steps t the least squares slutin. In the tp panel, starting at the rigin, ridge regressin shrinks the cefficients tgether until it finally cnverges t least squares. PLS and PCR shw similar behavir t ridge, althugh are discrete and mre extreme. Best subset vershts the slutin and then backtracks. The behavir f the lass is intermediate t the ther methds. When the crrelatin is negative (lwer panel), again PLS and PCR rughly track the ridge path, while all f the methds are mre similar t ne anther. It is interesting t cmpare the shrinkage behavir f these different methds. Recall that ridge regressin shrinks all directins, but shrinks lw-variance directins mre. Principal cmpnents regressin leaves M high-variance directins alne, and discards the rest. Interestingly, it can be shwn that partial least squares als tends t shrink the lw-variance directins, but can actually inflate sme f the higher variance directins. This can make PLS a little unstable, and cause it t have slightly higher predictin errr cmpared t ridge regressin. A full study is given in Frank and Friedman (99). These authrs cnclude that fr minimizing predictin errr, ridge regressin is generally preferable t variable subset selectin, principal cmpnents regressin and partial least squares. Hwever the imprvement ver the latter tw methds was nly slight. T summarize, PLS, PCR and ridge regressin tend t behave similarly. Ridge regressin may be preferred because it shrinks smthly, rather than in discrete steps. Lass falls smewhere between ridge regressin and best subset regressin, and enjys sme f the prperties f each.

102 .6 Discussin: A Cmparisn f the Selectin and Shrinkage Methds 8 ρ = 0.5 β β PCR Lass PLS Ridge Least Squares Best Subset PCR Ridge Lass PLS β ρ = 0.5 Best Subset Least Squares β FIGURE.8. Cefficient prfiles frm different methds fr a simple prblem: tw inputs with crrelatin ±0.5, and the true regressin cefficients β = (4, ).

103 84. Linear Methds fr Regressin.7 Multiple utcme Shrinkage and Selectin As nted in Sectin..4, the least squares estimates in a multiple-utput linear mdel are simply the individual least squares estimates fr each f the utputs. T apply selectin and shrinkage methds in the multiple utput case, ne culd apply a univariate technique individually t each utcme r simultaneusly t all utcmes. With ridge regressin, fr example, we culd apply frmula (.44) t each f the K clumns f the utcme matrix Y, using pssibly different parameters λ, r apply it t all clumns using the same value f λ. The frmer strategy wuld allw different amunts f regularizatin t be applied t different utcmes but require estimatin f k separate regularizatin parameters λ,...,λ k, while the latter wuld permit all k utputs t be used in estimating the sle regularizatin parameter λ. ther mre sphisticated shrinkage and selectin strategies that explit crrelatins in the different respnses can be helpful in the multiple utput case. Suppse fr example that amng the utputs we have Y k = f(x) + ε k (.65) Y l = f(x) + ε l ; (.66) i.e., (.65) and (.66) share the same structural part f(x) in their mdels. It is clear in this case that we shuld pl ur bservatins n Y k and Y l t estimate the cmmn f. Cmbining respnses is at the heart f cannical crrelatin analysis (CCA), a data reductin technique develped fr the multiple utput case. Similar t PCA, CCA finds a sequence f uncrrelated linear cmbinatins Xv m, m =,...,M f the x j, and a crrespnding sequence f uncrrelated linear cmbinatins Yu m f the respnses y k, such that the crrelatins Crr (Yu m,xv m ) (.67) are successively maximized. Nte that at mst M = min(k, p) directins can be fund. The leading cannical respnse variates are thse linear cmbinatins (derived respnses) best predicted by the x j ; in cntrast, the trailing cannical variates can be prly predicted by the x j, and are candidates fr being drpped. The CCA slutin is cmputed using a generalized SVD f the sample crss-cvariance matrix Y T X/N (assuming Y and X are centered; Exercise.0). Reduced-rank regressin (Izenman, 975; van der Merwe and Zidek, 980) frmalizes this apprach in terms f a regressin mdel that explicitly pls infrmatin. Given an errr cvariance Cv(ε) = Σ, we slve the fllwing

104 restricted multivariate regressin prblem: ˆB rr (m) = argmin rank(b)=m i=.7 Multiple utcme Shrinkage and Selectin 85 N (y i B T x i ) T Σ (y i B T x i ). (.68) With Σ replaced by the estimate Y T Y/N, ne can shw (Exercise.) that the slutin is given by a CCA f Y and X: ˆB rr (m) = ˆBU m U m, (.69) where U m is the K m sub-matrix f U cnsisting f the first m clumns, and U is the K M matrix f left cannical vectrs u,u,...,u M. U m is its generalized inverse. Writing the slutin as ˆB rr (M) = (X T X) X T (YU m )U m, (.70) we see that reduced-rank regressin perfrms a linear regressin n the pled respnse matrix YU m, and then maps the cefficients (and hence the fits as well) back t the riginal respnse space. The reduced-rank fits are given by Ŷ rr (m) = X(X T X) X T YU m U m = HYP m, (.7) where H is the usual linear regressin prjectin peratr, and P m is the rank-m CCA respnse prjectin peratr. Althugh a better estimate f Σ wuld be (Y XˆB) T (Y XˆB)/(N pk), ne can shw that the slutin remains the same (Exercise.). Reduced-rank regressin brrws strength amng respnses by truncating the CCA. Breiman and Friedman (997) explred with sme success shrinkage f the cannical variates between X and Y, a smth versin f reduced rank regressin. Their prpsal has the frm (cmpare (.69)) ˆB c+w = ˆBUΛU, (.7) where Λ is a diagnal shrinkage matrix (the c+w stands fr Curds and Whey, the name they gave t their prcedure). Based n ptimal predictin in the ppulatin setting, they shw that Λ has diagnal entries λ m = c m c m + p N (, m =,...,M, (.7) c m) where c m is the mth cannical crrelatin cefficient. Nte that as the rati f the number f input variables t sample size p/n gets small, the shrinkage factrs apprach. Breiman and Friedman (997) prpsed mdified versins f Λ based n training data and crss-validatin, but the general frm is the same. Here the fitted respnse has the frm Ŷ c+w = HYS c+w, (.74)

105 86. Linear Methds fr Regressin where S c+w = UΛU is the respnse shrinkage peratr. Breiman and Friedman (997) als suggested shrinking in bth the Y space and X space. This leads t hybrid shrinkage mdels f the frm Ŷ ridge,c+w = A λ YS c+w, (.75) where A λ = X(X T X+λI) X T is the ridge regressin shrinkage peratr, as in (.46) n page 66. Their paper and the discussins theref cntain many mre details..8 Mre n the Lass and Related Path Algrithms Since the publicatin f the LAR algrithm (Efrn et al., 004) there has been a lt f activity in develping algrithms fr fitting regularizatin paths fr a variety f different prblems. In additin, L regularizatin has taken n a life f its wn, leading t the develpment f the field cmpressed sensing in the signal-prcessing literature. (Dnh, 006a; Candes, 006). In this sectin we discuss sme related prpsals and ther path algrithms, starting ff with a precursr t the LAR algrithm..8. Incremental Frward Stagewise Regressin Here we present anther LAR-like algrithm, this time fcused n frward stagewise regressin. Interestingly, effrts t understand a flexible nnlinear regressin prcedure (bsting) led t a new algrithm fr linear mdels (LAR). In reading the first editin f this bk and the frward stagewise Algrithm.4 Incremental Frward Stagewise Regressin FS ǫ.. Start with the residual r equal t y and β,β,...,β p = 0. All the predictrs are standardized t have mean zer and unit nrm.. Find the predictr x j mst crrelated with r. Update β j β j + δ j, where δ j = ǫ sign[ x j,r ] and ǫ > 0 is a small step size, and set r r δ j x j. 4. Repeat steps and many times, until the residuals are uncrrelated with all the predictrs. Algrithm 6. f Chapter 6 4, ur clleague Brad Efrn realized that with 4 In the first editin, this was Algrithm 0.4 in Chapter 0.

106 .8 Mre n the Lass and Related Path Algrithms 87 FS ǫ FS 0 lcavl lcavl Cefficients svi lweight pgg45 lbph gleasn age Cefficients svi lweight pgg45 lbph gleasn age lcp lcp Iteratin L Arc-length f Cefficients FIGURE.9. Cefficient prfiles fr the prstate data. The left panel shws incremental frward stagewise regressin with step size ǫ = 0.0. The right panel shws the infinitesimal versin FS 0 btained letting ǫ 0. This prfile was fit by the mdificatin.b t the LAR Algrithm.. In this example the FS 0 prfiles are mntne, and hence identical t thse f lass and LAR. linear mdels, ne culd explicitly cnstruct the piecewise-linear lass paths f Figure.0. This led him t prpse the LAR prcedure f Sectin.4.4, as well as the incremental versin f frward-stagewise regressin presented here. Cnsider the linear-regressin versin f the frward-stagewise bsting algrithm 6. prpsed in Sectin 6. (page 608). It generates a cefficient prfile by repeatedly updating (by a small amunt ǫ) the cefficient f the variable mst crrelated with the current residuals. Algrithm.4 gives the details. Figure.9 (left panel) shws the prgress f the algrithm n the prstate data with step size ǫ = 0.0. If δ j = x j,r (the least-squares cefficient f the residual n jth predictr), then this is exactly the usual frward stagewise prcedure (FS) utlined in Sectin... Here we are mainly interested in small values f ǫ. Letting ǫ 0 gives the right panel f Figure.9, which in this case is identical t the lass path in Figure.0. We call this limiting prcedure infinitesimal frward stagewise regressin r FS 0. This prcedure plays an imprtant rle in nn-linear, adaptive methds like bsting (Chapters 0 and 6) and is the versin f incremental frward stagewise regressin that is mst amenable t theretical analysis. Bühlmann and Hthrn (007) refer t the same prcedure as Lbst, because f its cnnectins t bsting.

107 88. Linear Methds fr Regressin Efrn riginally thught that the LAR Algrithm. was an implementatin f FS 0, allwing each tied predictr a chance t update their cefficients in a balanced way, while remaining tied in crrelatin. Hwever, he then realized that the LAR least-squares fit amngst the tied predictrs can result in cefficients mving in the ppsite directin t their crrelatin, which cannt happen in Algrithm.4. The fllwing mdificatin f the LAR algrithm implements FS 0 : Algrithm.b Least Angle Regressin: FS 0 Mdificatin. 4. Find the new directin by slving the cnstrained least squares prblem min b r X A b subject t b j s j 0, j A, where s j is the sign f x j,r. The mdificatin amunts t a nn-negative least squares fit, keeping the signs f the cefficients the same as thse f the crrelatins. ne can shw that this achieves the ptimal balancing f infinitesimal update turns fr the variables tied fr maximal crrelatin (Hastie et al., 007). Like lass, the entire FS 0 path can be cmputed very efficiently via the LAR algrithm. As a cnsequence f these results, if the LAR prfiles are mntne nnincreasing r nn-decreasing, as they are in Figure.9, then all three methds LAR, lass, and FS 0 give identical prfiles. If the prfiles are nt mntne but d nt crss the zer axis, then LAR and lass are identical. Since FS 0 is different frm the lass, it is natural t ask if it ptimizes a criterin. The answer is mre cmplex than fr lass; the FS 0 cefficient prfile is the slutin t a differential equatin. While the lass makes ptimal prgress in terms f reducing the residual sum-f-squares per unit increase in L -nrm f the cefficient vectr β, FS 0 is ptimal per unit increase in L arc-length traveled alng the cefficient path. Hence its cefficient path is discuraged frm changing directins t ften. FS 0 is mre cnstrained than lass, and in fact can be viewed as a mntne versin f the lass; see Figure 6. n page 64 fr a dramatic example. FS 0 may be useful in p N situatins, where its cefficient prfiles are much smther and hence have less variance than thse f lass. Mre details n FS 0 are given in Sectin 6.. and Hastie et al. (007). Figure.6 includes FS 0 where its perfrmance is very similar t that f the lass.

108 .8 Mre n the Lass and Related Path Algrithms Piecewise-Linear Path Algrithms The least angle regressin prcedure explits the piecewise linear nature f the lass slutin paths. It has led t similar path algrithms fr ther regularized prblems. Suppse we slve with ˆβ(λ) = argmin β [R(β) + λj(β)], (.76) R(β) = N L(y i,β 0 + i= p x ij β j ), (.77) where bth the lss functin L and the penalty functin J are cnvex. Then the fllwing are sufficient cnditins fr the slutin path ˆβ(λ) t be piecewise linear (Rsset and Zhu, 007):. R is quadratic r piecewise-quadratic as a functin f β, and. J is piecewise linear in β. This als implies (in principle) that the slutin path can be efficiently cmputed. Examples include squared- and abslute-errr lss, Huberized lsses, and the L,L penalties n β. Anther example is the hinge lss functin used in the supprt vectr machine. There the lss is piecewise linear, and the penalty is quadratic. Interestingly, this leads t a piecewiselinear path algrithm in the dual space; mre details are given in Sectin..5. j=.8. The Dantzig Selectr Candes and Ta (007) prpsed the fllwing criterin: min β β subject t X T (y Xβ) s. (.78) They call the slutin the Dantzig selectr (DS). It can be written equivalently as min β X T (y Xβ) subject t β t. (.79) Here dentes the L nrm, the maximum abslute value f the cmpnents f the vectr. In this frm it resembles the lass, replacing squared errr lss by the maximum abslute value f its gradient. Nte that as t gets large, bth prcedures yield the least squares slutin if N < p. If p N, they bth yield the least squares slutin with minimum L nrm. Hwever fr smaller values f t, the DS prcedure prduces a different path f slutins than the lass. Candes and Ta (007) shw that the slutin t DS is a linear prgramming prblem; hence the name Dantzig selectr, in hnr f the late

109 90. Linear Methds fr Regressin Gerge Dantzig, the inventr f the simplex methd fr linear prgramming. They als prve a number f interesting mathematical prperties fr the methd, related t its ability t recver an underlying sparse cefficient vectr. These same prperties als hld fr the lass, as shwn later by Bickel et al. (008). Unfrtunately the perating prperties f the DS methd are smewhat unsatisfactry. The methd seems similar in spirit t the lass, especially when we lk at the lass s statinary cnditins (.58). Like the LAR algrithm, the lass maintains the same inner prduct (and crrelatin) with the current residual fr all variables in the active set, and mves their cefficients t ptimally decrease the residual sum f squares. In the prcess, this cmmn crrelatin is decreased mntnically (Exercise.), and at all times this crrelatin is larger than that fr nn-active variables. The Dantzig selectr instead tries t minimize the maximum inner prduct f the current residual with all the predictrs. Hence it can achieve a smaller maximum than the lass, but in the prcess a curius phenmenn can ccur. If the size f the active set is m, there will be m variables tied with maximum crrelatin. Hwever, these need nt cincide with the active set! Hence it can include a variable in the mdel that has smaller crrelatin with the current residual than sme f the excluded variables (Efrn et al., 007). This seems unreasnable and may be respnsible fr its smetimes inferir predictin accuracy. Efrn et al. (007) als shw that DS can yield extremely erratic cefficient paths as the regularizatin parameter s is varied..8.4 The Gruped Lass In sme prblems, the predictrs belng t pre-defined grups; fr example genes that belng t the same bilgical pathway, r cllectins f indicatr (dummy) variables fr representing the levels f a categrical predictr. In this situatin it may be desirable t shrink and select the members f a grup tgether. The gruped lass is ne way t achieve this. Suppse that the p predictrs are divided int L grups, with p l the number in grup l. Fr ease f ntatin, we use a matrix X l t represent the predictrs crrespnding t the lth grup, with crrespnding cefficient vectr β l. The gruped-lass minimizes the cnvex criterin ( min β IR p y β 0 ) L L X l β l + λ pl β l, (.80) l= where the p l terms accunts fr the varying grup sizes, and is the Euclidean nrm (nt squared). Since the Euclidean nrm f a vectr β l is zer nly if all f its cmpnents are zer, this prcedure encurages sparsity at bth the grup and individual levels. That is, fr sme values f λ, an entire grup f predictrs may drp ut f the mdel. This prcedure l=

110 .8 Mre n the Lass and Related Path Algrithms 9 was prpsed by Bakin (999) and Lin and Zhang (006), and studied and generalized by Yuan and Lin (007). Generalizatins include mre general L nrms η K = (η T Kη) /, as well as allwing verlapping grups f predictrs (Zha et al., 008). There are als cnnectins t methds fr fitting sparse additive mdels (Lin and Zhang, 006; Ravikumar et al., 008)..8.5 Further Prperties f the Lass A number f authrs have studied the ability f the lass and related prcedures t recver the crrect mdel, as N and p grw. Examples f this wrk include Knight and Fu (000), Greenshtein and Ritv (004), Trpp (004), Dnh (006b), Meinshausen (007), Meinshausen and Bühlmann (006), Trpp (006), Zha and Yu (006), Wainwright (006), and Bunea et al. (007). Fr example Dnh (006b) fcuses n the p > N case and cnsiders the lass slutin as the bund t gets large. In the limit this gives the slutin with minimum L nrm amng all mdels with zer training errr. He shws that under certain assumptins n the mdel matrix X, if the true mdel is sparse, this slutin identifies the crrect predictrs with high prbability. Many f the results in this area assume a cnditin n the mdel matrix f the frm (X S T X S ) X S T X S c ( ǫ) fr sme ǫ (0,]. (.8) Here S indexes the subset f features with nn-zer cefficients in the true underlying mdel, and X S are the clumns f X crrespnding t thse features. Similarly S c are the features with true cefficients equal t zer, and X S c the crrespnding clumns. This says that the least squares cefficients fr the clumns f X S c n X S are nt t large, that is, the gd variables S are nt t highly crrelated with the nuisance variables S c. Regarding the cefficients themselves, the lass shrinkage causes the estimates f the nn-zer cefficients t be biased twards zer, and in general they are nt cnsistent 5. ne apprach fr reducing this bias is t run the lass t identify the set f nn-zer cefficients, and then fit an unrestricted linear mdel t the selected set f features. This is nt always feasible, if the selected set is large. Alternatively, ne can use the lass t select the set f nn-zer predictrs, and then apply the lass again, but using nly the selected predictrs frm the first step. This is knwn as the relaxed lass (Meinshausen, 007). The idea is t use crss-validatin t estimate the initial penalty parameter fr the lass, and then again fr a secnd penalty parameter applied t the selected set f predictrs. Since 5 Statistical cnsistency means as the sample size grws, the estimates cnverge t the true values.

111 9. Linear Methds fr Regressin the variables in the secnd step have less cmpetitin frm nise variables, crss-validatin will tend t pick a smaller value fr λ, and hence their cefficients will be shrunken less than thse in the initial estimate. Alternatively, ne can mdify the lass penalty functin s that larger cefficients are shrunken less severely; the smthly clipped abslute deviatin (SCAD) penalty f Fan and Li (005) replaces λ β by J a (β,λ), where dj a (β,λ) dβ [ = λ sign(β) I( β λ) + (aλ β ) + (a )λ I( β > λ) ] (.8) fr sme a. The secnd term in square-braces reduces the amunt f shrinkage in the lass fr larger values f β, with ultimately n shrinkage as a. Figure.0 shws the SCAD penalty, alng with the lass and β SCAD β ν β β β FIGURE.0. The lass and tw alternative nn-cnvex penalties designed t penalize large cefficients less. Fr SCAD we use λ = and a = 4, and ν = in the last panel. β ν. Hwever this criterin is nn-cnvex, which is a drawback since it makes the cmputatin much mre difficult. The adaptive lass (Zu, 006) uses a weighted penalty f the frm p j= w j β j where w j = / ˆβ j ν, ˆβ j is the rdinary least squares estimate and ν > 0. This is a practical apprximatin t the β q penalties (q = ν here) discussed in Sectin.4.. The adaptive lass yields cnsistent estimates f the parameters while retaining the attractive cnvexity prperty f the lass..8.6 Pathwise Crdinate ptimizatin An alternate apprach t the LARS algrithm fr cmputing the lass slutin is simple crdinate descent. This idea was prpsed by Fu (998) and Daubechies et al. (004), and later studied and generalized by Friedman et al. (007), Wu and Lange (008) and thers. The idea is t fix the penalty parameter λ in the Lagrangian frm (.5) and ptimize successively ver each parameter, hlding the ther parameters fixed at their current values. Suppse the predictrs are all standardized t have mean zer and unit nrm. Dente by β k (λ) the current estimate fr β k at penalty parameter

112 λ. We can rearrange (.5) t islate β j,.9 Cmputatinal Cnsideratins 9 ( R( β(λ),β j ) = N y i ) x ik βk (λ) x ij β j + λ β k (λ) + λ β j, i= k j k j (.8) where we have suppressed the intercept and intrduced a factr fr cnvenience. This can be viewed as a univariate lass prblem with respnse variable the partial residual y i ỹ (j) i = y i k j x β ik k (λ). This has an explicit slutin, resulting in the update ( N ) β j (λ) S x ij (y i ỹ (j) i ),λ. (.84) i= Here S(t,λ) = sign(t)( t λ) + is the sft-threshlding peratr in Table.4 n page 7. The first argument t S( ) is the simple least-squares cefficient f the partial residual n the standardized variable x ij. Repeated iteratin f (.84) cycling thrugh each variable in turn until cnvergence yields the lass estimate ˆβ(λ). We can als use this simple algrithm t efficiently cmpute the lass slutins at a grid f values f λ. We start with the smallest value λ max fr which ˆβ(λ max ) = 0, decrease it a little and cycle thrugh the variables until cnvergence. Then λ is decreased again and the prcess is repeated, using the previus slutin as a warm start fr the new value f λ. This can be faster than the LARS algrithm, especially in large prblems. A key t its speed is the fact that the quantities in (.84) can be updated quickly as j varies, and ften the update is t leave β j = 0. n the ther hand, it delivers slutins ver a grid f λ values, rather than the entire slutin path. The same kind f algrithm can be applied t the elastic net, the gruped lass and many ther mdels in which the penalty is a sum f functins f the individual parameters (Friedman et al., 00). It can als be applied, with sme substantial mdificatins, t the fused lass (Sectin 8.4.); details are in Friedman et al. (007)..9 Cmputatinal Cnsideratins Least squares fitting is usually dne via the Chlesky decmpsitin f the matrix X T X r a QR decmpsitin f X. With N bservatins and p features, the Chlesky decmpsitin requires p +Np / peratins, while the QR decmpsitin requires Np peratins. Depending n the relative size f N and p, the Chlesky can smetimes be faster; n the ther hand, it can be less numerically stable (Lawsn and Hansen, 974). Cmputatin f the lass via the LAR algrithm has the same rder f cmputatin as a least squares fit.

113 94. Linear Methds fr Regressin Bibligraphic Ntes Linear regressin is discussed in many statistics bks, fr example, Seber (984), Weisberg (980) and Mardia et al. (979). Ridge regressin was intrduced by Herl and Kennard (970), while the lass was prpsed by Tibshirani (996). Arund the same time, lass-type penalties were prpsed in the basis pursuit methd fr signal prcessing (Chen et al., 998). The least angle regressin prcedure was prpsed in Efrn et al. (004); related t this is the earlier hmtpy prcedure f sbrne et al. (000a) and sbrne et al. (000b). Their algrithm als explits the piecewise linearity used in the LAR/lass algrithm, but lacks its transparency. The criterin fr the frward stagewise criterin is discussed in Hastie et al. (007). Park and Hastie (007) develp a path algrithm similar t least angle regressin fr generalized regressin mdels. Partial least squares was intrduced by Wld (975). Cmparisns f shrinkage methds may be fund in Cpas (98) and Frank and Friedman (99). Exercises Ex.. Shw that the F statistic (.) fr drpping a single cefficient frm a mdel is equal t the square f the crrespnding z-scre (.). Ex.. Given data n tw variables X and Y, cnsider fitting a cubic plynmial regressin mdel f(x) = j=0 β jx j. In additin t pltting the fitted curve, yu wuld like a 95% cnfidence band abut the curve. Cnsider the fllwing tw appraches:. At each pint x 0, frm a 95% cnfidence interval fr the linear functin a T β = j=0 β jx j 0.. Frm a 95% cnfidence set fr β as in (.5), which in turn generates cnfidence intervals fr f(x 0 ). Hw d these appraches differ? Which band is likely t be wider? Cnduct a small simulatin experiment t cmpare the tw methds. Ex.. Gauss Markv therem: (a) Prve the Gauss Markv therem: the least squares estimate f a parameter a T β has variance n bigger than that f any ther linear unbiased estimate f a T β (Sectin..). (b) The matrix inequality B A hlds if A B is psitive semidefinite. Shw that if ˆV is the variance-cvariance matrix f the least squares estimate f β and Ṽ is the variance-cvariance matrix f any ther linear unbiased estimate, then ˆV Ṽ.

114 Exercises 95 Ex..4 Shw hw the vectr f least squares cefficients can be btained frm a single pass f the Gram Schmidt prcedure (Algrithm.). Represent yur slutin in terms f the QR decmpsitin f X. Ex..5 Cnsider the ridge regressin prblem (.4). Shw that this prblem is equivalent t the prblem ˆβ c = argmin β c { N [ yi β0 c i= p (x ij x j )βj c ] p + λ j= j= βj c }. (.85) Give the crrespndence between β c and the riginal β in (.4). Characterize the slutin t this mdified criterin. Shw that a similar result hlds fr the lass. Ex..6 Shw that the ridge regressin estimate is the mean (and mde) f the psterir distributin, under a Gaussian prir β N(0,τI), and Gaussian sampling mdel y N(Xβ,σ I). Find the relatinship between the regularizatin parameter λ in the ridge frmula, and the variances τ and σ. Ex..7 Assume y i N(β 0 + x T i β,σ ),i =,,...,N, and the parameters β j are each distributed as N(0,τ ), independently f ne anther. Assuming σ and τ are knwn, shw that the (minus) lg-psterir density f β is prprtinal t N i= (y i β 0 j x ijβ j ) + λ p j= β j where λ = σ /τ. Ex..8 Cnsider the QR decmpsitin f the uncentered N (p + ) matrix X (whse first clumn is all nes), and the SVD f the N p centered matrix X. Shw that Q and U span the same subspace, where Q is the sub-matrix f Q with the first clumn remved. Under what circumstances will they be the same, up t sign flips? Ex..9 Frward stepwise regressin. Suppse we have the QR decmpsitin fr the N q matrix X in a multiple regressin prblem with respnse y, and we have an additinal p q predictrs in the matrix X. Dente the current residual by r. We wish t establish which ne f these additinal variables will reduce the residual-sum-f squares the mst when included with thse in X. Describe an efficient prcedure fr ding this. Ex..0 Backward stepwise regressin. Suppse we have the multiple regressin fit f y n X, alng with the standard errrs and Z-scres as in Table.. We wish t establish which variable, when drpped, will increase the residual sum-f-squares the least. Hw wuld yu d this? Ex.. Shw that the slutin t the multivariate linear regressin prblem (.40) is given by (.9). What happens if the cvariance matrices Σ i are different fr each bservatin?

115 96. Linear Methds fr Regressin Ex.. Shw that the ridge regressin estimates can be btained by rdinary least squares regressin n an augmented data set. We augment the centered matrix X with p additinal rws λi, and augment y with p zers. By intrducing artificial data having respnse value zer, the fitting prcedure is frced t shrink the cefficients tward zer. This is related t the idea f hints due t Abu-Mstafa (995), where mdel cnstraints are implemented by adding artificial data examples that satisfy them. Ex.. Derive the expressin (.6), and shw that ˆβ pcr (p) = ˆβ ls. Ex..4 Shw that in the rthgnal case, PLS stps after m = steps, because subsequent ˆϕ mj in step in Algrithm. are zer. Ex..5 Verify expressin (.64), and hence shw that the partial least squares directins are a cmprmise between the rdinary regressin cefficient and the principal cmpnent directins. Ex..6 Derive the entries in Table.4, the explicit frms fr estimatrs in the rthgnal case. Ex..7 Repeat the analysis f Table. n the spam data discussed in Chapter. Ex..8 Read abut cnjugate gradient algrithms (Murray et al., 98, fr example), and establish a cnnectin between these algrithms and partial least squares. Ex..9 Shw that ˆβ ridge increases as its tuning parameter λ 0. Des the same prperty hld fr the lass and partial least squares estimates? Fr the latter, cnsider the tuning parameter t be the successive steps in the algrithm. Ex..0 Cnsider the cannical-crrelatin prblem (.67). Shw that the leading pair f cannical variates u and v slve the prblem max u T (Y T X)v, (.86) u T (Y T Y)u= v T (X T X)v= a generalized SVD prblem. Shw that the slutin is given by u = (Y T Y) u, and v = (X T X) v, where u and v are the leading left and right singular vectrs in (Y T Y) (Y T X)(X T X) = U D V T. (.87) Shw that the entire sequence u m, v m, m =,...,min(k,p) is als given by (.87). Ex.. Shw that the slutin t the reduced-rank regressin prblem (.68), with Σ estimated by Y T Y/N, is given by (.69). Hint: Transfrm

116 Exercises 97 Y t Y = YΣ, and slved in terms f the cannical vectrs u m. Shw that U m = Σ U m, and a generalized inverse is U m = U m T Σ. Ex.. Shw that the slutin in Exercise. des nt change if Σ is estimated by the mre natural quantity (Y XˆB) T (Y XˆB)/(N pk). Ex.. Cnsider a regressin prblem with all variables and respnse having mean zer and standard deviatin ne. Suppse als that each variable has identical abslute crrelatin with the respnse: N x j,y = λ, j =,...,p. Let ˆβ be the least-squares cefficient f y n X, and let u(α) = αxˆβ fr α [0,] be the vectr that mves a fractin α tward the least squares fit u. Let RSS be the residual sum-f-squares frm the full least squares fit. (a) Shw that N x j,y u(α) = ( α)λ, j =,...,p, and hence the crrelatins f each x j with the residuals remain equal in magnitude as we prgress tward u. (b) Shw that these crrelatins are all equal t λ(α) = ( α) λ, ( α) + α( α) N RSS and hence they decrease mntnically t zer. (c) Use these results t shw that the LAR algrithm in Sectin.4.4 keeps the crrelatins tied and mntnically decreasing, as claimed in (.55). Ex..4 LAR directins. Using the ntatin arund equatin (.55) n page 74, shw that the LAR directin makes an equal angle with each f the predictrs in A k. Ex..5 LAR lk-ahead (Efrn et al., 004, Sec. ). Starting at the beginning f the kth step f the LAR algrithm, derive expressins t identify the next variable t enter the active set at step k +, and the value f α at which this ccurs (using the ntatin arund equatin (.55) n page 74). Ex..6 Frward stepwise regressin enters the variable at each step that mst reduces the residual sum-f-squares. LAR adjusts variables that have the mst (abslute) crrelatin with the current residuals. Shw that these tw entry criteria are nt necessarily the same. [Hint: let x j.a be the jth

117 98. Linear Methds fr Regressin variable, linearly adjusted fr all the variables currently in the mdel. Shw that the first criterin amunts t identifying the j fr which Cr(x j.a,r) is largest in magnitude. Ex..7 Lass and LAR: Cnsider the lass prblem in Lagrange multiplier frm: with L(β) = i (y i j x ijβ j ), we minimize L(β) + λ j β j (.88) fr fixed λ > 0. (a) Setting β j = β + j β j with β + j,β j 0, expressin (.88) becmes L(β) + λ j (β+ j + β j ). Shw that the Lagrange dual functin is L(β) + λ j (β + j + β j ) j λ + j β+ j j λ j β j (.89) and the Karush Kuhn Tucker ptimality cnditins are L(β) j + λ λ + j = 0 L(β) j + λ λ j = 0 λ + j β+ j = 0 λ j β j = 0, alng with the nn-negativity cnstraints n the parameters and all the Lagrange multipliers. (b) Shw that L(β) j λ j, and that the KKT cnditins imply ne f the fllwing three scenaris: λ = 0 L(β) j = 0 j β + j > 0, λ > 0 λ + j = 0, L(β) j = λ < 0, β j = 0 β j > 0, λ > 0 λ j = 0, L(β) j = λ > 0, β + j = 0. Hence shw that fr any active predictr having β j 0, we must have L(β) j = λ if β j > 0, and L(β) j = λ if β j < 0. Assuming the predictrs are standardized, relate λ t the crrelatin between the jth predictr and the current residuals. (c) Suppse that the set f active predictrs is unchanged fr λ 0 λ λ. Shw that there is a vectr γ 0 such that ˆβ(λ) = ˆβ(λ 0 ) (λ λ 0 )γ 0 (.90) Thus the lass slutin path is linear as λ ranges frm λ 0 t λ (Efrn et al., 004; Rsset and Zhu, 007).

118 Exercises 99 Ex..8 Suppse fr a given t in (.5), the fitted lass cefficient fr variable X j is ˆβ j = a. Suppse we augment ur set f variables with an identical cpy X j = X j. Characterize the effect f this exact cllinearity by describing the set f slutins fr ˆβ j and ˆβ j, using the same value f t. Ex..9 Suppse we run a ridge regressin with parameter λ n a single variable X, and get cefficient a. We nw include an exact cpy X = X, and refit ur ridge regressin. Shw that bth cefficients are identical, and derive their value. Shw in general that if m cpies f a variable X j are included in a ridge regressin, their cefficients are all the same. Ex..0 Cnsider the elastic-net ptimizatin prblem: min β y Xβ + λ [ α β + ( α) β ]. (.9) Shw hw ne can turn this int a lass prblem, using an augmented versin f X and y.

119 00. Linear Methds fr Regressin

120 4 Linear Methds fr Classificatin This is page 0 Printer: paque this 4. Intrductin In this chapter we revisit the classificatin prblem and fcus n linear methds fr classificatin. Since ur predictr G(x) takes values in a discrete set G, we can always divide the input space int a cllectin f regins labeled accrding t the classificatin. We saw in Chapter that the bundaries f these regins can be rugh r smth, depending n the predictin functin. Fr an imprtant class f prcedures, these decisin bundaries are linear; this is what we will mean by linear methds fr classificatin. There are several different ways in which linear decisin bundaries can be fund. In Chapter we fit linear regressin mdels t the class indicatr variables, and classify t the largest fit. Suppse there are K classes, fr cnvenience labeled,,...,k, and the fitted linear mdel fr the kth indicatr respnse variable is ˆf k (x) = ˆβ k0 + ˆβ k T x. The decisin bundary between class k and l is that set f pints fr which ˆf k (x) = ˆf l (x), that is, the set {x : (ˆβ k0 ˆβ l0 ) + (ˆβ k ˆβ l ) T x = 0}, an affine set r hyperplane Since the same is true fr any pair f classes, the input space is divided int regins f cnstant classificatin, with piecewise hyperplanar decisin bundaries. This regressin apprach is a member f a class f methds that mdel discriminant functins δ k (x) fr each class, and then classify x t the class with the largest value fr its discriminant functin. Methds Strictly speaking, a hyperplane passes thrugh the rigin, while an affine set need nt. We smetimes ignre the distinctin and refer in general t hyperplanes.

121 0 4. Linear Methds fr Classificatin that mdel the psterir prbabilities Pr(G = k X = x) are als in this class. Clearly, if either the δ k (x) r Pr(G = k X = x) are linear in x, then the decisin bundaries will be linear. Actually, all we require is that sme mntne transfrmatin f δ k r Pr(G = k X = x) be linear fr the decisin bundaries t be linear. Fr example, if there are tw classes, a ppular mdel fr the psterir prbabilities is Pr(G = X = x) = exp(β 0 + β T x) + exp(β 0 + β T x), Pr(G = X = x) = + exp(β 0 + β T x). (4.) Here the mntne transfrmatin is the lgit transfrmatin: lg[p/( p)], and in fact we see that lg Pr(G = X = x) Pr(G = X = x) = β 0 + β T x. (4.) The decisin bundary is the set f pints fr which the lg-dds are zer, and this is a hyperplane defined by { x β 0 + β T x = 0 }. We discuss tw very ppular but different methds that result in linear lg-dds r lgits: linear discriminant analysis and linear lgistic regressin. Althugh they differ in their derivatin, the essential difference between them is in the way the linear functin is fit t the training data. A mre direct apprach is t explicitly mdel the bundaries between the classes as linear. Fr a tw-class prblem in a p-dimensinal input space, this amunts t mdeling the decisin bundary as a hyperplane in ther wrds, a nrmal vectr and a cut-pint. We will lk at tw methds that explicitly lk fr separating hyperplanes. The first is the wellknwn perceptrn mdel f Rsenblatt (958), with an algrithm that finds a separating hyperplane in the training data, if ne exists. The secnd methd, due t Vapnik (996), finds an ptimally separating hyperplane if ne exists, else finds a hyperplane that minimizes sme measure f verlap in the training data. We treat the separable case here, and defer treatment f the nnseparable case t Chapter. While this entire chapter is devted t linear decisin bundaries, there is cnsiderable scpe fr generalizatin. Fr example, we can expand ur variable set X,...,X p by including their squares and crss-prducts X,X,..., X X,..., thereby adding p(p+)/ additinal variables. Linear functins in the augmented space map dwn t quadratic functins in the riginal space hence linear decisin bundaries t quadratic decisin bundaries. Figure 4. illustrates the idea. The data are the same: the left plt uses linear decisin bundaries in the tw-dimensinal space shwn, while the right plt uses linear decisin bundaries in the augmented five-dimensinal space described abve. This apprach can be used with any basis transfr-

122 4. Linear Regressin f an Indicatr Matrix 0 FIGURE 4.. The left plt shws sme data frm three classes, with linear decisin bundaries fund by linear discriminant analysis. The right plt shws quadratic decisin bundaries. These were btained by finding linear bundaries in the five-dimensinal space X, X, X X, X, X. Linear inequalities in this space are quadratic inequalities in the riginal space. matin h(x) where h : IR p IR q with q > p, and will be explred in later chapters. 4. Linear Regressin f an Indicatr Matrix Here each f the respnse categries are cded via an indicatr variable. Thus if G has K classes, there will be K such indicatrs Y k, k =,...,K, with Y k = if G = k else 0. These are cllected tgether in a vectr Y = (Y,...,Y K ), and the N training instances f these frm an N K indicatr respnse matrix Y. Y is a matrix f 0 s and s, with each rw having a single. We fit a linear regressin mdel t each f the clumns f Y simultaneusly, and the fit is given by Ŷ = X(X T X) X T Y. (4.) Chapter has mre details n linear regressin. Nte that we have a cefficient vectr fr each respnse clumn y k, and hence a (p+) K cefficient matrix ˆB = (X T X) X T Y. Here X is the mdel matrix with p+ clumns crrespnding t the p inputs, and a leading clumn f s fr the intercept. A new bservatin with input x is classified as fllws: cmpute the fitted utput ˆf(x) T = (,x T )ˆB, a K vectr; identify the largest cmpnent and classify accrdingly: Ĝ(x) = argmax k G ˆfk (x). (4.4)

123 04 4. Linear Methds fr Classificatin What is the ratinale fr this apprach? ne rather frmal justificatin is t view the regressin as an estimate f cnditinal expectatin. Fr the randm variable Y k, E(Y k X = x) = Pr(G = k X = x), s cnditinal expectatin f each f the Y k seems a sensible gal. The real issue is: hw gd an apprximatin t cnditinal expectatin is the rather rigid linear regressin mdel? Alternatively, are the ˆf k (x) reasnable estimates f the psterir prbabilities Pr(G = k X = x), and mre imprtantly, des this matter? It is quite straightfrward t verify that k G ˆf k (x) = fr any x, as lng as there is an intercept in the mdel (clumn f s in X). Hwever, the ˆf k (x) can be negative r greater than, and typically sme are. This is a cnsequence f the rigid nature f linear regressin, especially if we make predictins utside the hull f the training data. These vilatins in themselves d nt guarantee that this apprach will nt wrk, and in fact n many prblems it gives similar results t mre standard linear methds fr classificatin. If we allw linear regressin nt basis expansins h(x) f the inputs, this apprach can lead t cnsistent estimates f the prbabilities. As the size f the training set N grws bigger, we adaptively include mre basis elements s that linear regressin nt these basis functins appraches cnditinal expectatin. We discuss such appraches in Chapter 5. A mre simplistic viewpint is t cnstruct targets t k fr each class, where t k is the kth clumn f the K K identity matrix. ur predictin prblem is t try and reprduce the apprpriate target fr an bservatin. With the same cding as befre, the respnse vectr y i (ith rw f Y) fr bservatin i has the value y i = t k if g i = k. We might then fit the linear mdel by least squares: min B N y i [(,x T i )B] T. (4.5) i= The criterin is a sum-f-squared Euclidean distances f the fitted vectrs frm their targets. A new bservatin is classified by cmputing its fitted vectr ˆf(x) and classifying t the clsest target: Ĝ(x) = argmin ˆf(x) t k. (4.6) k This is exactly the same as the previus apprach: The sum-f-squared-nrm criterin is exactly the criterin fr multiple respnse linear regressin, just viewed slightly differently. Since a squared nrm is itself a sum f squares, the cmpnents decuple and can be rearranged as a separate linear mdel fr each element. Nte that this is nly pssible because there is nthing in the mdel that binds the different respnses tgether.

124 4. Linear Regressin f an Indicatr Matrix 05 Linear Regressin Linear Discriminant Analysis X X X X FIGURE 4.. The data cme frm three classes in IR and are easily separated by linear decisin bundaries. The right plt shws the bundaries fund by linear discriminant analysis. The left plt shws the bundaries fund by linear regressin f the indicatr respnse variables. The middle class is cmpletely masked (never dminates). The clsest target classificatin rule (4.6) is easily seen t be exactly the same as the maximum fitted cmpnent criterin (4.4), but des require that the fitted values sum t. There is a serius prblem with the regressin apprach when the number f classes K, especially prevalent when K is large. Because f the rigid nature f the regressin mdel, classes can be masked by thers. Figure 4. illustrates an extreme situatin when K =. The three classes are perfectly separated by linear decisin bundaries, yet linear regressin misses the middle class cmpletely. In Figure 4. we have prjected the data nt the line jining the three centrids (there is n infrmatin in the rthgnal directin in this case), and we have included and cded the three respnse variables Y, Y and Y. The three regressin lines (left panel) are included, and we see that the line crrespnding t the middle class is hrizntal and its fitted values are never dminant! Thus, bservatins frm class are classified either as class r class. The right panel uses quadratic regressin rather than linear regressin. Fr this simple example a quadratic rather than linear fit (fr the middle class at least) wuld slve the prblem. Hwever, it can be seen that if there were fur rather than three classes lined up like this, a quadratic wuld nt cme dwn fast enugh, and a cubic wuld be needed as well. A lse but general rule is that if K classes are lined up, plynmial terms up t degree K might be needed t reslve them. Nte als that these are plynmials alng the derived directin

125 06 4. Linear Methds fr Classificatin Degree = ; Errr = 0. Degree = ; Errr = FIGURE 4.. The effects f masking n linear regressin in IR fr a three-class prblem. The rug plt at the base indicates the psitins and class membership f each bservatin. The three curves in each panel are the fitted regressins t the three-class indicatr variables; fr example, fr the blue class, y blue is fr the blue bservatins, and 0 fr the green and range. The fits are linear and quadratic plynmials. Abve each plt is the training errr rate. The Bayes errr rate is 0.05 fr this prblem, as is the LDA errr rate. passing thrugh the centrids, which can have arbitrary rientatin. S in p-dimensinal input space, ne wuld need general plynmial terms and crss-prducts f ttal degree K, (p K ) terms in all, t reslve such wrst-case scenaris. The example is extreme, but fr large K and small p such maskings naturally ccur. As a mre realistic illustratin, Figure 4.4 is a prjectin f the training data fr a vwel recgnitin prblem nt an infrmative tw-dimensinal subspace. There are K = classes in p = 0 dimensins. This is a difficult classificatin prblem, and the best methds achieve arund 40% errrs n the test data. The main pint here is summarized in Table 4.; linear regressin has an errr rate f 67%, while a clse relative, linear discriminant analysis, has an errr rate f 56%. It seems that masking has hurt in this case. While all the ther methds in this chapter are based n linear functins f x as well, they use them in such a way that avids this masking prblem. 4. Linear Discriminant Analysis Decisin thery fr classificatin (Sectin.4) tells us that we need t knw the class psterirs Pr(G X) fr ptimal classificatin. Suppse f k (x) is the class-cnditinal density f X in class G = k, and let π k be the prir prbability f class k, with K k= π k =. A simple applicatin f Bayes

126 4. Linear Discriminant Analysis 07 Linear Discriminant Analysis Crdinate fr Training Data Crdinate fr Training Data FIGURE 4.4. A tw-dimensinal plt f the vwel training data. There are eleven classes with X IR 0, and this is the best view in terms f a LDA mdel (Sectin 4..). The heavy circles are the prjected mean vectrs fr each class. The class verlap is cnsiderable. TABLE 4.. Training and test errr rates using a variety f linear techniques n the vwel data. There are eleven classes in ten dimensins, f which three accunt fr 90% f the variance (via a principal cmpnents analysis). We see that linear regressin is hurt by masking, increasing the test and training errr by ver 0%. Technique Errr Rates Training Test Linear regressin Linear discriminant analysis Quadratic discriminant analysis Lgistic regressin

127 08 4. Linear Methds fr Classificatin therem gives us Pr(G = k X = x) = f k (x)π k K l= f l(x)π l. (4.7) We see that in terms f ability t classify, having the f k (x) is almst equivalent t having the quantity Pr(G = k X = x). Many techniques are based n mdels fr the class densities: linear and quadratic discriminant analysis use Gaussian densities; mre flexible mixtures f Gaussians allw fr nnlinear decisin bundaries (Sectin 6.8); general nnparametric density estimates fr each class density allw the mst flexibility (Sectin 6.6.); Naive Bayes mdels are a variant f the previus case, and assume that each f the class densities are prducts f marginal densities; that is, they assume that the inputs are cnditinally independent in each class (Sectin 6.6.). Suppse that we mdel each class density as multivariate Gaussian f k (x) = (π) p/ Σ k / e (x µ k) T Σ k (x µ k). (4.8) Linear discriminant analysis (LDA) arises in the special case when we assume that the classes have a cmmn cvariance matrix Σ k = Σ k. In cmparing tw classes k and l, it is sufficient t lk at the lg-rati, and we see that lg Pr(G = k X = x) Pr(G = l X = x) = lg f k(x) f l (x) + lg π k π l = lg π k π l (µ k + µ l ) T Σ (µ k µ l ) + x T Σ (µ k µ l ), (4.9) an equatin linear in x. The equal cvariance matrices cause the nrmalizatin factrs t cancel, as well as the quadratic part in the expnents. This linear lg-dds functin implies that the decisin bundary between classes k and l the set where Pr(G = k X = x) = Pr(G = l X = x) is linear in x; in p dimensins a hyperplane. This is f curse true fr any pair f classes, s all the decisin bundaries are linear. If we divide IR p int regins that are classified as class, class, etc., these regins will be separated by hyperplanes. Figure 4.5 (left panel) shws an idealized example with three classes and p =. Here the data d arise frm three Gaussian distributins with a cmmn cvariance matrix. We have included in

128 4. Linear Discriminant Analysis FIGURE 4.5. The left panel shws three Gaussian distributins, with the same cvariance and different means. Included are the cnturs f cnstant density enclsing 95% f the prbability in each case. The Bayes decisin bundaries between each pair f classes are shwn (brken straight lines), and the Bayes decisin bundaries separating all three classes are the thicker slid lines (a subset f the frmer). n the right we see a sample f 0 drawn frm each Gaussian distributin, and the fitted LDA decisin bundaries. the figure the cnturs crrespnding t 95% highest prbability density, as well as the class centrids. Ntice that the decisin bundaries are nt the perpendicular bisectrs f the line segments jining the centrids. This wuld be the case if the cvariance Σ were spherical σ I, and the class prirs were equal. Frm (4.9) we see that the linear discriminant functins δ k (x) = x T Σ µ k µt k Σ µ k + lg π k (4.0) are an equivalent descriptin f the decisin rule, with G(x) = argmax k δ k (x). In practice we d nt knw the parameters f the Gaussian distributins, and will need t estimate them using ur training data: ˆπ k = N k /N, where N k is the number f class-k bservatins; ˆµ k = g x i=k i/n k ; ˆΣ = K k= g (x i=k i ˆµ k )(x i ˆµ k ) T /(N K). Figure 4.5 (right panel) shws the estimated decisin bundaries based n a sample f size 0 each frm three Gaussian distributins. Figure 4. n page 0 is anther example, but here the classes are nt Gaussian. With tw classes there is a simple crrespndence between linear discriminant analysis and classificatin by linear least squares, as in (4.5). The LDA rule classifies t class if x T ˆΣ (ˆµ ˆµ ) > ˆµT ˆΣ ˆµ ˆµT ˆΣ ˆµ + lg(n /N) lg(n /N) (4.)

129 0 4. Linear Methds fr Classificatin and class therwise. Suppse we cde the targets in the tw classes as + and, respectively. It is easy t shw that the cefficient vectr frm least squares is prprtinal t the LDA directin given in (4.) (Exercise 4.). [In fact, this crrespndence ccurs fr any (distinct) cding f the targets; see Exercise 4.]. Hwever unless N = N the intercepts are different and hence the resulting decisin rules are different. Since this derivatin f the LDA directin via least squares des nt use a Gaussian assumptin fr the features, its applicability extends beynd the realm f Gaussian data. Hwever the derivatin f the particular intercept r cut-pint given in (4.) des require Gaussian data. Thus it makes sense t instead chse the cut-pint that empirically minimizes training errr fr a given dataset. This is smething we have fund t wrk well in practice, but have nt seen it mentined in the literature. With mre than tw classes, LDA is nt the same as linear regressin f the class indicatr matrix, and it avids the masking prblems assciated with that apprach (Hastie et al., 994). A crrespndence between regressin and LDA can be established thrugh the ntin f ptimal scring, discussed in Sectin.5. Getting back t the general discriminant prblem (4.8), if the Σ k are nt assumed t be equal, then the cnvenient cancellatins in (4.9) d nt ccur; in particular the pieces quadratic in x remain. We then get quadratic discriminant functins (QDA), δ k (x) = lg Σ k (x µ k) T Σ k (x µ k) + lg π k. (4.) The decisin bundary between each pair f classes k and l is described by a quadratic equatin {x : δ k (x) = δ l (x)}. Figure 4.6 shws an example (frm Figure 4. n page 0) where the three classes are Gaussian mixtures (Sectin 6.8) and the decisin bundaries are apprximated by quadratic equatins in x. Here we illustrate tw ppular ways f fitting these quadratic bundaries. The right plt uses QDA as described here, while the left plt uses LDA in the enlarged five-dimensinal quadratic plynmial space. The differences are generally small; QDA is the preferred apprach, with the LDA methd a cnvenient substitute. The estimates fr QDA are similar t thse fr LDA, except that separate cvariance matrices must be estimated fr each class. When p is large this can mean a dramatic increase in parameters. Since the decisin bundaries are functins f the parameters f the densities, cunting the number f parameters must be dne with care. Fr LDA, it seems there are (K ) (p + ) parameters, since we nly need the differences δ k (x) δ K (x) Fr this figure and many similar figures in the bk we cmpute the decisin bundaries by an exhaustive cnturing methd. We cmpute the decisin rule n a fine lattice f pints, and then use cnturing algrithms t cmpute the bundaries.

130 4. Linear Discriminant Analysis FIGURE 4.6. Tw methds fr fitting quadratic bundaries. The left plt shws the quadratic decisin bundaries fr the data in Figure 4. (btained using LDA in the five-dimensinal space X, X, X X, X, X ). The right plt shws the quadratic decisin bundaries fund by QDA. The differences are small, as is usually the case. between the discriminant functins where K is sme pre-chsen class (here we have chsen the last), and each difference requires p + parameters. Likewise fr QDA there will be (K ) {p(p + )/ + } parameters. Bth LDA and QDA perfrm well n an amazingly large and diverse set f classificatin tasks. Fr example, in the STATLG prject (Michie et al., 994) LDA was amng the tp three classifiers fr 7 f the datasets, QDA amng the tp three fr fur datasets, and ne f the pair were in the tp three fr 0 datasets. Bth techniques are widely used, and entire bks are devted t LDA. It seems that whatever extic tls are the rage f the day, we shuld always have available these tw simple tls. The questin arises why LDA and QDA have such a gd track recrd. The reasn is nt likely t be that the data are apprximately Gaussian, and in additin fr LDA that the cvariances are apprximately equal. Mre likely a reasn is that the data can nly supprt simple decisin bundaries such as linear r quadratic, and the estimates prvided via the Gaussian mdels are stable. This is a bias variance tradeff we can put up with the bias f a linear decisin bundary because it can be estimated with much lwer variance than mre extic alternatives. This argument is less believable fr QDA, since it can have many parameters itself, althugh perhaps fewer than the nn-parametric alternatives. Althugh we fit the cvariance matrix ˆΣ t cmpute the LDA discriminant functins, a much reduced functin f it is all that is required t estimate the (p) parameters needed t cmpute the decisin bundaries.

131 4. Linear Methds fr Classificatin Regularized Discriminant Analysis n the Vwel Data Misclassificatin Rate Test Data Train Data α FIGURE 4.7. Test and training errrs fr the vwel data, using regularized discriminant analysis with a series f values f α [0, ]. The ptimum fr the test data ccurs arund α = 0.9, clse t quadratic discriminant analysis. 4.. Regularized Discriminant Analysis Friedman (989) prpsed a cmprmise between LDA and QDA, which allws ne t shrink the separate cvariances f QDA tward a cmmn cvariance as in LDA. These methds are very similar in flavr t ridge regressin. The regularized cvariance matrices have the frm ˆΣ k (α) = α ˆΣ k + ( α)ˆσ, (4.) where ˆΣ is the pled cvariance matrix as used in LDA. Here α [0,] allws a cntinuum f mdels between LDA and QDA, and needs t be specified. In practice α can be chsen based n the perfrmance f the mdel n validatin data, r by crss-validatin. Figure 4.7 shws the results f RDA applied t the vwel data. Bth the training and test errr imprve with increasing α, althugh the test errr increases sharply after α = 0.9. The large discrepancy between the training and test errr is partly due t the fact that there are many repeat measurements n a small number f individuals, different in the training and test set. Similar mdificatins allw ˆΣ itself t be shrunk tward the scalar cvariance, ˆΣ(γ) = γ ˆΣ + ( γ)ˆσ I (4.4) fr γ [0,]. Replacing ˆΣ in (4.) by ˆΣ(γ) leads t a mre general family f cvariances ˆΣ(α,γ) indexed by a pair f parameters. In Chapter, we discuss ther regularized versins f LDA, which are mre suitable when the data arise frm digitized analg signals and images.

132 4. Linear Discriminant Analysis In these situatins the features are high-dimensinal and crrelated, and the LDA cefficients can be regularized t be smth r sparse in the riginal dmain f the signal. This leads t better generalizatin and allws fr easier interpretatin f the cefficients. In Chapter 8 we als deal with very high-dimensinal prblems, where fr example the features are geneexpressin measurements in micrarray studies. There the methds fcus n the case γ = 0 in (4.4), and ther severely regularized versins f LDA. 4.. Cmputatins fr LDA As a lead-in t the next tpic, we briefly digress n the cmputatins required fr LDA and especially QDA. Their cmputatins are simplified by diagnalizing ˆΣ r ˆΣ k. Fr the latter, suppse we cmpute the eigendecmpsitin fr each ˆΣ k = U k D k U T k, where U k is p p rthnrmal, and D k a diagnal matrix f psitive eigenvalues d kl. Then the ingredients fr δ k (x) (4.) are (x ˆµ k ) T ˆΣ k (x ˆµ k ) = [U T k (x ˆµ k)] T D k [UT k (x ˆµ k)]; lg ˆΣ k = l lg d kl. In light f the cmputatinal steps utlined abve, the LDA classifier can be implemented by the fllwing pair f steps: Sphere the data with respect t the cmmn cvariance estimate ˆΣ: X D U T X, where ˆΣ = UDU T. The cmmn cvariance estimate f X will nw be the identity. Classify t the clsest class centrid in the transfrmed space, mdul the effect f the class prir prbabilities π k. 4.. Reduced-Rank Linear Discriminant Analysis S far we have discussed LDA as a restricted Gaussian classifier. Part f its ppularity is due t an additinal restrictin that allws us t view infrmative lw-dimensinal prjectins f the data. The K centrids in p-dimensinal input space lie in an affine subspace f dimensin K, and if p is much larger than K, this will be a cnsiderable drp in dimensin. Mrever, in lcating the clsest centrid, we can ignre distances rthgnal t this subspace, since they will cntribute equally t each class. Thus we might just as well prject the X nt this centrid-spanning subspace H K, and make distance cmparisns there. Thus there is a fundamental dimensin reductin in LDA, namely, that we need nly cnsider the data in a subspace f dimensin at mst K.

133 4 4. Linear Methds fr Classificatin If K =, fr instance, this culd allw us t view the data in a twdimensinal plt, clr-cding the classes. In ding s we wuld nt have relinquished any f the infrmatin needed fr LDA classificatin. What if K >? We might then ask fr a L < K dimensinal subspace H L H K ptimal fr LDA in sme sense. Fisher defined ptimal t mean that the prjected centrids were spread ut as much as pssible in terms f variance. This amunts t finding principal cmpnent subspaces f the centrids themselves (principal cmpnents are described briefly in Sectin.5., and in mre detail in Sectin 4.5.). Figure 4.4 shws such an ptimal tw-dimensinal subspace fr the vwel data. Here there are eleven classes, each a different vwel sund, in a ten-dimensinal input space. The centrids require the full space in this case, since K = p, but we have shwn an ptimal tw-dimensinal subspace. The dimensins are rdered, s we can cmpute additinal dimensins in sequence. Figure 4.8 shws fur additinal pairs f crdinates, als knwn as cannical r discriminant variables. In summary then, finding the sequences f ptimal subspaces fr LDA invlves the fllwing steps: cmpute the K p matrix f class centrids M and the cmmn cvariance matrix W (fr within-class cvariance); cmpute M = MW using the eigen-decmpsitin f W; cmpute B, the cvariance matrix f M (B fr between-class cvariance), and its eigen-decmpsitin B = V D B V T. The clumns v l f V in sequence frm first t last define the crdinates f the ptimal subspaces. Cmbining all these peratins the lth discriminant variable is given by Z l = v T l X with v l = W v l. Fisher arrived at this decmpsitin via a different rute, withut referring t Gaussian distributins at all. He psed the prblem: Find the linear cmbinatin Z = a T X such that the betweenclass variance is maximized relative t the within-class variance. Again, the between class variance is the variance f the class means f Z, and the within class variance is the pled variance abut the means. Figure 4.9 shws why this criterin makes sense. Althugh the directin jining the centrids separates the means as much as pssible (i.e., maximizes the between-class variance), there is cnsiderable verlap between the prjected classes due t the nature f the cvariances. By taking the cvariance int accunt as well, a directin with minimum verlap can be fund. The between-class variance f Z is a T Ba and the within-class variance a T Wa, where W is defined earlier, and B is the cvariance matrix f the class centrid matrix M. Nte that B + W = T, where T is the ttal cvariance matrix f X, ignring class infrmatin.

134 4. Linear Discriminant Analysis 5 Linear Discriminant Analysis Crdinate - 0 Crdinate Crdinate Crdinate Crdinate Crdinate Crdinate Crdinate 9 FIGURE 4.8. Fur prjectins nt pairs f cannical variates. Ntice that as the rank f the cannical variates increases, the centrids becme less spread ut. In the lwer right panel they appear t be superimpsed, and the classes mst cnfused.

135 6 4. Linear Methds fr Classificatin FIGURE 4.9. Althugh the line jining the centrids defines the directin f greatest centrid spread, the prjected data verlap because f the cvariance (left panel). The discriminant directin minimizes this verlap fr Gaussian data (right panel). Fisher s prblem therefre amunts t maximizing the Rayleigh qutient, r equivalently max a a T Ba a T Wa, (4.5) max a T Ba subject t a T Wa =. (4.6) a This is a generalized eigenvalue prblem, with a given by the largest eigenvalue f W B. It is nt hard t shw (Exercise 4.) that the ptimal a is identical t v defined abve. Similarly ne can find the next directin a, rthgnal in W t a, such that a T Ba /a T Wa is maximized; the slutin is a = v, and s n. The a l are referred t as discriminant crdinates, nt t be cnfused with discriminant functins. They are als referred t as cannical variates, since an alternative derivatin f these results is thrugh a cannical crrelatin analysis f the indicatr respnse matrix Y n the predictr matrix X. This line is pursued in Sectin.5. T summarize the develpments s far: Gaussian classificatin with cmmn cvariances leads t linear decisin bundaries. Classificatin can be achieved by sphering the data with respect t W, and classifying t the clsest centrid (mdul lg π k ) in the sphered space. Since nly the relative distances t the centrids cunt, ne can cnfine the data t the subspace spanned by the centrids in the sphered space. This subspace can be further decmpsed int successively ptimal subspaces in term f centrid separatin. This decmpsitin is identical t the decmpsitin due t Fisher.

136 4. Linear Discriminant Analysis 7 LDA and Dimensin Reductin n the Vwel Data Misclassificatin Rate Test Data Train Data Dimensin FIGURE 4.0. Training and test errr rates fr the vwel data, as a functin f the dimensin f the discriminant subspace. In this case the best errr rate is fr dimensin. Figure 4. shws the decisin bundaries in this space. The reduced subspaces have been mtivated as a data reductin (fr viewing) tl. Can they als be used fr classificatin, and what is the ratinale? Clearly they can, as in ur riginal derivatin; we simply limit the distance-t-centrid calculatins t the chsen subspace. ne can shw that this is a Gaussian classificatin rule with the additinal restrictin that the centrids f the Gaussians lie in a L-dimensinal subspace f IR p. Fitting such a mdel by maximum likelihd, and then cnstructing the psterir prbabilities using Bayes therem amunts t the classificatin rule described abve (Exercise 4.8). Gaussian classificatin dictates the lg π k crrectin factr in the distance calculatin. The reasn fr this crrectin can be seen in Figure 4.9. The misclassificatin rate is based n the area f verlap between the tw densities. If the π k are equal (implicit in that figure), then the ptimal cut-pint is midway between the prjected means. If the π k are nt equal, mving the cut-pint tward the smaller class will imprve the errr rate. As mentined earlier fr tw classes, ne can derive the linear rule using LDA (r any ther methd), and then chse the cut-pint t minimize misclassificatin errr ver the training data. As an example f the benefit f the reduced-rank restrictin, we return t the vwel data. There are classes and 0 variables, and hence 0 pssible dimensins fr the classifier. We can cmpute the training and test errr in each f these hierarchical subspaces; Figure 4.0 shws the results. Figure 4. shws the decisin bundaries fr the classifier based n the tw-dimensinal LDA slutin. There is a clse cnnectin between Fisher s reduced rank discriminant analysis and regressin f an indicatr respnse matrix. It turns ut that

137 8 4. Linear Methds fr Classificatin Cannical Crdinate Cannical Crdinate Classificatin in Reduced Subspace FIGURE 4.. Decisin bundaries fr the vwel training data, in the tw-dimensinal subspace spanned by the first tw cannical variates. Nte that in any higher-dimensinal subspace, the decisin bundaries are higher-dimensinal affine planes, and culd nt be represented as lines.

138 4.4 Lgistic Regressin 9 LDA amunts t the regressin fllwed by an eigen-decmpsitin f Ŷ T Y. In the case f tw classes, there is a single discriminant variable that is identical up t a scalar multiplicatin t either f the clumns f Ŷ. These cnnectins are develped in Chapter. A related fact is that if ne transfrms the riginal predictrs X t Ŷ, then LDA using Ŷ is identical t LDA in the riginal space (Exercise 4.). 4.4 Lgistic Regressin The lgistic regressin mdel arises frm the desire t mdel the psterir prbabilities f the K classes via linear functins in x, while at the same time ensuring that they sum t ne and remain in [0,]. The mdel has the frm lg Pr(G = X = x) lg Pr(G = K X = x) = β 0 + β T x Pr(G = X = x) lg Pr(G = K X = x) = β 0 + β T x Pr(G = K X = x) Pr(G = K X = x). = β (K )0 + β T K x. (4.7) The mdel is specified in terms f K lg-dds r lgit transfrmatins (reflecting the cnstraint that the prbabilities sum t ne). Althugh the mdel uses the last class as the denminatr in the dds-ratis, the chice f denminatr is arbitrary in that the estimates are equivariant under this chice. A simple calculatin shws that Pr(G = k X = x) = Pr(G = K X = x) = exp(β k0 + βk Tx) + K l= exp(β k =,...,K, l0 + βl T x), + K l= exp(β (4.8) l0 + βl T x), and they clearly sum t ne. T emphasize the dependence n the entire parameter set θ = {β 0,β T,...,β (K )0,βK T }, we dente the prbabilities Pr(G = k X = x) = p k (x;θ). When K =, this mdel is especially simple, since there is nly a single linear functin. It is widely used in bistatistical applicatins where binary respnses (tw classes) ccur quite frequently. Fr example, patients survive r die, have heart disease r nt, r a cnditin is present r absent.

139 0 4. Linear Methds fr Classificatin 4.4. Fitting Lgistic Regressin Mdels Lgistic regressin mdels are usually fit by maximum likelihd, using the cnditinal likelihd f G given X. Since Pr(G X) cmpletely specifies the cnditinal distributin, the multinmial distributin is apprpriate. The lg-likelihd fr N bservatins is l(θ) = N lg p gi (x i ;θ), (4.9) i= where p k (x i ;θ) = Pr(G = k X = x i ;θ). We discuss in detail the tw-class case, since the algrithms simplify cnsiderably. It is cnvenient t cde the tw-class g i via a 0/ respnse y i, where y i = when g i =, and y i = 0 when g i =. Let p (x;θ) = p(x;θ), and p (x;θ) = p(x;θ). The lg-likelihd can be written l(β) = = N { } y i lg p(x i ;β) + ( y i )lg( p(x i ;β)) i= N i= { } y i β T x i lg( + e βt x i ). (4.0) Here β = {β 0,β }, and we assume that the vectr f inputs x i includes the cnstant term t accmmdate the intercept. T maximize the lg-likelihd, we set its derivatives t zer. These scre equatins are N l(β) β = x i (y i p(x i ;β)) = 0, (4.) i= which are p+ equatins nnlinear in β. Ntice that since the first cmpnent f x i is, the first scre equatin specifies that N i= y i = N i= p(x i;β); the expected number f class nes matches the bserved number (and hence als class tws.) T slve the scre equatins (4.), we use the Newtn Raphsn algrithm, which requires the secnd-derivative r Hessian matrix N l(β) β β T = x i x T i p(x i ;β)( p(x i ;β)). (4.) i= Starting with β ld, a single Newtn update is β new = β ld where the derivatives are evaluated at β ld. ( ) l(β) l(β) β β T β, (4.)

140 4.4 Lgistic Regressin It is cnvenient t write the scre and Hessian in matrix ntatin. Let y dente the vectr f y i values, X the N (p + ) matrix f x i values, p the vectr f fitted prbabilities with ith element p(x i ;β ld ) and W a N N diagnal matrix f weights with ith diagnal element p(x i ;β ld )( p(x i ;β ld )). Then we have The Newtn step is thus l(β) β = X T (y p) (4.4) l(β) β β T = X T WX (4.5) β new = β ld + (X T WX) X T (y p) = (X T WX) X T W ( Xβ ld + W (y p) ) = (X T WX) X T Wz. (4.6) In the secnd and third line we have re-expressed the Newtn step as a weighted least squares step, with the respnse z = Xβ ld + W (y p), (4.7) smetimes knwn as the adjusted respnse. These equatins get slved repeatedly, since at each iteratin p changes, and hence s des W and z. This algrithm is referred t as iteratively reweighted least squares r IRLS, since each iteratin slves the weighted least squares prblem: β new arg min β (z Xβ) T W(z Xβ). (4.8) It seems that β = 0 is a gd starting value fr the iterative prcedure, althugh cnvergence is never guaranteed. Typically the algrithm des cnverge, since the lg-likelihd is cncave, but vershting can ccur. In the rare cases that the lg-likelihd decreases, step size halving will guarantee cnvergence. Fr the multiclass case (K ) the Newtn algrithm can als be expressed as an iteratively reweighted least squares algrithm, but with a vectr f K respnses and a nndiagnal weight matrix per bservatin. The latter precludes any simplified algrithms, and in this case it is numerically mre cnvenient t wrk with the expanded vectr θ directly (Exercise 4.4). Alternatively crdinate-descent methds (Sectin.8.6) can be used t maximize the lg-likelihd efficiently. The R package glmnet (Friedman et al., 00) can fit very large lgistic regressin prblems efficiently, bth in N and p. Althugh designed t fit regularized mdels, ptins allw fr unregularized fits. Lgistic regressin mdels are used mstly as a data analysis and inference tl, where the gal is t understand the rle f the input variables

141 4. Linear Methds fr Classificatin TABLE 4.. Results frm a lgistic regressin fit t the Suth African heart disease data. Cefficient Std. Errr Z Scre (Intercept) sbp tbacc ldl famhist besity alchl age in explaining the utcme. Typically many mdels are fit in a search fr a parsimnius mdel invlving a subset f the variables, pssibly with sme interactins terms. The fllwing example illustrates sme f the issues invlved Example: Suth African Heart Disease Here we present an analysis f binary data t illustrate the traditinal statistical use f the lgistic regressin mdel. The data in Figure 4. are a subset f the Crnary Risk-Factr Study (CRIS) baseline survey, carried ut in three rural areas f the Western Cape, Suth Africa (Russeauw et al., 98). The aim f the study was t establish the intensity f ischemic heart disease risk factrs in that high-incidence regin. The data represent white males between 5 and 64, and the respnse variable is the presence r absence f mycardial infarctin (MI) at the time f the survey (the verall prevalence f MI was 5.% in this regin). There are 60 cases in ur data set, and a sample f 0 cntrls. These data are described in mre detail in Hastie and Tibshirani (987). We fit a lgistic-regressin mdel by maximum likelihd, giving the results shwn in Table 4.. This summary includes Z scres fr each f the cefficients in the mdel (cefficients divided by their standard errrs); a nnsignificant Z scre suggests a cefficient can be drpped frm the mdel. Each f these crrespnd frmally t a test f the null hypthesis that the cefficient in questin is zer, while all the thers are nt (als knwn as the Wald test). A Z scre greater than apprximately in abslute value is significant at the 5% level. There are sme surprises in this table f cefficients, which must be interpreted with cautin. Systlic bld pressure (sbp) is nt significant! Nr is besity, and its sign is negative. This cnfusin is a result f the crrelatin between the set f predictrs. n their wn, bth sbp and besity are significant, and with psitive sign. Hwever, in the presence f many

142 4.4 Lgistic Regressin sbp tbacc ldl famhist besity alchl age FIGURE 4.. A scatterplt matrix f the Suth African heart disease data. Each plt shws a pair f risk factrs, and the cases and cntrls are clr cded (red is a case). The variable family histry f heart disease (famhist) is binary (yes r n).

143 4 4. Linear Methds fr Classificatin TABLE 4.. Results frm stepwise lgistic regressin fit t Suth African heart disease data. Cefficient Std. Errr Z scre (Intercept) tbacc ldl famhist age ther crrelated variables, they are n lnger needed (and can even get a negative sign). At this stage the analyst might d sme mdel selectin; find a subset f the variables that are sufficient fr explaining their jint effect n the prevalence f chd. ne way t prceed by is t drp the least significant cefficient, and refit the mdel. This is dne repeatedly until n further terms can be drpped frm the mdel. This gave the mdel shwn in Table 4.. A better but mre time-cnsuming strategy is t refit each f the mdels with ne variable remved, and then perfrm an analysis f deviance t decide which variable t exclude. The residual deviance f a fitted mdel is minus twice its lg-likelihd, and the deviance between tw mdels is the difference f their individual residual deviances (in analgy t sums-fsquares). This strategy gave the same final mdel as abve. Hw des ne interpret a cefficient f 0.08 (Std. Errr = 0.06) fr tbacc, fr example? Tbacc is measured in ttal lifetime usage in kilgrams, with a median f.0kg fr the cntrls and 4.kg fr the cases. Thus an increase f kg in lifetime tbacc usage accunts fr an increase in the dds f crnary heart disease f exp(0.08) =.084 r 8.4%. Incrprating the standard errr we get an apprximate 95% cnfidence interval f exp(0.08 ± 0.06) = (.0,.4). We return t these data in Chapter 5, where we see that sme f the variables have nnlinear effects, and when mdeled apprpriately, are nt excluded frm the mdel Quadratic Apprximatins and Inference The maximum-likelihd parameter estimates ˆβ satisfy a self-cnsistency relatinship: they are the cefficients f a weighted least squares fit, where the respnses are z i = x T i ˆβ + (y i ˆp i ) ˆp i ( ˆp i ), (4.9)

144 4.4 Lgistic Regressin 5 and the weights are w i = ˆp i ( ˆp i ), bth depending n ˆβ itself. Apart frm prviding a cnvenient algrithm, this cnnectin with least squares has mre t ffer: The weighted residual sum-f-squares is the familiar Pearsn chisquare statistic N (y i ˆp i ) ˆp i ( ˆp i ), (4.0) i= a quadratic apprximatin t the deviance. Asympttic likelihd thery says that if the mdel is crrect, then ˆβ is cnsistent (i.e., cnverges t the true β). A central limit therem then shws that the distributin f ˆβ cnverges t N(β,(X T WX) ). This and ther asympttics can be derived directly frm the weighted least squares fit by mimicking nrmal thery inference. Mdel building can be cstly fr lgistic regressin mdels, because each mdel fitted requires iteratin. Ppular shrtcuts are the Ra scre test which tests fr inclusin f a term, and the Wald test which can be used t test fr exclusin f a term. Neither f these require iterative fitting, and are based n the maximum-likelihd fit f the current mdel. It turns ut that bth f these amunt t adding r drpping a term frm the weighted least squares fit, using the same weights. Such cmputatins can be dne efficiently, withut recmputing the entire weighted least squares fit. Sftware implementatins can take advantage f these cnnectins. Fr example, the generalized linear mdeling sftware in R (which includes lgistic regressin as part f the binmial family f mdels) explits them fully. GLM (generalized linear mdel) bjects can be treated as linear mdel bjects, and all the tls available fr linear mdels can be applied autmatically L Regularized Lgistic Regressin The L penalty used in the lass (Sectin.4.) can be used fr variable selectin and shrinkage with any linear regressin mdel. Fr lgistic regressin, we wuld maximize a penalized versin f (4.0): max β 0,β N i= [ ] y i (β 0 + β T x i ) lg( + e β0+βt x i ) λ p β j. (4.) As with the lass, we typically d nt penalize the intercept term, and standardize the predictrs fr the penalty t be meaningful. Criterin (4.) is j=

145 6 4. Linear Methds fr Classificatin cncave, and a slutin can be fund using nnlinear prgramming methds (Kh et al., 007, fr example). Alternatively, using the same quadratic apprximatins that were used in the Newtn algrithm in Sectin 4.4., we can slve (4.) by repeated applicatin f a weighted lass algrithm. Interestingly, the scre equatins [see (4.4)] fr the variables with nn-zer cefficients have the frm x T j (y p) = λ sign(β j ), (4.) which generalizes (.58) in Sectin.4.4; the active variables are tied in their generalized crrelatin with the residuals. Path algrithms such as LAR fr lass are mre difficult, because the cefficient prfiles are piecewise smth rather than linear. Nevertheless, prgress can be made using quadratic apprximatins Cefficients βj(λ) age famhist ldl tbacc sbp alchl besity β(λ) FIGURE 4.. L regularized lgistic regressin cefficients fr the Suth African heart disease data, pltted as a functin f the L nrm. The variables were all standardized t have unit variance. The prfiles are cmputed exactly at each f the pltted pints. Figure 4. shws the L regularizatin path fr the Suth African heart disease data f Sectin This was prduced using the R package glmpath (Park and Hastie, 007), which uses predictr crrectr methds f cnvex ptimizatin t identify the exact values f λ at which the active set f nn-zer cefficients changes (vertical lines in the figure). Here the prfiles lk almst linear; in ther examples the curvature will be mre visible. Crdinate descent methds (Sectin.8.6) are very efficient fr cmputing the cefficient prfiles n a grid f values fr λ. The R package glmnet

146 4.4 Lgistic Regressin 7 (Friedman et al., 00) can fit cefficient paths fr very large lgistic regressin prblems efficiently (large in N r p). Their algrithms can explit sparsity in the predictr matrix X, which allws fr even larger prblems. See Sectin 8.4 fr mre details, and a discussin f L -regularized multinmial mdels Lgistic Regressin r LDA? In Sectin 4. we find that the lg-psterir dds between class k and K are linear functins f x (4.9): lg Pr(G = k X = x) Pr(G = K X = x) = lg π k π K (µ k + µ K ) T Σ (µ k µ K ) +x T Σ (µ k µ K ) = α k0 + α T k x. (4.) This linearity is a cnsequence f the Gaussian assumptin fr the class densities, as well as the assumptin f a cmmn cvariance matrix. The linear lgistic mdel (4.7) by cnstructin has linear lgits: lg Pr(G = k X = x) Pr(G = K X = x) = β k0 + β T k x. (4.4) It seems that the mdels are the same. Althugh they have exactly the same frm, the difference lies in the way the linear cefficients are estimated. The lgistic regressin mdel is mre general, in that it makes less assumptins. We can write the jint density f X and G as Pr(X,G = k) = Pr(X)Pr(G = k X), (4.5) where Pr(X) dentes the marginal density f the inputs X. Fr bth LDA and lgistic regressin, the secnd term n the right has the lgit-linear frm e β k0+β T k x Pr(G = k X = x) = + K, (4.6) l= eβ l0+βl T x where we have again arbitrarily chsen the last class as the reference. The lgistic regressin mdel leaves the marginal density f X as an arbitrary density functin Pr(X), and fits the parameters f Pr(G X) by maximizing the cnditinal likelihd the multinmial likelihd with prbabilities the Pr(G = k X). Althugh Pr(X) is ttally ignred, we can think f this marginal density as being estimated in a fully nnparametric and unrestricted fashin, using the empirical distributin functin which places mass /N at each bservatin. With LDA we fit the parameters by maximizing the full lg-likelihd, based n the jint density Pr(X,G = k) = φ(x;µ k,σ)π k, (4.7)

147 8 4. Linear Methds fr Classificatin where φ is the Gaussian density functin. Standard nrmal thery leads easily t the estimates ˆµ k,ˆσ, and ˆπ k given in Sectin 4.. Since the linear parameters f the lgistic frm (4.) are functins f the Gaussian parameters, we get their maximum-likelihd estimates by plugging in the crrespnding estimates. Hwever, unlike in the cnditinal case, the marginal density Pr(X) des play a rle here. It is a mixture density Pr(X) = K π k φ(x;µ k,σ), (4.8) k= which als invlves the parameters. What rle can this additinal cmpnent/restrictin play? By relying n the additinal mdel assumptins, we have mre infrmatin abut the parameters, and hence can estimate them mre efficiently (lwer variance). If in fact the true f k (x) are Gaussian, then in the wrst case ignring this marginal part f the likelihd cnstitutes a lss f efficiency f abut 0% asympttically in the errr rate (Efrn, 975). Paraphrasing: with 0% mre data, the cnditinal likelihd will d as well. Fr example, bservatins far frm the decisin bundary (which are dwn-weighted by lgistic regressin) play a rle in estimating the cmmn cvariance matrix. This is nt all gd news, because it als means that LDA is nt rbust t grss utliers. Frm the mixture frmulatin, it is clear that even bservatins withut class labels have infrmatin abut the parameters. ften it is expensive t generate class labels, but unclassified bservatins cme cheaply. By relying n strng mdel assumptins, such as here, we can use bth types f infrmatin. The marginal likelihd can be thught f as a regularizer, requiring in sme sense that class densities be visible frm this marginal view. Fr example, if the data in a tw-class lgistic regressin mdel can be perfectly separated by a hyperplane, the maximum likelihd estimates f the parameters are undefined (i.e., infinite; see Exercise 4.5). The LDA cefficients fr the same data will be well defined, since the marginal likelihd will nt permit these degeneracies. In practice these assumptins are never crrect, and ften sme f the cmpnents f X are qualitative variables. It is generally felt that lgistic regressin is a safer, mre rbust bet than the LDA mdel, relying n fewer assumptins. It is ur experience that the mdels give very similar results, even when LDA is used inapprpriately, such as with qualitative predictrs.

148 4.5 Separating Hyperplanes 9 FIGURE 4.4. A ty example with tw classes separable by a hyperplane. The range line is the least squares slutin, which misclassifies ne f the training pints. Als shwn are tw blue separating hyperplanes fund by the perceptrn learning algrithm with different randm starts. 4.5 Separating Hyperplanes We have seen that linear discriminant analysis and lgistic regressin bth estimate linear decisin bundaries in similar but slightly different ways. Fr the rest f this chapter we describe separating hyperplane classifiers. These prcedures cnstruct linear decisin bundaries that explicitly try t separate the data int different classes as well as pssible. They prvide the basis fr supprt vectr classifiers, discussed in Chapter. The mathematical level f this sectin is smewhat higher than that f the previus sectins. Figure 4.4 shws 0 data pints in tw classes in IR. These data can be separated by a linear bundary. Included in the figure (blue lines) are tw f the infinitely many pssible separating hyperplanes. The range line is the least squares slutin t the prblem, btained by regressing the / respnse Y n X (with intercept); the line is given by {x : ˆβ 0 + ˆβ x + ˆβ x = 0}. (4.9) This least squares slutin des nt d a perfect jb in separating the pints, and makes ne errr. This is the same bundary fund by LDA, in light f its equivalence with linear regressin in the tw-class case (Sectin 4. and Exercise 4.). Classifiers such as (4.9), that cmpute a linear cmbinatin f the input features and return the sign, were called perceptrns in the engineering liter-

149 0 4. Linear Methds fr Classificatin x 0 x β β 0 + β T x = 0 FIGURE 4.5. The linear algebra f a hyperplane (affine set). ature in the late 950s (Rsenblatt, 958). Perceptrns set the fundatins fr the neural netwrk mdels f the 980s and 990s. Befre we cntinue, let us digress slightly and review sme vectr algebra. Figure 4.5 depicts a hyperplane r affine set L defined by the equatin f(x) = β 0 + β T x = 0; since we are in IR this is a line. Here we list sme prperties:. Fr any tw pints x and x lying in L, β T (x x ) = 0, and hence β = β/ β is the vectr nrmal t the surface f L.. Fr any pint x 0 in L, β T x 0 = β 0.. The signed distance f any pint x t L is given by β T (x x 0 ) = = β (βt x + β 0 ) f f(x). (4.40) (x) Hence f(x) is prprtinal t the signed distance frm x t the hyperplane defined by f(x) = Rsenblatt s Perceptrn Learning Algrithm The perceptrn learning algrithm tries t find a separating hyperplane by minimizing the distance f misclassified pints t the decisin bundary. If

150 4.5 Separating Hyperplanes a respnse y i = is misclassified, then x T i β + β 0 < 0, and the ppsite fr a misclassified respnse with y i =. The gal is t minimize D(β,β 0 ) = y i (x T i β + β 0 ), (4.4) i M where M indexes the set f misclassified pints. The quantity is nnnegative and prprtinal t the distance f the misclassified pints t the decisin bundary defined by β T x + β 0 = 0. The gradient (assuming M is fixed) is given by D(β,β 0) β = y i x i, (4.4) i M D(β,β 0) β 0 = i M y i. (4.4) The algrithm in fact uses stchastic gradient descent t minimize this piecewise linear criterin. This means that rather than cmputing the sum f the gradient cntributins f each bservatin fllwed by a step in the negative gradient directin, a step is taken after each bservatin is visited. Hence the misclassified bservatins are visited in sme sequence, and the parameters β are updated via ( ) ( ) ( ) β β yi x + ρ i. (4.44) β 0 β 0 Here ρ is the learning rate, which in this case can be taken t be withut lss in generality. If the classes are linearly separable, it can be shwn that the algrithm cnverges t a separating hyperplane in a finite number f steps (Exercise 4.6). Figure 4.4 shws tw slutins t a ty prblem, each started at a different randm guess. There are a number f prblems with this algrithm, summarized in Ripley (996): When the data are separable, there are many slutins, and which ne is fund depends n the starting values. The finite number f steps can be very large. The smaller the gap, the lnger the time t find it. When the data are nt separable, the algrithm will nt cnverge, and cycles develp. The cycles can be lng and therefre hard t detect. The secnd prblem can ften be eliminated by seeking a hyperplane nt in the riginal space, but in a much enlarged space btained by creating y i

151 4. Linear Methds fr Classificatin many basis-functin transfrmatins f the riginal variables. This is analgus t driving the residuals in a plynmial regressin prblem dwn t zer by making the degree sufficiently large. Perfect separatin cannt always be achieved: fr example, if bservatins frm tw different classes share the same input. It may nt be desirable either, since the resulting mdel is likely t be verfit and will nt generalize well. We return t this pint at the end f the next sectin. A rather elegant slutin t the first prblem is t add additinal cnstraints t the separating hyperplane ptimal Separating Hyperplanes The ptimal separating hyperplane separates the tw classes and maximizes the distance t the clsest pint frm either class (Vapnik, 996). Nt nly des this prvide a unique slutin t the separating hyperplane prblem, but by maximizing the margin between the tw classes n the training data, this leads t better classificatin perfrmance n test data. We need t generalize criterin (4.4). Cnsider the ptimizatin prblem max M β,β 0, β = subject t y i (x T i β + β 0 ) M, i =,...,N. (4.45) The set f cnditins ensure that all the pints are at least a signed distance M frm the decisin bundary defined by β and β 0, and we seek the largest such M and assciated parameters. We can get rid f the β = cnstraint by replacing the cnditins with (which redefines β 0 ) r equivalently β y i(x T i β + β 0 ) M, (4.46) y i (x T i β + β 0 ) M β. (4.47) Since fr any β and β 0 satisfying these inequalities, any psitively scaled multiple satisfies them t, we can arbitrarily set β = /M. Thus (4.45) is equivalent t min β,β 0 β subject t y i (x T i β + β 0 ), i =,...,N. (4.48) In light f (4.40), the cnstraints define an empty slab r margin arund the linear decisin bundary f thickness / β. Hence we chse β and β 0 t maximize its thickness. This is a cnvex ptimizatin prblem (quadratic

152 4.5 Separating Hyperplanes criterin with linear inequality cnstraints). The Lagrange (primal) functin, t be minimized w.r.t. β and β 0, is L P = β N α i [y i (x T i β + β 0 ) ]. (4.49) i= Setting the derivatives t zer, we btain: β = 0 = N α i y i x i, (4.50) i= N α i y i, (4.5) i= and substituting these in (4.49) we btain the s-called Wlfe dual L D = N α i i= N i= k= N α i α k y i y k x T i x k subject t α i 0. (4.5) The slutin is btained by maximizing L D in the psitive rthant, a simpler cnvex ptimizatin prblem, fr which standard sftware can be used. In additin the slutin must satisfy the Karush Kuhn Tucker cnditins, which include (4.50), (4.5), (4.5) and Frm these we can see that α i [y i (x T i β + β 0 ) ] = 0 i. (4.5) if α i > 0, then y i (x T i β + β 0) =, r in ther wrds, x i is n the bundary f the slab; if y i (x T i β+β 0) >, x i is nt n the bundary f the slab, and α i = 0. Frm (4.50) we see that the slutin vectr β is defined in terms f a linear cmbinatin f the supprt pints x i thse pints defined t be n the bundary f the slab via α i > 0. Figure 4.6 shws the ptimal separating hyperplane fr ur ty example; there are three supprt pints. Likewise, β 0 is btained by slving (4.5) fr any f the supprt pints. The ptimal separating hyperplane prduces a functin ˆf(x) = x T ˆβ + ˆβ 0 fr classifying new bservatins: Ĝ(x) = sign ˆf(x). (4.54) Althugh nne f the training bservatins fall in the margin (by cnstructin), this will nt necessarily be the case fr test bservatins. The

153 4 4. Linear Methds fr Classificatin FIGURE 4.6. The same data as in Figure 4.4. The shaded regin delineates the maximum margin separating the tw classes. There are three supprt pints indicated, which lie n the bundary f the margin, and the ptimal separating hyperplane (blue line) bisects the slab. Included in the figure is the bundary fund using lgistic regressin (red line), which is very clse t the ptimal separating hyperplane (see Sectin..). intuitin is that a large margin n the training data will lead t gd separatin n the test data. The descriptin f the slutin in terms f supprt pints seems t suggest that the ptimal hyperplane fcuses mre n the pints that cunt, and is mre rbust t mdel misspecificatin. The LDA slutin, n the ther hand, depends n all f the data, even pints far away frm the decisin bundary. Nte, hwever, that the identificatin f these supprt pints required the use f all the data. f curse, if the classes are really Gaussian, then LDA is ptimal, and separating hyperplanes will pay a price fr fcusing n the (nisier) data at the bundaries f the classes. Included in Figure 4.6 is the lgistic regressin slutin t this prblem, fit by maximum likelihd. Bth slutins are similar in this case. When a separating hyperplane exists, lgistic regressin will always find it, since the lg-likelihd can be driven t 0 in this case (Exercise 4.5). The lgistic regressin slutin shares sme ther qualitative features with the separating hyperplane slutin. The cefficient vectr is defined by a weighted least squares fit f a zer-mean linearized respnse n the input features, and the weights are larger fr pints near the decisin bundary than fr thse further away. When the data are nt separable, there will be n feasible slutin t this prblem, and an alternative frmulatin is needed. Again ne can enlarge the space using basis transfrmatins, but this can lead t artificial

154 Exercises 5 separatin thrugh ver-fitting. In Chapter we discuss a mre attractive alternative knwn as the supprt vectr machine, which allws fr verlap, but minimizes a measure f the extent f this verlap. Bibligraphic Ntes Gd general texts n classificatin include Duda et al. (000), Hand (98), McLachlan (99) and Ripley (996). Mardia et al. (979) have a cncise discussin f linear discriminant analysis. Michie et al. (994) cmpare a large number f ppular classifiers n benchmark datasets. Linear separating hyperplanes are discussed in Vapnik (996). ur accunt f the perceptrn learning algrithm fllws Ripley (996). Exercises Ex. 4. Shw hw t slve the generalized eigenvalue prblem max a T Ba subject t a T Wa = by transfrming t a standard eigenvalue prblem. Ex. 4. Suppse we have features x IR p, a tw-class respnse, with class sizes N,N, and the target cded as N/N,N/N. (a) Shw that the LDA rule classifies t class if x T ˆΣ (ˆµ ˆµ ) > ˆµT ˆΣ ˆµ ˆµT ˆΣ ˆµ + lg and class therwise. (b) Cnsider minimizatin f the least squares criterin ( N N ) lg ( N N (y i β 0 β T x i ). (4.55) i= Shw that the slutin ˆβ satisfies [ (N )ˆΣ + N ] N N ˆΣ B β = N(ˆµ ˆµ ) (4.56) (after simplificatin),where ˆΣ B = (ˆµ ˆµ )(ˆµ ˆµ ) T. (c) Hence shw that ˆΣ B β is in the directin (ˆµ ˆµ ) and thus ˆβ ˆΣ (ˆµ ˆµ ). (4.57) Therefre the least squares regressin cefficient is identical t the LDA cefficient, up t a scalar multiple. N ),

155 6 4. Linear Methds fr Classificatin (d) Shw that this result hlds fr any (distinct) cding f the tw classes. (e) Find the slutin ˆβ 0, and hence the predicted values ˆf = ˆβ 0 + ˆβ T x. Cnsider the fllwing rule: classify t class if ŷ i > 0 and class therwise. Shw this is nt the same as the LDA rule unless the classes have equal numbers f bservatins. (Fisher, 96; Ripley, 996) Ex. 4. Suppse we transfrm the riginal predictrs X t Ŷ via linear regressin. In detail, let Ŷ = X(XT X) X T Y = XˆB, where Y is the indicatr respnse matrix. Similarly fr any input x IR p, we get a transfrmed vectr ŷ = ˆB T x IR K. Shw that LDA using Ŷ is identical t LDA in the riginal space. Ex. 4.4 Cnsider the multilgit mdel with K classes (4.7). Let β be the (p + )(K )-vectr cnsisting f all the cefficients. Define a suitably enlarged versin f the input vectr x t accmmdate this vectrized cefficient matrix. Derive the Newtn-Raphsn algrithm fr maximizing the multinmial lg-likelihd, and describe hw yu wuld implement this algrithm. Ex. 4.5 Cnsider a tw-class lgistic regressin prblem with x IR. Characterize the maximum-likelihd estimates f the slpe and intercept parameter if the sample x i fr the tw classes are separated by a pint x 0 IR. Generalize this result t (a) x IR p (see Figure 4.6), and (b) mre than tw classes. Ex. 4.6 Suppse we have N pints x i in IR p in general psitin, with class labels y i {,}. Prve that the perceptrn learning algrithm cnverges t a separating hyperplane in a finite number f steps: (a) Dente a hyperplane by f(x) = β T x + β 0 = 0, r in mre cmpact ntatin β T x = 0, where x = (x,) and β = (β,β 0 ). Let z i = x i / x i. Shw that separability implies the existence f a β sep such that y i β T sepz i i (b) Given a current β ld, the perceptrn algrithm identifies a pint z i that is misclassified, and prduces the update β new β ld + y i z i. Shw that β new β sep β ld β sep, and hence that the algrithm cnverges t a separating hyperplane in n mre than β start β sep steps (Ripley, 996). Ex. 4.7 Cnsider the criterin D (β,β 0 ) = N y i (x T i β + β 0 ), (4.58) i=

156 Exercises 7 a generalizatin f (4.4) where we sum ver all the bservatins. Cnsider minimizing D subject t β =. Describe this criterin in wrds. Des it slve the ptimal separating hyperplane prblem? Ex. 4.8 Cnsider the multivariate Gaussian mdel X G = k N(µ k,σ), with the additinal restrictin that rank{µ k } K = L < max(k,p). Derive the cnstrained MLEs fr the µ k and Σ. Shw that the Bayes classificatin rule is equivalent t classifying in the reduced subspace cmputed by LDA (Hastie and Tibshirani, 996b). Ex. 4.9 Write a cmputer prgram t perfrm a quadratic discriminant analysis by fitting a separate Gaussian mdel per class. Try it ut n the vwel data, and cmpute the misclassificatin errr fr the test data. The data can be fund in the bk website www-stat.stanfrd.edu/elemstatlearn.

157 8 4. Linear Methds fr Classificatin

158 5 Basis Expansins and Regularizatin This is page 9 Printer: paque this 5. Intrductin We have already made use f mdels linear in the input features, bth fr regressin and classificatin. Linear regressin, linear discriminant analysis, lgistic regressin and separating hyperplanes all rely n a linear mdel. It is extremely unlikely that the true functin f(x) is actually linear in X. In regressin prblems, f(x) = E(Y X) will typically be nnlinear and nnadditive in X, and representing f(x) by a linear mdel is usually a cnvenient, and smetimes a necessary, apprximatin. Cnvenient because a linear mdel is easy t interpret, and is the first-rder Taylr apprximatin t f(x). Smetimes necessary, because with N small and/r p large, a linear mdel might be all we are able t fit t the data withut verfitting. Likewise in classificatin, a linear, Bayes-ptimal decisin bundary implies that sme mntne transfrmatin f Pr(Y = X) is linear in X. This is inevitably an apprximatin. In this chapter and the next we discuss ppular methds fr mving beynd linearity. The cre idea in this chapter is t augment/replace the vectr f inputs X with additinal variables, which are transfrmatins f X, and then use linear mdels in this new space f derived input features. Dente by h m (X) : IR p IR the mth transfrmatin f X, m =,...,M. We then mdel M f(x) = β m h m (X), (5.) m=

159 40 5. Basis Expansins and Regularizatin a linear basis expansin in X. The beauty f this apprach is that nce the basis functins h m have been determined, the mdels are linear in these new variables, and the fitting prceeds as befre. Sme simple and widely used examples f the h m are the fllwing: h m (X) = X m, m =,...,p recvers the riginal linear mdel. h m (X) = X j r h m(x) = X j X k allws us t augment the inputs with plynmial terms t achieve higher-rder Taylr expansins. Nte, hwever, that the number f variables grws expnentially in the degree f the plynmial. A full quadratic mdel in p variables requires (p ) square and crss-prduct terms, r mre generally (p d ) fr a degree-d plynmial. h m (X) = lg(x j ), X j,... permits ther nnlinear transfrmatins f single inputs. Mre generally ne can use similar functins invlving several inputs, such as h m (X) = X. h m (X) = I(L m X k < U m ), an indicatr fr a regin f X k. By breaking the range f X k up int M k such nnverlapping regins results in a mdel with a piecewise cnstant cntributin fr X k. Smetimes the prblem at hand will call fr particular basis functins h m, such as lgarithms r pwer functins. Mre ften, hwever, we use the basis expansins as a device t achieve mre flexible representatins fr f(x). Plynmials are an example f the latter, althugh they are limited by their glbal nature tweaking the cefficients t achieve a functinal frm in ne regin can cause the functin t flap abut madly in remte regins. In this chapter we cnsider mre useful families f piecewise-plynmials and splines that allw fr lcal plynmial representatins. We als discuss the wavelet bases, especially useful fr mdeling signals and images. These methds prduce a dictinary D cnsisting f typically a very large number D f basis functins, far mre than we can affrd t fit t ur data. Alng with the dictinary we require a methd fr cntrlling the cmplexity f ur mdel, using basis functins frm the dictinary. There are three cmmn appraches: Restrictin methds, where we decide befre-hand t limit the class f functins. Additivity is an example, where we assume that ur mdel has the frm f(x) = = p f j (X j ) j= p M j j= m= β jm h jm (X j ). (5.)

160 5. Piecewise Plynmials and Splines 4 The size f the mdel is limited by the number f basis functins M j used fr each cmpnent functin f j. Selectin methds, which adaptively scan the dictinary and include nly thse basis functins h m that cntribute significantly t the fit f the mdel. Here the variable selectin techniques discussed in Chapter are useful. The stagewise greedy appraches such as CART, MARS and bsting fall int this categry as well. Regularizatin methds where we use the entire dictinary but restrict the cefficients. Ridge regressin is a simple example f a regularizatin apprach, while the lass is bth a regularizatin and selectin methd. Here we discuss these and mre sphisticated methds fr regularizatin. 5. Piecewise Plynmials and Splines We assume until Sectin 5.7 that X is ne-dimensinal. A piecewise plynmial functin f(x) is btained by dividing the dmain f X int cntiguus intervals, and representing f by a separate plynmial in each interval. Figure 5. shws tw simple piecewise plynmials. The first is piecewise cnstant, with three basis functins: h (X) = I(X < ξ ), h (X) = I(ξ X < ξ ), h (X) = I(ξ X). Since these are psitive ver disjint regins, the least squares estimate f the mdel f(x) = m= β mh m (X) amunts t ˆβ m = Ȳm, the mean f Y in the mth regin. The tp right panel shws a piecewise linear fit. Three additinal basis functins are needed: h m+ = h m (X)X, m =,...,. Except in special cases, we wuld typically prefer the third panel, which is als piecewise linear, but restricted t be cntinuus at the tw knts. These cntinuity restrictins lead t linear cnstraints n the parameters; fr example, f(ξ ) = f(ξ+ ) implies that β + ξ β 4 = β + ξ β 5. In this case, since there are tw restrictins, we expect t get back tw parameters, leaving fur free parameters. A mre direct way t prceed in this case is t use a basis that incrprates the cnstraints: h (X) =, h (X) = X, h (X) = (X ξ ) +, h 4 (X) = (X ξ ) +, where t + dentes the psitive part. The functin h is shwn in the lwer right panel f Figure 5.. We ften prefer smther functins, and these can be achieved by increasing the rder f the lcal plynmial. Figure 5. shws a series f piecewise-cubic plynmials fit t the same data, with

161 4 5. Basis Expansins and Regularizatin Piecewise Cnstant Piecewise Linear Cntinuus Piecewise Linear Piecewise-linear Basis Functin ξ ξ ξ ξ ξ ξ ξ ξ (X ξ ) + FIGURE 5.. The tp left panel shws a piecewise cnstant functin fit t sme artificial data. The brken vertical lines indicate the psitins f the tw knts ξ and ξ. The blue curve represents the true functin, frm which the data were generated with Gaussian nise. The remaining tw panels shw piecewise linear functins fit t the same data the tp right unrestricted, and the lwer left restricted t be cntinuus at the knts. The lwer right panel shws a piecewise linear basis functin, h (X) = (X ξ ) +, cntinuus at ξ. The black pints indicate the sample evaluatins h (x i), i =,..., N.

162 5. Piecewise Plynmials and Splines 4 Discntinuus Cntinuus Cntinuus First Derivative Cntinuus Secnd Derivative Piecewise Cubic Plynmials ξ ξ ξ ξ ξ ξ ξ ξ FIGURE 5.. A series f piecewise-cubic plynmials, with increasing rders f cntinuity. increasing rders f cntinuity at the knts. The functin in the lwer right panel is cntinuus, and has cntinuus first and secnd derivatives at the knts. It is knwn as a cubic spline. Enfrcing ne mre rder f cntinuity wuld lead t a glbal cubic plynmial. It is nt hard t shw (Exercise 5.) that the fllwing basis represents a cubic spline with knts at ξ and ξ : h (X) =, h (X) = X, h 5 (X) = (X ξ ) +, h (X) = X, h 4 (X) = X, h 6 (X) = (X ξ ) +. (5.) There are six basis functins crrespnding t a six-dimensinal linear space f functins. A quick check cnfirms the parameter cunt: ( regins) (4 parameters per regin) ( knts) ( cnstraints per knt)= 6.

163 44 5. Basis Expansins and Regularizatin Mre generally, an rder-m spline with knts ξ j, j =,...,K is a piecewise-plynmial f rder M, and has cntinuus derivatives up t rder M. A cubic spline has M = 4. In fact the piecewise-cnstant functin in Figure 5. is an rder- spline, while the cntinuus piecewise linear functin is an rder- spline. Likewise the general frm fr the truncated-pwer basis set wuld be h j (X) = X j, j =,...,M, h M+l (X) = (X ξ l ) M +, l =,...,K. It is claimed that cubic splines are the lwest-rder spline fr which the knt-discntinuity is nt visible t the human eye. There is seldm any gd reasn t g beynd cubic-splines, unless ne is interested in smth derivatives. In practice the mst widely used rders are M =, and 4. These fixed-knt splines are als knwn as regressin splines. ne needs t select the rder f the spline, the number f knts and their placement. ne simple apprach is t parameterize a family f splines by the number f basis functins r degrees f freedm, and have the bservatins x i determine the psitins f the knts. Fr example, the expressin bs(x,df=7) in R generates a basis matrix f cubic-spline functins evaluated at the N bservatins in x, with the 7 = 4 interir knts at the apprpriate percentiles fx(0, 40, 60 and 80th.) ne can be mre explicit, hwever;bs(x, degree=, knts = c(0., 0.4, 0.6)) generates a basis fr linear splines, with three interir knts, and returns an N 4 matrix. Since the space f spline functins f a particular rder and knt sequence is a vectr space, there are many equivalent bases fr representing them (just as there are fr rdinary plynmials.) While the truncated pwer basis is cnceptually simple, it is nt t attractive numerically: pwers f large numbers can lead t severe runding prblems. The B-spline basis, described in the Appendix t this chapter, allws fr efficient cmputatins even when the number f knts K is large. 5.. Natural Cubic Splines We knw that the behavir f plynmials fit t data tends t be erratic near the bundaries, and extraplatin can be dangerus. These prblems are exacerbated with splines. The plynmials fit beynd the bundary knts behave even mre wildly than the crrespnding glbal plynmials in that regin. This can be cnveniently summarized in terms f the pintwise variance f spline functins fit by least squares (see the example in the next sectin fr details n these variance calculatins). Figure 5. cmpares A cubic spline with fur knts is eight-dimensinal. The bs() functin mits by default the cnstant term in the basis, since terms like this are typically included with ther terms in the mdel.

164 5. Piecewise Plynmials and Splines 45 Pintwise Variances Glbal Linear Glbal Cubic Plynmial Cubic Spline - knts Natural Cubic Spline - 6 knts X FIGURE 5.. Pintwise variance curves fr fur different mdels, with X cnsisting f 50 pints drawn at randm frm U[0, ], and an assumed errr mdel with cnstant variance. The linear and cubic plynmial fits have tw and fur degrees f freedm, respectively, while the cubic spline and natural cubic spline each have six degrees f freedm. The cubic spline has tw knts at 0. and 0.66, while the natural spline has bundary knts at 0. and 0.9, and fur interir knts unifrmly spaced between them. the pintwise variances fr a variety f different mdels. The explsin f the variance near the bundaries is clear, and inevitably is wrst fr cubic splines. A natural cubic spline adds additinal cnstraints, namely that the functin is linear beynd the bundary knts. This frees up fur degrees f freedm (tw cnstraints each in bth bundary regins), which can be spent mre prfitably by sprinkling mre knts in the interir regin. This tradeff is illustrated in terms f variance in Figure 5.. There will be a price paid in bias near the bundaries, but assuming the functin is linear near the bundaries (where we have less infrmatin anyway) is ften cnsidered reasnable. A natural cubic spline with K knts is represented by K basis functins. ne can start frm a basis fr cubic splines, and derive the reduced basis by impsing the bundary cnstraints. Fr example, starting frm the truncated pwer series basis described in Sectin 5., we arrive at (Exercise 5.4): N (X) =, N (X) = X, N k+ (X) = d k (X) d K (X), (5.4)

165 46 5. Basis Expansins and Regularizatin where d k (X) = (X ξ k) + (X ξ K ) + ξ K ξ k. (5.5) Each f these basis functins can be seen t have zer secnd and third derivative fr X ξ K. 5.. Example: Suth African Heart Disease (Cntinued) In Sectin 4.4. we fit linear lgistic regressin mdels t the Suth African heart disease data. Here we explre nnlinearities in the functins using natural splines. The functinal frm f the mdel is lgit[pr(chd X)] = θ 0 + h (X ) T θ + h (X ) T θ + + h p (X p ) T θ p, (5.6) where each f the θ j are vectrs f cefficients multiplying their assciated vectr f natural spline basis functins h j. We use fur natural spline bases fr each term in the mdel. Fr example, with X representing sbp, h (X ) is a basis cnsisting f fur basis functins. This actually implies three rather than tw interir knts (chsen at unifrm quantiles f sbp), plus tw bundary knts at the extremes f the data, since we exclude the cnstant term frm each f the h j. Since famhist is a tw-level factr, it is cded by a simple binary r dummy variable, and is assciated with a single cefficient in the fit f the mdel. Mre cmpactly we can cmbine all p vectrs f basis functins (and the cnstant term) int ne big vectr h(x), and then the mdel is simply h(x) T θ, with ttal number f parameters df = + p j= df j, the sum f the parameters in each cmpnent term. Each basis functin is evaluated at each f the N samples, resulting in a N df basis matrix H. At this pint the mdel is like any ther linear lgistic mdel, and the algrithms described in Sectin 4.4. apply. We carried ut a backward stepwise deletin prcess, drpping terms frm this mdel while preserving the grup structure f each term, rather than drpping ne cefficient at a time. The AIC statistic (Sectin 7.5) was used t drp terms, and all the terms remaining in the final mdel wuld cause AIC t increase if deleted frm the mdel (see Table 5.). Figure 5.4 shws a plt f the final mdel selected by the stepwise regressin. The functins displayed are ˆf j (X j ) = h j (X j ) T ˆθ j fr each variable X j. The cvariance matrix Cv(ˆθ) = Σ is estimated by ˆΣ = (H T WH), where W is the diagnal weight matrix frm the lgistic regressin. Hence v j (X j ) = Var[ ˆf j (X j )] = h j (X j ) T ˆΣjj h j (X j ) is the pintwise variance functin f ˆf j, where Cv(ˆθ j ) = ˆΣ jj is the apprpriate sub-matrix f ˆΣ. The shaded regin in each panel is defined by ˆf j (X j ) ± v j (X j ). The AIC statistic is slightly mre generus than the likelihd-rati test (deviance test). Bth sbp and besity are included in this mdel, while

166 5. Piecewise Plynmials and Splines Absent Present ˆf(sbp) ˆf(tbacc) sbp tbacc ˆf(ldl) ldl ˆf(besity) ˆf(age) ˆf(famhist) famhist besity age FIGURE 5.4. Fitted natural-spline functins fr each f the terms in the final mdel selected by the stepwise prcedure. Included are pintwise standard-errr bands. The rug plt at the base f each figure indicates the lcatin f each f the sample values fr that variable (jittered t break ties).

167 48 5. Basis Expansins and Regularizatin TABLE 5.. Final lgistic regressin mdel, after stepwise deletin f natural splines terms. The clumn labeled LRT is the likelihd-rati test statistic when that term is deleted frm the mdel, and is the change in deviance frm the full mdel (labeled nne ). Terms Df Deviance AIC LRT P-value nne sbp tbacc ldl famhist besity age they were nt in the linear mdel. The figure explains why, since their cntributins are inherently nnlinear. These effects at first may cme as a surprise, but an explanatin lies in the nature f the retrspective data. These measurements were made smetime after the patients suffered a heart attack, and in many cases they had already benefited frm a healthier diet and lifestyle, hence the apparent increase in risk at lw values fr besity and sbp. Table 5. shws a summary f the selected mdel. 5.. Example: Phneme Recgnitin In this example we use splines t reduce flexibility rather than increase it; the applicatin cmes under the general heading f functinal mdeling. In the tp panel f Figure 5.5 are displayed a sample f 5 lg-peridgrams fr each f the tw phnemes aa and a measured at 56 frequencies. The gal is t use such data t classify a spken phneme. These tw phnemes were chsen because they are difficult t separate. The input feature is a vectr x f length 56, which we can think f as a vectr f evaluatins f a functin X(f) ver a grid f frequencies f. In reality there is a cntinuus analg signal which is a functin f frequency, and we have a sampled versin f it. The gray lines in the lwer panel f Figure 5.5 shw the cefficients f a linear lgistic regressin mdel fit by maximum likelihd t a training sample f 000 drawn frm the ttal f 695 aa s and 0 a s. The cefficients are als pltted as a functin f frequency, and in fact we can think f the mdel in terms f its cntinuus cunterpart lg Pr(aa X) Pr(a X) = X(f)β(f)df, (5.7)

168 5. Piecewise Plynmials and Splines 49 Phneme Examples Lg-peridgram aa a Frequency Phneme Classificatin: Raw and Restricted Lgistic Regressin Lgistic Regressin Cefficients Frequency FIGURE 5.5. The tp panel displays the lg-peridgram as a functin f frequency fr 5 examples each f the phnemes aa and a sampled frm a ttal f 695 aa s and 0 a s. Each lg-peridgram is measured at 56 unifrmly spaced frequencies. The lwer panel shws the cefficients (as a functin f frequency) f a lgistic regressin fit t the data by maximum likelihd, using the 56 lg-peridgram values as inputs. The cefficients are restricted t be smth in the red curve, and are unrestricted in the jagged gray curve.

169 50 5. Basis Expansins and Regularizatin which we apprximate by X(f j )β(f j ) = x j β j. (5.8) j= The cefficients cmpute a cntrast functinal, and will have appreciable values in regins f frequency where the lg-peridgrams differ between the tw classes. The gray curves are very rugh. Since the input signals have fairly strng psitive autcrrelatin, this results in negative autcrrelatin in the cefficients. In additin the sample size effectively prvides nly fur bservatins per cefficient. Applicatins such as this permit a natural regularizatin. We frce the cefficients t vary smthly as a functin f frequency. The red curve in the lwer panel f Figure 5.5 shws such a smth cefficient curve fit t these data. We see that the lwer frequencies ffer the mst discriminatry pwer. Nt nly des the smthing allw easier interpretatin f the cntrast, it als prduces a mre accurate classifier: j= Raw Regularized Training errr Test errr The smth red curve was btained thrugh a very simple use f natural cubic splines. We can represent the cefficient functin as an expansin f splines β(f) = M m= h m(f)θ m. In practice this means that β = Hθ where, H is a p M basis matrix f natural cubic splines, defined n the set f frequencies. Here we used M = basis functins, with knts unifrmly placed ver the integers,,...,56 representing the frequencies. Since x T β = x T Hθ, we can simply replace the input features x by their filtered versins x = H T x, and fit θ by linear lgistic regressin n the x. The red curve is thus ˆβ(f) = h(f) T ˆθ. 5. Filtering and Feature Extractin In the previus example, we cnstructed a p M basis matrix H, and then transfrmed ur features x int new features x = H T x. These filtered versins f the features were then used as inputs int a learning prcedure: in the previus example, this was linear lgistic regressin. Preprcessing f high-dimensinal features is a very general and pwerful methd fr imprving the perfrmance f a learning algrithm. The preprcessing need nt be linear as it was abve, but can be a general

170 5.4 Smthing Splines 5 (nnlinear) functin f the frm x = g(x). The derived features x can then be used as inputs int any (linear r nnlinear) learning prcedure. Fr example, fr signal r image recgnitin a ppular apprach is t first transfrm the raw features via a wavelet transfrm x = H T x (Sectin 5.9) and then use the features x as inputs int a neural netwrk (Chapter ). Wavelets are effective in capturing discrete jumps r edges, and the neural netwrk is a pwerful tl fr cnstructing nnlinear functins f these features fr predicting the target variable. By using dmain knwledge t cnstruct apprpriate features, ne can ften imprve upn a learning methd that has nly the raw features x at its dispsal. 5.4 Smthing Splines Here we discuss a spline basis methd that avids the knt selectin prblem cmpletely by using a maximal set f knts. The cmplexity f the fit is cntrlled by regularizatin. Cnsider the fllwing prblem: amng all functins f(x) with tw cntinuus derivatives, find ne that minimizes the penalized residual sum f squares N RSS(f,λ) = {y i f(x i )} + λ {f (t)} dt, (5.9) i= where λ is a fixed smthing parameter. The first term measures clseness t the data, while the secnd term penalizes curvature in the functin, and λ establishes a tradeff between the tw. Tw special cases are: λ = 0 : f can be any functin that interplates the data. λ = : the simple least squares line fit, since n secnd derivative can be tlerated. These vary frm very rugh t very smth, and the hpe is that λ (0, ) indexes an interesting class f functins in between. The criterin (5.9) is defined n an infinite-dimensinal functin space in fact, a Sblev space f functins fr which the secnd term is defined. Remarkably, it can be shwn that (5.9) has an explicit, finite-dimensinal, unique minimizer which is a natural cubic spline with knts at the unique values f the x i, i =,...,N (Exercise 5.7). At face value it seems that the family is still ver-parametrized, since there are as many as N knts, which implies N degrees f freedm. Hwever, the penalty term translates t a penalty n the spline cefficients, which are shrunk sme f the way tward the linear fit. Since the slutin is a natural spline, we can write it as f(x) = N N j (x)θ j, (5.0) j=

171 5 5. Basis Expansins and Regularizatin Age Relative Change in Spinal BMD Male Female FIGURE 5.6. The respnse is the relative change in bne mineral density measured at the spine in adlescents, as a functin f age. A separate smthing spline was fit t the males and females, with λ This chice crrespnds t abut degrees f freedm. where the N j (x) are an N-dimensinal set f basis functins fr representing this family f natural splines (Sectin 5.. and Exercise 5.4). The criterin thus reduces t RSS(θ,λ) = (y Nθ) T (y Nθ) + λθ T Ω N θ, (5.) where {N} ij = N j (x i ) and {Ω N } jk = N j (t)n k (t)dt. The slutin is easily seen t be ˆθ = (N T N + λω N ) N T y, (5.) a generalized ridge regressin. The fitted smthing spline is given by ˆf(x) = N j= N j (x)ˆθ j. (5.) Efficient cmputatinal techniques fr smthing splines are discussed in the Appendix t this chapter. Figure 5.6 shws a smthing spline fit t sme data n bne mineral density (BMD) in adlescents. The respnse is relative change in spinal BMD ver tw cnsecutive visits, typically abut ne year apart. The data are clr cded by gender, and tw separate curves were fit. This simple

172 5.4 Smthing Splines 5 summary reinfrces the evidence in the data that the grwth spurt fr females precedes that fr males by abut tw years. In bth cases the smthing parameter λ was apprximately 0.000; this chice is discussed in the next sectin Degrees f Freedm and Smther Matrices We have nt yet indicated hw λ is chsen fr the smthing spline. Later in this chapter we describe autmatic methds using techniques such as crss-validatin. In this sectin we discuss intuitive ways f prespecifying the amunt f smthing. A smthing spline with prechsen λ is an example f a linear smther (as in linear peratr). This is because the estimated parameters in (5.) are a linear cmbinatin f the y i. Dente by ˆf the N-vectr f fitted values ˆf(x i ) at the training predictrs x i. Then ˆf = N(N T N + λω N ) N T y = S λ y. (5.4) Again the fit is linear in y, and the finite linear peratr S λ is knwn as the smther matrix. ne cnsequence f this linearity is that the recipe fr prducing ˆf frm y des nt depend n y itself; S λ depends nly n the x i and λ. Linear peratrs are familiar in mre traditinal least squares fitting as well. Suppse B ξ is a N M matrix f M cubic-spline basis functins evaluated at the N training pints x i, with knt sequence ξ, and M N. Then the vectr f fitted spline values is given by ˆf = Bξ (B T ξ B ξ ) B T ξ y = H ξ y. (5.5) Here the linear peratr H ξ is a prjectin peratr, als knwn as the hat matrix in statistics. There are sme imprtant similarities and differences between H ξ and S λ : Bth are symmetric, psitive semidefinite matrices. H ξ H ξ = H ξ (idemptent), while S λ S λ S λ, meaning that the righthand side exceeds the left-hand side by a psitive semidefinite matrix. This is a cnsequence f the shrinking nature f S λ, which we discuss further belw. H ξ has rank M, while S λ has rank N. The expressin M = trace(h ξ ) gives the dimensin f the prjectin space, which is als the number f basis functins, and hence the number f parameters invlved in the fit. By analgy we define the effective degrees f

173 54 5. Basis Expansins and Regularizatin freedm f a smthing spline t be df λ = trace(s λ ), (5.6) the sum f the diagnal elements f S λ. This very useful definitin allws us a mre intuitive way t parameterize the smthing spline, and indeed many ther smthers as well, in a cnsistent fashin. Fr example, in Figure 5.6 we specified df λ = fr each f the curves, and the crrespnding λ was derived numerically by slving trace(s λ ) =. There are many arguments supprting this definitin f degrees f freedm, and we cver sme f them here. Since S λ is symmetric (and psitive semidefinite), it has a real eigendecmpsitin. Befre we prceed, it is cnvenient t rewrite S λ in the Reinsch frm S λ = (I + λk), (5.7) where K des nt depend n λ (Exercise 5.9). Since ˆf = S λ y slves min f (y f) T (y f) + λf T Kf, (5.8) K is knwn as the penalty matrix, and indeed a quadratic frm in K has a representatin in terms f a weighted sum f squared (divided) secnd differences. The eigen-decmpsitin f S λ is with S λ = N ρ k (λ)u k u T k (5.9) k= ρ k (λ) = + λd k, (5.0) and d k the crrespnding eigenvalue f K. Figure 5.7 (tp) shws the results f applying a cubic smthing spline t sme air pllutin data (8 bservatins). Tw fits are given: a smther fit crrespnding t a larger penalty λ and a rugher fit fr a smaller penalty. The lwer panels represent the eigenvalues (lwer left) and sme eigenvectrs (lwer right) f the crrespnding smther matrices. Sme f the highlights f the eigenrepresentatin are the fllwing: The eigenvectrs are nt affected by changes in λ, and hence the whle family f smthing splines (fr a particular sequence x) indexed by λ have the same eigenvectrs. S λ y = N k= u kρ k (λ) u k,y, and hence the smthing spline perates by decmpsing y w.r.t. the (cmplete) basis {u k }, and differentially shrinking the cntributins using ρ k (λ). This is t be cntrasted with a basis-regressin methd, where the cmpnents are

174 5.4 Smthing Splines 55 zne Cncentratin Daggt Pressure Gradient Eigenvalues df=5 df= rder FIGURE 5.7. (Tp:) Smthing spline fit f zne cncentratin versus Daggt pressure gradient. The tw fits crrespnd t different values f the smthing parameter, chsen t achieve five and eleven effective degrees f freedm, defined by df λ = trace(s λ ). (Lwer left:) First 5 eigenvalues fr the tw smthing-spline matrices. The first tw are exactly, and all are 0. (Lwer right:) Third t sixth eigenvectrs f the spline smther matrices. In each case, u k is pltted against x, and as such is viewed as a functin f x. The rug at the base f the plts indicate the ccurrence f data pints. The damped functins represent the smthed versins f these functins (using the 5 df smther).

175 56 5. Basis Expansins and Regularizatin either left alne, r shrunk t zer that is, a prjectin matrix such as H ξ abve has M eigenvalues equal t, and the rest are 0. Fr this reasn smthing splines are referred t as shrinking smthers, while regressin splines are prjectin smthers (see Figure.7 n page 80). The sequence f u k, rdered by decreasing ρ k (λ), appear t increase in cmplexity. Indeed, they have the zer-crssing behavir f plynmials f increasing degree. Since S λ u k = ρ k (λ)u k, we see hw each f the eigenvectrs themselves are shrunk by the smthing spline: the higher the cmplexity, the mre they are shrunk. If the dmain f X is peridic, then the u k are sines and csines at different frequencies. The first tw eigenvalues are always ne, and they crrespnd t the tw-dimensinal eigenspace f functins linear in x (Exercise 5.), which are never shrunk. The eigenvalues ρ k (λ) = /( + λd k ) are an inverse functin f the eigenvalues d k f the penalty matrix K, mderated by λ; λ cntrls the rate at which the ρ k (λ) decrease t zer. d = d = 0 and again linear functins are nt penalized. ne can reparametrize the smthing spline using the basis vectrs u k (the Demmler Reinsch basis). In this case the smthing spline slves min y Uθ + λθ T Dθ, (5.) θ where U has clumns u k and D is a diagnal matrix with elements d k. df λ = trace(s λ ) = N k= ρ k(λ). Fr prjectin smthers, all the eigenvalues are, each ne crrespnding t a dimensin f the prjectin subspace. Figure 5.8 depicts a smthing spline matrix, with the rws rdered with x. The banded nature f this representatin suggests that a smthing spline is a lcal fitting methd, much like the lcally weighted regressin prcedures in Chapter 6. The right panel shws in detail selected rws f S, which we call the equivalent kernels. As λ 0, df λ N, and S λ I, the N-dimensinal identity matrix. As λ, df λ, and S λ H, the hat matrix fr linear regressin n x. 5.5 Autmatic Selectin f the Smthing Parameters The smthing parameters fr regressin splines encmpass the degree f the splines, and the number and placement f the knts. Fr smthing

176 5.5 Autmatic Selectin f the Smthing Parameters 57 Equivalent Kernels Rw Smther Matrix Rw 5 5 Rw Rw Rw 00 Rw 5 FIGURE 5.8. The smther matrix fr a smthing spline is nearly banded, indicating an equivalent kernel with lcal supprt. The left panel represents the elements f S as an image. The right panel shws the equivalent kernel r weighting functin in detail fr the indicated rws.

177 58 5. Basis Expansins and Regularizatin splines, we have nly the penalty parameter λ t select, since the knts are at all the unique training X s, and cubic degree is almst always used in practice. Selecting the placement and number f knts fr regressin splines can be a cmbinatrially cmplex task, unless sme simplificatins are enfrced. The MARS prcedure in Chapter 9 uses a greedy algrithm with sme additinal apprximatins t achieve a practical cmprmise. We will nt discuss this further here Fixing the Degrees f Freedm Since df λ = trace(s λ ) is mntne in λ fr smthing splines, we can invert the relatinship and specify λ by fixing df. In practice this can be achieved by simple numerical methds. S, fr example, in R ne can use smth.spline(x,y,df=6) t specify the amunt f smthing. This encurages a mre traditinal mde f mdel selectin, where we might try a cuple f different values f df, and select ne based n apprximate F-tests, residual plts and ther mre subjective criteria. Using df in this way prvides a unifrm apprach t cmpare many different smthing methds. It is particularly useful in generalized additive mdels (Chapter 9), where several smthing methds can be simultaneusly used in ne mdel The Bias Variance Tradeff Figure 5.9 shws the effect f the chice f df λ when using a smthing spline n a simple example: Y = f(x) + ε, f(x) = sin((x + 0.)), X + 0. (5.) with X U[0,] and ε N(0,). ur training sample cnsists f N = 00 pairs x i,y i drawn independently frm this mdel. The fitted splines fr three different values f df λ are shwn. The yellw shaded regin in the figure represents the pintwise standard errr f ˆf λ, that is, we have shaded the regin between ˆf λ (x) ± se( ˆf λ (x)). Since ˆf = Sλ y, Cv(ˆf) = S λ Cv(y)S T λ = S λ S T λ. (5.) The diagnal cntains the pintwise variances at the training x i. The bias is given by Bias(ˆf) = f E(ˆf) = f S λ f, (5.4)

178 5.5 Autmatic Selectin f the Smthing Parameters y y y EPE CV X X X df λ = 5 df λ = 9 df λ = 5 df λ Crss-Validatin EPE(λ) and CV(λ) FIGURE 5.9. The tp left panel shws the EPE(λ) and CV(λ) curves fr a realizatin frm a nnlinear additive errr mdel (5.). The remaining panels shw the data, the true functins (in purple), and the fitted curves (in green) with yellw shaded ± standard errr bands, fr three different values f df λ.

179 60 5. Basis Expansins and Regularizatin where f is the (unknwn) vectr f evaluatins f the true f at the training X s. The expectatins and variances are with respect t repeated draws f samples f size N = 00 frm the mdel (5.). In a similar fashin Var( ˆf λ (x 0 )) and Bias( ˆf λ (x 0 )) can be cmputed at any pint x 0 (Exercise 5.0). The three fits displayed in the figure give a visual demnstratin f the bias-variance tradeff assciated with selecting the smthing parameter. df λ = 5: The spline under fits, and clearly trims dwn the hills and fills in the valleys. This leads t a bias that is mst dramatic in regins f high curvature. The standard errr band is very narrw, s we estimate a badly biased versin f the true functin with great reliability! df λ = 9: Here the fitted functin is clse t the true functin, althugh a slight amunt f bias seems evident. The variance has nt increased appreciably. df λ = 5: The fitted functin is smewhat wiggly, but clse t the true functin. The wiggliness als accunts fr the increased width f the standard errr bands the curve is starting t fllw sme individual pints t clsely. Nte that in these figures we are seeing a single realizatin f data and hence fitted spline ˆf in each case, while the bias invlves an expectatin E( ˆf). We leave it as an exercise (5.0) t cmpute similar figures where the bias is shwn as well. The middle curve seems just right, in that it has achieved a gd cmprmise between bias and variance. The integrated squared predictin errr (EPE) cmbines bth bias and variance in a single summary: EPE( ˆf λ ) = E(Y ˆf λ (X)) [ = Var(Y ) + E Bias ( ˆf λ (X)) + Var( ˆf ] λ (X)) = σ + MSE( ˆf λ ). (5.5) Nte that this is averaged bth ver the training sample (giving rise t ˆf λ ), and the values f the (independently chsen) predictin pints (X, Y ). EPE is a natural quantity f interest, and des create a tradeff between bias and variance. The blue pints in the tp left panel f Figure 5.9 suggest that df λ = 9 is spt n! Since we dn t knw the true functin, we d nt have access t EPE, and need an estimate. This tpic is discussed in sme detail in Chapter 7, and techniques such as K-fld crss-validatin, GCV and C p are all in cmmn use. In Figure 5.9 we include the N-fld (leave-ne-ut) crss-validatin curve:

180 5.6 Nnparametric Lgistic Regressin 6 CV( ˆf λ ) = N = N N (y i i= ˆf ( i) λ (x i )) (5.6) ( N y i ˆf ) λ (x i ), (5.7) S λ (i,i) i= which can (remarkably) be cmputed fr each value f λ frm the riginal fitted values and the diagnal elements S λ (i,i) f S λ (Exercise 5.). The EPE and CV curves have a similar shape, but the entire CV curve is abve the EPE curve. Fr sme realizatins this is reversed, and verall the CV curve is apprximately unbiased as an estimate f the EPE curve. 5.6 Nnparametric Lgistic Regressin The smthing spline prblem (5.9) in Sectin 5.4 is psed in a regressin setting. It is typically straightfrward t transfer this technlgy t ther dmains. Here we cnsider lgistic regressin with a single quantitative input X. The mdel is which implies lg Pr(Y = X = x) = f(x), (5.8) Pr(Y = 0 X = x) Pr(Y = X = x) = ef(x). (5.9) + ef(x) Fitting f(x) in a smth fashin leads t a smth estimate f the cnditinal prbability Pr(Y = x), which can be used fr classificatin r risk scring. We cnstruct the penalized lg-likelihd criterin l(f;λ) = = N [y i lg p(x i ) + ( y i )lg( p(x i ))] λ {f (t)} dt i= N i= [ ] y i f(x i ) lg( + e f(xi) ) λ {f (t)} dt, (5.0) where we have abbreviated p(x) = Pr(Y = x). The first term in this expressin is the lg-likelihd based n the binmial distributin (c.f. Chapter 4, page 0). Arguments similar t thse used in Sectin 5.4 shw that the ptimal f is a finite-dimensinal natural spline with knts at the unique

181 6 5. Basis Expansins and Regularizatin values f x. This means that we can represent f(x) = N j= N j(x)θ j. We cmpute the first and secnd derivatives l(θ) θ = N T (y p) λωθ, (5.) l(θ) θ θ T = N T WN λω, (5.) where p is the N-vectr with elements p(x i ), and W is a diagnal matrix f weights p(x i )( p(x i )). The first derivative (5.) is nnlinear in θ, s we need t use an iterative algrithm as in Sectin Using Newtn Raphsn as in (4.) and (4.6) fr linear lgistic regressin, the update equatin can be written θ new = (N T WN + λω) N T W ( Nθ ld + W (y p) ) = (N T WN + λω) N T Wz. (5.) We can als express this update in terms f the fitted values f new = N(N T WN + λω) N T W ( f ld + W (y p) ) = S λ,w z. (5.4) Referring back t (5.) and (5.4), we see that the update fits a weighted smthing spline t the wrking respnse z (Exercise 5.). The frm f (5.4) is suggestive. It is tempting t replace S λ,w by any nnparametric (weighted) regressin peratr, and btain general families f nnparametric lgistic regressin mdels. Althugh here x is nedimensinal, this prcedure generalizes naturally t higher-dimensinal x. These extensins are at the heart f generalized additive mdels, which we pursue in Chapter Multidimensinal Splines S far we have fcused n ne-dimensinal spline mdels. Each f the appraches have multidimensinal analgs. Suppse X IR, and we have a basis f functins h k (X ), k =,...,M fr representing functins f crdinate X, and likewise a set f M functins h k (X ) fr crdinate X. Then the M M dimensinal tensr prduct basis defined by g jk (X) = h j (X )h k (X ), j =,...,M, k =,...,M (5.5) can be used fr representing a tw-dimensinal functin: M M g(x) = θ jk g jk (X). (5.6) j= k=

182 5.7 Multidimensinal Splines 6 FIGURE 5.0. A tensr prduct basis f B-splines, shwing sme selected pairs. Each tw-dimensinal functin is the tensr prduct f the crrespnding ne dimensinal marginals. Figure 5.0 illustrates a tensr prduct basis using B-splines. The cefficients can be fit by least squares, as befre. This can be generalized t d dimensins, but nte that the dimensin f the basis grws expnentially fast yet anther manifestatin f the curse f dimensinality. The MARS prcedure discussed in Chapter 9 is a greedy frward algrithm fr including nly thse tensr prducts that are deemed necessary by least squares. Figure 5. illustrates the difference between additive and tensr prduct (natural) splines n the simulated classificatin example frm Chapter. A lgistic regressin mdel lgit[pr(t x)] = h(x) T θ is fit t the binary respnse, and the estimated decisin bundary is the cntur h(x) T ˆθ = 0. The tensr prduct basis can achieve mre flexibility at the decisin bundary, but intrduces sme spurius structure alng the way.

183 64 5. Basis Expansins and Regularizatin Additive Natural Cubic Splines - 4 df each Training Errr: 0. Test Errr: 0.8 Bayes Errr: 0. Natural Cubic Splines - Tensr Prduct - 4 df each Training Errr: 0.0 Test Errr: 0.8 Bayes Errr: 0.0 FIGURE 5.. The simulatin example f Figure.. The upper panel shws the decisin bundary f an additive lgistic regressin mdel, using natural splines in each f the tw crdinates (ttal df = + (4 ) + (4 ) = 7). The lwer panel shws the results f using a tensr prduct f natural spline bases in each crdinate (ttal df = 4 4 = 6). The brken purple bundary is the Bayes decisin bundary fr this prblem.

184 5.7 Multidimensinal Splines 65 ne-dimensinal smthing splines (via regularizatin) generalize t higher dimensins as well. Suppse we have pairs y i,x i with x i IR d, and we seek a d-dimensinal regressin functin f(x). The idea is t set up the prblem min f N {y i f(x i )} + λj[f], (5.7) i= where J is an apprpriate penalty functinal fr stabilizing a functin f in IR d. Fr example, a natural generalizatin f the ne-dimensinal rughness penalty (5.9) fr functins n IR is J[f] = IR [ ( ) ( f(x) ) ( f(x) ) f(x) ] + + dx dx. (5.8) x x x ptimizing (5.7) with this penalty leads t a smth tw-dimensinal surface, knwn as a thin-plate spline. It shares many prperties with the ne-dimensinal cubic smthing spline: x as λ 0, the slutin appraches an interplating functin [the ne with smallest penalty (5.8)]; as λ, the slutin appraches the least squares plane; fr intermediate values f λ, the slutin can be represented as a linear expansin f basis functins, whse cefficients are btained by a frm f generalized ridge regressin. The slutin has the frm f(x) = β 0 + β T x + N α j h j (x), (5.9) where h j (x) = x x j lg x x j. These h j are examples f radial basis functins, which are discussed in mre detail in the next sectin. The cefficients are fund by plugging (5.9) int (5.7), which reduces t a finite-dimensinal penalized least squares prblem. Fr the penalty t be finite, the cefficients α j have t satisfy a set f linear cnstraints; see Exercise 5.4. Thin-plate splines are defined mre generally fr arbitrary dimensin d, fr which an apprpriately mre general J is used. There are a number f hybrid appraches that are ppular in practice, bth fr cmputatinal and cnceptual simplicity. Unlike ne-dimensinal smthing splines, the cmputatinal cmplexity fr thin-plate splines is (N ), since there is nt in general any sparse structure that can be explited. Hwever, as with univariate smthing splines, we can get away with substantially less than the N knts prescribed by the slutin (5.9). j=

185 66 5. Basis Expansins and Regularizatin Age besity Systlic Bld Pressure FIGURE 5.. A thin-plate spline fit t the heart disease data, displayed as a cntur plt. The respnse is systlic bld pressure, mdeled as a functin f age and besity. The data pints are indicated, as well as the lattice f pints used as knts. Care shuld be taken t use knts frm the lattice inside the cnvex hull f the data (red), and ignre thse utside (green). In practice, it is usually sufficient t wrk with a lattice f knts cvering the dmain. The penalty is cmputed fr the reduced expansin just as befre. Using K knts reduces the cmputatins t (NK + K ). Figure 5. shws the result f fitting a thin-plate spline t sme heart disease risk factrs, representing the surface as a cntur plt. Indicated are the lcatin f the input features, as well as the knts used in the fit. Nte that λ was specified via df λ = trace(s λ ) = 5. Mre generally ne can represent f IR d as an expansin in any arbitrarily large cllectin f basis functins, and cntrl the cmplexity by applying a regularizer such as (5.8). Fr example, we culd cnstruct a basis by frming the tensr prducts f all pairs f univariate smthing-spline basis functins as in (5.5), using, fr example, the univariate B-splines recmmended in Sectin 5.9. as ingredients. This leads t an expnential

186 5.8 Regularizatin and Reprducing Kernel Hilbert Spaces 67 grwth in basis functins as the dimensin increases, and typically we have t reduce the number f functins per crdinate accrdingly. The additive spline mdels discussed in Chapter 9 are a restricted class f multidimensinal splines. They can be represented in this general frmulatin as well; that is, there exists a penalty J[f] that guarantees that the slutin has the frm f(x) = α + f (X ) + + f d (X d ) and that each f the functins f j are univariate splines. In this case the penalty is smewhat degenerate, and it is mre natural t assume that f is additive, and then simply impse an additinal penalty n each f the cmpnent functins: J[f] = J(f + f + + f d ) d = f j (t j ) dt j. (5.40) j= These are naturally extended t ANVA spline decmpsitins, f(x) = α + j f j (X j ) + j<k f jk (X j,x k ) +, (5.4) where each f the cmpnents are splines f the required dimensin. There are many chices t be made: The maximum rder f interactin we have shwn up t rder abve. Which terms t include nt all main effects and interactins are necessarily needed. What representatin t use sme chices are: regressin splines with a relatively small number f basis functins per crdinate, and their tensr prducts fr interactins; a cmplete basis as in smthing splines, and include apprpriate regularizers fr each term in the expansin. In many cases when the number f ptential dimensins (features) is large, autmatic methds are mre desirable. The MARS and MART prcedures (Chapters 9 and 0, respectively), bth fall int this categry. 5.8 Regularizatin and Reprducing Kernel Hilbert Spaces In this sectin we cast splines int the larger cntext f regularizatin methds and reprducing kernel Hilbert spaces. This sectin is quite technical and can be skipped by the disinterested r intimidated reader.

187 68 5. Basis Expansins and Regularizatin A general class f regularizatin prblems has the frm [ N ] L(y i,f(x i )) + λj(f) min f H i= (5.4) where L(y,f(x)) is a lss functin, J(f) is a penalty functinal, and H is a space f functins n which J(f) is defined. Girsi et al. (995) describe quite general penalty functinals f the frm IRd J(f) = f(s) ds, (5.4) G(s) where f dentes the Furier transfrm f f, and G is sme psitive functin that falls ff t zer as s. The idea is that / G increases the penalty fr high-frequency cmpnents f f. Under sme additinal assumptins they shw that the slutins have the frm K N f(x) = α k φ k (X) + θ i G(X x i ), (5.44) k= where the φ k span the null space f the penalty functinal J, and G is the inverse Furier transfrm f G. Smthing splines and thin-plate splines fall int this framewrk. The remarkable feature f this slutin is that while the criterin (5.4) is defined ver an infinite-dimensinal space, the slutin is finite-dimensinal. In the next sectins we lk at sme specific examples. i= 5.8. Spaces f Functins Generated by Kernels An imprtant subclass f prblems f the frm (5.4) are generated by a psitive definite kernel K(x, y), and the crrespnding space f functins H K is called a reprducing kernel Hilbert space (RKHS). The penalty functinal J is defined in terms f the kernel as well. We give a brief and simplified intrductin t this class f mdels, adapted frm Wahba (990) and Girsi et al. (995), and nicely summarized in Evgeniu et al. (000). Let x,y IR p. We cnsider the space f functins generated by the linear span f {K(,y), y IR p )}; i.e arbitrary linear cmbinatins f the frm f(x) = m α mk(x,y m ), where each kernel term is viewed as a functin f the first argument, and indexed by the secnd. Suppse that K has an eigen-expansin K(x,y) = γ i φ i (x)φ i (y) (5.45) i= with γ i 0, i= γ i <. Elements f H K have an expansin in terms f these eigen-functins, f(x) = c i φ i (x), (5.46) i=

188 5.8 Regularizatin and Reprducing Kernel Hilbert Spaces 69 with the cnstraint that f H K def = c i/γ i <, (5.47) i= where f HK is the nrm induced by K. The penalty functinal in (5.4) fr the space H K is defined t be the squared nrm J(f) = f H K. The quantity J(f) can be interpreted as a generalized ridge penalty, where functins with large eigenvalues in the expansin (5.45) get penalized less, and vice versa. Rewriting (5.4) we have [ N min f H K i= i= L(y i,f(x i )) + λ f H K ] j= (5.48) r equivalently N min L(y i, c j φ j (x i )) + λ c j/γ j. (5.49) {c j} It can be shwn (Wahba, 990, see als Exercise 5.5) that the slutin t (5.48) is finite-dimensinal, and has the frm f(x) = j= N α i K(x,x i ). (5.50) i= The basis functin h i (x) = K(x,x i ) (as a functin f the first argument) is knwn as the representer f evaluatin at x i in H K, since fr f H K, it is easily seen that K(,x i ),f HK = f(x i ). Similarly K(,x i ),K(,x j ) HK = K(x i,x j ) (the reprducing prperty f H K ), and hence J(f) = N i= j= N K(x i,x j )α i α j (5.5) fr f(x) = N i= α ik(x,x i ). In light f (5.50) and (5.5), (5.48) reduces t a finite-dimensinal criterin min L(y,Kα) + α λαt Kα. (5.5) We are using a vectr ntatin, in which K is the N N matrix with ijth entry K(x i,x j ) and s n. Simple numerical algrithms can be used t ptimize (5.5). This phenmenn, whereby the infinite-dimensinal prblem (5.48) r (5.49) reduces t a finite dimensinal ptimizatin prblem, has been dubbed the kernel prperty in the literature n supprt-vectr machines (see Chapter ).

189 70 5. Basis Expansins and Regularizatin There is a Bayesian interpretatin f this class f mdels, in which f is interpreted as a realizatin f a zer-mean statinary Gaussian prcess, with prir cvariance functin K. The eigen-decmpsitin prduces a series f rthgnal eigen-functins φ j (x) with assciated variances γ j. The typical scenari is that smth functins φ j have large prir variance, while rugh φ j have small prir variances. The penalty in (5.48) is the cntributin f the prir t the jint likelihd, and penalizes mre thse cmpnents with smaller prir variance (cmpare with (5.4)). Fr simplicity we have dealt with the case here where all members f H are penalized, as in (5.48). Mre generally, there may be sme cmpnents in H that we wish t leave alne, such as the linear functins fr cubic smthing splines in Sectin 5.4. The multidimensinal thin-plate splines f Sectin 5.7 and tensr prduct splines fall int this categry as well. In these cases there is a mre cnvenient representatin H = H 0 H, with the null space H 0 cnsisting f, fr example, lw degree plynmials in x that d nt get penalized. The penalty becmes J(f) = P f, where P is the rthgnal prjectin f f nt H. The slutin has the frm f(x) = M j= β jh j (x) + N i= α ik(x,x i ), where the first term represents an expansin in H 0. Frm a Bayesian perspective, the cefficients f cmpnents in H 0 have imprper prirs, with infinite variance Examples f RKHS The machinery abve is driven by the chice f the kernel K and the lss functin L. We cnsider first regressin using squared-errr lss. In this case (5.48) specializes t penalized least squares, and the slutin can be characterized in tw equivalent ways crrespnding t (5.49) r (5.5): min {c j} N y i c j φ j (x i ) i= j= + λ an infinite-dimensinal, generalized ridge regressin prblem, r j= c j γ j (5.5) min α (y Kα)T (y Kα) + λα T Kα. (5.54) The slutin fr α is btained simply as ˆα = (K + λi) y, (5.55) and N ˆf(x) = ˆα j K(x,x j ). (5.56) j=

190 5.8 Regularizatin and Reprducing Kernel Hilbert Spaces 7 The vectr f N fitted values is given by ˆf = Kˆα = K(K + λi) y (5.57) = (I + λk ) y. (5.58) The estimate (5.57) als arises as the kriging estimate f a Gaussian randm field in spatial statistics (Cressie, 99). Cmpare als (5.58) with the smthing spline fit (5.7) n page 54. Penalized Plynmial Regressin The kernel K(x,y) = ( x,y + ) d (Vapnik, 996), fr x,y IR p, has M = ( ) p+d d eigen-functins that span the space f plynmials in IR p f ttal degree d. Fr example, with p = and d =, M = 6 and K(x,y) = + x y + x y + x y + x y + x x y y (5.59) M = h m (x)h m (y) (5.60) m= with h(x) T = (, x, x,x,x, x x ). (5.6) ne can represent h in terms f the M rthgnal eigen-functins and eigenvalues f K, h(x) = VD γ φ(x), (5.6) where D γ = diag(γ,γ,...,γ M ), and V is M M and rthgnal. Suppse we wish t slve the penalized plynmial regressin prblem min {β m} M ( N y i i= M M β m h m (x i )) + λ βm. (5.6) m= m= Substituting (5.6) int (5.6), we get an expressin f the frm (5.5) t ptimize (Exercise 5.6). The number f basis functins M = ( ) p+d d can be very large, ften much larger than N. Equatin (5.55) tells us that if we use the kernel representatin fr the slutin functin, we have nly t evaluate the kernel N times, and can cmpute the slutin in (N ) peratins. This simplicity is nt withut implicatins. Each f the plynmials h m in (5.6) inherits a scaling factr frm the particular frm f K, which has a bearing n the impact f the penalty in (5.6). We elabrate n this in the next sectin.

191 7 5. Basis Expansins and Regularizatin Radial Kernel in IR K(, xm) X FIGURE 5.. Radial kernels k k (x) fr the mixture data, with scale parameter ν =. The kernels are centered at five pints x m chsen at randm frm the 00. Gaussian Radial Basis Functins In the preceding example, the kernel is chsen because it represents an expansin f plynmials and can cnveniently cmpute high-dimensinal inner prducts. In this example the kernel is chsen because f its functinal frm in the representatin (5.50). The Gaussian kernel K(x,y) = e ν x y alng with squared-errr lss, fr example, leads t a regressin mdel that is an expansin in Gaussian radial basis functins, k m (x) = e ν x xm, m =,...,N, (5.64) each ne centered at ne f the training feature vectrs x m. The cefficients are estimated using (5.54). Figure 5. illustrates radial kernels in IR using the first crdinate f the mixture example frm Chapter. We shw five f the 00 kernel basis functins k m (x) = K(x,x m ). Figure 5.4 illustrates the implicit feature space fr the radial kernel with x IR. We cmputed the kernel matrix K, and its eigendecmpsitin ΦD γ Φ T. We can think f the clumns f Φ and the crrespnding eigenvalues in D γ as empirical estimates f the eigen expansin (5.45). Althugh the eigenvectrs are discrete, we can represent them as functins n IR (Exercise 5.7). Figure 5.5 shws the largest 50 eigenvalues f K. The leading eigenfunctins are smth, and they are successively mre wiggly as the rder increases. This brings t life the penalty in (5.49), where we see the cefficients f higher-rder functins get penalized mre than lwer-rder nes. The right panel in Figure 5.4 shws the crrespnd- The lth clumn f Φ is an estimate f φ l, evaluated at each f the N bservatins. Alternatively, the ith rw f Φ is the estimated vectr f basis functins φ(x i ), evaluated at the pint x i. Althugh in principle, there can be infinitely many elements in φ, ur estimate has at mst N elements.

192 5.8 Regularizatin and Reprducing Kernel Hilbert Spaces 7 rthnrmal Basis Φ Feature Space H FIGURE 5.4. (Left panel) The first 6 nrmalized eigenvectrs f K, the kernel matrix fr the first crdinate f the mixture data. These are viewed as estimates ˆφ l f the eigenfunctins in (5.45), and are represented as functins in IR with the bserved values superimpsed in clr. They are arranged in rws, starting at the tp left. (Right panel) Rescaled versins h l = ˆγ l ˆφl f the functins in the left panel, fr which the kernel cmputes the inner prduct e 5 e e 07 e 0 e+0 Eigenvalue FIGURE 5.5. The largest 50 eigenvalues f K; all thse beynd the 0th are effectively zer.

193 74 5. Basis Expansins and Regularizatin ing feature space representatin f the eigenfunctins h l (x) = ˆγ l ˆφl (x), l =,...,N. (5.65) Nte that h(x i ),h(x i ) = K(x i,x i ). The scaling by the eigenvalues quickly shrinks mst f the functins dwn t zer, leaving an effective dimensin f abut in this case. The crrespnding ptimizatin prblem is a standard ridge regressin, as in (5.6). S althugh in principle the implicit feature space is infinite dimensinal, the effective dimensin is dramatically lwer because f the relative amunts f shrinkage applied t each basis functin. The kernel scale parameter ν plays a rle here as well; larger ν implies mre lcal k m functins, and increases the effective dimensin f the feature space. See Hastie and Zhu (006) fr mre details. It is als knwn (Girsi et al., 995) that a thin-plate spline (Sectin 5.7) is an expansin in radial basis functins, generated by the kernel K(x,y) = x y lg( x y ). (5.66) Radial basis functins are discussed in mre detail in Sectin 6.7. Supprt Vectr Classifiers The supprt vectr machines f Chapter fr a tw-class classificatin prblem have the frm f(x) = α 0 + N i= α ik(x,x i ), where the parameters are chsen t minimize { N } [ y i f(x i )] + + λ αt Kα, (5.67) min α 0,α i= where y i {,}, and [z] + dentes the psitive part f z. This can be viewed as a quadratic ptimizatin prblem with linear cnstraints, and requires a quadratic prgramming algrithm fr its slutin. The name supprt vectr arises frm the fact that typically many f the ˆα i = 0 [due t the piecewise-zer nature f the lss functin in (5.67)], and s ˆf is an expansin in a subset f the K(,x i ). See Sectin.. fr mre details. 5.9 Wavelet Smthing We have seen tw different mdes f peratin with dictinaries f basis functins. With regressin splines, we select a subset f the bases, using either subject-matter knwledge, r else autmatically. The mre adaptive prcedures such as MARS (Chapter 9) can capture bth smth and nnsmth behavir. With smthing splines, we use a cmplete basis, but then shrink the cefficients tward smthness.

194 5.9 Wavelet Smthing 75 Haar Wavelets Symmlet-8 Wavelets ψ 6,5 ψ 6,5 ψ 5,5 ψ 5, ψ 4,9 ψ 4,4 ψ,5 ψ, ψ, ψ, ψ, Time Time FIGURE 5.6. Sme selected wavelets at different translatins and dilatins fr the Haar and symmlet families. The functins have been scaled t suit the display. Wavelets typically use a cmplete rthnrmal basis t represent functins, but then shrink and select the cefficients tward a sparse representatin. Just as a smth functin can be represented by a few spline basis functins, a mstly flat functin with a few islated bumps can be represented with a few (bumpy) basis functins. Wavelets bases are very ppular in signal prcessing and cmpressin, since they are able t represent bth smth and/r lcally bumpy functins in an efficient way a phenmenn dubbed time and frequency lcalizatin. In cntrast, the traditinal Furier basis allws nly frequency lcalizatin. Befre we give details, let s lk at the Haar wavelets in the left panel f Figure 5.6 t get an intuitive idea f hw wavelet smthing wrks. The vertical axis indicates the scale (frequency) f the wavelets, frm lw scale at the bttm t high scale at the tp. At each scale the wavelets are packed in side-by-side t cmpletely fill the time axis: we have nly shwn

195 76 5. Basis Expansins and Regularizatin a selected subset. Wavelet smthing fits the cefficients fr this basis by least squares, and then threshlds (discards, filters) the smaller cefficients. Since there are many basis functins at each scale, it can use bases where it needs them and discard the nes it des nt need, t achieve time and frequency lcalizatin. The Haar wavelets are simple t understand, but nt smth enugh fr mst purpses. The symmlet wavelets in the right panel f Figure 5.6 have the same rthnrmal prperties, but are smther. Figure 5.7 displays an NMR (nuclear magnetic resnance) signal, which appears t be cmpsed f smth cmpnents and islated spikes, plus sme nise. The wavelet transfrm, using a symmlet basis, is shwn in the lwer left panel. The wavelet cefficients are arranged in rws, frm lwest scale at the bttm, t highest scale at the tp. The length f each line segment indicates the size f the cefficient. The bttm right panel shws the wavelet cefficients after they have been threshlded. The threshld prcedure, given belw in equatin (5.69), is the same sft-threshlding rule that arises in the lass prcedure fr linear regressin (Sectin.4.). Ntice that many f the smaller cefficients have been set t zer. The green curve in the tp panel shws the back-transfrm f the threshlded cefficients: this is the smthed versin f the riginal signal. In the next sectin we give the details f this prcess, including the cnstructin f wavelets and the threshlding rule Wavelet Bases and the Wavelet Transfrm In this sectin we give details n the cnstructin and filtering f wavelets. Wavelet bases are generated by translatins and dilatins f a single scaling functin φ(x) (als knwn as the father). The red curves in Figure 5.8 are the Haar and symmlet-8 scaling functins. The Haar basis is particularly easy t understand, especially fr anyne with experience in analysis f variance r trees, since it prduces a piecewise-cnstant representatin. Thus if φ(x) = I(x [0,]), then φ 0,k (x) = φ(x k), k an integer, generates an rthnrmal basis fr functins with jumps at the integers. Call this reference space V 0. The dilatins φ,k (x) = φ(x k) frm an rthnrmal basis fr a space V V 0 f functins piecewise cnstant n intervals f length. In fact, mre generally we have V V 0 V where each V j is spanned by φ j,k = j/ φ( j x k). Nw t the definitin f wavelets. In analysis f variance, we ften represent a pair f means µ and µ by their grand mean µ = (µ +µ ), and then a cntrast α = (µ µ ). A simplificatin ccurs if the cntrast α is very small, because then we can set it t zer. In a similar manner we might represent a functin in V j+ by a cmpnent in V j plus the cmpnent in the rthgnal cmplement W j f V j t V j+, written as V j+ = V j W j. The cmpnent in W j represents detail, and we might wish t set sme elements f this cmpnent t zer. It is easy t see that the functins ψ(x k)

196 5.9 Wavelet Smthing 77 NMR Signal Wavelet Transfrm - riginal Signal Wavelet Transfrm - WaveShrunk Signal Signal Signal W 9 W 9 W 8 W 8 W 7 W 7 W 6 W 6 W 5 W 5 W 4 W 4 V 4 V FIGURE 5.7. The tp panel shws an NMR signal, with the wavelet-shrunk versin superimpsed in green. The lwer left panel represents the wavelet transfrm f the riginal signal, dwn t V 4, using the symmlet-8 basis. Each cefficient is represented by the height (psitive r negative) f the vertical bar. The lwer right panel represents the wavelet cefficients after being shrunken using the waveshrink functin in S-PLUS, which implements the SureShrink methd f wavelet adaptatin f Dnh and Jhnstne.

197 78 5. Basis Expansins and Regularizatin Haar Basis Symmlet Basis φ(x) φ(x) ψ(x) ψ(x) FIGURE 5.8. The Haar and symmlet father (scaling) wavelet φ(x) and mther wavelet ψ(x). generated by the mther wavelet ψ(x) = φ(x) φ(x ) frm an rthnrmal basis fr W 0 fr the Haar family. Likewise ψ j,k = j/ ψ( j x k) frm a basis fr W j. Nw V j+ = V j W j = V j W j W j, s besides representing a functin by its level-j detail and level-j rugh cmpnents, the latter can be brken dwn t level-(j ) detail and rugh, and s n. Finally we get a representatin f the frm V J = V 0 W 0 W W J. Figure 5.6 n page 75 shws particular wavelets ψ j,k (x). Ntice that since these spaces are rthgnal, all the basis functins are rthnrmal. In fact, if the dmain is discrete with N = J (time) pints, this is as far as we can g. There are j basis elements at level j, and adding up, we have a ttal f J elements in the W j, and ne in V 0. This structured rthnrmal basis allws fr a multireslutin analysis, which we illustrate in the next sectin. While helpful fr understanding the cnstructin abve, the Haar basis is ften t carse fr practical purpses. Frtunately, many clever wavelet bases have been invented. Figures 5.6 and 5.8 include the Daubechies symmlet-8 basis. This basis has smther elements than the crrespnding Haar basis, but there is a tradeff: Each wavelet has a supprt cvering 5 cnsecutive time intervals, rather than ne fr the Haar basis. Mre generally, the symmlet-p family has a supprt f p cnsecutive intervals. The wider the supprt, the mre time the wavelet has t die t zer, and s it can

198 5.9 Wavelet Smthing 79 achieve this mre smthly. Nte that the effective supprt seems t be much narrwer. The symmlet-p wavelet ψ(x) has p vanishing mments; that is, ψ(x)x j dx = 0, j = 0,...,p. ne implicatin is that any rder-p plynmial ver the N = J times pints is reprduced exactly in V 0 (Exercise 5.8). In this sense V 0 is equivalent t the null space f the smthing-spline penalty. The Haar wavelets have ne vanishing mment, and V 0 can reprduce any cnstant functin. The symmlet-p scaling functins are ne f many families f wavelet generatrs. The peratins are similar t thse fr the Haar basis: If V 0 is spanned by φ(x k), then V V 0 is spanned by φ,k (x) = φ(x k) and φ(x) = k Z h(k)φ,k(x), fr sme filter cefficients h(k). W 0 is spanned by ψ(x) = k Z g(k)φ,k(x), with filter cefficients g(k) = ( ) k h( k) Adaptive Wavelet Filtering Wavelets are particularly useful when the data are measured n a unifrm lattice, such as a discretized signal, image, r a time series. We will fcus n the ne-dimensinal case, and having N = J lattice-pints is cnvenient. Suppse y is the respnse vectr, and W is the N N rthnrmal wavelet basis matrix evaluated at the N unifrmly spaced bservatins. Then y = W T y is called the wavelet transfrm f y (and is the full least squares regressin cefficient). A ppular methd fr adaptive wavelet fitting is knwn as SURE shrinkage (Stein Unbiased Risk Estimatin, Dnh and Jhnstne (994)). We start with the criterin min θ y Wθ + λ θ, (5.68) which is the same as the lass criterin in Chapter. Because W is rthnrmal, this leads t the simple slutin: ˆθ j = sign(y j)( y j λ) +. (5.69) The least squares cefficients are translated tward zer, and truncated at zer. The fitted functin (vectr) is then given by the inverse wavelet transfrm ˆf = Wˆθ.

199 80 5. Basis Expansins and Regularizatin A simple chice fr λ is λ = σ lg N, where σ is an estimate f the standard deviatin f the nise. We can give sme mtivatin fr this chice. Since W is an rthnrmal transfrmatin, if the elements f y are white nise (independent Gaussian variates with mean 0 and variance σ ), then s are y. Furthermre if randm variables Z,Z,...,Z N are white nise, the expected maximum f Z j,j =,...,N is apprximately σ lg N. Hence all cefficients belw σ lg N are likely t be nise and are set t zer. The space W culd be any basis f rthnrmal functins: plynmials, natural splines r csinusids. What makes wavelets special is the particular frm f basis functins used, which allws fr a representatin lcalized in time and in frequency. Let s lk again at the NMR signal f Figure 5.7. The wavelet transfrm was cmputed using a symmlet 8 basis. Ntice that the cefficients d nt descend all the way t V 0, but stp at V 4 which has 6 basis functins. As we ascend t each level f detail, the cefficients get smaller, except in lcatins where spiky behavir is present. The wavelet cefficients represent characteristics f the signal lcalized in time (the basis functins at each level are translatins f each ther) and lcalized in frequency. Each dilatin increases the detail by a factr f tw, and in this sense crrespnds t dubling the frequency in a traditinal Furier representatin. In fact, a mre mathematical understanding f wavelets reveals that the wavelets at a particular scale have a Furier transfrm that is restricted t a limited range r ctave f frequencies. The shrinking/truncatin in the right panel was achieved using the SURE apprach described in the intrductin t this sectin. The rthnrmal N N basis matrix W has clumns which are the wavelet basis functins evaluated at the N time pints. In particular, in this case there will be 6 clumns crrespnding t the φ 4,k (x), and the remainder devted t the ψ j,k (x), j = 4,...,. In practice λ depends n the nise variance, and has t be estimated frm the data (such as the variance f the cefficients at the highest level). Ntice the similarity between the SURE criterin (5.68) n page 79, and the smthing spline criterin (5.) n page 56: Bth are hierarchically structured frm carse t fine detail, althugh wavelets are als lcalized in time within each reslutin level. The splines build in a bias tward smth functins by impsing differential shrinking cnstants d k. Early versins f SURE shrinkage treated all scales equally. The S+wavelets functin waveshrink() has many ptins, sme f which allw fr differential shrinkage. The spline L penalty cause pure shrinkage, while the SURE L penalty des shrinkage and selectin.

200 Exercises 8 Mre generally smthing splines achieve cmpressin f the riginal signal by impsing smthness, while wavelets impse sparsity. Figure 5.9 cmpares a wavelet fit (using SURE shrinkage) t a smthing spline fit (using crss-validatin) n tw examples different in nature. Fr the NMR data in the upper panel, the smthing spline intrduces detail everywhere in rder t capture the detail in the islated spikes; the wavelet fit nicely lcalizes the spikes. In the lwer panel, the true functin is smth, and the nise is relatively high. The wavelet fit has let in sme additinal and unnecessary wiggles a price it pays in variance fr the additinal adaptivity. The wavelet transfrm is nt perfrmed by matrix multiplicatin as in y = W T y. In fact, using clever pyramidal schemes y can be btained in (N) cmputatins, which is even faster than the N lg(n) f the fast Furier transfrm (FFT). While the general cnstructin is beynd the scpe f this bk, it is easy t see fr the Haar basis (Exercise 5.9). Likewise, the inverse wavelet transfrm Wˆθ is als (N). This has been a very brief glimpse f this vast and grwing field. There is a very large mathematical and cmputatinal base built n wavelets. Mdern image cmpressin is ften perfrmed using tw-dimensinal wavelet representatins. Bibligraphic Ntes Splines and B-splines are discussed in detail in de Br (978). Green and Silverman (994) and Wahba (990) give a thrugh treatment f smthing splines and thin-plate splines; the latter als cvers reprducing kernel Hilbert spaces. See als Girsi et al. (995) and Evgeniu et al. (000) fr cnnectins between many nnparametric regressin techniques using RKHS appraches. Mdeling functinal data, as in Sectin 5.., is cvered in detail in Ramsay and Silverman (997). Daubechies (99) is a classic and mathematical treatment f wavelets. ther useful surces are Chui (99) and Wickerhauser (994). Dnh and Jhnstne (994) develped the SURE shrinkage and selectin technlgy frm a statistical estimatin framewrk; see als Vidakvic (999). Bruce and Ga (996) is a useful applied intrductin, which als describes the wavelet sftware in S-PLUS. Exercises Ex. 5. Shw that the truncated pwer basis functins in (5.) represent a basis fr a cubic spline with the tw knts as indicated.

201 8 5. Basis Expansins and Regularizatin spline wavelet NMR Signal n spline wavelet true Smth Functin (Simulated) FIGURE 5.9. Wavelet smthing cmpared with smthing splines n tw examples. Each panel cmpares the SURE-shrunk wavelet fit t the crss-validated smthing spline fit.

202 Exercises 8 Ex. 5. Suppse that B i,m (x) is an rder-m B-spline defined in the Appendix n page 86 thrugh the sequence (5.77) (5.78). (a) Shw by inductin that B i,m (x) = 0 fr x [τ i,τ i+m ]. This shws, fr example, that the supprt f cubic B-splines is at mst 5 knts. (b) Shw by inductin that B i,m (x) > 0 fr x (τ i,τ i+m ). The B-splines are psitive in the interir f their supprt. (c) Shw by inductin that K+M i= B i,m (x) = x [ξ 0,ξ K+ ]. (d) Shw that B i,m is a piecewise plynmial f rder M (degree M ) n [ξ 0,ξ K+ ], with breaks nly at the knts ξ,...,ξ K. (e) Shw that an rder-m B-spline basis functin is the density functin f a cnvlutin f M unifrm randm variables. Ex. 5. Write a prgram t reprduce Figure 5. n page 45. Ex. 5.4 Cnsider the truncated pwer series representatin fr cubic splines with K interir knts. Let f(x) = K β j X j + θ k (X ξ k ) +. (5.70) j=0 Prve that the natural bundary cnditins fr natural cubic splines (Sectin 5..) imply the fllwing linear cnstraints n the cefficients: k= β = 0, β = 0, K k= θ k = 0, K k= ξ kθ k = 0. (5.7) Hence derive the basis (5.4) and (5.5). Ex. 5.5 Write a prgram t classify the phneme data using a quadratic discriminant analysis (Sectin 4.). Since there are many crrelated features, yu shuld filter them using a smth basis f natural cubic splines (Sectin 5..). Decide befrehand n a series f five different chices fr the number and psitin f the knts, and use tenfld crss-validatin t make the final selectin. The phneme data are available frm the bk website www-stat.stanfrd.edu/elemstatlearn. Ex. 5.6 Suppse yu wish t fit a peridic functin, with a knwn perid T. Describe hw yu culd mdify the truncated pwer series basis t achieve this gal. Ex. 5.7 Derivatin f smthing splines (Green and Silverman, 994). Suppse that N, and that g is the natural cubic spline interplant t the pairs {x i,z i } N, with a < x < < x N < b. This is a natural spline

203 84 5. Basis Expansins and Regularizatin with a knt at every x i ; being an N-dimensinal space f functins, we can determine the cefficients such that it interplates the sequence z i exactly. Let g be any ther differentiable functin n [a,b] that interplates the N pairs. (a) Let h(x) = g(x) g(x). Use integratin by parts and the fact that g is a natural cubic spline t shw that b (b) Hence shw that a N g (x)h (x)dx = g (x + j ){h(x j+) h(x j )} (5.7) b a = 0. j= g (t) dt b a g (t) dt, and that equality can nly hld if h is identically zer in [a,b]. (c) Cnsider the penalized least squares prblem [ N ] b min (y i f(x i )) + λ f (t) dt. f i= Use (b) t argue that the minimizer must be a cubic spline with knts at each f the x i. Ex. 5.8 In the appendix t this chapter we shw hw the smthing spline cmputatins culd be mre efficiently carried ut using a (N + 4) dimensinal basis f B-splines. Describe a slightly simpler scheme using a (N +) dimensinal B-spline basis defined n the N interir knts. Ex. 5.9 Derive the Reinsch frm S λ = (I+λK) fr the smthing spline. Ex. 5.0 Derive an expressin fr Var( ˆf λ (x 0 )) and bias( ˆf λ (x 0 )). Using the example (5.), create a versin f Figure 5.9 where the mean and several (pintwise) quantiles f ˆf λ (x) are shwn. Ex. 5. Prve that fr a smthing spline the null space f K is spanned by functins linear in X. Ex. 5. Characterize the slutin t the fllwing prblem, min f RSS(f,λ) = N i= a w i {y i f(x i )} + λ {f (t)} dt, (5.7) where the w i 0 are bservatin weights. Characterize the slutin t the smthing spline prblem (5.9) when the training data have ties in X.

204 Exercises 85 Ex. 5. Yu have fitted a smthing spline ˆf λ t a sample f N pairs (x i,y i ). Suppse yu augment yur riginal sample with the pair x 0, ˆf λ (x 0 ), and refit; describe the result. Use this t derive the N-fld crss-validatin frmula (5.6). Ex. 5.4 Derive the cnstraints n the α j in the thin-plate spline expansin (5.9) t guarantee that the penalty J(f) is finite. Hw else culd ne ensure that the penalty was finite? Ex. 5.5 This exercise derives sme f the results quted in Sectin Suppse K(x,y) satisfying the cnditins (5.45) and let f(x) H K. Shw that (a) K(,x i ),f HK = f(x i ). (b) K(,x i ),K(,x j ) HK = K(x i,x j ). (c) If g(x) = N i= α ik(x,x i ), then J(g) = N i= j= N K(x i,x j )α i α j. Suppse that g(x) = g(x) + ρ(x), with ρ(x) H K, and rthgnal in H K t each f K(x,x i ), i =,...,N. Shw that (d) N N L(y i, g(x i )) + λj( g) L(y i,g(x i )) + λj(g) (5.74) i= i= with equality iff ρ(x) = 0. Ex. 5.6 Cnsider the ridge regressin prblem (5.5), and assume M N. Assume yu have a kernel K that cmputes the inner prduct K(x,y) = M m= h m(x)h m (y). (a) Derive (5.6) n page 7 in the text. Hw wuld yu cmpute the matrices V and D γ, given K? Hence shw that (5.6) is equivalent t (5.5). (b) Shw that ˆf = Hˆβ = K(K + λi) y, (5.75) where H is the N M matrix f evaluatins h m (x i ), and K = HH T the N N matrix f inner-prducts h(x i ) T h(x j ).

205 86 5. Basis Expansins and Regularizatin (c) Shw that and ˆα = (K + λi) y. ˆf(x) = h(x) T ˆβ N = K(x,x i )ˆα i (5.76) i= (d) Hw wuld yu mdify yur slutin if M < N? Ex. 5.7 Shw hw t cnvert the discrete eigen-decmpsitin f K in Sectin 5.8. t estimates f the eigenfunctins f K. Ex. 5.8 The wavelet functin ψ(x) f the symmlet-p wavelet basis has vanishing mments up t rder p. Shw that this implies that plynmials f rder p are represented exactly in V 0, defined n page 76. Ex. 5.9 Shw that the Haar wavelet transfrm f a signal f length N = J can be cmputed in (N) cmputatins. Appendix: Cmputatins fr Splines In this Appendix, we describe the B-spline basis fr representing plynmial splines. We als discuss their use in the cmputatins f smthing splines. B-splines Befre we can get started, we need t augment the knt sequence defined in Sectin 5.. Let ξ 0 < ξ and ξ K < ξ K+ be tw bundary knts, which typically define the dmain ver which we wish t evaluate ur spline. We nw define the augmented knt sequence τ such that τ τ τ M ξ 0 ; τ j+m = ξ j, j =,,K; ξ K+ τ K+M+ τ K+M+ τ K+M. The actual values f these additinal knts beynd the bundary are arbitrary, and it is custmary t make them all the same and equal t ξ 0 and ξ K+, respectively. Dente by B i,m (x) the ith B-spline basis functin f rder m fr the knt-sequence τ, m M. They are defined recursively in terms f divided

206 Appendix: Cmputatins fr Splines 87 differences as fllws: { if τi x < τ B i, (x) = i+ 0 therwise (5.77) fr i =,...,K + M. These are als knwn as Haar basis functins. x τ i B i,m (x) = B i,m (x) + τ i+m x B i+,m (x) τ i+m τ i τ i+m τ i+ fr i =,...,K + M m. (5.78) Thus with M = 4, B i,4, i =,,K + 4 are the K + 4 cubic B-spline basis functins fr the knt sequence ξ. This recursin can be cntinued and will generate the B-spline basis fr any rder spline. Figure 5.0 shws the sequence f B-splines up t rder fur with knts at the pints 0.0,0.,...,.0. Since we have created sme duplicate knts, sme care has t be taken t avid divisin by zer. If we adpt the cnventin that B i, = 0 if τ i = τ i+, then by inductin B i,m = 0 if τ i = τ i+ =... = τ i+m. Nte als that in the cnstructin abve, nly the subset B i,m, i = M m +,...,M + K are required fr the B-spline basis f rder m < M with knts ξ. T fully understand the prperties f these functins, and t shw that they d indeed span the space f cubic splines fr the knt sequence, requires additinal mathematical machinery, including the prperties f divided differences. Exercise 5. explres these issues. The scpe f B-splines is in fact bigger than advertised here, and has t d with knt duplicatin. If we duplicate an interir knt in the cnstructin f the τ sequence abve, and then generate the B-spline sequence as befre, the resulting basis spans the space f piecewise plynmials with ne less cntinuus derivative at the duplicated knt. In general, if in additin t the repeated bundary knts, we include the interir knt ξ j r j M times, then the lwest-rder derivative t be discntinuus at x = ξ j will be rder M r j. Thus fr cubic splines with n repeats, r j =, j =,...,K, and at each interir knt the third derivatives (4 ) are discntinuus. Repeating the jth knt three times leads t a discntinuus st derivative; repeating it fur times leads t a discntinuus zerth derivative, i.e., the functin is discntinuus at x = ξ j. This is exactly what happens at the bundary knts; we repeat the knts M times, s the spline becmes discntinuus at the bundary knts (i.e., undefined beynd the bundary). The lcal supprt f B-splines has imprtant cmputatinal implicatins, especially when the number f knts K is large. Least squares cmputatins with N bservatins and K + M variables (basis functins) take (N(K + M) + (K + M) ) flps (flating pint peratins.) If K is sme appreciable fractin f N, this leads t (N ) algrithms which becmes

207 88 5. Basis Expansins and Regularizatin B-splines f rder B-splines f rder B-splines f rder B-splines f rder FIGURE 5.0. The sequence f B-splines up t rder fur with ten knts evenly spaced frm 0 t. The B-splines have lcal supprt; they are nnzer n an interval spanned by M + knts.

208 Appendix: Cmputatins fr Splines 89 unacceptable fr large N. If the N bservatins are srted, the N (K+M) regressin matrix cnsisting f the K + M B-spline basis functins evaluated at the N pints has many zers, which can be explited t reduce the cmputatinal cmplexity back t (N). We take this up further in the next sectin. Cmputatins fr Smthing Splines Althugh natural splines (Sectin 5..) prvide a basis fr smthing splines, it is cmputatinally mre cnvenient t perate in the larger space f uncnstrained B-splines. We write f(x) = N+4 γ j B j (x), where γ j are cefficients and the B j are the cubic B-spline basis functins. The slutin lks the same as befre, ˆγ = (B T B + λω B ) B T y, (5.79) except nw the N N matrix N is replaced by the N (N + 4) matrix B, and similarly the (N + 4) (N + 4) penalty matrix Ω B replaces the N N dimensinal Ω N. Althugh at face value it seems that there are n bundary derivative cnstraints, it turns ut that the penalty term autmatically impses them by giving effectively infinite weight t any nn zer derivative beynd the bundary. In practice, ˆγ is restricted t a linear subspace fr which the penalty is always finite. Since the clumns f B are the evaluated B-splines, in rder frm left t right and evaluated at the srted values f X, and the cubic B-splines have lcal supprt, B is lwer 4-banded. Cnsequently the matrix M = (B T B+λΩ) is 4-banded and hence its Chlesky decmpsitin M = LL T can be cmputed easily. ne then slves LL T γ = B T y by back-substitutin t give γ and hence the slutin ˆf in (N) peratins. In practice, when N is large, it is unnecessary t use all N interir knts, and any reasnable thinning strategy will save in cmputatins and have negligible effect n the fit. Fr example, the smth.spline functin in S- PLUS uses an apprximately lgarithmic strategy: if N < 50 all knts are included, but even at N = 5,000 nly 04 knts are used.

209 90 5. Basis Expansins and Regularizatin

210 6 Kernel Smthing Methds This is page 9 Printer: paque this In this chapter we describe a class f regressin techniques that achieve flexibility in estimating the regressin functin f(x) ver the dmain IR p by fitting a different but simple mdel separately at each query pint x 0. This is dne by using nly thse bservatins clse t the target pint x 0 t fit the simple mdel, and in such a way that the resulting estimated functin ˆf(X) is smth in IR p. This lcalizatin is achieved via a weighting functin r kernel K λ (x 0,x i ), which assigns a weight t x i based n its distance frm x 0. The kernels K λ are typically indexed by a parameter λ that dictates the width f the neighbrhd. These memry-based methds require in principle little r n training; all the wrk gets dne at evaluatin time. The nly parameter that needs t be determined frm the training data is λ. The mdel, hwever, is the entire training data set. We als discuss mre general classes f kernel-based techniques, which tie in with structured methds in ther chapters, and are useful fr density estimatin and classificatin. The techniques in this chapter shuld nt be cnfused with thse assciated with the mre recent usage f the phrase kernel methds. In this chapter kernels are mstly used as a device fr lcalizatin. We discuss kernel methds in Sectins 5.8, 4.5.4, 8.5 and Chapter ; in thse cntexts the kernel cmputes an inner prduct in a high-dimensinal (implicit) feature space, and is used fr regularized nnlinear mdeling. We make sme cnnectins t the methdlgy in this chapter at the end f Sectin 6.7.

211 9 6. Kernel Smthing Methds Nearest-Neighbr Kernel Epanechnikv Kernel ˆf(x 0 ) ˆf(x 0 ) x x FIGURE 6.. In each panel 00 pairs x i, y i are generated at randm frm the blue curve with Gaussian errrs: Y = sin(4x)+ε, X U[0, ], ε N(0,/). In the left panel the green curve is the result f a 0-nearest-neighbr running-mean smther. The red pint is the fitted cnstant ˆf(x 0), and the red circles indicate thse bservatins cntributing t the fit at x 0. The slid yellw regin indicates the weights assigned t bservatins. In the right panel, the green curve is the kernel-weighted average, using an Epanechnikv kernel with (half) windw width λ = ne-dimensinal Kernel Smthers In Chapter, we mtivated the k nearest-neighbr average ˆf(x) = Ave(y i x i N k (x)) (6.) as an estimate f the regressin functin E(Y X = x). Here N k (x) is the set f k pints nearest t x in squared distance, and Ave dentes the average (mean). The idea is t relax the definitin f cnditinal expectatin, as illustrated in the left panel f Figure 6., and cmpute an average in a neighbrhd f the target pint. In this case we have used the 0-nearest neighbrhd the fit at x 0 is the average f the 0 pairs whse x i values are clsest t x 0. The green curve is traced ut as we apply this definitin at different values x 0. The green curve is bumpy, since ˆf(x) is discntinuus in x. As we mve x 0 frm left t right, the k-nearest neighbrhd remains cnstant, until a pint x i t the right f x 0 becmes clser than the furthest pint x i in the neighbrhd t the left f x 0, at which time x i replaces x i. The average in (6.) changes in a discrete way, leading t a discntinuus ˆf(x). This discntinuity is ugly and unnecessary. Rather than give all the pints in the neighbrhd equal weight, we can assign weights that die ff smthly with distance frm the target pint. The right panel shws an example f this, using the s-called Nadaraya Watsn kernel-weighted

212 6. ne-dimensinal Kernel Smthers 9 average ˆf(x 0 ) = N i= K λ(x 0,x i )y i N i= K λ(x 0,x i ), (6.) with the Epanechnikv quadratic kernel ( ) x x0 K λ (x 0,x) = D, (6.) λ with D(t) = { 4 ( t ) if t ; 0 therwise. (6.4) The fitted functin is nw cntinuus, and quite smth in the right panel f Figure 6.. As we mve the target frm left t right, pints enter the neighbrhd initially with weight zer, and then their cntributin slwly increases (see Exercise 6.). In the right panel we used a metric windw size λ = 0. fr the kernel fit, which des nt change as we mve the target pint x 0, while the size f the 0-nearest-neighbr smthing windw adapts t the lcal density f the x i. ne can, hwever, als use such adaptive neighbrhds with kernels, but we need t use a mre general ntatin. Let h λ (x 0 ) be a width functin (indexed by λ) that determines the width f the neighbrhd at x 0. Then mre generally we have ( ) x x0 K λ (x 0,x) = D. (6.5) h λ (x 0 ) In (6.), h λ (x 0 ) = λ is cnstant. Fr k-nearest neighbrhds, the neighbrhd size k replaces λ, and we have h k (x 0 ) = x 0 x [k] where x [k] is the kth clsest x i t x 0. There are a number f details that ne has t attend t in practice: The smthing parameter λ, which determines the width f the lcal neighbrhd, has t be determined. Large λ implies lwer variance (averages ver mre bservatins) but higher bias (we essentially assume the true functin is cnstant within the windw). Metric windw widths (cnstant h λ (x)) tend t keep the bias f the estimate cnstant, but the variance is inversely prprtinal t the lcal density. Nearest-neighbr windw widths exhibit the ppsite behavir; the variance stays cnstant and the abslute bias varies inversely with lcal density. Issues arise with nearest-neighbrs when there are ties in the x i. With mst smthing techniques ne can simply reduce the data set by averaging the y i at tied values f X, and supplementing these new bservatins at the unique values f x i with an additinal weight w i (which multiples the kernel weight).

213 94 6. Kernel Smthing Methds Kλ(x0, x) Epanechnikv Tri-cube Gaussian FIGURE 6.. A cmparisn f three ppular kernels fr lcal smthing. Each has been calibrated t integrate t. The tri-cube kernel is cmpact and has tw cntinuus derivatives at the bundary f its supprt, while the Epanechnikv kernel has nne. The Gaussian kernel is cntinuusly differentiable, but has infinite supprt. This leaves a mre general prblem t deal with: bservatin weights w i. peratinally we simply multiply them by the kernel weights befre cmputing the weighted average. With nearest neighbrhds, it is nw natural t insist n neighbrhds with a ttal weight cntent k (relative t w i ). In the event f verflw (the last bservatin needed in a neighbrhd has a weight w j which causes the sum f weights t exceed the budget k), then fractinal parts can be used. Bundary issues arise. The metric neighbrhds tend t cntain less pints n the bundaries, while the nearest-neighbrhds get wider. The Epanechnikv kernel has cmpact supprt (needed when used with nearest-neighbr windw size). Anther ppular cmpact kernel is based n the tri-cube functin { ( t D(t) = ) if t ; (6.6) 0 therwise This is flatter n the tp (like the nearest-neighbr bx) and is differentiable at the bundary f its supprt. The Gaussian density functin D(t) = φ(t) is a ppular nncmpact kernel, with the standarddeviatin playing the rle f the windw size. Figure 6. cmpares the three. 6.. Lcal Linear Regressin We have prgressed frm the raw mving average t a smthly varying lcally weighted average by using kernel weighting. The smth kernel fit still has prblems, hwever, as exhibited in Figure 6. (left panel). Lcallyweighted averages can be badly biased n the bundaries f the dmain,

214 6. ne-dimensinal Kernel Smthers 95 N-W Kernel at Bundary Lcal Linear Regressin at Bundary ˆf(x 0 ) ˆf(x 0 ) x x FIGURE 6.. The lcally weighted average has bias prblems at r near the bundaries f the dmain. The true functin is apprximately linear here, but mst f the bservatins in the neighbrhd have a higher mean than the target pint, s despite weighting, their mean will be biased upwards. By fitting a lcally weighted linear regressin (right panel), this bias is remved t first rder because f the asymmetry f the kernel in that regin. By fitting straight lines rather than cnstants lcally, we can remve this bias exactly t first rder; see Figure 6. (right panel). Actually, this bias can be present in the interir f the dmain as well, if the X values are nt equally spaced (fr the same reasns, but usually less severe). Again lcally weighted linear regressin will make a first-rder crrectin. Lcally weighted regressin slves a separate weighted least squares prblem at each target pint x 0 : min α(x 0),β(x 0) i= N K λ (x 0,x i )[y i α(x 0 ) β(x 0 )x i ]. (6.7) The estimate is then ˆf(x 0 ) = ˆα(x 0 ) + ˆβ(x 0 )x 0. Ntice that althugh we fit an entire linear mdel t the data in the regin, we nly use it t evaluate the fit at the single pint x 0. Define the vectr-valued functin b(x) T = (,x). Let B be the N regressin matrix with ith rw b(x i ) T, and W(x 0 ) the N N diagnal matrix with ith diagnal element K λ (x 0,x i ). Then ˆf(x 0 ) = b(x 0 ) T (B T W(x 0 )B) B T W(x 0 )y (6.8) N = l i (x 0 )y i. (6.9) i= Equatin (6.8) gives an explicit expressin fr the lcal linear regressin estimate, and (6.9) highlights the fact that the estimate is linear in the

215 96 6. Kernel Smthing Methds Lcal Linear Equivalent Kernel at Bundary Lcal Linear Equivalent Kernel in Interir ˆf(x 0 ) x ˆf(x 0 ) x FIGURE 6.4. The green pints shw the equivalent kernel l i(x 0) fr lcal regressin. These are the weights in ˆf(x 0) = P N i= li(x0)yi, pltted against their crrespnding x i. Fr display purpses, these have been rescaled, since in fact they sum t. Since the yellw shaded regin is the (rescaled) equivalent kernel fr the Nadaraya Watsn lcal average, we see hw lcal regressin autmatically mdifies the weighting kernel t crrect fr biases due t asymmetry in the smthing windw. y i (the l i (x 0 ) d nt invlve y). These weights l i (x 0 ) cmbine the weighting kernel K λ (x 0, ) and the least squares peratins, and are smetimes referred t as the equivalent kernel. Figure 6.4 illustrates the effect f lcal linear regressin n the equivalent kernel. Histrically, the bias in the Nadaraya Watsn and ther lcal average kernel methds were crrected by mdifying the kernel. These mdificatins were based n theretical asympttic mean-square-errr cnsideratins, and besides being tedius t implement, are nly apprximate fr finite sample sizes. Lcal linear regressin autmatically mdifies the kernel t crrect the bias exactly t first rder, a phenmenn dubbed as autmatic kernel carpentry. Cnsider the fllwing expansin fr E ˆf(x 0 ), using the linearity f lcal regressin and a series expansin f the true functin f arund x 0, E ˆf(x 0 ) = N l i (x 0 )f(x i ) i= N N = f(x 0 ) l i (x 0 ) + f (x 0 ) (x i x 0 )l i (x 0 ) i= + f (x 0 ) i= N (x i x 0 ) l i (x 0 ) + R, (6.0) i= where the remainder term R invlves third- and higher-rder derivatives f f, and is typically small under suitable smthness assumptins. It can be

216 6. ne-dimensinal Kernel Smthers 97 Lcal Linear in Interir Lcal Quadratic in Interir ˆf(x 0 ) ˆf(x 0 ) FIGURE 6.5. Lcal linear fits exhibit bias in regins f curvature f the true functin. Lcal quadratic fits tend t eliminate this bias. shwn (Exercise 6.) that fr lcal linear regressin, N i= l i(x 0 ) = and N i= (x i x 0 )l i (x 0 ) = 0. Hence the middle term equals f(x 0 ), and since the bias is E ˆf(x 0 ) f(x 0 ), we see that it depends nly n quadratic and higher rder terms in the expansin f f. 6.. Lcal Plynmial Regressin Why stp at lcal linear fits? We can fit lcal plynmial fits f any degree d, N d min K λ (x 0,x i ) y i α(x 0 ) β j (x 0 )x j i (6.) α(x 0),β j(x 0), j=,...,d i= with slutin ˆf(x 0 ) = ˆα(x 0 )+ d ˆβ j= j (x 0 )x j 0. In fact, an expansin such as (6.0) will tell us that the bias will nly have cmpnents f degree d+ and higher (Exercise 6.). Figure 6.5 illustrates lcal quadratic regressin. Lcal linear fits tend t be biased in regins f curvature f the true functin, a phenmenn referred t as trimming the hills and filling the valleys. Lcal quadratic regressin is generally able t crrect this bias. There is f curse a price t be paid fr this bias reductin, and that is increased variance. The fit in the right panel f Figure 6.5 is slightly mre wiggly, especially in the tails. Assuming the mdel y i = f(x i ) + ε i, with ε i independent and identically distributed with mean zer and variance σ, Var( ˆf(x 0 )) = σ l(x 0 ), where l(x 0 ) is the vectr f equivalent kernel weights at x 0. It can be shwn (Exercise 6.) that l(x 0 ) increases with d, and s there is a bias variance tradeff in selecting the plynmial degree. Figure 6.6 illustrates these variance curves fr degree zer, ne and tw j=

217 98 6. Kernel Smthing Methds Variance Cnstant Linear Quadratic FIGURE 6.6. The variances functins l(x) fr lcal cnstant, linear and quadratic regressin, fr a metric bandwidth (λ = 0.) tri-cube kernel. lcal plynmials. T summarize sme cllected wisdm n this issue: Lcal linear fits can help bias dramatically at the bundaries at a mdest cst in variance. Lcal quadratic fits d little at the bundaries fr bias, but increase the variance a lt. Lcal quadratic fits tend t be mst helpful in reducing bias due t curvature in the interir f the dmain. Asympttic analysis suggest that lcal plynmials f dd degree dminate thse f even degree. This is largely due t the fact that asympttically the MSE is dminated by bundary effects. While it may be helpful t tinker, and mve frm lcal linear fits at the bundary t lcal quadratic fits in the interir, we d nt recmmend such strategies. Usually the applicatin will dictate the degree f the fit. Fr example, if we are interested in extraplatin, then the bundary is f mre interest, and lcal linear fits are prbably mre reliable. 6. Selecting the Width f the Kernel In each f the kernels K λ, λ is a parameter that cntrls its width: Fr the Epanechnikv r tri-cube kernel with metric width, λ is the radius f the supprt regin. Fr the Gaussian kernel, λ is the standard deviatin. λ is the number k f nearest neighbrs in k-nearest neighbrhds, ften expressed as a fractin r span k/n f the ttal training sample.

218 6. Selecting the Width f the Kernel 99 FIGURE 6.7. Equivalent kernels fr a lcal linear regressin smther (tri-cube kernel; range) and a smthing spline (blue), with matching degrees f freedm. The vertical spikes indicates the target pints. There is a natural bias variance tradeff as we change the width f the averaging windw, which is mst explicit fr lcal averages: If the windw is narrw, ˆf(x 0 ) is an average f a small number f y i clse t x 0, and its variance will be relatively large clse t that f an individual y i. The bias will tend t be small, again because each f the E(y i ) = f(x i ) shuld be clse t f(x 0 ). If the windw is wide, the variance f ˆf(x 0 ) will be small relative t the variance f any y i, because f the effects f averaging. The bias will be higher, because we are nw using bservatins x i further frm x 0, and there is n guarantee that f(x i ) will be clse t f(x 0 ). Similar arguments apply t lcal regressin estimates, say lcal linear: as the width ges t zer, the estimates apprach a piecewise-linear functin that interplates the training data ; as the width gets infinitely large, the fit appraches the glbal linear least-squares fit t the data. The discussin in Chapter 5 n selecting the regularizatin parameter fr smthing splines applies here, and will nt be repeated. Lcal regressin smthers are linear estimatrs; the smther matrix in ˆf = S λ y is built up frm the equivalent kernels (6.8), and has ijth entry {S λ } ij = l i (x j ). Leavene-ut crss-validatin is particularly simple (Exercise 6.7), as is generalized crss-validatin, C p (Exercise 6.0), and k-fld crss-validatin. The effective degrees f freedm is again defined as trace(s λ ), and can be used t calibrate the amunt f smthing. Figure 6.7 cmpares the equivalent kernels fr a smthing spline and lcal linear regressin. The lcal regressin smther has a span f 40%, which results in df = trace(s λ ) = The smthing spline was calibrated t have the same df, and their equivalent kernels are qualitatively quite similar. With unifrmly spaced x i ; with irregularly spaced x i, the behavir can deterirate.

219 00 6. Kernel Smthing Methds 6. Lcal Regressin in IR p Kernel smthing and lcal regressin generalize very naturally t tw r mre dimensins. The Nadaraya Watsn kernel smther fits a cnstant lcally with weights supplied by a p-dimensinal kernel. Lcal linear regressin will fit a hyperplane lcally in X, by weighted least squares, with weights supplied by a p-dimensinal kernel. It is simple t implement and is generally preferred t the lcal cnstant fit fr its superir perfrmance n the bundaries. Let b(x) be a vectr f plynmial terms in X f maximum degree d. Fr example, with d = and p = we get b(x) = (,X,X ); with d = we get b(x) = (,X,X,X,X,X X ); and trivially with d = 0 we get b(x) =. At each x 0 IR p slve min β(x 0) i= N K λ (x 0,x i )(y i b(x i ) T β(x 0 )) (6.) t prduce the fit ˆf(x 0 ) = b(x 0 ) T ˆβ(x 0 ). Typically the kernel will be a radial functin, such as the radial Epanechnikv r tri-cube kernel ( ) x x0 K λ (x 0,x) = D, (6.) λ where is the Euclidean nrm. Since the Euclidean nrm depends n the units in each crdinate, it makes mst sense t standardize each predictr, fr example, t unit standard deviatin, prir t smthing. While bundary effects are a prblem in ne-dimensinal smthing, they are a much bigger prblem in tw r higher dimensins, since the fractin f pints n the bundary is larger. In fact, ne f the manifestatins f the curse f dimensinality is that the fractin f pints clse t the bundary increases t ne as the dimensin grws. Directly mdifying the kernel t accmmdate tw-dimensinal bundaries becmes very messy, especially fr irregular bundaries. Lcal plynmial regressin seamlessly perfrms bundary crrectin t the desired rder in any dimensins. Figure 6.8 illustrates lcal linear regressin n sme measurements frm an astrnmical study with an unusual predictr design (star-shaped). Here the bundary is extremely irregular, and the fitted surface must als interplate ver regins f increasing data sparsity as we apprach the bundary. Lcal regressin becmes less useful in dimensins much higher than tw r three. We have discussed in sme detail the prblems f dimensinality, fr example, in Chapter. It is impssible t simultaneusly maintain lcalness ( lw bias) and a sizable sample in the neighbrhd ( lw variance) as the dimensin increases, withut the ttal sample size increasing expnentially in p. Visualizatin f ˆf(X) als becmes difficult in higher dimensins, and this is ften ne f the primary gals f smthing.

220 6.4 Structured Lcal Regressin Mdels in IR p 0 Velcity Velcity Suth-Nrth Suth-Nrth East-West East-West FIGURE 6.8. The left panel shws three-dimensinal data, where the respnse is the velcity measurements n a galaxy, and the tw predictrs recrd psitins n the celestial sphere. The unusual star -shaped design indicates the way the measurements were made, and results in an extremely irregular bundary. The right panel shws the results f lcal linear regressin smthing in IR, using a nearest-neighbr windw with 5% f the data. Althugh the scatter-clud and wire-frame pictures in Figure 6.8 lk attractive, it is quite difficult t interpret the results except at a grss level. Frm a data analysis perspective, cnditinal plts are far mre useful. Figure 6.9 shws an analysis f sme envirnmental data with three predictrs. The trellis display here shws zne as a functin f radiatin, cnditined n the ther tw variables, temperature and wind speed. Hwever, cnditining n the value f a variable really implies lcal t that value (as in lcal regressin). Abve each f the panels in Figure 6.9 is an indicatin f the range f values present in that panel fr each f the cnditining values. In the panel itself the data subsets are displayed (respnse versus remaining variable), and a ne-dimensinal lcal linear regressin is fit t the data. Althugh this is nt quite the same as lking at slices f a fitted three-dimensinal surface, it is prbably mre useful in terms f understanding the jint behavir f the data. 6.4 Structured Lcal Regressin Mdels in IR p When the dimensin t sample-size rati is unfavrable, lcal regressin des nt help us much, unless we are willing t make sme structural assumptins abut the mdel. Much f this bk is abut structured regressin and classificatin mdels. Here we fcus n sme appraches directly related t kernel methds.

221 0 6. Kernel Smthing Methds Wind Temp Wind Temp Wind Temp Wind Temp 5 4 Cube Rt zne (cube rt ppb) 5 4 Wind Temp Wind Temp Wind Temp Wind Temp Wind Temp Wind Temp Wind Temp Wind Temp 5 4 Wind Temp Wind Temp Wind Temp Wind Temp Slar Radiatin (langleys) FIGURE 6.9. Three-dimensinal smthing example. The respnse is (cube-rt f) zne cncentratin, and the three predictrs are temperature, wind speed and radiatin. The trellis display shws zne as a functin f radiatin, cnditined n intervals f temperature and wind speed (indicated by darker green r range shaded bars). Each panel cntains abut 40% f the range f each f the cnditined variables. The curve in each panel is a univariate lcal linear regressin, fit t the data in the panel.

222 6.4. Structured Kernels 6.4 Structured Lcal Regressin Mdels in IR p 0 ne line f apprach is t mdify the kernel. The default spherical kernel (6.) gives equal weight t each crdinate, and s a natural default strategy is t standardize each variable t unit standard deviatin. A mre general apprach is t use a psitive semidefinite matrix A t weigh the different crdinates: ( (x x0 ) T ) A(x x 0 ) K λ,a (x 0,x) = D. (6.4) λ Entire crdinates r directins can be dwngraded r mitted by impsing apprpriate restrictins n A. Fr example, if A is diagnal, then we can increase r decrease the influence f individual predictrs X j by increasing r decreasing A jj. ften the predictrs are many and highly crrelated, such as thse arising frm digitized analg signals r images. The cvariance functin f the predictrs can be used t tailr a metric A that fcuses less, say, n high-frequency cntrasts (Exercise 6.4). Prpsals have been made fr learning the parameters fr multidimensinal kernels. Fr example, the prjectin-pursuit regressin mdel discussed in Chapter is f this flavr, where lw-rank versins f A imply ridge functins fr ˆf(X). Mre general mdels fr A are cumbersme, and we favr instead the structured frms fr the regressin functin discussed next Structured Regressin Functins We are trying t fit a regressin functin E(Y X) = f(x,x,...,x p ) in IR p, in which every level f interactin is ptentially present. It is natural t cnsider analysis-f-variance (ANVA) decmpsitins f the frm f(x,x,...,x p ) = α + j g j (X j ) + k<l g kl (X k,x l ) + (6.5) and then intrduce structure by eliminating sme f the higher-rder terms. Additive mdels assume nly main effect terms: f(x) = α + p j= g j(x j ); secnd-rder mdels will have terms with interactins f rder at mst tw, and s n. In Chapter 9, we describe iterative backfitting algrithms fr fitting such lw-rder interactin mdels. In the additive mdel, fr example, if all but the kth term is assumed knwn, then we can estimate g k by lcal regressin f Y j k g j(x j ) n X k. This is dne fr each functin in turn, repeatedly, until cnvergence. The imprtant detail is that at any stage, ne-dimensinal lcal regressin is all that is needed. The same ideas can be used t fit lw-dimensinal ANVA decmpsitins. An imprtant special case f these structured mdels are the class f varying cefficient mdels. Suppse, fr example, that we divide the p predictrs in X int a set (X,X,...,X q ) with q < p, and the remainder f

223 04 6. Kernel Smthing Methds Artic Diameter vs Age Male Depth Male Depth Male Depth Male Depth Male Depth Male Depth Diameter 4 Female Depth Female Depth Female Depth Female Depth Female Depth Female Depth Age FIGURE 6.0. In each panel the arta diameter is mdeled as a linear functin f age. The cefficients f this mdel vary with gender and depth dwn the arta (left is near the tp, right is lw dwn). There is a clear trend in the cefficients f the linear mdel. the variables we cllect in the vectr Z. We then assume the cnditinally linear mdel f(x) = α(z) + β (Z)X + + β q (Z)X q. (6.6) Fr given Z, this is a linear mdel, but each f the cefficients can vary with Z. It is natural t fit such a mdel by lcally weighted least squares: min N K λ (z 0,z i )(y i α(z 0 ) x i β (z 0 ) x qi β q (z 0 )). α(z 0),β(z 0) i= (6.7) Figure 6.0 illustrates the idea n measurements f the human arta. A lngstanding claim has been that the arta thickens with age. Here we mdel the diameter f the arta as a linear functin f age, but allw the cefficients t vary with gender and depth dwn the arta. We used a lcal regressin mdel separately fr males and females. While the arta clearly des thicken with age at the higher regins f the arta, the relatinship fades with distance dwn the arta. Figure 6. shws the intercept and slpe as a functin f depth.

224 6.5 Lcal Likelihd and ther Mdels 05 Male Female Age Slpe Age Intercept Distance Dwn Arta Distance Dwn Arta FIGURE 6.. The intercept and slpe f age as a functin f distance dwn the arta, separately fr males and females. The yellw bands indicate ne standard errr. 6.5 Lcal Likelihd and ther Mdels The cncept f lcal regressin and varying cefficient mdels is extremely brad: any parametric mdel can be made lcal if the fitting methd accmmdates bservatin weights. Here are sme examples: Assciated with each bservatin y i is a parameter θ i = θ(x i ) = x T i β linear in the cvariate(s) x i, and inference fr β is based n the lglikelihd l(β) = N i= l(y i,x T i β). We can mdel θ(x) mre flexibly by using the likelihd lcal t x 0 fr inference f θ(x 0 ) = x T 0 β(x 0 ): l(β(x 0 )) = N K λ (x 0,x i )l(y i,x T i β(x 0 )). i= Many likelihd mdels, in particular the family f generalized linear mdels including lgistic and lg-linear mdels, invlve the cvariates in a linear fashin. Lcal likelihd allws a relaxatin frm a glbally linear mdel t ne that is lcally linear.

225 06 6. Kernel Smthing Methds As abve, except different variables are assciated with θ frm thse used fr defining the lcal likelihd: l(θ(z 0 )) = N K λ (z 0,z i )l(y i,η(x i,θ(z 0 ))). i= Fr example, η(x,θ) = x T θ culd be a linear mdel in x. This will fit a varying cefficient mdel θ(z) by maximizing the lcal likelihd. Autregressive time series mdels f rder k have the frm y t = β 0 + β y t + β y t + + β k y t k + ε t. Denting the lag set by z t = (y t,y t,...,y t k ), the mdel lks like a standard linear mdel y t = z T t β + ε t, and is typically fit by least squares. Fitting by lcal least squares with a kernel K(z 0,z t ) allws the mdel t vary accrding t the shrt-term histry f the series. This is t be distinguished frm the mre traditinal dynamic linear mdels that vary by windwing time. As an illustratin f lcal likelihd, we cnsider the lcal versin f the multiclass linear lgistic regressin mdel (4.6) f Chapter 4. The data cnsist f features x i and an assciated categrical respnse g i {,,...,J}, and the linear mdel has the frm Pr(G = j X = x) = e βj0+βt j x + J k= eβ k0+β T k x. (6.8) The lcal lg-likelihd fr this J class mdel can be written { N K λ (x 0,x i ) β gi0(x 0 ) + β gi (x 0 ) T (x i x 0 ) i= Ntice that [ J lg + exp ( β k0 (x 0 ) + β k (x 0 ) T (x i x 0 ) )]}. k= (6.9) we have used g i as a subscript in the first line t pick ut the apprpriate numeratr; β J0 = 0 and β J = 0 by the definitin f the mdel; we have centered the lcal regressins at x 0, s that the fitted psterir prbabilities at x 0 are simply ˆPr(G = j X = x 0 ) = eˆβ j0(x 0) + J k= eˆβ k0 (x 0). (6.0)

226 6.5 Lcal Likelihd and ther Mdels 07 Prevalence CHD Prevalence CHD Systlic Bld Pressure besity FIGURE 6.. Each plt shws the binary respnse CHD (crnary heart disease) as a functin f a risk factr fr the Suth African heart disease data. Fr each plt we have cmputed the fitted prevalence f CHD using a lcal linear lgistic regressin mdel. The unexpected increase in the prevalence f CHD at the lwer ends f the ranges is because these are retrspective data, and sme f the subjects had already undergne treatment t reduce their bld pressure and weight. The shaded regin in the plt indicates an estimated pintwise standard errr band. This mdel can be used fr flexible multiclass classificatin in mderately lw dimensins, althugh successes have been reprted with the highdimensinal ZIP-cde classificatin prblem. Generalized additive mdels (Chapter 9) using kernel smthing methds are clsely related, and avid dimensinality prblems by assuming an additive structure fr the regressin functin. As a simple illustratin we fit a tw-class lcal linear lgistic mdel t the heart disease data f Chapter 4. Figure 6. shws the univariate lcal lgistic mdels fit t tw f the risk factrs (separately). This is a useful screening device fr detecting nnlinearities, when the data themselves have little visual infrmatin t ffer. In this case an unexpected anmaly is uncvered in the data, which may have gne unnticed with traditinal methds. Since CHD is a binary indicatr, we culd estimate the cnditinal prevalence Pr(G = j x 0 ) by simply smthing this binary respnse directly withut resrting t a likelihd frmulatin. This amunts t fitting a lcally cnstant lgistic regressin mdel (Exercise 6.5). In rder t enjy the biascrrectin f lcal-linear smthing, it is mre natural t perate n the unrestricted lgit scale. Typically with lgistic regressin, we cmpute parameter estimates as well as their standard errrs. This can be dne lcally as well, and s

227 08 6. Kernel Smthing Methds Density Estimate Systlic Bld Pressure (fr CHD grup) FIGURE 6.. A kernel density estimate fr systlic bld pressure (fr the CHD grup). The density estimate at each pint is the average cntributin frm each f the kernels at that pint. We have scaled the kernels dwn by a factr f 0 t make the graph readable. we can prduce, as shwn in the plt, estimated pintwise standard-errr bands abut ur fitted prevalence. 6.6 Kernel Density Estimatin and Classificatin Kernel density estimatin is an unsupervised learning prcedure, which histrically precedes kernel regressin. It als leads naturally t a simple family f prcedures fr nnparametric classificatin Kernel Density Estimatin Suppse we have a randm sample x,...,x N drawn frm a prbability density f X (x), and we wish t estimate f X at a pint x 0. Fr simplicity we assume fr nw that X IR. Arguing as befre, a natural lcal estimate has the frm ˆf X (x 0 ) = #x i N(x 0 ), (6.) Nλ where N(x 0 ) is a small metric neighbrhd arund x 0 f width λ. This estimate is bumpy, and the smth Parzen estimate is preferred ˆf X (x 0 ) = Nλ N K λ (x 0,x i ), (6.) i=

228 6.6 Kernel Density Estimatin and Classificatin 09 Density Estimates CHD n CHD Psterir Estimate Systlic Bld Pressure Systlic Bld Pressure FIGURE 6.4. The left panel shws the tw separate density estimates fr systlic bld pressure in the CHD versus n-chd grups, using a Gaussian kernel density estimate in each. The right panel shws the estimated psterir prbabilities fr CHD, using (6.5). because it cunts bservatins clse t x 0 with weights that decrease with distance frm x 0. In this case a ppular chice fr K λ is the Gaussian kernel K λ (x 0,x) = φ( x x 0 /λ). Figure 6. shws a Gaussian kernel density fit t the sample values fr systlic bld pressure fr the CHD grup. Letting φ λ dente the Gaussian density with mean zer and standard-deviatin λ, then (6.) has the frm ˆf X (x) = N N φ λ (x x i ) i= = ( ˆF φ λ )(x), (6.) the cnvlutin f the sample empirical distributin ˆF with φ λ. The distributin ˆF(x) puts mass /N at each f the bserved x i, and is jumpy; in ˆf X (x) we have smthed ˆF by adding independent Gaussian nise t each bservatin x i. The Parzen density estimate is the equivalent f the lcal average, and imprvements have been prpsed alng the lines f lcal regressin [n the lg scale fr densities; see Lader (999)]. We will nt pursue these here. In IR p the natural generalizatin f the Gaussian density estimate amunts t using the Gaussian prduct kernel in (6.), ˆf X (x 0 ) = N(λ π) p N e ( xi x0 /λ). (6.4) i=

229 0 6. Kernel Smthing Methds FIGURE 6.5. The ppulatin class densities may have interesting structure (left) that disappears when the psterir prbabilities are frmed (right) Kernel Density Classificatin ne can use nnparametric density estimates fr classificatin in a straightfrward fashin using Bayes therem. Suppse fr a J class prblem we fit nnparametric density estimates ˆf j (X), j =,...,J separately in each f the classes, and we als have estimates f the class prirs ˆπ j (usually the sample prprtins). Then ˆPr(G = j X = x 0 ) = ˆπ j ˆfj (x 0 ) J k= ˆπ k ˆf k (x 0 ). (6.5) Figure 6.4 uses this methd t estimate the prevalence f CHD fr the heart risk factr study, and shuld be cmpared with the left panel f Figure 6.. The main difference ccurs in the regin f high SBP in the right panel f Figure 6.4. In this regin the data are sparse fr bth classes, and since the Gaussian kernel density estimates use metric kernels, the density estimates are lw and f pr quality (high variance) in these regins. The lcal lgistic regressin methd (6.0) uses the tri-cube kernel with k-nn bandwidth; this effectively widens the kernel in this regin, and makes use f the lcal linear assumptin t smth ut the estimate (n the lgit scale). If classificatin is the ultimate gal, then learning the separate class densities well may be unnecessary, and can in fact be misleading. Figure 6.5 shws an example where the densities are bth multimdal, but the psterir rati is quite smth. In learning the separate densities frm data, ne might decide t settle fr a rugher, high-variance fit t capture these features, which are irrelevant fr the purpses f estimating the psterir prbabilities. In fact, if classificatin is the ultimate gal, then we need nly t estimate the psterir well near the decisin bundary (fr tw classes, this is the set {x Pr(G = X = x) = }) The Naive Bayes Classifier This is a technique that has remained ppular ver the years, despite its name (als knwn as Idit s Bayes!) It is especially apprpriate when

230 6.6 Kernel Density Estimatin and Classificatin the dimensin p f the feature space is high, making density estimatin unattractive. The naive Bayes mdel assumes that given a class G = j, the features X k are independent: f j (X) = p f jk (X k ). (6.6) k= While this assumptin is generally nt true, it des simplify the estimatin dramatically: The individual class-cnditinal marginal densities f jk can each be estimated separately using ne-dimensinal kernel density estimates. This is in fact a generalizatin f the riginal naive Bayes prcedures, which used univariate Gaussians t represent these marginals. If a cmpnent X j f X is discrete, then an apprpriate histgram estimate can be used. This prvides a seamless way f mixing variable types in a feature vectr. Despite these rather ptimistic assumptins, naive Bayes classifiers ften utperfrm far mre sphisticated alternatives. The reasns are related t Figure 6.5: althugh the individual class density estimates may be biased, this bias might nt hurt the psterir prbabilities as much, especially near the decisin regins. In fact, the prblem may be able t withstand cnsiderable bias fr the savings in variance such a naive assumptin earns. Starting frm (6.6) we can derive the lgit-transfrm (using class J as the base): lg Pr(G = l X) Pr(G = J X) = lg π lf l (X) π J f J (X) = lg π p l k= f lk(x k ) p π J k= f Jk(X k ) = lg π l π J + = α l + p k= p g lk (X k ). k= lg f lk(x k ) f Jk (X k ) (6.7) This has the frm f a generalized additive mdel, which is described in mre detail in Chapter 9. The mdels are fit in quite different ways thugh; their differences are explred in Exercise 6.9. The relatinship between naive Bayes and generalized additive mdels is analgus t that between linear discriminant analysis and lgistic regressin (Sectin 4.4.5).

231 6. Kernel Smthing Methds 6.7 Radial Basis Functins and Kernels In Chapter 5, functins are represented as expansins in basis functins: f(x) = M j= β jh j (x). The art f flexible mdeling using basis expansins cnsists f picking an apprpriate family f basis functins, and then cntrlling the cmplexity f the representatin by selectin, regularizatin, r bth. Sme f the families f basis functins have elements that are defined lcally; fr example, B-splines are defined lcally in IR. If mre flexibility is desired in a particular regin, then that regin needs t be represented by mre basis functins (which in the case f B-splines translates t mre knts). Tensr prducts f IR-lcal basis functins deliver basis functins lcal in IR p. Nt all basis functins are lcal fr example, the truncated pwer bases fr splines, r the sigmidal basis functins σ(α 0 + αx) used in neural-netwrks (see Chapter ). The cmpsed functin f(x) can nevertheless shw lcal behavir, because f the particular signs and values f the cefficients causing cancellatins f glbal effects. Fr example, the truncated pwer basis has an equivalent B-spline basis fr the same space f functins; the cancellatin is exact in this case. Kernel methds achieve flexibility by fitting simple mdels in a regin lcal t the target pint x 0. Lcalizatin is achieved via a weighting kernel K λ, and individual bservatins receive weights K λ (x 0,x i ). Radial basis functins cmbine these ideas, by treating the kernel functins K λ (ξ,x) as basis functins. This leads t the mdel f(x) = = M K λj (ξ j,x)β j j= M ( ) x ξj D β j, (6.8) j= λ j where each basis element is indexed by a lcatin r prttype parameter ξ j and a scale parameter λ j. A ppular chice fr D is the standard Gaussian density functin. There are several appraches t learning the parameters {λ j,ξ j,β j }, j =,...,M. Fr simplicity we will fcus n least squares methds fr regressin, and use the Gaussian kernel. ptimize the sum-f-squares with respect t all the parameters: min {λ j,ξ j,β j} M { } N M y i β 0 β j exp (x i ξ j ) T (x i ξ j ) λ. j i= j= (6.9) This mdel is cmmnly referred t as an RBF netwrk, an alternative t the sigmidal neural netwrk discussed in Chapter ; the ξ j and λ j playing the rle f the weights. This criterin is nncnvex

232 6.7 Radial Basis Functins and Kernels FIGURE 6.6. Gaussian radial basis functins in IR with fixed width can leave hles (tp panel). Renrmalized Gaussian radial basis functins avid this prblem, and prduce basis functins similar in sme respects t B-splines. with multiple lcal minima, and the algrithms fr ptimizatin are similar t thse used fr neural netwrks. Estimate the {λ j,ξ j } separately frm the β j. Given the frmer, the estimatin f the latter is a simple least squares prblem. ften the kernel parameters λ j and ξ j are chsen in an unsupervised way using the X distributin alne. ne f the methds is t fit a Gaussian mixture density mdel t the training x i, which prvides bth the centers ξ j and the scales λ j. ther even mre adhc appraches use clustering methds t lcate the prttypes ξ j, and treat λ j = λ as a hyper-parameter. The bvius drawback f these appraches is that the cnditinal distributin Pr(Y X) and in particular E(Y X) is having n say in where the actin is cncentrated. n the psitive side, they are much simpler t implement. While it wuld seem attractive t reduce the parameter set and assume a cnstant value fr λ j = λ, this can have an undesirable side effect f creating hles regins f IR p where nne f the kernels has appreciable supprt, as illustrated in Figure 6.6 (upper panel). Renrmalized radial basis functins, D( x ξ j /λ) h j (x) = M k= D( x ξ k /λ), (6.0) avid this prblem (lwer panel). The Nadaraya Watsn kernel regressin estimatr (6.) in IR p can be viewed as an expansin in renrmalized radial basis functins, ˆf(x 0 ) = N i= y K λ (x 0,x i) i P N i= K λ(x 0,x i) = N i= y ih i (x 0 ) (6.)

233 4 6. Kernel Smthing Methds with a basis functin h i lcated at every bservatin and cefficients y i ; that is, ξ i = x i, ˆβi = y i, i =,...,N. Nte the similarity between the expansin (6.) and the slutin (5.50) n page 69 t the regularizatin prblem induced by the kernel K. Radial basis functins frm the bridge between the mdern kernel methds and lcal fitting technlgy. 6.8 Mixture Mdels fr Density Estimatin and Classificatin The mixture mdel is a useful tl fr density estimatin, and can be viewed as a kind f kernel methd. The Gaussian mixture mdel has the frm M f(x) = α m φ(x;µ m,σ m ) (6.) m= with mixing prprtins α m, m α m =, and each Gaussian density has a mean µ m and cvariance matrix Σ m. In general, mixture mdels can use any cmpnent densities in place f the Gaussian in (6.): the Gaussian mixture mdel is by far the mst ppular. The parameters are usually fit by maximum likelihd, using the EM algrithm as described in Chapter 8. Sme special cases arise: If the cvariance matrices are cnstrained t be scalar: Σ m = σ m I, then (6.) has the frm f a radial basis expansin. If in additin σ m = σ > 0 is fixed, and M N, then the maximum likelihd estimate fr (6.) appraches the kernel density estimate (6.) where ˆα m = /N and ˆµ m = x m. Using Bayes therem, separate mixture densities in each class lead t flexible mdels fr Pr(G X); this is taken up in sme detail in Chapter. Figure 6.7 shws an applicatin f mixtures t the heart disease riskfactr study. In the tp rw are histgrams f Age fr the n CHD and CHD grups separately, and then cmbined n the right. Using the cmbined data, we fit a tw-cmpnent mixture f the frm (6.) with the (scalars) Σ and Σ nt cnstrained t be equal. Fitting was dne via the EM algrithm (Chapter 8): nte that the prcedure des nt use knwledge f the CHD labels. The resulting estimates were ˆµ = 6.4, ˆΣ = 57.7, ˆα = 0.7, ˆµ = 58.0, ˆΣ = 5.6, ˆα = 0.. The cmpnent densities φ(ˆµ, ˆΣ ) and φ(ˆµ, ˆΣ ) are shwn in the lwerleft and middle panels. The lwer-right panel shws these cmpnent densities (range and blue) alng with the estimated mixture density (green).

234 6.8 Mixture Mdels fr Density Estimatin and Classificatin 5 N CHD CHD Cmbined Cunt Cunt Cunt Age Age Age Mixture Estimate Mixture Estimate Mixture Estimate Age Age Age FIGURE 6.7. Applicatin f mixtures t the heart disease risk-factr study. (Tp rw:) Histgrams f Age fr the n CHD and CHD grups separately, and cmbined. (Bttm rw:) estimated cmpnent densities frm a Gaussian mixture mdel, (bttm left, bttm middle); (bttm right:) Estimated cmpnent densities (blue and range) alng with the estimated mixture density (green). The range density has a very large standard deviatin, and apprximates a unifrm density. The mixture mdel als prvides an estimate f the prbability that bservatin i belngs t cmpnent m, ˆr im = ˆα mφ(x i ; ˆµ m, ˆΣ m ) M k= ˆα kφ(x i ; ˆµ k, ˆΣ k ), (6.) where x i is Age in ur example. Suppse we threshld each value ˆr i and hence define ˆδ i = I(ˆr i > 0.5). Then we can cmpare the classificatin f each bservatin by CHD and the mixture mdel: Mixture mdel ˆδ = 0 ˆδ = CHD N 70 Yes Althugh the mixture mdel did nt use the CHD labels, it has dne a fair jb in discvering the tw CHD subppulatins. Linear lgistic regressin, using the CHD as a respnse, achieves the same errr rate (%) when fit t these data using maximum-likelihd (Sectin 4.4).

235 6 6. Kernel Smthing Methds 6.9 Cmputatinal Cnsideratins Kernel and lcal regressin and density estimatin are memry-based methds: the mdel is the entire training data set, and the fitting is dne at evaluatin r predictin time. Fr many real-time applicatins, this can make this class f methds infeasible. The cmputatinal cst t fit at a single bservatin x 0 is (N) flps, except in versimplified cases (such as square kernels). By cmparisn, an expansin in M basis functins csts (M) fr ne evaluatin, and typically M (lg N). Basis functin methds have an initial cst f at least (NM + M ). The smthing parameter(s) λ fr kernel methds are typically determined ff-line, fr example using crss-validatin, at a cst f (N ) flps. Ppular implementatins f lcal regressin, such as the less functin in S-PLUS and R and the lcfit prcedure (Lader, 999), use triangulatin schemes t reduce the cmputatins. They cmpute the fit exactly at M carefully chsen lcatins ((N M)), and then use blending techniques t interplate the fit elsewhere ((M) per evaluatin). Bibligraphic Ntes There is a vast literature n kernel methds which we will nt attempt t summarize. Rather we will pint t a few gd references that themselves have extensive bibligraphies. Lader (999) gives excellent cverage f lcal regressin and likelihd, and als describes state-f-the-art sftware fr fitting these mdels. Fan and Gijbels (996) cver these mdels frm a mre theretical aspect. Hastie and Tibshirani (990) discuss lcal regressin in the cntext f additive mdeling. Silverman (986) gives a gd verview f density estimatin, as des Sctt (99). Exercises Ex. 6. Shw that the Nadaraya Watsn kernel smth with fixed metric bandwidth λ and a Gaussian kernel is differentiable. What can be said fr the Epanechnikv kernel? What can be said fr the Epanechnikv kernel with adaptive nearest-neighbr bandwidth λ(x 0 )? Ex. 6. Shw that N i= (x i x 0 )l i (x 0 ) = 0 fr lcal linear regressin. Define b j (x 0 ) = N i= (x i x 0 ) j l i (x 0 ). Shw that b 0 (x 0 ) = fr lcal plynmial regressin f any degree (including lcal cnstants). Shw that b j (x 0 ) = 0 fr all j {,,...,k} fr lcal plynmial regressin f degree k. What are the implicatins f this n the bias?

236 Exercises 7 Ex. 6. Shw that l(x) (Sectin 6..) increases with the degree f the lcal plynmial. Ex. 6.4 Suppse that the p predictrs X arise frm sampling relatively smth analg curves at p unifrmly spaced abscissa values. Dente by Cv(X Y ) = Σ the cnditinal cvariance matrix f the predictrs, and assume this des nt change much with Y. Discuss the nature f Mahalanbis chice A = Σ fr the metric in (6.4). Hw des this cmpare with A = I? Hw might yu cnstruct a kernel A that (a) dwnweights high-frequency cmpnents in the distance metric; (b) ignres them cmpletely? Ex. 6.5 Shw that fitting a lcally cnstant multinmial lgit mdel f the frm (6.9) amunts t smthing the binary respnse indicatrs fr each class separately using a Nadaraya Watsn kernel smther with kernel weights K λ (x 0,x i ). Ex. 6.6 Suppse that all yu have is sftware fr fitting lcal regressin, but yu can specify exactly which mnmials are included in the fit. Hw culd yu use this sftware t fit a varying-cefficient mdel in sme f the variables? Ex. 6.7 Derive an expressin fr the leave-ne-ut crss-validated residual sum-f-squares fr lcal plynmial regressin. Ex. 6.8 Suppse that fr cntinuus respnse Y and predictr X, we mdel the jint density f X, Y using a multivariate Gaussian kernel estimatr. Nte that the kernel in this case wuld be the prduct kernel φ λ (X)φ λ (Y ). Shw that the cnditinal mean E(Y X) derived frm this estimate is a Nadaraya Watsn estimatr. Extend this result t classificatin by prviding a suitable kernel fr the estimatin f the jint distributin f a cntinuus X and discrete Y. Ex. 6.9 Explre the differences between the naive Bayes mdel (6.7) and a generalized additive lgistic regressin mdel, in terms f (a) mdel assumptins and (b) estimatin. If all the variables X k are discrete, what can yu say abut the crrespnding GAM? Ex. 6.0 Suppse we have N samples generated frm the mdel y i = f(x i )+ ε i, with ε i independent and identically distributed with mean zer and variance σ, the x i assumed fixed (nn randm). We estimate f using a linear smther (lcal regressin, smthing spline, etc.) with smthing parameter λ. Thus the vectr f fitted values is given by ˆf = S λ y. Cnsider the in-sample predictin errr PE(λ) = E N N (yi ˆf λ (x i )) (6.4) i=

237 8 6. Kernel Smthing Methds fr predicting new respnses at the N input values. Shw that the average squared residual n the training data, ASR(λ), is a biased estimate (ptimistic) fr PE(λ), while is unbiased. C λ = ASR(λ) + σ N trace(s λ) (6.5) Ex. 6. Shw that fr the Gaussian mixture mdel (6.) the likelihd is maximized at +, and describe hw. Ex. 6. Write a cmputer prgram t perfrm a lcal linear discriminant analysis. At each query pint x 0, the training data receive weights K λ (x 0,x i ) frm a weighting kernel, and the ingredients fr the linear decisin bundaries (see Sectin 4.) are cmputed by weighted averages. Try ut yur prgram n the zipcde data, and shw the training and test errrs fr a series f five pre-chsen values f λ. The zipcde data are available frm the bk website www-stat.stanfrd.edu/elemstatlearn.

238 7 Mdel Assessment and Selectin This is page 9 Printer: paque this 7. Intrductin The generalizatin perfrmance f a learning methd relates t its predictin capability n independent test data. Assessment f this perfrmance is extremely imprtant in practice, since it guides the chice f learning methd r mdel, and gives us a measure f the quality f the ultimately chsen mdel. In this chapter we describe and illustrate the key methds fr perfrmance assessment, and shw hw they are used t select mdels. We begin the chapter with a discussin f the interplay between bias, variance and mdel cmplexity. 7. Bias, Variance and Mdel Cmplexity Figure 7. illustrates the imprtant issue in assessing the ability f a learning methd t generalize. Cnsider first the case f a quantitative r interval scale respnse. We have a target variable Y, a vectr f inputs X, and a predictin mdel ˆf(X) that has been estimated frm a training set T. The lss functin fr measuring errrs between Y and ˆf(X) is dented by L(Y, ˆf(X)). Typical chices are { L(Y, ˆf(X)) (Y = ˆf(X)) squared errr Y ˆf(X) (7.) abslute errr.

239 0 7. Mdel Assessment and Selectin Predictin Errr High Bias Lw Variance Lw Bias High Variance Mdel Cmplexity (df) FIGURE 7.. Behavir f test sample and training sample errr as the mdel cmplexity is varied. The light blue curves shw the training errr err, while the light red curves shw the cnditinal test errr Err T fr 00 training sets f size 50 each, as the mdel cmplexity is increased. The slid curves shw the expected test errr Err and the expected training errr E[err]. Test errr, als referred t as generalizatin errr, is the predictin errr ver an independent test sample Err T = E[L(Y, ˆf(X)) T ] (7.) where bth X and Y are drawn randmly frm their jint distributin (ppulatin). Here the training set T is fixed, and test errr refers t the errr fr this specific training set. A related quantity is the expected predictin errr (r expected test errr) Err = E[L(Y, ˆf(X))] = E[Err T ]. (7.) Nte that this expectatin averages ver everything that is randm, including the randmness in the training set that prduced ˆf. Figure 7. shws the predictin errr (light red curves) Err T fr 00 simulated training sets each f size 50. The lass (Sectin.4.) was used t prduce the sequence f fits. The slid red curve is the average, and hence an estimate f Err. Estimatin f Err T will be ur gal, althugh we will see that Err is mre amenable t statistical analysis, and mst methds effectively estimate the expected errr. It des nt seem pssible t estimate cnditinal

240 7. Bias, Variance and Mdel Cmplexity errr effectively, given nly the infrmatin in the same training set. Sme discussin f this pint is given in Sectin 7.. Training errr is the average lss ver the training sample err = N N L(y i, ˆf(x i )). (7.4) i= We wuld like t knw the expected test errr f ur estimated mdel ˆf. As the mdel becmes mre and mre cmplex, it uses the training data mre and is able t adapt t mre cmplicated underlying structures. Hence there is a decrease in bias but an increase in variance. There is sme intermediate mdel cmplexity that gives minimum expected test errr. Unfrtunately training errr is nt a gd estimate f the test errr, as seen in Figure 7.. Training errr cnsistently decreases with mdel cmplexity, typically drpping t zer if we increase the mdel cmplexity enugh. Hwever, a mdel with zer training errr is verfit t the training data and will typically generalize prly. The stry is similar fr a qualitative r categrical respnse G taking ne f K values in a set G, labeled fr cnvenience as,,...,k. Typically we mdel the prbabilities p k (X) = Pr(G = k X) (r sme mntne transfrmatins f k (X)), and then Ĝ(X) = arg max k ˆp k (X). In sme cases, such as -nearest neighbr classificatin (Chapters and ) we prduce Ĝ(X) directly. Typical lss functins are L(G,Ĝ(X)) = I(G Ĝ(X)) (0 lss), (7.5) K L(G, ˆp(X)) = I(G = k)lg ˆp k (X) k= = lg ˆp G (X) ( lg-likelihd). (7.6) The quantity the lg-likelihd is smetimes referred t as the deviance. Again, test errr here is Err T = E[L(G,Ĝ(X)) T ], the ppulatin misclassificatin errr f the classifier trained n T, and Err is the expected misclassificatin errr. Training errr is the sample analgue, fr example, err = N N lg ˆp gi (x i ), (7.7) i= the sample lg-likelihd fr the mdel. The lg-likelihd can be used as a lss-functin fr general respnse densities, such as the Pissn, gamma, expnential, lg-nrmal and thers. If Pr θ(x) (Y ) is the density f Y, indexed by a parameter θ(x) that depends n the predictr X, then L(Y,θ(X)) = lg Pr θ(x) (Y ). (7.8)

241 7. Mdel Assessment and Selectin The in the definitin makes the lg-likelihd lss fr the Gaussian distributin match squared-errr lss. Fr ease f expsitin, fr the remainder f this chapter we will use Y and f(x) t represent all f the abve situatins, since we fcus mainly n the quantitative respnse (squared-errr lss) setting. Fr the ther situatins, the apprpriate translatins are bvius. In this chapter we describe a number f methds fr estimating the expected test errr fr a mdel. Typically ur mdel will have a tuning parameter r parameters α and s we can write ur predictins as ˆf α (x). The tuning parameter varies the cmplexity f ur mdel, and we wish t find the value f α that minimizes errr, that is, prduces the minimum f the average test errr curve in Figure 7.. Having said this, fr brevity we will ften suppress the dependence f ˆf(x) n α. It is imprtant t nte that there are in fact tw separate gals that we might have in mind: Mdel selectin: estimating the perfrmance f different mdels in rder t chse the best ne. Mdel assessment: having chsen a final mdel, estimating its predictin errr (generalizatin errr) n new data. If we are in a data-rich situatin, the best apprach fr bth prblems is t randmly divide the dataset int three parts: a training set, a validatin set, and a test set. The training set is used t fit the mdels; the validatin set is used t estimate predictin errr fr mdel selectin; the test set is used fr assessment f the generalizatin errr f the final chsen mdel. Ideally, the test set shuld be kept in a vault, and be brught ut nly at the end f the data analysis. Suppse instead that we use the test-set repeatedly, chsing the mdel with smallest test-set errr. Then the test set errr f the final chsen mdel will underestimate the true test errr, smetimes substantially. It is difficult t give a general rule n hw t chse the number f bservatins in each f the three parts, as this depends n the signal-tnise rati in the data and the training sample size. A typical split might be 50% fr training, and 5% each fr validatin and testing: Train Validatin Test The methds in this chapter are designed fr situatins where there is insufficient data t split it int three parts. Again it is t difficult t give a general rule n hw much training data is enugh; amng ther things, this depends n the signal-t-nise rati f the underlying functin, and the cmplexity f the mdels being fit t the data.

242 7. The Bias Variance Decmpsitin The methds f this chapter apprximate the validatin step either analytically (AIC, BIC, MDL, SRM) r by efficient sample re-use (crssvalidatin and the btstrap). Besides their use in mdel selectin, we als examine t what extent each methd prvides a reliable estimate f test errr f the final chsen mdel. Befre jumping int these tpics, we first explre in mre detail the nature f test errr and the bias variance tradeff. 7. The Bias Variance Decmpsitin As in Chapter, if we assume that Y = f(x) + ε where E(ε) = 0 and Var(ε) = σ ε, we can derive an expressin fr the expected predictin errr f a regressin fit ˆf(X) at an input pint X = x 0, using squared-errr lss: Err(x 0 ) = E[(Y ˆf(x 0 )) X = x 0 ] = σ ε + [E ˆf(x 0 ) f(x 0 )] + E[ ˆf(x 0 ) E ˆf(x 0 )] = σ ε + Bias ( ˆf(x 0 )) + Var( ˆf(x 0 )) = Irreducible Errr + Bias + Variance. (7.9) The first term is the variance f the target arund its true mean f(x 0 ), and cannt be avided n matter hw well we estimate f(x 0 ), unless σ ε = 0. The secnd term is the squared bias, the amunt by which the average f ur estimate differs frm the true mean; the last term is the variance; the expected squared deviatin f ˆf(x 0 ) arund its mean. Typically the mre cmplex we make the mdel ˆf, the lwer the (squared) bias but the higher the variance. Fr the k-nearest-neighbr regressin fit, these expressins have the simple frm Err(x 0 ) = E[(Y ˆf k (x 0 )) X = x 0 ] [ ] = σε + f(x 0 ) k f(x (l) ) + σ ε k k. (7.0) Here we assume fr simplicity that training inputs x i are fixed, and the randmness arises frm the y i. The number f neighbrs k is inversely related t the mdel cmplexity. Fr small k, the estimate ˆf k (x) can ptentially adapt itself better t the underlying f(x). As we increase k, the bias the squared difference between f(x 0 ) and the average f f(x) at the k-nearest neighbrs will typically increase, while the variance decreases. Fr a linear mdel fit ˆf p (x) = x T ˆβ, where the parameter vectr β with p cmpnents is fit by least squares, we have l= Err(x 0 ) = E[(Y ˆf p (x 0 )) X = x 0 ]

243 4 7. Mdel Assessment and Selectin = σ ε + [f(x 0 ) E ˆf p (x 0 )] + h(x 0 ) σ ε. (7.) Here h(x 0 ) = X(X T X) x 0, the N-vectr f linear weights that prduce the fit ˆf p (x 0 ) = x 0 T (X T X) X T y, and hence Var[ ˆf p (x 0 )] = h(x 0 ) σ ε. While this variance changes with x 0, its average (with x 0 taken t be each f the sample values x i ) is (p/n)σ ε, and hence N N Err(x i ) = σε + N i= N [f(x i ) E ˆf(x i )] + p N σ ε, (7.) i= the in-sample errr. Here mdel cmplexity is directly related t the number f parameters p. The test errr Err(x 0 ) fr a ridge regressin fit ˆf α (x 0 ) is identical in frm t (7.), except the linear weights in the variance term are different: h(x 0 ) = X(X T X + αi) T x 0. The bias term will als be different. Fr a linear mdel family such as ridge regressin, we can break dwn the bias mre finely. Let β dente the parameters f the best-fitting linear apprximatin t f: β = arg min E ( f(x) X T β ). (7.) β Here the expectatin is taken with respect t the distributin f the input variables X. Then we can write the average squared bias as E x0 [f(x 0 ) E ˆf ] [ α (x 0 ) = E x0 f(x0 ) x T ] 0 β + Ex0 [x T 0 β Ex T ˆβ ] 0 α = Ave[Mdel Bias] + Ave[Estimatin Bias] (7.4) The first term n the right-hand side is the average squared mdel bias, the errr between the best-fitting linear apprximatin and the true functin. The secnd term is the average squared estimatin bias, the errr between the average estimate E(x T 0 ˆβ) and the best-fitting linear apprximatin. Fr linear mdels fit by rdinary least squares, the estimatin bias is zer. Fr restricted fits, such as ridge regressin, it is psitive, and we trade it ff with the benefits f a reduced variance. The mdel bias can nly be reduced by enlarging the class f linear mdels t a richer cllectin f mdels, by including interactins and transfrmatins f the variables in the mdel. Figure 7. shws the bias variance tradeff schematically. In the case f linear mdels, the mdel space is the set f all linear predictins frm p inputs and the black dt labeled clsest fit is x T β. The blue-shaded regin indicates the errr σ ε with which we see the truth in the training sample. Als shwn is the variance f the least squares fit, indicated by the large yellw circle centered at the black dt labeled clsest fit in ppulatin,

244 7. The Bias Variance Decmpsitin 5 Realizatin Clsest fit in ppulatin Clsest fit Truth Mdel bias Estimatin Bias MDEL SPACE Shrunken fit Estimatin Variance RESTRICTED MDEL SPACE FIGURE 7.. Schematic f the behavir f bias and variance. The mdel space is the set f all pssible predictins frm the mdel, with the clsest fit labeled with a black dt. The mdel bias frm the truth is shwn, alng with the variance, indicated by the large yellw circle centered at the black dt labeled clsest fit in ppulatin. A shrunken r regularized fit is als shwn, having additinal estimatin bias, but smaller predictin errr due t its decreased variance.

245 6 7. Mdel Assessment and Selectin Nw if we were t fit a mdel with fewer predictrs, r regularize the cefficients by shrinking them tward zer (say), we wuld get the shrunken fit shwn in the figure. This fit has an additinal estimatin bias, due t the fact that it is nt the clsest fit in the mdel space. n the ther hand, it has smaller variance. If the decrease in variance exceeds the increase in (squared) bias, then this is wrthwhile. 7.. Example: Bias Variance Tradeff Figure 7. shws the bias variance tradeff fr tw simulated examples. There are 80 bservatins and 0 predictrs, unifrmly distributed in the hypercube [0,] 0. The situatins are as fllws: Left panels: Y is 0 if X / and if X > /, and we apply k-nearest neighbrs. Right panels: Y is if 0 j= X j is greater than 5 and 0 therwise, and we use best subset linear regressin f size p. The tp rw is regressin with squared errr lss; the bttm rw is classificatin with 0 lss. The figures shw the predictin errr (red), squared bias (green) and variance (blue), all cmputed fr a large test sample. In the regressin prblems, bias and variance add t prduce the predictin errr curves, with minima at abut k = 5 fr k-nearest neighbrs, and p 0 fr the linear mdel. Fr classificatin lss (bttm figures), sme interesting phenmena can be seen. The bias and variance curves are the same as in the tp figures, and predictin errr nw refers t misclassificatin rate. We see that predictin errr is n lnger the sum f squared bias and variance. Fr the k-nearest neighbr classifier, predictin errr decreases r stays the same as the number f neighbrs is increased t 0, despite the fact that the squared bias is rising. Fr the linear mdel classifier the minimum ccurs fr p 0 as in regressin, but the imprvement ver the p = mdel is mre dramatic. We see that bias and variance seem t interact in determining predictin errr. Why des this happen? There is a simple explanatin fr the first phenmenn. Suppse at a given input pint, the true prbability f class is 0.9 while the expected value f ur estimate is 0.6. Then the squared bias ( ) is cnsiderable, but the predictin errr is zer since we make the crrect decisin. In ther wrds, estimatin errrs that leave us n the right side f the decisin bundary dn t hurt. Exercise 7. demnstrates this phenmenn analytically, and als shws the interactin effect between bias and variance. The verall pint is that the bias variance tradeff behaves differently fr 0 lss than it des fr squared errr lss. This in turn means that the best chices f tuning parameters may differ substantially in the tw

246 7. The Bias Variance Decmpsitin 7 k NN Regressin Linear Mdel Regressin Number f Neighbrs k Subset Size p k NN Classificatin Linear Mdel Classificatin Number f Neighbrs k Subset Size p FIGURE 7.. Expected predictin errr (range), squared bias (green) and variance (blue) fr a simulated example. The tp rw is regressin with squared errr lss; the bttm rw is classificatin with 0 lss. The mdels are k-nearest neighbrs (left) and best subset regressin f size p (right). The variance and bias curves are the same in regressin and classificatin, but the predictin errr curve is different.

247 8 7. Mdel Assessment and Selectin settings. ne shuld base the chice f tuning parameter n an estimate f predictin errr, as described in the fllwing sectins. 7.4 ptimism f the Training Errr Rate Discussins f errr rate estimatin can be cnfusing, because we have t make clear which quantities are fixed and which are randm. Befre we cntinue, we need a few definitins, elabrating n the material f Sectin 7.. Given a training set T = {(x,y ),(x,y ),...(x N,y N )} the generalizatin errr f a mdel ˆf is Err T = E X 0,Y 0[L(Y 0, ˆf(X 0 )) T ]; (7.5) Nte that the training set T is fixed in expressin (7.5). The pint (X 0,Y 0 ) is a new test data pint, drawn frm F, the jint distributin f the data. Averaging ver training sets T yields the expected errr Err = E T E X 0,Y 0[L(Y 0, ˆf(X 0 )) T ], (7.6) which is mre amenable t statistical analysis. As mentined earlier, it turns ut that mst methds effectively estimate the expected errr rather than E T ; see Sectin 7. fr mre n this pint. Nw typically, the training errr err = N N L(y i, ˆf(x i )) (7.7) i= will be less than the true errr Err T, because the same data is being used t fit the methd and assess its errr (see Exercise.9). A fitting methd typically adapts t the training data, and hence the apparent r training errr err will be an verly ptimistic estimate f the generalizatin errr Err T. Part f the discrepancy is due t where the evaluatin pints ccur. The quantity Err T can be thught f as extra-sample errr, since the test input vectrs dn t need t cincide with the training input vectrs. The nature f the ptimism in err is easiest t understand when we fcus instead n the in-sample errr Err in = N N i= E Y 0[L(Y 0 i, ˆf(x i )) T ] (7.8) The Y 0 ntatin indicates that we bserve N new respnse values at each f the training pints x i, i =,,...,N. We define the ptimism as Indeed, in the first editin f ur bk, this sectin wasn t sufficiently clear.

248 7.4 ptimism f the Training Errr Rate 9 the difference between Err in and the training errr err: p Err in err. (7.9) This is typically psitive since err is usually biased dwnward as an estimate f predictin errr. Finally, the average ptimism is the expectatin f the ptimism ver training sets ω E y (p). (7.0) Here the predictrs in the training set are fixed, and the expectatin is ver the training set utcme values; hence we have used the ntatin E y instead f E T. We can usually estimate nly the expected errr ω rather than p, in the same way that we can estimate the expected errr Err rather than the cnditinal errr Err T. Fr squared errr, 0, and ther lss functins, ne can shw quite generally that ω = N N Cv(ŷ i,y i ), (7.) i= where Cv indicates cvariance. Thus the amunt by which err underestimates the true errr depends n hw strngly y i affects its wn predictin. The harder we fit the data, the greater Cv(ŷ i,y i ) will be, thereby increasing the ptimism. Exercise 7.4 prves this result fr squared errr lss where ŷ i is the fitted value frm the regressin. Fr 0 lss, ŷ i {0,} is the classificatin at x i, and fr entrpy lss, ŷ i [0,] is the fitted prbability f class at x i. In summary, we have the imprtant relatin E y (Err in ) = E y (err) + N N Cv(ŷ i,y i ). (7.) i= This expressin simplifies if ŷ i is btained by a linear fit with d inputs r basis functins. Fr example, N Cv(ŷ i,y i ) = dσε (7.) i= fr the additive errr mdel Y = f(x) + ε, and s E y (Err in ) = E y (err) + d N σ ε. (7.4) Expressin (7.) is the basis fr the definitin f the effective number f parameters discussed in Sectin 7.6 The ptimism increases linearly with

249 0 7. Mdel Assessment and Selectin the number d f inputs r basis functins we use, but decreases as the training sample size increases. Versins f (7.4) hld apprximately fr ther errr mdels, such as binary data and entrpy lss. An bvius way t estimate predictin errr is t estimate the ptimism and then add it t the training errr err. The methds described in the next sectin C p, AIC, BIC and thers wrk in this way, fr a special class f estimates that are linear in their parameters. In cntrast, crss-validatin and btstrap methds, described later in the chapter, are direct estimates f the extra-sample errr Err. These general tls can be used with any lss functin, and with nnlinear, adaptive fitting techniques. In-sample errr is nt usually f direct interest since future values f the features are nt likely t cincide with their training set values. But fr cmparisn between mdels, in-sample errr is cnvenient and ften leads t effective mdel selectin. The reasn is that the relative (rather than abslute) size f the errr is what matters. 7.5 Estimates f In-Sample Predictin Errr The general frm f the in-sample estimates is Êrr in = err + ˆω, (7.5) where ˆω is an estimate f the average ptimism. Using expressin (7.4), applicable when d parameters are fit under squared errr lss, leads t a versin f the s-called C p statistic, C p = err + d N ˆσ ε. (7.6) Here ˆσ ε is an estimate f the nise variance, btained frm the meansquared errr f a lw-bias mdel. Using this criterin we adjust the training errr by a factr prprtinal t the number f basis functins used. The Akaike infrmatin criterin is a similar but mre generally applicable estimate f Err in when a lg-likelihd lss functin is used. It relies n a relatinship similar t (7.4) that hlds asympttically as N : E[lg Prˆθ(Y )] N E[lglik] + d N. (7.7) Here Pr θ (Y ) is a family f densities fr Y (cntaining the true density), ˆθ is the maximum-likelihd estimate f θ, and lglik is the maximized lg-likelihd: N lglik = lg Prˆθ(y i ). (7.8) i=

250 7.5 Estimates f In-Sample Predictin Errr Fr example, fr the lgistic regressin mdel, using the binmial lglikelihd, we have AIC = N lglik + d N. (7.9) Fr the Gaussian mdel (with variance σ ε = ˆσ ε assumed knwn), the AIC statistic is equivalent t C p, and s we refer t them cllectively as AIC. T use AIC fr mdel selectin, we simply chse the mdel giving smallest AIC ver the set f mdels cnsidered. Fr nnlinear and ther cmplex mdels, we need t replace d by sme measure f mdel cmplexity. We discuss this in Sectin 7.6. Given a set f mdels f α (x) indexed by a tuning parameter α, dente by err(α) and d(α) the training errr and number f parameters fr each mdel. Then fr this set f mdels we define AIC(α) = err(α) + d(α) N ˆσ ε. (7.0) The functin AIC(α) prvides an estimate f the test errr curve, and we find the tuning parameter ˆα that minimizes it. ur final chsen mdel is fˆα (x). Nte that if the basis functins are chsen adaptively, (7.) n lnger hlds. Fr example, if we have a ttal f p inputs, and we chse the best-fitting linear mdel with d < p inputs, the ptimism will exceed (d/n)σε. Put anther way, by chsing the best-fitting mdel with d inputs, the effective number f parameters fit is mre than d. Figure 7.4 shws AIC in actin fr the phneme recgnitin example f Sectin 5.. n page 48. The input vectr is the lg-peridgram f the spken vwel, quantized t 56 unifrmly spaced frequencies. A linear lgistic regressin mdel is used t predict the phneme class, with cefficient functin β(f) = M m= h m(f)θ m, an expansin in M spline basis functins. Fr any given M, a basis f natural cubic splines is used fr the h m, with knts chsen unifrmly ver the range f frequencies (s d(α) = d(m) = M). Using AIC t select the number f basis functins will apprximately minimize Err(M) fr bth entrpy and 0 lss. The simple frmula (/N) N Cv(ŷ i,y i ) = (d/n)σε i= hlds exactly fr linear mdels with additive errrs and squared errr lss, and apprximately fr linear mdels and lg-likelihds. In particular, the frmula des nt hld in general fr 0 lss (Efrn, 986), althugh many authrs nevertheless use it in that cntext (right panel f Figure 7.4).

251 7. Mdel Assessment and Selectin Lg-likelihd Lss 0- Lss Lg-likelihd Train Test AIC Misclassificatin Errr Number f Basis Functins Number f Basis Functins FIGURE 7.4. AIC used fr mdel selectin fr the phneme recgnitin example f Sectin 5... The lgistic regressin cefficient functin β(f) = P M m= hm(f)θm is mdeled as an expansin in M spline basis functins. In the left panel we see the AIC statistic used t estimate Err in using lg-likelihd lss. Included is an estimate f Err based n an independent test sample. It des well except fr the extremely ver-parametrized case (M = 56 parameters fr N = 000 bservatins). In the right panel the same is dne fr 0 lss. Althugh the AIC frmula des nt strictly apply here, it des a reasnable jb in this case. 7.6 The Effective Number f Parameters The cncept f number f parameters can be generalized, especially t mdels where regularizatin is used in the fitting. Suppse we stack the utcmes y,y,...,y N int a vectr y, and similarly fr the predictins ŷ. Then a linear fitting methd is ne fr which we can write ŷ = Sy, (7.) where S is an N N matrix depending n the input vectrs x i but nt n the y i. Linear fitting methds include linear regressin n the riginal features r n a derived basis set, and smthing methds that use quadratic shrinkage, such as ridge regressin and cubic smthing splines. Then the effective number f parameters is defined as df(s) = trace(s), (7.) the sum f the diagnal elements f S (als knwn as the effective degreesf-freedm). Nte that if S is an rthgnal-prjectin matrix nt a basis

252 7.7 The Bayesian Apprach and BIC set spanned by M features, then trace(s) = M. It turns ut that trace(s) is exactly the crrect quantity t replace d as the number f parameters in the C p statistic (7.6). If y arises frm an additive-errr mdel Y = f(x) + ε with Var(ε) = σε, then ne can shw that N i= Cv(ŷ i,y i ) = trace(s)σε, which mtivates the mre general definitin N i= df(ŷ) = Cv(ŷ i,y i ) σ ε (7.) (Exercises 7.4 and 7.5). Sectin 5.4. n page 5 gives sme mre intuitin fr the definitin df = trace(s) in the cntext f smthing splines. Fr mdels like neural netwrks, in which we minimize an errr functin R(w) with weight decay penalty (regularizatin) α m w m, the effective number f parameters has the frm df(α) = M m= θ m θ m + α, (7.4) where the θ m are the eigenvalues f the Hessian matrix R(w)/ w w T. Expressin (7.4) fllws frm (7.) if we make a quadratic apprximatin t the errr functin at the slutin (Bishp, 995). 7.7 The Bayesian Apprach and BIC The Bayesian infrmatin criterin (BIC), like AIC, is applicable in settings where the fitting is carried ut by maximizatin f a lg-likelihd. The generic frm f BIC is BIC = lglik + (lg N) d. (7.5) The BIC statistic (times /) is als knwn as the Schwarz criterin (Schwarz, 978). Under the Gaussian mdel, assuming the variance σε is knwn, lglik equals (up t a cnstant) i (y i ˆf(x i )) /σε, which is N err/σε fr squared errr lss. Hence we can write BIC = N [ σε err + (lg N) d ] N σ ε. (7.6) Therefre BIC is prprtinal t AIC (C p ), with the factr replaced by lg N. Assuming N > e 7.4, BIC tends t penalize cmplex mdels mre heavily, giving preference t simpler mdels in selectin. As with AIC, σ ε is typically estimated by the mean squared errr f a lw-bias mdel. Fr classificatin prblems, use f the multinmial lg-likelihd leads t a similar relatinship with the AIC, using crss-entrpy as the errr measure.

253 4 7. Mdel Assessment and Selectin Nte hwever that the misclassificatin errr measure des nt arise in the BIC cntext, since it des nt crrespnd t the lg-likelihd f the data under any prbability mdel. Despite its similarity with AIC, BIC is mtivated in quite a different way. It arises in the Bayesian apprach t mdel selectin, which we nw describe. Suppse we have a set f candidate mdels M m,m =,...,M and crrespnding mdel parameters θ m, and we wish t chse a best mdel frm amng them. Assuming we have a prir distributin Pr(θ m M m ) fr the parameters f each mdel M m, the psterir prbability f a given mdel is Pr(M m Z) Pr(M m ) Pr(Z M m ) (7.7) Pr(M m ) Pr(Z θ m, M m )Pr(θ m M m )dθ m, where Z represents the training data {x i,y i } N. T cmpare tw mdels M m and M l, we frm the psterir dds Pr(M m Z) Pr(M l Z) = Pr(M m) Pr(M l ) Pr(Z M m) Pr(Z M l ). (7.8) If the dds are greater than ne we chse mdel m, therwise we chse mdel l. The rightmst quantity BF(Z) = Pr(Z M m) Pr(Z M l ) (7.9) is called the Bayes factr, the cntributin f the data tward the psterir dds. Typically we assume that the prir ver mdels is unifrm, s that Pr(M m ) is cnstant. We need sme way f apprximating Pr(Z M m ). A s-called Laplace apprximatin t the integral fllwed by sme ther simplificatins (Ripley, 996, page 64) t (7.7) gives lg Pr(Z M m ) = lg Pr(Z ˆθ m, M m ) d m lg N + (). (7.40) Here ˆθ m is a maximum likelihd estimate and d m is the number f free parameters in mdel M m. If we define ur lss functin t be lg Pr(Z ˆθ m, M m ), this is equivalent t the BIC criterin f equatin (7.5). Therefre, chsing the mdel with minimum BIC is equivalent t chsing the mdel with largest (apprximate) psterir prbability. But this framewrk gives us mre. If we cmpute the BIC criterin fr a set f M,

254 7.8 Minimum Descriptin Length 5 mdels, giving BIC m, m =,,...,M, then we can estimate the psterir prbability f each mdel M m as e BICm M. (7.4) l= e BIC l Thus we can estimate nt nly the best mdel, but als assess the relative merits f the mdels cnsidered. Fr mdel selectin purpses, there is n clear chice between AIC and BIC. BIC is asympttically cnsistent as a selectin criterin. What this means is that given a family f mdels, including the true mdel, the prbability that BIC will select the crrect mdel appraches ne as the sample size N. This is nt the case fr AIC, which tends t chse mdels which are t cmplex as N. n the ther hand, fr finite samples, BIC ften chses mdels that are t simple, because f its heavy penalty n cmplexity. 7.8 Minimum Descriptin Length The minimum descriptin length (MDL) apprach gives a selectin criterin frmally identical t the BIC apprach, but is mtivated frm an ptimal cding viewpint. We first review the thery f cding fr data cmpressin, and then apply it t mdel selectin. We think f ur datum z as a message that we want t encde and send t smene else (the receiver ). We think f ur mdel as a way f encding the datum, and will chse the mst parsimnius mdel, that is the shrtest cde, fr the transmissin. Suppse first that the pssible messages we might want t transmit are z,z,...,z m. ur cde uses a finite alphabet f length A: fr example, we might use a binary cde {0,} f length A =. Here is an example with fur pssible messages and a binary cding: Message z z z z 4 Cde (7.4) This cde is knwn as an instantaneus prefix cde: n cde is the prefix f any ther, and the receiver (wh knws all f the pssible cdes), knws exactly when the message has been cmpletely sent. We restrict ur discussin t such instantaneus prefix cdes. ne culd use the cding in (7.4) r we culd permute the cdes, fr example use cdes 0,0,,0 fr z,z,z,z 4. Hw d we decide which t use? It depends n hw ften we will be sending each f the messages. If, fr example, we will be sending z mst ften, it makes sense t use the shrtest cde 0 fr z. Using this kind f strategy shrter cdes fr mre frequent messages the average message length will be shrter.

255 6 7. Mdel Assessment and Selectin In general, if messages are sent with prbabilities Pr(z i ),i =,,...,4, a famus therem due t Shannn says we shuld use cde lengths l i = lg Pr(z i ) and the average message length satisfies E(length) Pr(z i )lg (Pr(z i )). (7.4) The right-hand side abve is als called the entrpy f the distributin Pr(z i ). The inequality is an equality when the prbabilities satisfy p i = A li. In ur example, if Pr(z i ) = /,/4,/8,/8, respectively, then the cding shwn in (7.4) is ptimal and achieves the entrpy lwer bund. In general the lwer bund cannt be achieved, but prcedures like the Huffmann cding scheme can get clse t the bund. Nte that with an infinite set f messages, the entrpy is replaced by Pr(z)lg Pr(z)dz. Frm this result we glean the fllwing: T transmit a randm variable z having prbability density functin Pr(z), we require abut lg Pr(z) bits f infrmatin. We hencefrth change ntatin frm lg Pr(z) t lg Pr(z) = lg e Pr(z); this is fr cnvenience, and just intrduces an unimprtant multiplicative cnstant. Nw we apply this result t the prblem f mdel selectin. We have a mdel M with parameters θ, and data Z = (X,y) cnsisting f bth inputs and utputs. Let the (cnditinal) prbability f the utputs under the mdel be Pr(y θ,m,x), assume the receiver knws all f the inputs, and we wish t transmit the utputs. Then the message length required t transmit the utputs is length = lg Pr(y θ,m,x) lg Pr(θ M), (7.44) the lg-prbability f the target values given the inputs. The secnd term is the average cde length fr transmitting the mdel parameters θ, while the first term is the average cde length fr transmitting the discrepancy between the mdel and actual target values. Fr example suppse we have a single target y with y N(θ,σ ), parameter θ N(0,) and n input (fr simplicity). Then the message length is length = cnstant + lg σ + (y θ) σ + θ. (7.45) Nte that the smaller σ is, the shrter n average is the message length, since y is mre cncentrated arund θ. The MDL principle says that we shuld chse the mdel that minimizes (7.44). We recgnize (7.44) as the (negative) lg-psterir distributin, and hence minimizing descriptin length is equivalent t maximizing psterir prbability. Hence the BIC criterin, derived as apprximatin t lg-psterir prbability, can als be viewed as a device fr (apprximate) mdel chice by minimum descriptin length.

256 7.9 Vapnik Chervnenkis Dimensin 7 sin(50 x) FIGURE 7.5. The slid curve is the functin sin(50x) fr x [0, ]. The green (slid) and blue (hllw) pints illustrate hw the assciated indicatr functin I(sin(αx) > 0) can shatter (separate) an arbitrarily large number f pints by chsing an apprpriately high frequency α. Nte that we have ignred the precisin with which a randm variable z is cded. With a finite cde length we cannt cde a cntinuus variable exactly. Hwever, if we cde z within a tlerance δz, the message length needed is the lg f the prbability in the interval [z,z+δz] which is well apprximated by δzpr(z) if δz is small. Since lg δzpr(z) = lg δz+lg Pr(z), this means we can just ignre the cnstant lg δz and use lg Pr(z) as ur measure f message length, as we did abve. The preceding view f MDL fr mdel selectin says that we shuld chse the mdel with highest psterir prbability. Hwever, many Bayesians wuld instead d inference by sampling frm the psterir distributin. x 7.9 Vapnik Chervnenkis Dimensin A difficulty in using estimates f in-sample errr is the need t specify the number f parameters (r the cmplexity) d used in the fit. Althugh the effective number f parameters intrduced in Sectin 7.6 is useful fr sme nnlinear mdels, it is nt fully general. The Vapnik Chervnenkis (VC) thery prvides such a general measure f cmplexity, and gives assciated bunds n the ptimism. Here we give a brief review f this thery. Suppse we have a class f functins {f(x,α)} indexed by a parameter vectr α, with x IR p. Assume fr nw that f is an indicatr functin, that is, takes the values 0 r. If α = (α 0,α ) and f is the linear indicatr functin I(α 0 + α T x > 0), then it seems reasnable t say that the cmplexity f the class f is the number f parameters p +. But what abut f(x,α) = I(sinα x) where α is any real number and x IR? The functin sin(50 x) is shwn in Figure 7.5. This is a very wiggly functin that gets even rugher as the frequency α increases, but it has nly ne parameter: despite this, it desn t seem reasnable t cnclude that it has less cmplexity than the linear indicatr functin I(α 0 + α x) in p = dimensin.

257 8 7. Mdel Assessment and Selectin FIGURE 7.6. The first three panels shw that the class f lines in the plane can shatter three pints. The last panel shws that this class cannt shatter fur pints, as n line will put the hllw pints n ne side and the slid pints n the ther. Hence the VC dimensin f the class f straight lines in the plane is three. Nte that a class f nnlinear curves culd shatter fur pints, and hence has VC dimensin greater than three. The Vapnik Chervnenkis dimensin is a way f measuring the cmplexity f a class f functins by assessing hw wiggly its members can be. The VC dimensin f the class {f(x,α)} is defined t be the largest number f pints (in sme cnfiguratin) that can be shattered by members f {f(x,α)}. A set f pints is said t be shattered by a class f functins if, n matter hw we assign a binary label t each pint, a member f the class can perfectly separate them. Figure 7.6 shws that the VC dimensin f linear indicatr functins in the plane is but nt 4, since n fur pints can be shattered by a set f lines. In general, a linear indicatr functin in p dimensins has VC dimensin p+, which is als the number f free parameters. n the ther hand, it can be shwn that the family sin(αx) has infinite VC dimensin, as Figure 7.5 suggests. By apprpriate chice f α, any set f pints can be shattered by this class (Exercise 7.8). S far we have discussed the VC dimensin nly f indicatr functins, but this can be extended t real-valued functins. The VC dimensin f a class f real-valued functins {g(x, α)} is defined t be the VC dimensin f the indicatr class {I(g(x,α) β > 0)}, where β takes values ver the range f g. ne can use the VC dimensin in cnstructing an estimate f (extrasample) predictin errr; different types f results are available. Using the cncept f VC dimensin, ne can prve results abut the ptimism f the training errr when using a class f functins. An example f such a result is the fllwing. If we fit N training pints using a class f functins {f(x,α)} having VC dimensin h, then with prbability at least η ver training

258 7.9 Vapnik Chervnenkis Dimensin 9 sets: Err T err + ǫ Err T ( err ǫ ) (binary classificatin) err ( c ǫ) + (regressin) (7.46) h[lg (a N/h) + ] lg (η/4) where ǫ = a, N and 0 < a 4, 0 < a These bunds hld simultaneusly fr all members f(x,α), and are taken frm Cherkassky and Mulier (007, pages 6 8). They recmmend the value c =. Fr regressin they suggest a = a =, and fr classificatin they make n recmmendatin, with a = 4 and a = crrespnding t wrst-case scenaris. They als give an alternative practical bund fr regressin Err T err ( ρ ρlg ρ + lg N N ) + (7.47) with ρ = h N, which is free f tuning cnstants. The bunds suggest that the ptimism increases with h and decreases with N in qualitative agreement with the AIC crrectin d/n given is (7.4). Hwever, the results in (7.46) are strnger: rather than giving the expected ptimism fr each fixed functin f(x,α), they give prbabilistic upper bunds fr all functins f(x,α), and hence allw fr searching ver the class. Vapnik s structural risk minimizatin (SRM) apprach fits a nested sequence f mdels f increasing VC dimensins h < h <, and then chses the mdel with the smallest value f the upper bund. We nte that upper bunds like the nes in (7.46) are ften very lse, but that desn t rule them ut as gd criteria fr mdel selectin, where the relative (nt abslute) size f the test errr is imprtant. The main drawback f this apprach is the difficulty in calculating the VC dimensin f a class f functins. ften nly a crude upper bund fr VC dimensin is btainable, and this may nt be adequate. An example in which the structural risk minimizatin prgram can be successfully carried ut is the supprt vectr classifier, discussed in Sectin Example (Cntinued) Figure 7.7 shws the results when AIC, BIC and SRM are used t select the mdel size fr the examples f Figure 7.. Fr the examples labeled KNN, the mdel index α refers t neighbrhd size, while fr thse labeled REG α refers t subset size. Using each selectin methd (e.g., AIC) we estimated the best mdel ˆα and fund its true predictin errr Err T (ˆα) n a test set. Fr the same training set we cmputed the predictin errr f the best

259 40 7. Mdel Assessment and Selectin AIC % Increase ver Best reg/knn reg/linear class/knn class/linear BIC % Increase ver Best reg/knn reg/linear class/knn class/linear SRM % Increase ver Best reg/knn reg/linear class/knn class/linear FIGURE 7.7. Bxplts shw the distributin f the relative errr 00 [Err T (ˆα) min α Err T (α)]/[max α Err T (α) min α Err T (α)] ver the fur scenaris f Figure 7.. This is the errr in using the chsen mdel relative t the best mdel. There are 00 training sets each f size 80 represented in each bxplt, with the errrs cmputed n test sets f size 0, 000.

260 7.0 Crss-Validatin 4 and wrst pssible mdel chices: min α Err T (α) and max α Err T (α). The bxplts shw the distributin f the quantity 00 Err T (ˆα) min α Err T (α) max α Err T (α) min α Err T (α), which represents the errr in using the chsen mdel relative t the best mdel. Fr linear regressin the mdel cmplexity was measured by the number f features; as mentined in Sectin 7.5, this underestimates the df, since it des nt charge fr the search fr the best mdel f that size. This was als used fr the VC dimensin f the linear classifier. Fr k- nearest neighbrs, we used the quantity N/k. Under an additive-errr regressin mdel, this can be justified as the exact effective degrees f freedm (Exercise 7.6); we d nt knw if it crrespnds t the VC dimensin. We used a = a = fr the cnstants in (7.46); the results fr SRM changed with different cnstants, and this chice gave the mst favrable results. We repeated the SRM selectin using the alternative practical bund (7.47), and gt almst identical results. Fr misclassificatin errr we used ˆσ ε = [N/(N d)] err(α) fr the least restrictive mdel (k = 5 fr KNN, since k = results in zer training errr). The AIC criterin seems t wrk well in all fur scenaris, despite the lack f theretical supprt with 0 lss. BIC des nearly as well, while the perfrmance f SRM is mixed. 7.0 Crss-Validatin Prbably the simplest and mst widely used methd fr estimating predictin errr is crss-validatin. This methd directly estimates the expected extra-sample errr Err = E[L(Y, ˆf(X))], the average generalizatin errr when the methd ˆf(X) is applied t an independent test sample frm the jint distributin f X and Y. As mentined earlier, we might hpe that crss-validatin estimates the cnditinal errr, with the training set T held fixed. But as we will see in Sectin 7., crss-validatin typically estimates well nly the expected predictin errr K-Fld Crss-Validatin Ideally, if we had enugh data, we wuld set aside a validatin set and use it t assess the perfrmance f ur predictin mdel. Since data are ften scarce, this is usually nt pssible. T finesse the prblem, K-fld crssvalidatin uses part f the available data t fit the mdel, and a different part t test it. We split the data int K rughly equal-sized parts; fr example, when K = 5, the scenari lks like this:

261 4 7. Mdel Assessment and Selectin 4 5 Train Train Validatin Train Train Fr the kth part (third abve), we fit the mdel t the ther K parts f the data, and calculate the predictin errr f the fitted mdel when predicting the kth part f the data. We d this fr k =,,...,K and cmbine the K estimates f predictin errr. Here are mre details. Let κ : {,...,N} {,...,K} be an indexing functin that indicates the partitin t which bservatin i is allcated by the randmizatin. Dente by ˆf k (x) the fitted functin, cmputed with the kth part f the data remved. Then the crss-validatin estimate f predictin errr is CV( ˆf) = N N L(y i, ˆf κ(i) (x i )). (7.48) i= Typical chices f K are 5 r 0 (see belw). The case K = N is knwn as leave-ne-ut crss-validatin. In this case κ(i) = i, and fr the ith bservatin the fit is cmputed using all the data except the ith. Given a set f mdels f(x,α) indexed by a tuning parameter α, dente by ˆf k (x,α) the αth mdel fit with the kth part f the data remved. Then fr this set f mdels we define CV( ˆf,α) = N N L(y i, ˆf κ(i) (x i,α)). (7.49) i= The functin CV( ˆf,α) prvides an estimate f the test errr curve, and we find the tuning parameter ˆα that minimizes it. ur final chsen mdel is f(x, ˆα), which we then fit t all the data. It is interesting t wnder abut what quantity K-fld crss-validatin estimates. With K = 5 r 0, we might guess that it estimates the expected errr Err, since the training sets in each fld are quite different frm the riginal training set. n the ther hand, if K = N we might guess that crss-validatin estimates the cnditinal errr Err T. It turns ut that crss-validatin nly estimates effectively the average errr Err, as discussed in Sectin 7.. What value shuld we chse fr K? With K = N, the crss-validatin estimatr is apprximately unbiased fr the true (expected) predictin errr, but can have high variance because the N training sets are s similar t ne anther. The cmputatinal burden is als cnsiderable, requiring N applicatins f the learning methd. In certain special prblems, this cmputatin can be dne quickly see Exercises 7. and 5..

263 44 7. Mdel Assessment and Selectin Misclassificatin Errr Subset Size p FIGURE 7.9. Predictin errr (range) and tenfld crss-validatin curve (blue) estimated frm a single training set, frm the scenari in the bttm right panel f Figure 7.. ear mdel with best subsets regressin f subset size p. Standard errr bars are shwn, which are the standard errrs f the individual misclassificatin errr rates fr each f the ten parts. Bth curves have minima at p = 0, althugh the CV curve is rather flat beynd 0. ften a ne-standard errr rule is used with crss-validatin, in which we chse the mst parsimnius mdel whse errr is n mre than ne standard errr abve the errr f the best mdel. Here it lks like a mdel with abut p = 9 predictrs wuld be chsen, while the true mdel uses p = 0. Generalized crss-validatin prvides a cnvenient apprximatin t leavene ut crss-validatin, fr linear fitting under squared-errr lss. As defined in Sectin 7.6, a linear fitting methd is ne fr which we can write Nw fr many linear fitting methds, N N [y i ˆf i (x i )] = N i= ŷ = Sy. (7.50) N i= [ yi ˆf(x i ) S ii ], (7.5) where S ii is the ith diagnal element f S (see Exercise 7.). The GCV apprximatin is [ GCV( ˆf) = N y i ˆf(x ] i ). (7.5) N trace(s)/n i=

265 46 7. Mdel Assessment and Selectin Wrng way Frequency Crrelatins f Selected Predictrs with utcme Right way Frequency Crrelatins f Selected Predictrs with utcme FIGURE 7.0. Crss-validatin the wrng and right way: histgrams shws the crrelatin f class labels, in 0 randmly chsen samples, with the 00 predictrs chsen using the incrrect (upper red) and crrect (lwer green) versins f crss-validatin. (a) Find a subset f gd predictrs that shw fairly strng (univariate) crrelatin with the class labels, using all f the samples except thse in fld k. (b) Using just this subset f predictrs, build a multivariate classifier, using all f the samples except thse in fld k. (c) Use the classifier t predict the class labels fr the samples in fld k. The errr estimates frm step (c) are then accumulated ver all K flds, t prduce the crss-validatin estimate f predictin errr. The lwer panel f Figure 7.0 shws the crrelatins f class labels with the 00 predictrs chsen in step (a) f the crrect prcedure, ver the samples in a typical fld k. We see that they average abut zer, as they shuld. In general, with a multistep mdeling prcedure, crss-validatin must be applied t the entire sequence f mdeling steps. In particular, samples must be left ut befre any selectin r filtering steps are applied. There is ne qualificatin: initial unsupervised screening steps can be dne befre samples are left ut. Fr example, we culd select the 000 predictrs

267 48 7. Mdel Assessment and Selectin Errr n Full Training Set Errr n / Predictr Errr n 4/5 Class Label 0 full 4/ Predictr 46 (blue) CV Errrs FIGURE 7.. Simulatin study t investigate the perfrmance f crss validatin in a high-dimensinal prblem where the predictrs are independent f the class labels. The tp-left panel shws the number f errrs made by individual stump classifiers n the full training set (0 bservatins). The tp right panel shws the errrs made by individual stumps trained n a randm split f the dataset int 4/5ths (6 bservatins) and tested n the remaining /5th (4 bservatins). The best perfrmers are depicted by clred dts in each panel. The bttm left panel shws the effect f re-estimating the split pint in each fld: the clred pints crrespnd t the fur samples in the 4/5ths validatin set. The split pint derived frm the full dataset classifies all fur samples crrectly, but when the split pint is re-estimated n the 4/5ths data (as it shuld be), it cmmits tw errrs n the fur validatin samples. In the bttm right we see the verall result f five-fld crss-validatin applied t 50 simulated datasets. The average errr rate is abut 50%, as it shuld be.

268 7. Btstrap Methds 49 f the prcess. In the present example, this means that the best predictr and crrespnding split pint are fund frm 4/5ths f the data. The effect f predictr chice is seen in the tp right panel. Since the class labels are independent f the predictrs, the perfrmance f a stump n the 4/5ths training data cntains n infrmatin abut its perfrmance in the remaining /5th. The effect f the chice f split pint is shwn in the bttm left panel. Here we see the data fr predictr 46, crrespnding t the blue dt in the tp left plt. The clred pints indicate the /5th data, while the remaining pints belng t the 4/5ths. The ptimal split pints fr this predictr based n bth the full training set and 4/5ths data are indicated. The split based n the full data makes n errrs n the /5ths data. But crss-validatin must base its split n the 4/5ths data, and this incurs tw errrs ut f fur samples. The results f applying five-fld crss-validatin t each f 50 simulated datasets is shwn in the bttm right panel. As we wuld hpe, the average crss-validatin errr is arund 50%, which is the true expected predictin errr fr this classifier. Hence crss-validatin has behaved as it shuld. n the ther hand, there is cnsiderable variability in the errr, underscring the imprtance f reprting the estimated standard errr f the CV estimate. See Exercise 7.0 fr anther variatin f this prblem. 7. Btstrap Methds The btstrap is a general tl fr assessing statistical accuracy. First we describe the btstrap in general, and then shw hw it can be used t estimate extra-sample predictin errr. As with crss-validatin, the btstrap seeks t estimate the cnditinal errr Err T, but typically estimates well nly the expected predictin errr Err. Suppse we have a mdel fit t a set f training data. We dente the training set by Z = (z,z,...,z N ) where z i = (x i,y i ). The basic idea is t randmly draw datasets with replacement frm the training data, each sample the same size as the riginal training set. This is dne B times (B = 00 say), prducing B btstrap datasets, as shwn in Figure 7.. Then we refit the mdel t each f the btstrap datasets, and examine the behavir f the fits ver the B replicatins. In the figure, S(Z) is any quantity cmputed frm the data Z, fr example, the predictin at sme input pint. Frm the btstrap sampling we can estimate any aspect f the distributin f S(Z), fr example, its variance, Var[S(Z)] = B B (S(Z b ) S ), (7.5) b=

269 50 7. Mdel Assessment and Selectin Btstrap replicatins S(Z ) S(Z ) S(Z B ) Z Z Z B Btstrap samples Z = (z,z,...,z N ) Training sample FIGURE 7.. Schematic f the btstrap prcess. We wish t assess the statistical accuracy f a quantity S(Z) cmputed frm ur dataset. B training sets Z b, b =,..., B each f size N are drawn with replacement frm the riginal dataset. The quantity f interest S(Z) is cmputed frm each btstrap training set, and the values S(Z ),..., S(Z B ) are used t assess the statistical accuracy f S(Z). where S = b S(Z b )/B. Nte that Var[S(Z)] can be thught f as a Mnte-Carl estimate f the variance f S(Z) under sampling frm the empirical distributin functin ˆF fr the data (z,z,...,z N ). Hw can we apply the btstrap t estimate predictin errr? ne apprach wuld be t fit the mdel in questin n a set f btstrap samples, and then keep track f hw well it predicts the riginal training set. If ˆf b (x i ) is the predicted value at x i, frm the mdel fitted t the bth btstrap dataset, ur estimate is Êrr bt = B N B b= i= N L(y i, ˆf b (x i )). (7.54) Hwever, it is easy t see that Êrr bt des nt prvide a gd estimate in general. The reasn is that the btstrap datasets are acting as the training samples, while the riginal training set is acting as the test sample, and these tw samples have bservatins in cmmn. This verlap can make verfit predictins lk unrealistically gd, and is the reasn that crssvalidatin explicitly uses nn-verlapping data fr the training and test samples. Cnsider fr example a -nearest neighbr classifier applied t a tw-class classificatin prblem with the same number f bservatins in

270 7. Btstrap Methds 5 each class, in which the predictrs and class labels are in fact independent. Then the true errr rate is 0.5. But the cntributins t the btstrap estimate Êrr bt will be zer unless the bservatin i des nt appear in the btstrap sample b. In this latter case it will have the crrect expectatin 0.5. Nw Pr{bservatin i btstrap sample b} = ( ) N N e = 0.6. (7.55) Hence the expectatin f Êrr bt is abut = 0.84, far belw the crrect errr rate 0.5. By mimicking crss-validatin, a better btstrap estimate can be btained. Fr each bservatin, we nly keep track f predictins frm btstrap samples nt cntaining that bservatin. The leave-ne-ut btstrap estimate f predictin errr is defined by Êrr () = N N i= C i b C i L(y i, ˆf b (x i )). (7.56) Here C i is the set f indices f the btstrap samples b that d nt cntain bservatin i, and C i is the number f such samples. In cmputing Êrr(), we either have t chse B large enugh t ensure that all f the C i are greater than zer, r we can just leave ut the terms in (7.56) crrespnding t C i s that are zer. The leave-ne ut btstrap slves the verfitting prblem suffered by Êrr bt, but has the training-set-size bias mentined in the discussin f crss-validatin. The average number f distinct bservatins in each btstrap sample is abut 0.6 N, s its bias will rughly behave like that f twfld crss-validatin. Thus if the learning curve has cnsiderable slpe at sample size N/, the leave-ne ut btstrap will be biased upward as an estimate f the true errr. The.6 estimatr is designed t alleviate this bias. It is defined by Êrr (.6) =.68 err +.6 Êrr(). (7.57) The derivatin f the.6 estimatr is cmplex; intuitively it pulls the leave-ne ut btstrap estimate dwn tward the training errr rate, and hence reduces its upward bias. The use f the cnstant.6 relates t (7.55). The.6 estimatr wrks well in light fitting situatins, but can break dwn in verfit nes. Here is an example due t Breiman et al. (984). Suppse we have tw equal-size classes, with the targets independent f the class labels, and we apply a ne-nearest neighbr rule. Then err = 0,

271 5 7. Mdel Assessment and Selectin Êrr () = 0.5 and s Êrr(.6) = =.6. Hwever, the true errr rate is 0.5. ne can imprve the.6 estimatr by taking int accunt the amunt f verfitting. First we define γ t be the n-infrmatin errr rate: this is the errr rate f ur predictin rule if the inputs and class labels were independent. An estimate f γ is btained by evaluating the predictin rule n all pssible cmbinatins f targets y i and predictrs x i ˆγ = N N i= i = N L(y i, ˆf(x i )). (7.58) Fr example, cnsider the dichtmus classificatin prblem: let ˆp be the bserved prprtin f respnses y i equaling, and let ˆq be the bserved prprtin f predictins ˆf(x i ) equaling. Then ˆγ = ˆp ( ˆq ) + ( ˆp )ˆq. (7.59) With a rule like -nearest neighbrs fr which ˆq = ˆp the value f ˆγ is ˆp ( ˆp ). The multi-categry generalizatin f (7.59) is ˆγ = l ˆp l( ˆq l ). Using this, the relative verfitting rate is defined t be ˆR = Êrr() err, (7.60) ˆγ err a quantity that ranges frm 0 if there is n verfitting (Êrr() = err) t if the verfitting equals the n-infrmatin value ˆγ err. Finally, we define the.6+ estimatr by Êrr (.6+) = ( ŵ) err + ŵ Êrr() (7.6) with ŵ =.6.68 ˆR. The weight w ranges frm.6 if ˆR = 0 t if ˆR =, s Êrr (.6+) ranges frm Êrr(.6) t Êrr(). Again, the derivatin f (7.6) is cmplicated: rughly speaking, it prduces a cmprmise between the leave-neut btstrap and the training errr rate that depends n the amunt f verfitting. Fr the -nearest-neighbr prblem with class labels independent f the inputs, ŵ = ˆR =, s Êrr(.6+) = Êrr(), which has the crrect expectatin f 0.5. In ther prblems with less verfitting, Êrr(.6+) will lie smewhere between err and Êrr(). 7.. Example (Cntinued) Figure 7. shws the results f tenfld crss-validatin and the.6+ btstrap estimate in the same fur prblems f Figures 7.7. As in that figure,

272 7. Btstrap Methds 5 Crss validatin % Increase ver Best reg/knn reg/linear class/knn class/linear Btstrap % Increase ver Best reg/knn reg/linear class/knn class/linear FIGURE 7.. Bxplts shw the distributin f the relative errr 00 [Errˆα min α Err(α)]/[max α Err(α) min α Err(α)] ver the fur scenaris f Figure 7.. This is the errr in using the chsen mdel relative t the best mdel. There are 00 training sets represented in each bxplt. Figure 7. shws bxplts f 00 [Errˆα min α Err(α)]/[max α Err(α) min α Err(α)], the errr in using the chsen mdel relative t the best mdel. There are 00 different training sets represented in each bxplt. Bth measures perfrm well verall, perhaps the same r slightly wrse that the AIC in Figure 7.7. ur cnclusin is that fr these particular prblems and fitting methds, minimizatin f either AIC, crss-validatin r btstrap yields a mdel fairly clse t the best available. Nte that fr the purpse f mdel selectin, any f the measures culd be biased and it wuldn t affect things, as lng as the bias did nt change the relative perfrmance f the methds. Fr example, the additin f a cnstant t any f the measures wuld nt change the resulting chsen mdel. Hwever, fr many adaptive, nnlinear techniques (like trees), estimatin f the effective number f parameters is very difficult. This makes methds like AIC impractical and leaves us with crss-validatin r btstrap as the methds f chice. A different questin is: hw well des each methd estimate test errr? n the average the AIC criterin verestimated predictin errr f its ch-

273 54 7. Mdel Assessment and Selectin sen mdel by 8%, 7%, 5%, and 0%, respectively, ver the fur scenaris, with BIC perfrming similarly. In cntrast, crss-validatin verestimated the errr by %, 4%, 0%, and 4%, with the btstrap ding abut the same. Hence the extra wrk invlved in cmputing a crss-validatin r btstrap measure is wrthwhile, if an accurate estimate f test errr is required. With ther fitting methds like trees, crss-validatin and btstrap can underestimate the true errr by 0%, because the search fr best tree is strngly affected by the validatin set. In these situatins nly a separate test set will prvide an unbiased estimate f test errr. 7. Cnditinal r Expected Test Errr? Figures 7.4 and 7.5 examine the questin f whether crss-validatin des a gd jb in estimating Err T, the errr cnditinal n a given training set T (expressin (7.5) n page 8), as ppsed t the expected test errr. Fr each f 00 training sets generated frm the reg/linear setting in the tp-right panel f Figure 7., Figure 7.4 shws the cnditinal errr curves Err T as a functin f subset size (tp left). The next tw panels shw 0-fld and N-fld crss-validatin, the latter als knwn as leave-ne-ut (L). The thick red curve in each plt is the expected errr Err, while the thick black curves are the expected crss-validatin curves. The lwer right panel shws hw well crss-validatin apprximates the cnditinal and expected errr. ne might have expected N-fld CV t apprximate Err T well, since it almst uses the full training sample t fit a new test pint. 0-fld CV, n the ther hand, might be expected t estimate Err well, since it averages ver smewhat different training sets. Frm the figure it appears 0-fld des a better jb than N-fld in estimating Err T, and estimates Err even better. Indeed, the similarity f the tw black curves with the red curve suggests bth CV curves are apprximately unbiased fr Err, with 0-fld having less variance. Similar trends were reprted by Efrn (98). Figure 7.5 shws scatterplts f bth 0-fld and N-fld crss-validatin errr estimates versus the true cnditinal errr fr the 00 simulatins. Althugh the scatterplts d nt indicate much crrelatin, the lwer right panel shws that fr the mst part the crrelatins are negative, a curius phenmenn that has been bserved befre. This negative crrelatin explains why neither frm f CV estimates Err T well. The brken lines in each plt are drawn at Err(p), the expected errr fr the best subset f size p. We see again that bth frms f CV are apprximately unbiased fr expected errr, but the variatin in test errr fr different training sets is quite substantial. Amng the fur experimental cnditins in 7., this reg/linear scenari shwed the highest crrelatin between actual and predicted test errr. This

274 7. Cnditinal r Expected Test Errr? 55 Predictin Errr 0 Fld CV Errr Errr Errr Subset Size p Subset Size p Leave ne ut CV Errr Apprximatin Errr Errr Mean Abslute Deviatin E T CV 0 Err E T CV 0 Err T E T CV N Err T Subset Size p Subset Size p FIGURE 7.4. Cnditinal predictin-errr Err T, 0-fld crss-validatin, and leave-ne-ut crss-validatin curves fr a 00 simulatins frm the tp-right panel in Figure 7.. The thick red curve is the expected predictin errr Err, while the thick black curves are the expected CV curves E T CV 0 and E T CV N. The lwer-right panel shws the mean abslute deviatin f the CV curves frm the cnditinal errr, E T CV K Err T fr K = 0 (blue) and K = N (green), as well as frm the expected errr E T CV 0 Err (range).

275 56 7. Mdel Assessment and Selectin Subset Size Subset Size 5 CV Errr CV Errr Predictin Errr Predictin Errr Subset Size 0 CV Errr Crrelatin Leave ne ut 0 Fld Predictin Errr Subset Size FIGURE 7.5. Plts f the CV estimates f errr versus the true cnditinal errr fr each f the 00 training sets, fr the simulatin setup in the tp right panel Figure 7.. Bth 0-fld and leave-ne-ut CV are depicted in different clrs. The first three panels crrespnd t different subset sizes p, and vertical and hrizntal lines are drawn at Err(p). Althugh there appears t be little crrelatin in these plts, we see in the lwer right panel that fr the mst part the crrelatin is negative.

276 Exercises 57 phenmenn als ccurs fr btstrap estimates f errr, and we wuld guess, fr any ther estimate f cnditinal predictin errr. We cnclude that estimatin f test errr fr a particular training set is nt easy in general, given just the data frm that same training set. Instead, crss-validatin and related methds may prvide reasnable estimates f the expected errr Err. Bibligraphic Ntes Key references fr crss-validatin are Stne (974), Stne (977) and Allen (974). The AIC was prpsed by Akaike (97), while the BIC was intrduced by Schwarz (978). Madigan and Raftery (994) give an verview f Bayesian mdel selectin. The MDL criterin is due t Rissanen (98). Cver and Thmas (99) cntains a gd descriptin f cding thery and cmplexity. VC dimensin is described in Vapnik (996). Stne (977) shwed that the AIC and leave-ne ut crss-validatin are asympttically equivalent. Generalized crss-validatin is described by Glub et al. (979) and Wahba (980); a further discussin f the tpic may be fund in the mngraph by Wahba (990). See als Hastie and Tibshirani (990), Chapter. The btstrap is due t Efrn (979); see Efrn and Tibshirani (99) fr an verview. Efrn (98) prpses a number f btstrap estimates f predictin errr, including the ptimism and.6 estimates. Efrn (986) cmpares CV, GCV and btstrap estimates f errr rates. The use f crss-validatin and the btstrap fr mdel selectin is studied by Breiman and Spectr (99), Breiman (99), Sha (996), Zhang (99) and Khavi (995). The.6+ estimatr was prpsed by Efrn and Tibshirani (997). Cherkassky and Ma (00) published a study n the perfrmance f SRM fr mdel selectin in regressin, in respnse t ur study f sectin They cmplained that we had been unfair t SRM because had nt applied it prperly. ur respnse can be fund in the same issue f the jurnal (Hastie et al. (00)). Exercises Ex. 7. Derive the estimate f in-sample errr (7.4). Ex. 7. Fr 0 lss with Y {0,} and Pr(Y = x 0 ) = f(x 0 ), shw that Err(x 0 ) = Pr(Y Ĝ(x 0) X = x 0 ) = Err B (x 0 ) + f(x 0 ) Pr(Ĝ(x 0) G(x 0 ) X = x 0 ), (7.6)

277 58 7. Mdel Assessment and Selectin where Ĝ(x) = I( ˆf(x) > ), G(x) = I(f(x) > ) is the Bayes classifier, and Err B (x 0 ) = Pr(Y G(x 0 ) X = x 0 ), the irreducible Bayes errr at x 0. Using the apprximatin ˆf(x 0 ) N(E ˆf(x 0 ),Var( ˆf(x 0 )), shw that ( sign( Pr(Ĝ(x 0) G(x 0 ) X = x 0 ) Φ f(x 0))(E ˆf(x 0 ) ) ) Var( ˆf(x. (7.6) 0 )) In the abve, Φ(t) = π t exp( t /)dt, the cumulative Gaussian distributin functin. This is an increasing functin, with value 0 at t = and value at t = +. We can think f sign( f(x 0))(E ˆf(x 0 ) ) as a kind f bundarybias term, as it depends n the true f(x 0 ) nly thrugh which side f the bundary ( ) that it lies. Ntice als that the bias and variance cmbine in a multiplicative rather than additive fashin. If E ˆf(x 0 ) is n the same side f as f(x 0), then the bias is negative, and decreasing the variance will decrease the misclassificatin errr. n the ther hand, if E ˆf(x 0 ) is n the ppsite side f t f(x 0), then the bias is psitive and it pays t increase the variance! Such an increase will imprve the chance that ˆf(x 0 ) falls n the crrect side f (Friedman, 997). Ex. 7. Let ˆf = Sy be a linear smthing f y. (a) If S ii is the ith diagnal element f S, shw that fr S arising frm least squares prjectins and cubic smthing splines, the crss-validated residual can be written as y i ˆf i (x i ) = y i ˆf(x i ) S ii. (7.64) (b) Use this result t shw that y i ˆf i (x i ) y i ˆf(x i ). (c) Find general cnditins n any smther S t make result (7.64) hld. Ex. 7.4 Cnsider the in-sample predictin errr (7.8) and the training errr err in the case f squared-errr lss: Err in = N err = N N i= E Y 0(Y 0 i ˆf(x i )) N (y i ˆf(x i )). i=

278 Exercises 59 Add and subtract f(x i ) and E ˆf(x i ) in each expressin and expand. Hence establish that the average ptimism in the training errr is as given in (7.). N N Cv(ŷ i,y i ), i= Ex. 7.5 Fr a linear smther ŷ = Sy, shw that N Cv(ŷ i,y i ) = trace(s)σε, (7.65) i= which justifies its use as the effective number f parameters. Ex. 7.6 Shw that fr an additive-errr mdel, the effective degrees-ffreedm fr the k-nearest-neighbrs regressin fit is N/k. Ex. 7.7 Use the apprximatin /( x) +x t expse the relatinship between C p /AIC (7.6) and GCV (7.5), the main difference being the mdel used t estimate the nise variance σ ε. Ex. 7.8 Shw that the set f functins {I(sin(αx) > 0)} can shatter the fllwing pints n the line: z = 0,...,z l = 0 l, (7.66) fr any l. Hence the VC dimensin f the class {I(sin(αx) > 0)} is infinite. Ex. 7.9 Fr the prstate data f Chapter, carry ut a best-subset linear regressin analysis, as in Table. (third clumn frm left). Cmpute the AIC, BIC, five- and tenfld crss-validatin, and btstrap.6 estimates f predictin errr. Discuss the results. Ex. 7.0 Referring t the example in Sectin 7.0., suppse instead that all f the p predictrs are binary, and hence there is n need t estimate split pints. The predictrs are independent f the class labels as befre. Then if p is very large, we can prbably find a predictr that splits the entire training data perfectly, and hence wuld split the validatin data (ne-fifth f data) perfectly as well. This predictr wuld therefre have zer crss-validatin errr. Des this mean that crss-validatin des nt prvide a gd estimate f test errr in this situatin? [This questin was suggested by Li Ma.]

279 60 7. Mdel Assessment and Selectin

280 8 Mdel Inference and Averaging This is page 6 Printer: paque this 8. Intrductin Fr mst f this bk, the fitting (learning) f mdels has been achieved by minimizing a sum f squares fr regressin, r by minimizing crss-entrpy fr classificatin. In fact, bth f these minimizatins are instances f the maximum likelihd apprach t fitting. In this chapter we prvide a general expsitin f the maximum likelihd apprach, as well as the Bayesian methd fr inference. The btstrap, intrduced in Chapter 7, is discussed in this cntext, and its relatin t maximum likelihd and Bayes is described. Finally, we present sme related techniques fr mdel averaging and imprvement, including cmmittee methds, bagging, stacking and bumping. 8. The Btstrap and Maximum Likelihd Methds 8.. A Smthing Example The btstrap methd prvides a direct cmputatinal way f assessing uncertainty, by sampling frm the training data. Here we illustrate the btstrap in a simple ne-dimensinal smthing prblem, and shw its cnnectin t maximum likelihd.

281 6 8. Mdel Inference and Averaging y B-spline Basis x x FIGURE 8.. (Left panel): Data fr smthing example. (Right panel:) Set f seven B-spline basis functins. The brken vertical lines indicate the placement f the three knts. Dente the training data by Z = {z,z,...,z N }, with z i = (x i,y i ), i =,,...,N. Here x i is a ne-dimensinal input, and y i the utcme, either cntinuus r categrical. As an example, cnsider the N = 50 data pints shwn in the left panel f Figure 8.. Suppse we decide t fit a cubic spline t the data, with three knts placed at the quartiles f the X values. This is a seven-dimensinal linear space f functins, and can be represented, fr example, by a linear expansin f B-spline basis functins (see Sectin 5.9.): µ(x) = 7 β j h j (x). (8.) j= Here the h j (x), j =,,...,7 are the seven functins shwn in the right panel f Figure 8.. We can think f µ(x) as representing the cnditinal mean E(Y X = x). Let H be the N 7 matrix with ijth element h j (x i ). The usual estimate f β, btained by minimizing the squared errr ver the training set, is given by ˆβ = (H T H) H T y. (8.) The crrespnding fit ˆµ(x) = 7 j= ˆβ j h j (x) is shwn in the tp left panel f Figure 8.. The estimated cvariance matrix f ˆβ is Var(ˆβ) = (H T H) ˆσ, (8.) where we have estimated the nise variance by ˆσ = N i= (y i ˆµ(x i )) /N. Letting h(x) T = (h (x),h (x),...,h 7 (x)), the standard errr f a predic-

282 8. The Btstrap and Maximum Likelihd Methds 6 y y x x y y x x FIGURE 8.. (Tp left:) B-spline smth f data. (Tp right:) B-spline smth plus and minus.96 standard errr bands. (Bttm left:) Ten btstrap replicates f the B-spline smth. (Bttm right:) B-spline smth with 95% standard errr bands cmputed frm the btstrap distributin.

283 64 8. Mdel Inference and Averaging tin ˆµ(x) = h(x) T ˆβ is ŝe[ˆµ(x)] = [h(x) T (H T H) h(x)] ˆσ. (8.4) In the tp right panel f Figure 8. we have pltted ˆµ(x) ±.96 ŝe[ˆµ(x)]. Since.96 is the 97.5% pint f the standard nrmal distributin, these represent apprximate 00.5% = 95% pintwise cnfidence bands fr µ(x). Here is hw we culd apply the btstrap in this example. We draw B datasets each f size N = 50 with replacement frm ur training data, the sampling unit being the pair z i = (x i,y i ). T each btstrap dataset Z we fit a cubic spline ˆµ (x); the fits frm ten such samples are shwn in the bttm left panel f Figure 8.. Using B = 00 btstrap samples, we can frm a 95% pintwise cnfidence band frm the percentiles at each x: we find the.5% 00 = fifth largest and smallest values at each x. These are pltted in the bttm right panel f Figure 8.. The bands lk similar t thse in the tp right, being a little wider at the endpints. There is actually a clse cnnectin between the least squares estimates (8.) and (8.), the btstrap, and maximum likelihd. Suppse we further assume that the mdel errrs are Gaussian, Y = µ(x) + ε; ε N(0,σ ), 7 µ(x) = β j h j (x). (8.5) j= The btstrap methd described abve, in which we sample with replacement frm the training data, is called the nnparametric btstrap. This really means that the methd is mdel-free, since it uses the raw data, nt a specific parametric mdel, t generate new datasets. Cnsider a variatin f the btstrap, called the parametric btstrap, in which we simulate new respnses by adding Gaussian nise t the predicted values: y i = ˆµ(x i ) + ε i ; ε i N(0, ˆσ ); i =,,...,N. (8.6) This prcess is repeated B times, where B = 00 say. The resulting btstrap datasets have the frm (x,y),...,(x N,yN ) and we recmpute the B-spline smth n each. The cnfidence bands frm this methd will exactly equal the least squares bands in the tp right panel, as the number f btstrap samples ges t infinity. A functin estimated frm a btstrap sample y is given by ˆµ (x) = h(x) T (H T H) H T y, and has distributin ˆµ (x) N(ˆµ(x),h(x) T (H T H) h(x)ˆσ ). (8.7) Ntice that the mean f this distributin is the least squares estimate, and the standard deviatin is the same as the apprximate frmula (8.4).

284 8. The Btstrap and Maximum Likelihd Methds Maximum Likelihd Inference It turns ut that the parametric btstrap agrees with least squares in the previus example because the mdel (8.5) has additive Gaussian errrs. In general, the parametric btstrap agrees nt with least squares but with maximum likelihd, which we nw review. We begin by specifying a prbability density r prbability mass functin fr ur bservatins z i g θ (z). (8.8) In this expressin θ represents ne r mre unknwn parameters that gvern the distributin f Z. This is called a parametric mdel fr Z. As an example, if Z has a nrmal distributin with mean µ and variance σ, then θ = (µ,σ ), (8.9) and g θ (z) = πσ e (z µ) /σ. (8.0) Maximum likelihd is based n the likelihd functin, given by L(θ;Z) = N g θ (z i ), (8.) i= the prbability f the bserved data under the mdel g θ. The likelihd is defined nly up t a psitive multiplier, which we have taken t be ne. We think f L(θ;Z) as a functin f θ, with ur data Z fixed. Dente the lgarithm f L(θ;Z) by l(θ; Z) = = N l(θ;z i ) i= N lg g θ (z i ), (8.) i= which we will smetimes abbreviate as l(θ). This expressin is called the lg-likelihd, and each value l(θ;z i ) = lg g θ (z i ) is called a lg-likelihd cmpnent. The methd f maximum likelihd chses the value θ = ˆθ t maximize l(θ;z). The likelihd functin can be used t assess the precisin f ˆθ. We need a few mre definitins. The scre functin is defined by l(θ;z) = N l(θ;z i ), (8.) i=

285 66 8. Mdel Inference and Averaging where l(θ;z i ) = l(θ;z i )/ θ. Assuming that the likelihd takes its maximum in the interir f the parameter space, l(ˆθ;z) = 0. The infrmatin matrix is I(θ) = N i= l(θ;z i ) θ θ T. (8.4) When I(θ) is evaluated at θ = ˆθ, it is ften called the bserved infrmatin. The Fisher infrmatin (r expected infrmatin) is i(θ) = E θ [I(θ)]. (8.5) Finally, let θ 0 dente the true value f θ. A standard result says that the sampling distributin f the maximum likelihd estimatr has a limiting nrmal distributin ˆθ N(θ 0,i(θ 0 ) ), (8.6) as N. Here we are independently sampling frm g θ0 (z). This suggests that the sampling distributin f ˆθ may be apprximated by N(ˆθ,i(ˆθ) ) r N(ˆθ,I(ˆθ) ), (8.7) where ˆθ represents the maximum likelihd estimate frm the bserved data. The crrespnding estimates fr the standard errrs f ˆθ j are btained frm i(ˆθ) jj and I(ˆθ) jj. (8.8) Cnfidence pints fr θ j can be cnstructed frm either apprximatin in (8.7). Such a cnfidence pint has the frm ˆθ j z ( α) i(ˆθ) jj r ˆθj z ( α) I(ˆθ) jj, respectively, where z ( α) is the α percentile f the standard nrmal distributin. Mre accurate cnfidence intervals can be derived frm the likelihd functin, by using the chi-squared apprximatin [l(ˆθ) l(θ 0 )] χ p, (8.9) where p is the number f cmpnents in θ. The resulting α cnfidence interval is the set f all θ 0 such that [l(ˆθ) l(θ 0 )] χ p( α), where χ p( α) is the α percentile f the chi-squared distributin with p degrees f freedm.

286 8. Bayesian Methds 67 Let s return t ur smthing example t see what maximum likelihd yields. The parameters are θ = (β,σ ). The lg-likelihd is l(θ) = N lg σ π σ N (y i h(x i ) T β). (8.0) The maximum likelihd estimate is btained by setting l/ β = 0 and l/ σ = 0, giving i= ˆβ = (H T H) H T y, ˆσ = N (yi ˆµ(x i )), (8.) which are the same as the usual estimates given in (8.) and belw (8.). The infrmatin matrix fr θ = (β,σ ) is blck-diagnal, and the blck crrespnding t β is I(β) = (H T H)/σ, (8.) s that the estimated variance (H T H) ˆσ agrees with the least squares estimate (8.). 8.. Btstrap versus Maximum Likelihd In essence the btstrap is a cmputer implementatin f nnparametric r parametric maximum likelihd. The advantage f the btstrap ver the maximum likelihd frmula is that it allws us t cmpute maximum likelihd estimates f standard errrs and ther quantities in settings where n frmulas are available. In ur example, suppse that we adaptively chse by crss-validatin the number and psitin f the knts that define the B-splines, rather than fix them in advance. Dente by λ the cllectin f knts and their psitins. Then the standard errrs and cnfidence bands shuld accunt fr the adaptive chice f λ, but there is n way t d this analytically. With the btstrap, we cmpute the B-spline smth with an adaptive chice f knts fr each btstrap sample. The percentiles f the resulting curves capture the variability frm bth the nise in the targets as well as that frm ˆλ. In this particular example the cnfidence bands (nt shwn) dn t lk much different than the fixed λ bands. But in ther prblems, where mre adaptatin is used, this can be an imprtant effect t capture. 8. Bayesian Methds In the Bayesian apprach t inference, we specify a sampling mdel Pr(Z θ) (density r prbability mass functin) fr ur data given the parameters,

287 68 8. Mdel Inference and Averaging and a prir distributin fr the parameters Pr(θ) reflecting ur knwledge abut θ befre we see the data. We then cmpute the psterir distributin Pr(θ Z) = Pr(Z θ) Pr(θ), (8.) Pr(Z θ) Pr(θ)dθ which represents ur updated knwledge abut θ after we see the data. T understand this psterir distributin, ne might draw samples frm it r summarize by cmputing its mean r mde. The Bayesian apprach differs frm the standard ( frequentist ) methd fr inference in its use f a prir distributin t express the uncertainty present befre seeing the data, and t allw the uncertainty remaining after seeing the data t be expressed in the frm f a psterir distributin. The psterir distributin als prvides the basis fr predicting the values f a future bservatin z new, via the predictive distributin: Pr(z new Z) = Pr(z new θ) Pr(θ Z)dθ. (8.4) In cntrast, the maximum likelihd apprach wuld use Pr(z new ˆθ), the data density evaluated at the maximum likelihd estimate, t predict future data. Unlike the predictive distributin (8.4), this des nt accunt fr the uncertainty in estimating θ. Let s walk thrugh the Bayesian apprach in ur smthing example. We start with the parametric mdel given by equatin (8.5), and assume fr the mment that σ is knwn. We assume that the bserved feature values x,x,...,x N are fixed, s that the randmness in the data cmes slely frm y varying arund its mean µ(x). The secnd ingredient we need is a prir distributin. Distributins n functins are fairly cmplex entities: ne apprach is t use a Gaussian prcess prir in which we specify the prir cvariance between any tw functin values µ(x) and µ(x ) (Wahba, 990; Neal, 996). Here we take a simpler rute: by cnsidering a finite B-spline basis fr µ(x), we can instead prvide a prir fr the cefficients β, and this implicitly defines a prir fr µ(x). We chse a Gaussian prir centered at zer β N(0,τΣ) (8.5) with the chices f the prir crrelatin matrix Σ and variance τ t be discussed belw. The implicit prcess prir fr µ(x) is hence Gaussian, with cvariance kernel K(x,x ) = cv[µ(x),µ(x )] = τ h(x) T Σh(x ). (8.6)

288 8. Bayesian Methds 69 µ(x) FIGURE 8.. Smthing example: Ten draws frm the Gaussian prir distributin fr the functin µ(x). The psterir distributin fr β is als Gaussian, with mean and cvariance x E(β Z) = (H T H + σ cv(β Z) = τ Σ ) H T y, ) (H T H + σ τ Σ σ, (8.7) with the crrespnding psterir values fr µ(x), ( ) E(µ(x) Z) = h(x) T H T H + σ τ Σ H T y, ( ) cv[µ(x),µ(x ) Z] = h(x) T H T H + σ τ Σ h(x )σ. (8.8) Hw d we chse the prir crrelatin matrix Σ? In sme settings the prir can be chsen frm subject matter knwledge abut the parameters. Here we are willing t say the functin µ(x) shuld be smth, and have guaranteed this by expressing µ in a smth lw-dimensinal basis f B- splines. Hence we can take the prir crrelatin matrix t be the identity Σ = I. When the number f basis functins is large, this might nt be sufficient, and additinal smthness can be enfrced by impsing restrictins n Σ; this is exactly the case with smthing splines (Sectin 5.8.). Figure 8. shws ten draws frm the crrespnding prir fr µ(x). T generate psterir values f the functin µ(x), we generate values β frm its psterir (8.7), giving crrespnding psterir value µ (x) = 7 β j h j(x). Ten such psterir curves are shwn in Figure 8.4. Tw different values were used fr the prir variance τ, and 000. Ntice hw similar the right panel lks t the btstrap distributin in the bttm left panel

289 70 8. Mdel Inference and Averaging τ = τ = 000 µ(x) µ(x) x x FIGURE 8.4. Smthing example: Ten draws frm the psterir distributin fr the functin µ(x), fr tw different values f the prir variance τ. The purple curves are the psterir means. f Figure 8. n page 6. This similarity is n accident. As τ, the psterir distributin (8.7) and the btstrap distributin (8.7) cincide. n the ther hand, fr τ =, the psterir curves µ(x) in the left panel f Figure 8.4 are smther than the btstrap curves, because we have impsed mre prir weight n smthness. The distributin (8.5) with τ is called a nninfrmative prir fr θ. In Gaussian mdels, maximum likelihd and parametric btstrap analyses tend t agree with Bayesian analyses that use a nninfrmative prir fr the free parameters. These tend t agree, because with a cnstant prir, the psterir distributin is prprtinal t the likelihd. This crrespndence als extends t the nnparametric case, where the nnparametric btstrap apprximates a nninfrmative Bayes analysis; Sectin 8.4 has the details. We have, hwever, dne sme things that are nt prper frm a Bayesian pint f view. We have used a nninfrmative (cnstant) prir fr σ and replaced it with the maximum likelihd estimate ˆσ in the psterir. A mre standard Bayesian analysis wuld als put a prir n σ (typically g(σ) /σ), calculate a jint psterir fr µ(x) and σ, and then integrate ut σ, rather than just extract the maximum f the psterir distributin ( MAP estimate).

290 8.4 Relatinship Between the Btstrap and Bayesian Inference Relatinship Between the Btstrap and Bayesian Inference Cnsider first a very simple example, in which we bserve a single bservatin z frm a nrmal distributin z N(θ,). (8.9) T carry ut a Bayesian analysis fr θ, we need t specify a prir. The mst cnvenient and cmmn chice wuld be θ N(0,τ) giving psterir distributin ( θ z N z + /τ, + /τ ). (8.0) Nw the larger we take τ, the mre cncentrated the psterir becmes arund the maximum likelihd estimate ˆθ = z. In the limit as τ we btain a nninfrmative (cnstant) prir, and the psterir distributin is θ z N(z,). (8.) This is the same as a parametric btstrap distributin in which we generate btstrap values z frm the maximum likelihd estimate f the sampling density N(z,). There are three ingredients that make this crrespndence wrk:. The chice f nninfrmative prir fr θ.. The dependence f the lg-likelihd l(θ;z) n the data Z nly thrugh the maximum likelihd estimate ˆθ. Hence we can write the lg-likelihd as l(θ; ˆθ).. The symmetry f the lg-likelihd in θ and ˆθ, that is, l(θ; ˆθ) = l(ˆθ;θ) + cnstant. Prperties () and () essentially nly hld fr the Gaussian distributin. Hwever, they als hld apprximately fr the multinmial distributin, leading t a crrespndence between the nnparametric btstrap and Bayes inference, which we utline next. Assume that we have a discrete sample space with L categries. Let w j be the prbability that a sample pint falls in categry j, and ŵ j the bserved prprtin in categry j. Let w = (w,w,...,w L ),ŵ = (ŵ,ŵ,...,ŵ L ). Dente ur estimatr by S(ŵ); take as a prir distributin fr w a symmetric Dirichlet distributin with parameter a: w Di L (a), (8.)

291 7 8. Mdel Inference and Averaging that is, the prir prbability mass functin is prprtinal t L Then the psterir density f w is l= wa l. w Di L (a + Nŵ), (8.) where N is the sample size. Letting a 0 t btain a nninfrmative prir gives w Di L (Nŵ). (8.4) Nw the btstrap distributin, btained by sampling with replacement frm the data, can be expressed as sampling the categry prprtins frm a multinmial distributin. Specifically, Nŵ Mult(N,ŵ), (8.5) where Mult(N, ŵ) dentes a multinmial distributin, having prbability mass functin ( ) N Nŵ ŵnŵ l,...,nŵ l. This distributin is similar t the psterir distributin abve, having the same supprt, same mean, and nearly L the same cvariance matrix. Hence the btstrap distributin f S(ŵ ) will clsely apprximate the psterir distributin f S(w). In this sense, the btstrap distributin represents an (apprximate) nnparametric, nninfrmative psterir distributin fr ur parameter. But this btstrap distributin is btained painlessly withut having t frmally specify a prir and withut having t sample frm the psterir distributin. Hence we might think f the btstrap distributin as a pr man s Bayes psterir. By perturbing the data, the btstrap apprximates the Bayesian effect f perturbing the parameters, and is typically much simpler t carry ut. 8.5 The EM Algrithm The EM algrithm is a ppular tl fr simplifying difficult maximum likelihd prblems. We first describe it in the cntext f a simple mixture mdel Tw-Cmpnent Mixture Mdel In this sectin we describe a simple mixture mdel fr density estimatin, and the assciated EM algrithm fr carrying ut maximum likelihd estimatin. This has a natural cnnectin t Gibbs sampling methds fr Bayesian inference. Mixture mdels are discussed and demnstrated in several ther parts f the bk, in particular Sectins 6.8,.7 and... The left panel f Figure 8.5 shws a histgram f the 0 fictitius data pints in Table 8..

292 8.5 The EM Algrithm density y y FIGURE 8.5. Mixture example. (Left panel:) Histgram f data. (Right panel:) Maximum likelihd fit f Gaussian densities (slid red) and respnsibility (dtted green) f the left cmpnent density fr bservatin y, as a functin f y. TABLE 8.. Twenty fictitius data pints used in the tw-cmpnent mixture example in Figure We wuld like t mdel the density f the data pints, and due t the apparent bi-mdality, a Gaussian distributin wuld nt be apprpriate. There seems t be tw separate underlying regimes, s instead we mdel Y as a mixture f tw nrmal distributins: Y N(µ,σ ), Y N(µ,σ ), (8.6) Y = ( ) Y + Y, where {0,} with Pr( = ) = π. This generative representatin is explicit: generate a {0,} with prbability π, and then depending n the utcme, deliver either Y r Y. Let φ θ (x) dente the nrmal density with parameters θ = (µ,σ ). Then the density f Y is g Y (y) = ( π)φ θ (y) + πφ θ (y). (8.7) Nw suppse we wish t fit this mdel t the data in Figure 8.5 by maximum likelihd. The parameters are θ = (π,θ,θ ) = (π,µ,σ,µ,σ ). (8.8) The lg-likelihd based n the N training cases is N l(θ;z) = lg[( π)φ θ (y i ) + πφ θ (y i )]. (8.9) i=

293 74 8. Mdel Inference and Averaging Direct maximizatin f l(θ; Z) is quite difficult numerically, because f the sum f terms inside the lgarithm. There is, hwever, a simpler apprach. We cnsider unbserved latent variables i taking values 0 r as in (8.6): if i = then Y i cmes frm mdel, therwise it cmes frm mdel. Suppse we knew the values f the i s. Then the lg-likelihd wuld be N l 0 (θ;z, ) = [( i )lg φ θ (y i ) + i lg φ θ (y i )] i= + N [( i )lg( π) + i lg π], (8.40) i= and the maximum likelihd estimates f µ and σ wuld be the sample mean and variance fr thse data with i = 0, and similarly thse fr µ and σ wuld be the sample mean and variance f the data with i =. The estimate f π wuld be the prprtin f i =. Since the values f the i s are actually unknwn, we prceed in an iterative fashin, substituting fr each i in (8.40) its expected value γ i (θ) = E( i θ,z) = Pr( i = θ,z), (8.4) als called the respnsibility f mdel fr bservatin i. We use a prcedure called the EM algrithm, given in Algrithm 8. fr the special case f Gaussian mixtures. In the expectatin step, we d a sft assignment f each bservatin t each mdel: the current estimates f the parameters are used t assign respnsibilities accrding t the relative density f the training pints under each mdel. In the maximizatin step, these respnsibilities are used in weighted maximum-likelihd fits t update the estimates f the parameters. A gd way t cnstruct initial guesses fr ˆµ and ˆµ is simply t chse tw f the y i at randm. Bth ˆσ and ˆσ can be set equal t the verall sample variance N i= (y i ȳ) /N. The mixing prprtin ˆπ can be started at the value 0.5. Nte that the actual maximizer f the likelihd ccurs when we put a spike f infinite height at any ne data pint, that is, ˆµ = y i fr sme i and ˆσ = 0. This gives infinite likelihd, but is nt a useful slutin. Hence we are actually lking fr a gd lcal maximum f the likelihd, ne fr which ˆσ, ˆσ > 0. T further cmplicate matters, there can be mre than ne lcal maximum having ˆσ, ˆσ > 0. In ur example, we ran the EM algrithm with a number f different initial guesses fr the parameters, all having ˆσ k > 0.5, and chse the run that gave us the highest maximized likelihd. Figure 8.6 shws the prgress f the EM algrithm in maximizing the lg-likelihd. Table 8. shws ˆπ = i ˆγ i/n, the maximum likelihd estimate f the prprtin f bservatins in class, at selected iteratins f the EM prcedure.

294 8.5 The EM Algrithm 75 Algrithm 8. EM Algrithm fr Tw-cmpnent Gaussian Mixture.. Take initial guesses fr the parameters ˆµ, ˆσ, ˆµ, ˆσ, ˆπ (see text).. Expectatin Step: cmpute the respnsibilities (y ˆπφˆθ i ) ˆγ i =, i =,,...,N. (8.4) ( (y ˆπ)φˆθ i ) + (y ˆπφˆθ i ). Maximizatin Step: cmpute the weighted means and variances: ˆµ = N i= ( ˆγ i)y i N i= ( ˆγ i), N ˆσ i= = i)(y i ˆµ ) N i= ( ˆγ, i) N i= ˆµ = iy i N i= ˆγ, i N ˆσ i= = i(y i ˆµ ) N i= ˆγ, i and the mixing prbability ˆπ = N i= ˆγ i/n. 4. Iterate steps and until cnvergence. TABLE 8.. Selected iteratins f the EM algrithm fr mixture example. Iteratin ˆπ The final maximum likelihd estimates are ˆµ = 4.6, ˆσ = 0.87, ˆµ =.06, ˆσ = 0.77, ˆπ = The right panel f Figure 8.5 shws the estimated Gaussian mixture density frm this prcedure (slid red curve), alng with the respnsibilities (dtted green curve). Nte that mixtures are als useful fr supervised learning; in Sectin 6.7 we shw hw the Gaussian mixture mdel leads t a versin f radial basis functins.

295 76 8. Mdel Inference and Averaging bserved Data Lg-likelihd Iteratin FIGURE 8.6. EM algrithm: bserved data lg-likelihd as a functin f the iteratin number The EM Algrithm in General The abve prcedure is an example f the EM (r Baum Welch) algrithm fr maximizing likelihds in certain classes f prblems. These prblems are nes fr which maximizatin f the likelihd is difficult, but made easier by enlarging the sample with latent (unbserved) data. This is called data augmentatin. Here the latent data are the mdel memberships i. In ther prblems, the latent data are actual data that shuld have been bserved but are missing. Algrithm 8. gives the general frmulatin f the EM algrithm. ur bserved data is Z, having lg-likelihd l(θ;z) depending n parameters θ. The latent r missing data is Z m, s that the cmplete data is T = (Z,Z m ) with lg-likelihd l 0 (θ;t), l 0 based n the cmplete density. In the mixture prblem (Z,Z m ) = (y, ), and l 0 (θ;t) is given in (8.40). In ur mixture example, E(l 0 (θ ;T) Z, ˆθ (j) ) is simply (8.40) with the i replaced by the respnsibilities ˆγ i (ˆθ), and the maximizers in step are just weighted means and variances. We nw give an explanatin f why the EM algrithm wrks in general. Since Pr(Z m Z,θ ) = Pr(Zm,Z θ ) Pr(Z θ, (8.44) ) we can write Pr(Z θ ) = Pr(T θ ) Pr(Z m Z,θ ). (8.45) In terms f lg-likelihds, we have l(θ ;Z) = l 0 (θ ;T) l (θ ;Z m Z), where l is based n the cnditinal density Pr(Z m Z,θ ). Taking cnditinal expectatins with respect t the distributin f T Z gverned by parameter θ gives l(θ ;Z) = E[l 0 (θ ;T) Z,θ] E[l (θ ;Z m Z) Z,θ]

296 8.5 The EM Algrithm 77 Algrithm 8. The EM Algrithm.. Start with initial guesses fr the parameters ˆθ (0).. Expectatin Step: at the jth step, cmpute Q(θ, ˆθ (j) ) = E(l 0 (θ ;T) Z, ˆθ (j) ) (8.4) as a functin f the dummy argument θ.. Maximizatin Step: determine the new estimate ˆθ (j+) as the maximizer f Q(θ, ˆθ (j) ) ver θ. 4. Iterate steps and until cnvergence. Q(θ,θ) R(θ,θ). (8.46) In the M step, the EM algrithm maximizes Q(θ,θ) ver θ, rather than the actual bjective functin l(θ ;Z). Why des it succeed in maximizing l(θ ;Z)? Nte that R(θ,θ) is the expectatin f a lg-likelihd f a density (indexed by θ ), with respect t the same density indexed by θ, and hence (by Jensen s inequality) is maximized as a functin f θ, when θ = θ (see Exercise 8.). S if θ maximizes Q(θ,θ), we see that l(θ ;Z) l(θ;z) = [Q(θ,θ) Q(θ,θ)] [R(θ,θ) R(θ,θ)] 0. (8.47) Hence the EM iteratin never decreases the lg-likelihd. This argument als makes it clear that a full maximizatin in the M step is nt necessary: we need nly t find a value ˆθ (j+) s that Q(θ, ˆθ (j) ) increases as a functin f the first argument, that is, Q(ˆθ (j+), ˆθ (j) ) > Q(ˆθ (j), ˆθ (j) ). Such prcedures are called GEM (generalized EM) algrithms. The EM algrithm can als be viewed as a minrizatin prcedure: see Exercise EM as a Maximizatin Maximizatin Prcedure Here is a different view f the EM prcedure, as a jint maximizatin algrithm. Cnsider the functin F(θ, P) = E P[l 0 (θ ;T)] E P[lg P(Z m )]. (8.48) Here P(Z m ) is any distributin ver the latent data Z m. In the mixture example, P(Z m ) cmprises the set f prbabilities γ i = Pr( i = θ,z). Nte that F evaluated at P(Z m ) = Pr(Z m Z,θ ), is the lg-likelihd f

297 78 8. Mdel Inference and Averaging Mdel Parameters 0 4 E M E M Latent Data Parameters FIGURE 8.7. Maximizatin maximizatin view f the EM algrithm. Shwn are the cnturs f the (augmented) bserved data lg-likelihd F(θ, P). The E step is equivalent t maximizing the lg-likelihd ver the parameters f the latent data distributin. The M step maximizes it ver the parameters f the lg-likelihd. The red curve crrespnds t the bserved data lg-likelihd, a prfile btained by maximizing F(θ, P) fr each value f θ. the bserved data, frm (8.46). The functin F expands the dmain f the lg-likelihd, t facilitate its maximizatin. The EM algrithm can be viewed as a jint maximizatin methd fr F ver θ and P(Z m ), by fixing ne argument and maximizing ver the ther. The maximizer ver P(Z m ) fr fixed θ can be shwn t be P(Z m ) = Pr(Z m Z,θ ) (8.49) (Exercise 8.). This is the distributin cmputed by the E step, fr example, (8.4) in the mixture example. In the M step, we maximize F(θ, P) ver θ with P fixed: this is the same as maximizing the first term E P[l 0 (θ ;T) Z,θ] since the secnd term des nt invlve θ. Finally, since F(θ, P) and the bserved data lg-likelihd agree when P(Z m ) = Pr(Z m Z,θ ), maximizatin f the frmer accmplishes maximizatin f the latter. Figure 8.7 shws a schematic view f this prcess. This view f the EM algrithm leads t alternative maximizatin prce- (8.46) hlds fr all θ, including θ = θ.

298 Algrithm 8. Gibbs Sampler. 8.6 MCMC fr Sampling frm the Psterir 79. Take sme initial values U (0) k,k =,,...,K.. Repeat fr t =,,...,. : Fr k =,,...,K generate U (t) k frm Pr(U (t) k U(t),...,U(t) k,u(t ) k+,...,u(t ) K ).. Cntinue step until the jint distributin f (U (t),u(t),...,u(t) des nt change. K ) dures. Fr example, ne des nt need t maximize with respect t all f the latent data parameters at nce, but culd instead maximize ver ne f them at a time, alternating with the M step. 8.6 MCMC fr Sampling frm the Psterir Having defined a Bayesian mdel, ne wuld like t draw samples frm the resulting psterir distributin, in rder t make inferences abut the parameters. Except fr simple mdels, this is ften a difficult cmputatinal prblem. In this sectin we discuss the Markv chain Mnte Carl (MCMC) apprach t psterir sampling. We will see that Gibbs sampling, an MCMC prcedure, is clsely related t the EM algrithm: the main difference is that it samples frm the cnditinal distributins rather than maximizing ver them. Cnsider first the fllwing abstract prblem. We have randm variables U,U,...,U K and we wish t draw a sample frm their jint distributin. Suppse this is difficult t d, but it is easy t simulate frm the cnditinal distributins Pr(U j U,U,...,U j,u j+,...,u K ), j =,,...,K. The Gibbs sampling prcedure alternatively simulates frm each f these distributins and when the prcess stabilizes, prvides a sample frm the desired jint distributin. The prcedure is defined in Algrithm 8.. Under regularity cnditins it can be shwn that this prcedure eventually stabilizes, and the resulting randm variables are indeed a sample frm the jint distributin f U,U,...,U K. This ccurs despite the fact,...,u(t) K ) are clearly nt independent fr different t. Mre frmally, Gibbs sampling prduces a Markv chain whse statinary distributin is the true jint distributin, and hence the term Markv chain Mnte Carl. It is nt surprising that the true jint distributin is statinary under this prcess, as the successive steps leave the marginal distributins f the U k s unchanged. that the samples (U (t),u(t)

299 80 8. Mdel Inference and Averaging Nte that we dn t need t knw the explicit frm f the cnditinal densities, but just need t be able t sample frm them. After the prcedure reaches statinarity, the marginal density f any subset f the variables can be apprximated by a density estimate applied t the sample values. Hwever if the explicit frm f the cnditinal density Pr(U k, U l,l k) is available, a better estimate f say the marginal density f U k can be btained frm (Exercise 8.): Pr Uk (u) = (M m + ) M t=m Pr(u U (t) l,l k). (8.50) Here we have averaged ver the last M m + members f the sequence, t allw fr an initial burn-in perid befre statinarity is reached. Nw getting back t Bayesian inference, ur gal is t draw a sample frm the jint psterir f the parameters given the data Z. Gibbs sampling will be helpful if it is easy t sample frm the cnditinal distributin f each parameter given the ther parameters and Z. An example the Gaussian mixture prblem is detailed next. There is a clse cnnectin between Gibbs sampling frm a psterir and the EM algrithm in expnential family mdels. The key is t cnsider the latent data Z m frm the EM prcedure t be anther parameter fr the Gibbs sampler. T make this explicit fr the Gaussian mixture prblem, we take ur parameters t be (θ,z m ). Fr simplicity we fix the variances σ,σ and mixing prprtin π at their maximum likelihd values s that the nly unknwn parameters in θ are the means µ and µ. The Gibbs sampler fr the mixture prblem is given in Algrithm 8.4. We see that steps (a) and (b) are the same as the E and M steps f the EM prcedure, except that we sample rather than maximize. In step (a), rather than cmpute the maximum likelihd respnsibilities γ i = E( i θ,z), the Gibbs sampling prcedure simulates the latent data i frm the distributins Pr( i θ,z). In step (b), rather than cmpute the maximizers f the psterir Pr(µ,µ, Z) we simulate frm the cnditinal distributin Pr(µ,µ,Z). Figure 8.8 shws 00 iteratins f Gibbs sampling, with the mean parameters µ (lwer) and µ (upper) shwn in the left panel, and the prprtin f class bservatins i i/n n the right. Hrizntal brken lines have been drawn at the maximum likelihd estimate values ˆµ, ˆµ and i ˆγ i/n in each case. The values seem t stabilize quite quickly, and are distributed evenly arund the maximum likelihd values. The abve mixture mdel was simplified, in rder t make the clear cnnectin between Gibbs sampling and the EM algrithm. Mre realistically, ne wuld put a prir distributin n the variances σ,σ and mixing prprtin π, and include separate Gibbs sampling steps in which we sample frm their psterir distributins, cnditinal n the ther parameters. ne can als incrprate prper (infrmative) prirs fr the mean param-

300 8.6 MCMC fr Sampling frm the Psterir 8 Algrithm 8.4 Gibbs sampling fr mixtures.. Take sme initial values θ (0) = (µ (0),µ(0) ).. Repeat fr t =,,...,. (a) Fr i =,,...,N generate (t) i ˆγ i (θ (t) ), frm equatin (8.4). (b) Set {0,} with Pr( (t) i = ) = ˆµ = ˆµ = N i= N i= ( (t) i ) y i, ( (t) i ) N i= (t) i N i= (t) i y i, and generate µ (t) N(ˆµ, ˆσ ) and µ (t) N(ˆµ, ˆσ ).. Cntinue step until the jint distributin f ( (t),µ (t),µ(t) ) desn t change Mean Parameters Mixing Prprtin Gibbs Iteratin Gibbs Iteratin FIGURE 8.8. Mixture example. (Left panel:) 00 values f the tw mean parameters frm Gibbs sampling; hrizntal lines are drawn at the maximum likelihd estimates ˆµ, ˆµ. (Right panel:) Prprtin f values with i =, fr each f the 00 Gibbs sampling iteratins; a hrizntal line is drawn at P i ˆγi/N.

301 8 8. Mdel Inference and Averaging eters. These prirs must nt be imprper as this will lead t a degenerate psterir, with all the mixing weight n ne cmpnent. Gibbs sampling is just ne f a number f recently develped prcedures fr sampling frm psterir distributins. It uses cnditinal sampling f each parameter given the rest, and is useful when the structure f the prblem makes this sampling easy t carry ut. ther methds d nt require such structure, fr example the Metrplis Hastings algrithm. These and ther cmputatinal Bayesian methds have been applied t sphisticated learning algrithms such as Gaussian prcess mdels and neural netwrks. Details may be fund in the references given in the Bibligraphic Ntes at the end f this chapter. 8.7 Bagging Earlier we intrduced the btstrap as a way f assessing the accuracy f a parameter estimate r a predictin. Here we shw hw t use the btstrap t imprve the estimate r predictin itself. In Sectin 8.4 we investigated the relatinship between the btstrap and Bayes appraches, and fund that the btstrap mean is apprximately a psterir average. Bagging further explits this cnnectin. Cnsider first the regressin prblem. Suppse we fit a mdel t ur training data Z = {(x,y ),(x,y ),...,(x N,y N )}, btaining the predictin ˆf(x) at input x. Btstrap aggregatin r bagging averages this predictin ver a cllectin f btstrap samples, thereby reducing its variance. Fr each btstrap sample Z b, b =,,...,B, we fit ur mdel, giving predictin ˆf b (x). The bagging estimate is defined by ˆf bag (x) = B B ˆf b (x). (8.5) b= Dente by ˆP the empirical distributin putting equal prbability /N n each f the data pints (x i,y i ). In fact the true bagging estimate is defined by E ˆP ˆf (x), where Z = (x,y),(x,y),...,(x N,y N ) and each (x i,y i ) ˆP. Expressin (8.5) is a Mnte Carl estimate f the true bagging estimate, appraching it as B. The bagged estimate (8.5) will differ frm the riginal estimate ˆf(x) nly when the latter is a nnlinear r adaptive functin f the data. Fr example, t bag the B-spline smth f Sectin 8.., we average the curves in the bttm left panel f Figure 8. at each value f x. The B-spline smther is linear in the data if we fix the inputs; hence if we sample using the parametric btstrap in equatin (8.6), then ˆf bag (x) ˆf(x) as B (Exercise 8.4). Hence bagging just reprduces the riginal smth in the

302 8.7 Bagging 8 tp left panel f Figure 8.. The same is apprximately true if we were t bag using the nnparametric btstrap. A mre interesting example is a regressin tree, where ˆf(x) dentes the tree s predictin at input vectr x (regressin trees are described in Chapter 9). Each btstrap tree will typically invlve different features than the riginal, and might have a different number f terminal ndes. The bagged estimate is the average predictin at x frm these B trees. Nw suppse ur tree prduces a classifier Ĝ(x) fr a K-class respnse. Here it is useful t cnsider an underlying indicatr-vectr functin ˆf(x), with value a single ne and K zeres, such that Ĝ(x) = arg max ˆf(x). k Then the bagged estimate ˆf bag (x) (8.5) is a K-vectr [p (x),p (x),..., p K (x)], with p k (x) equal t the prprtin f trees predicting class k at x. The bagged classifier selects the class with the mst vtes frm the B trees, Ĝ bag (x) = arg max k ˆfbag (x). ften we require the class-prbability estimates at x, rather than the classificatins themselves. It is tempting t treat the vting prprtins p k (x) as estimates f these prbabilities. A simple tw-class example shws that they fail in this regard. Suppse the true prbability f class at x is 0.75, and each f the bagged classifiers accurately predict a. Then p (x) =, which is incrrect. Fr many classifiers Ĝ(x), hwever, there is already an underlying functin ˆf(x) that estimates the class prbabilities at x (fr trees, the class prprtins in the terminal nde). An alternative bagging strategy is t average these instead, rather than the vte indicatr vectrs. Nt nly des this prduce imprved estimates f the class prbabilities, but it als tends t prduce bagged classifiers with lwer variance, especially fr small B (see Figure 8.0 in the next example) Example: Trees with Simulated Data We generated a sample f size N = 0, with tw classes and p = 5 features, each having a standard Gaussian distributin with pairwise crrelatin The respnse Y was generated accrding t Pr(Y = x 0.5) = 0., Pr(Y = x > 0.5) = 0.8. The Bayes errr is 0.. A test sample f size 000 was als generated frm the same ppulatin. We fit classificatin trees t the training sample and t each f 00 btstrap samples (classificatin trees are described in Chapter 9). N pruning was used. Figure 8.9 shws the riginal tree and eleven btstrap trees. Ntice hw the trees are all different, with different splitting features and cutpints. The test errr fr the riginal tree and the bagged tree is shwn in Figure 8.0. In this example the trees have high variance due t the crrelatin in the predictrs. Bagging succeeds in smthing ut this variance and hence reducing the test errr. Bagging can dramatically reduce the variance f unstable prcedures like trees, leading t imprved predictin. A simple argument shws why

303 84 8. Mdel Inference and Averaging riginal Tree x. < 0.95 b = x. < b = x. < b = x. < 0.85 b = 4 x. < x.4 <.6 b = b = 6 x. < 0.95 b = 7 x. < 0.95 b = 8 x. < b = 9 x. < 0.95 b = 0 x. < b = x. < FIGURE 8.9. Bagging trees n simulated dataset. The tp left panel shws the riginal tree. Eleven trees grwn n btstrap samples are shwn. Fr each tree, the tp split is anntated.

304 8.7 Bagging 85 Test Errr Bayes riginal Tree Bagged Trees Cnsensus Prbability Number f Btstrap Samples FIGURE 8.0. Errr curves fr the bagging example f Figure 8.9. Shwn is the test errr f the riginal tree and bagged trees as a functin f the number f btstrap samples. The range pints crrespnd t the cnsensus vte, while the green pints average the prbabilities. bagging helps under squared-errr lss, in shrt because averaging reduces variance and leaves bias unchanged. Assume ur training bservatins (x i,y i ), i =,...,N are independently drawn frm a distributin P, and cnsider the ideal aggregate estimatr f ag (x) = E P ˆf (x). Here x is fixed and the btstrap dataset Z cnsists f bservatins x i,y i, i =,,...,N sampled frm P. Nte that f ag (x) is a bagging estimate, drawing btstrap samples frm the actual ppulatin P rather than the data. It is nt an estimate that we can use in practice, but is cnvenient fr analysis. We can write E P [Y ˆf (x)] = E P [Y f ag (x) + f ag (x) ˆf (x)] = E P [Y f ag (x)] + E P [ ˆf (x) f ag (x)] E P [Y f ag (x)]. (8.5) The extra errr n the right-hand side cmes frm the variance f ˆf (x) arund its mean f ag (x). Therefre true ppulatin aggregatin never increases mean squared errr. This suggests that bagging drawing samples frm the training data will ften decrease mean-squared errr. The abve argument des nt hld fr classificatin under 0- lss, because f the nnadditivity f bias and variance. In that setting, bagging a

305 86 8. Mdel Inference and Averaging gd classifier can make it better, but bagging a bad classifier can make it wrse. Here is a simple example, using a randmized rule. Suppse Y = fr all x, and the classifier Ĝ(x) predicts Y = (fr all x) with prbability 0.4 and predicts Y = 0 (fr all x) with prbability 0.6. Then the misclassificatin errr f Ĝ(x) is 0.6 but that f the bagged classifier is.0. Fr classificatin we can understand the bagging effect in terms f a cnsensus f independent weak learners (Dietterich, 000a). Let the Bayes ptimal decisin at x be G(x) = in a tw-class example. Suppse each f the weak learners G b have an errr-rate e b = e < 0.5, and let S (x) = B b= I(G b (x) = ) be the cnsensus vte fr class. Since the weak learners are assumed t be independent, S (x) Bin(B, e), and Pr(S > B/) as B gets large. This cncept has been ppularized utside f statistics as the Wisdm f Crwds (Surwiecki, 004) the cllective knwledge f a diverse and independent bdy f peple typically exceeds the knwledge f any single individual, and can be harnessed by vting. f curse, the main caveat here is independent, and bagged trees are nt. Figure 8. illustrates the pwer f a cnsensus vte in a simulated example, where nly 0% f the vters have sme knwledge. In Chapter 5 we see hw randm frests imprve n bagging by reducing the crrelatin between the sampled trees. Nte that when we bag a mdel, any simple structure in the mdel is lst. As an example, a bagged tree is n lnger a tree. Fr interpretatin f the mdel this is clearly a drawback. Mre stable prcedures like nearest neighbrs are typically nt affected much by bagging. Unfrtunately, the unstable mdels mst helped by bagging are unstable because f the emphasis n interpretability, and this is lst in the bagging prcess. Figure 8. shws an example where bagging desn t help. The 00 data pints shwn have tw features and tw classes, separated by the gray linear bundary x + x =. We chse as ur classifier Ĝ(x) a single axis-riented split, chsing the split alng either x r x that prduces the largest decrease in training misclassificatin errr. The decisin bundary btained frm bagging the 0- decisin rule ver B = 50 btstrap samples is shwn by the blue curve in the left panel. It des a pr jb f capturing the true bundary. The single split rule, derived frm the training data, splits near 0 (the middle f the range f x r x ), and hence has little cntributin away frm the center. Averaging the prbabilities rather than the classificatins des nt help here. Bagging estimates the expected class prbabilities frm the single split rule, that is, averaged ver many replicatins. Nte that the expected class prbabilities cmputed by bagging cannt be realized n any single replicatin, in the same way that a wman cannt have.4 children. In this sense, bagging increases smewhat the space f mdels f the individual base classifier. Hwever, it desn t help in this and many ther examples where a greater enlargement f the mdel class is needed. Bsting is a way f ding this

306 8.7 Bagging 87 Wisdm f Crwds Expected Crrect ut f Cnsensus Individual P Prbability f Infrmed Persn Being Crrect FIGURE 8.. Simulated academy awards vting. 50 members vte in 0 categries, each with 4 nminatins. Fr any categry, nly 5 vters have sme knwledge, represented by their prbability f selecting the crrect candidate in that categry (s P = 0.5 means they have n knwledge). Fr each categry, the 5 experts are chsen at randm frm the 50. Results shw the expected crrect (based n 50 simulatins) fr the cnsensus, as well as fr the individuals. The errr bars indicate ne standard deviatin. We see, fr example, that if the 5 infrmed fr a categry have a 50% chance f selecting the crrect candidate, the cnsensus dubles the expected perfrmance f an individual.

307 88 8. Mdel Inference and Averaging Bagged Decisin Rule Bsted Decisin Rule FIGURE 8.. Data with tw features and tw classes, separated by a linear bundary. (Left panel:) Decisin bundary estimated frm bagging the decisin rule frm a single split, axis-riented classifier. (Right panel:) Decisin bundary frm bsting the decisin rule f the same classifier. The test errr rates are 0.66, and 0.065, respectively. Bsting is described in Chapter 0. and is described in Chapter 0. The decisin bundary in the right panel is the result f the bsting prcedure, and it rughly captures the diagnal bundary. 8.8 Mdel Averaging and Stacking In Sectin 8.4 we viewed btstrap values f an estimatr as apprximate psterir values f a crrespnding parameter, frm a kind f nnparametric Bayesian analysis. Viewed in this way, the bagged estimate (8.5) is an apprximate psterir Bayesian mean. In cntrast, the training sample estimate ˆf(x) crrespnds t the mde f the psterir. Since the psterir mean (nt mde) minimizes squared-errr lss, it is nt surprising that bagging can ften reduce mean squared-errr. Here we discuss Bayesian mdel averaging mre generally. We have a set f candidate mdels M m, m =,...,M fr ur training set Z. These mdels may be f the same type with different parameter values (e.g., subsets in linear regressin), r different mdels fr the same task (e.g., neural netwrks and regressin trees). Suppse ζ is sme quantity f interest, fr example, a predictin f(x) at sme fixed feature value x. The psterir distributin f ζ is Pr(ζ Z) = M m= Pr(ζ M m,z)pr(m m Z), (8.5)

308 8.8 Mdel Averaging and Stacking 89 with psterir mean E(ζ Z) = M E(ζ M m,z)pr(m m Z). (8.54) m= This Bayesian predictin is a weighted average f the individual predictins, with weights prprtinal t the psterir prbability f each mdel. This frmulatin leads t a number f different mdel-averaging strategies. Cmmittee methds take a simple unweighted average f the predictins frm each mdel, essentially giving equal prbability t each mdel. Mre ambitiusly, the develpment in Sectin 7.7 shws the BIC criterin can be used t estimate psterir mdel prbabilities. This is applicable in cases where the different mdels arise frm the same parametric mdel, with different parameter values. The BIC gives weight t each mdel depending n hw well it fits and hw many parameters it uses. ne can als carry ut the Bayesian recipe in full. If each mdel M m has parameters θ m, we write Pr(M m Z) Pr(M m ) Pr(Z M m ) Pr(M m ) Pr(Z θ m, M m )Pr(θ m M m )dθ m. (8.55) In principle ne can specify prirs Pr(θ m M m ) and numerically cmpute the psterir prbabilities frm (8.55), t be used as mdel-averaging weights. Hwever, we have seen n real evidence that this is wrth all f the effrt, relative t the much simpler BIC apprximatin. Hw can we apprach mdel averaging frm a frequentist viewpint? Given predictins ˆf (x), ˆf (x),..., ˆf M (x), under squared-errr lss, we can seek the weights w = (w,w,...,w M ) such that ŵ = argmin w E P [Y M. w m ˆfm (x)] (8.56) m= Here the input value x is fixed and the N bservatins in the dataset Z (and the target Y ) are distributed accrding t P. The slutin is the ppulatin linear regressin f Y n ˆF(x) T [ ˆf (x), ˆf (x),..., ˆf M (x)]: ŵ = E P [ ˆF(x) ˆF(x) T ] E P [ ˆF(x)Y ]. (8.57) Nw the full regressin has smaller errr than any single mdel ] M [ E P [Y ŵ m ˆfm (x) E P Y ˆf m (x)] m (8.58) m= s cmbining mdels never makes things wrse, at the ppulatin level.

309 90 8. Mdel Inference and Averaging f curse the ppulatin linear regressin (8.57) is nt available, and it is natural t replace it with the linear regressin ver the training set. But there are simple examples where this des nt wrk well. Fr example, if ˆf m (x), m =,,...,M represent the predictin frm the best subset f inputs f size m amng M ttal inputs, then linear regressin wuld put all f the weight n the largest mdel, that is, ŵ M =, ŵ m = 0, m < M. The prblem is that we have nt put each f the mdels n the same fting by taking int accunt their cmplexity (the number f inputs m in this example). i Stacked generalizatin, r stacking, is a way f ding this. Let ˆf m (x) be the predictin at x, using mdel m, applied t the dataset with the ith training bservatin remved. The stacking estimate f the weights is i btained frm the least squares linear regressin f y i n ˆf m (x i ), m =,,...,M. In detail the stacking weights are given by [ ŵ st = argmin w N i= y i M m= w m ˆf i m (x i )]. (8.59) The final predictin is m ŵst ˆf m m (x). By using the crss-validated predictins ˆf m (x), stacking avids giving unfairly high weight t mdels with i higher cmplexity. Better results can be btained by restricting the weights t be nnnegative, and t sum t. This seems like a reasnable restrictin if we interpret the weights as psterir mdel prbabilities as in equatin (8.54), and it leads t a tractable quadratic prgramming prblem. There is a clse cnnectin between stacking and mdel selectin via leave-ne-ut crss-validatin (Sectin 7.0). If we restrict the minimizatin in (8.59) t weight vectrs w that have ne unit weight and the rest zer, this leads t a mdel chice ˆm with smallest leave-ne-ut crss-validatin errr. Rather than chse a single mdel, stacking cmbines them with estimated ptimal weights. This will ften lead t better predictin, but less interpretability than the chice f nly ne f the M mdels. The stacking idea is actually mre general than described abve. ne can use any learning methd, nt just linear regressin, t cmbine the mdels as in (8.59); the weights culd als depend n the input lcatin x. In this way, learning methds are stacked n tp f ne anther, t imprve predictin perfrmance. 8.9 Stchastic Search: Bumping The final methd described in this chapter des nt invlve averaging r cmbining mdels, but rather is a technique fr finding a better single mdel. Bumping uses btstrap sampling t mve randmly thrugh mdel space. Fr prblems where fitting methd finds many lcal minima, bumping can help the methd t avid getting stuck in pr slutins.

310 8.9 Stchastic Search: Bumping 9 Regular 4-Nde Tree Bumped 4-Nde Tree FIGURE 8.. Data with tw features and tw classes (blue and range), displaying a pure interactin. The left panel shws the partitin fund by three splits f a standard, greedy, tree-grwing algrithm. The vertical grey line near the left edge is the first split, and the brken lines are the tw subsequent splits. The algrithm has n idea where t make a gd initial split, and makes a pr chice. The right panel shws the near-ptimal splits fund by bumping the tree-grwing algrithm 0 times. As in bagging, we draw btstrap samples and fit a mdel t each. But rather than average the predictins, we chse the mdel estimated frm a btstrap sample that best fits the training data. In detail, we draw btstrap samples Z,...,Z B and fit ur mdel t each, giving predictins ˆf b (x), b =,,...,B at input pint x. We then chse the mdel that prduces the smallest predictin errr, averaged ver the riginal training set. Fr squared errr, fr example, we chse the mdel btained frm btstrap sample ˆb, where ˆb = arg min b N i= [y i ˆf b (x i )]. (8.60) The crrespnding mdel predictins are ˆf ˆb(x). By cnventin we als include the riginal training sample in the set f btstrap samples, s that the methd is free t pick the riginal mdel if it has the lwest training errr. By perturbing the data, bumping tries t mve the fitting prcedure arund t gd areas f mdel space. Fr example, if a few data pints are causing the prcedure t find a pr slutin, any btstrap sample that mits thse data pints shuld prcedure a better slutin. Fr anther example, cnsider the classificatin data in Figure 8., the ntrius exclusive r (XR) prblem. There are tw classes (blue and range) and tw input features, with the features exhibiting a pure inter-

311 9 8. Mdel Inference and Averaging actin. By splitting the data at x = 0 and then splitting each resulting strata at x = 0, (r vice versa) a tree-based classifier culd achieve perfect discriminatin. Hwever, the greedy, shrt-sighted CART algrithm (Sectin 9.) tries t find the best split n either feature, and then splits the resulting strata. Because f the balanced nature f the data, all initial splits n x r x appear t be useless, and the prcedure essentially generates a randm split at the tp level. The actual split fund fr these data is shwn in the left panel f Figure 8.. By btstrap sampling frm the data, bumping breaks the balance in the classes, and with a reasnable number f btstrap samples (here 0), it will by chance prduce at least ne tree with initial split near either x = 0 r x = 0. Using just 0 btstrap samples, bumping fund the near ptimal splits shwn in the right panel f Figure 8.. This shrtcming f the greedy tree-grwing algrithm is exacerbated if we add a number f nise features that are independent f the class label. Then the tree-grwing algrithm cannt distinguish x r x frm the thers, and gets seriusly lst. Since bumping cmpares different mdels n the training data, ne must ensure that the mdels have rughly the same cmplexity. In the case f trees, this wuld mean grwing trees with the same number f terminal ndes n each btstrap sample. Bumping can als help in prblems where it is difficult t ptimize the fitting criterin, perhaps because f a lack f smthness. The trick is t ptimize a different, mre cnvenient criterin ver the btstrap samples, and then chse the mdel prducing the best results fr the desired criterin n the training sample. Bibligraphic Ntes There are many bks n classical statistical inference: Cx and Hinkley (974) and Silvey (975) give nntechnical accunts. The btstrap is due t Efrn (979) and is described mre fully in Efrn and Tibshirani (99) and Hall (99). A gd mdern bk n Bayesian inference is Gelman et al. (995). A lucid accunt f the applicatin f Bayesian methds t neural netwrks is given in Neal (996). The statistical applicatin f Gibbs sampling is due t Geman and Geman (984), and Gelfand and Smith (990), with related wrk by Tanner and Wng (987). Markv chain Mnte Carl methds, including Gibbs sampling and the Metrplis Hastings algrithm, are discussed in Spiegelhalter et al. (996). The EM algrithm is due t Dempster et al. (977); as the discussants in that paper make clear, there was much related, earlier wrk. The view f EM as a jint maximizatin scheme fr a penalized cmplete-data lg-likelihd was elucidated by Neal and Hintn (998); they credit Csiszar and Tusnády (984) and Hathaway (986) as having nticed this cnnectin earlier. Bagging was prpsed by Breiman (996a). Stacking is due t Wlpert (99);

312 Exercises 9 Breiman (996b) cntains an accessible discussin fr statisticians. Leblanc and Tibshirani (996) describe variatins n stacking based n the btstrap. Mdel averaging in the Bayesian framewrk has been recently advcated by Madigan and Raftery (994). Bumping was prpsed by Tibshirani and Knight (999). Exercises Ex. 8. Let r(y) and q(y) be prbability density functins. Jensen s inequality states that fr a randm variable X and a cnvex functin φ(x), E[φ(X)] φ[e(x)]. Use Jensen s inequality t shw that E q lg[r(y )/q(y )] (8.6) is maximized as a functin f r(y) when r(y) = q(y). Hence shw that R(θ,θ) R(θ,θ) as stated belw equatin (8.46). Ex. 8. Cnsider the maximizatin f the lg-likelihd (8.48), ver distributins P(Z m ) such that P(Z m ) 0 and Z m P(Z m ) =. Use Lagrange multipliers t shw that the slutin is the cnditinal distributin P(Z m ) = Pr(Z m Z,θ ), as in (8.49). Ex. 8. Justify the estimate (8.50), using the relatinship Pr(A) = Pr(A B)d(Pr(B)). Ex. 8.4 Cnsider the bagging methd f Sectin 8.7. Let ur estimate ˆf(x) be the B-spline smther ˆµ(x) f Sectin 8... Cnsider the parametric btstrap f equatin (8.6), applied t this estimatr. Shw that if we bag ˆf(x), using the parametric btstrap t generate the btstrap samples, the bagging estimate ˆf bag (x) cnverges t the riginal estimate ˆf(x) as B. Ex. 8.5 Suggest generalizatins f each f the lss functins in Figure 0.4 t mre than tw classes, and design an apprpriate plt t cmpare them. Ex. 8.6 Cnsider the bne mineral density data f Figure 5.6. (a) Fit a cubic smth spline t the relative change in spinal BMD, as a functin f age. Use crss-validatin t estimate the ptimal amunt f smthing. Cnstruct pintwise 90% cnfidence bands fr the underlying functin. (b) Cmpute the psterir mean and cvariance fr the true functin via (8.8), and cmpare the psterir bands t thse btained in (a).

313 94 8. Mdel Inference and Averaging (c) Cmpute 00 btstrap replicates f the fitted curves, as in the bttm left panel f Figure 8.. Cmpare the results t thse btained in (a) and (b). Ex. 8.7 EM as a minrizatin algrithm(hunter and Lange, 004; Wu and Lange, 007). A functin g(x,y) t said t minrize a functin f(x) if g(x,y) f(x), g(x,x) = f(x) (8.6) fr all x,y in the dmain. This is useful fr maximizing f(x) since is easy t shw that f(x) is nn-decreasing under the update x s+ = argmax x g(x,x s ) (8.6) There are analgus definitins fr majrizatin, fr minimizing a functin f(x). The resulting algrithms are knwn as MM algrithms, fr Minrize- Maximize r Majrize-Minimize. Shw that the EM algrithm (Sectin 8.5.) is an example f an MM algrithm, using Q(θ,θ)+lg Pr(Z θ) Q(θ,θ) t minrize the bserved data lg-likelihd l(θ ;Z). (Nte that nly the first term invlves the relevant parameter θ ).

314 9 Additive Mdels, Trees, and Related Methds This is page 95 Printer: paque this In this chapter we begin ur discussin f sme specific methds fr supervised learning. These techniques each assume a (different) structured frm fr the unknwn regressin functin, and by ding s they finesse the curse f dimensinality. f curse, they pay the pssible price f misspecifying the mdel, and s in each case there is a tradeff that has t be made. They take ff where Chapters 6 left ff. We describe five related techniques: generalized additive mdels, trees, multivariate adaptive regressin splines, the patient rule inductin methd, and hierarchical mixtures f experts. 9. Generalized Additive Mdels Regressin mdels play an imprtant rle in many data analyses, prviding predictin and classificatin rules, and data analytic tls fr understanding the imprtance f different inputs. Althugh attractively simple, the traditinal linear mdel ften fails in these situatins: in real life, effects are ften nt linear. In earlier chapters we described techniques that used predefined basis functins t achieve nnlinearities. This sectin describes mre autmatic flexible statistical methds that may be used t identify and characterize nnlinear regressin effects. These methds are called generalized additive mdels. In the regressin setting, a generalized additive mdel has the frm E(Y X,X,...,X p ) = α + f (X ) + f (X ) + + f p (X p ). (9.)

316 9. Generalized Additive Mdels 97 f the functins f j need t be nnlinear. We can easily mix in linear and ther parametric frms with the nnlinear terms, a necessity when sme f the inputs are qualitative variables (factrs). The nnlinear terms are nt restricted t main effects either; we can have nnlinear cmpnents in tw r mre variables, r separate curves in X j fr each level f the factr X k. Thus each f the fllwing wuld qualify: g(µ) = X T β + α k + f(z) a semiparametric mdel, where X is a vectr f predictrs t be mdeled linearly, α k the effect fr the kth level f a qualitative input V, and the effect f predictr Z is mdeled nnparametrically. g(µ) = f(x) + g k (Z) again k indexes the levels f a qualitative input V, and thus creates an interactin term g(v,z) = g k (Z) fr the effect f V and Z. g(µ) = f(x) + g(z,w) where g is a nnparametric functin in tw features. Additive mdels can replace linear mdels in a wide variety f settings, fr example an additive decmpsitin f time series, Y t = S t + T t + ε t, (9.5) where S t is a seasnal cmpnent, T t is a trend and ε is an errr term. 9.. Fitting Additive Mdels In this sectin we describe a mdular algrithm fr fitting additive mdels and their generalizatins. The building blck is the scatterplt smther fr fitting nnlinear effects in a flexible way. Fr cncreteness we use as ur scatterplt smther the cubic smthing spline described in Chapter 5. The additive mdel has the frm Y = α + p f j (X j ) + ε, (9.6) j= where the errr term ε has mean zer. Given bservatins x i,y i, a criterin like the penalized sum f squares (5.9) f Sectin 5.4 can be specified fr this prblem, ( N p p PRSS(α,f,f,...,f p ) = y i α f j (x ij )) + λ j i= j= j= f j (t j ) dt j, (9.7) where the λ j 0 are tuning parameters. It can be shwn that the minimizer f (9.7) is an additive cubic spline mdel; each f the functins f j is a

317 98 9. Additive Mdels, Trees, and Related Methds Algrithm 9. The Backfitting Algrithm fr Additive Mdels.. Initialize: ˆα = N N y i, ˆf j 0, i,j.. Cycle: j =,,...,p,...,,,...,p,..., [ ˆf j S j {y i ˆα ] ˆf k (x ik )} N, k j ˆf j ˆf j N N ˆf j (x ij ). i= until the functins ˆf j change less than a prespecified threshld. cubic spline in the cmpnent X j, with knts at each f the unique values f x ij, i =,...,N. Hwever, withut further restrictins n the mdel, the slutin is nt unique. The cnstant α is nt identifiable, since we can add r subtract any cnstants t each f the functins f j, and adjust α accrdingly. The standard cnventin is t assume that N f j(x ij ) = 0 j the functins average zer ver the data. It is easily seen that ˆα = ave(y i ) in this case. If in additin t this restrictin, the matrix f input values (having ijth entry x ij ) has full clumn rank, then (9.7) is a strictly cnvex criterin and the minimizer is unique. If the matrix is singular, then the linear part f the cmpnents f j cannt be uniquely determined (while the nnlinear parts can!)(buja et al., 989). Furthermre, a simple iterative prcedure exists fr finding the slutin. We set ˆα = ave(y i ), and it never changes. We apply a cubic smthing spline S j t the targets {y i ˆα ˆf k j k (x ik )} N, as a functin f x ij, t btain a new estimate ˆf j. This is dne fr each predictr in turn, using the current estimates f the ther functins ˆf k when cmputing y i ˆα k j ˆf k (x ik ). The prcess is cntinued until the estimates ˆf j stabilize. This prcedure, given in detail in Algrithm 9., is knwn as backfitting and the resulting fit is analgus t a multiple regressin fr linear mdels. In principle, the secnd step in () f Algrithm 9. is nt needed, since the smthing spline fit t a mean-zer respnse has mean zer (Exercise 9.). In practice, machine runding can cause slippage, and the adjustment is advised. This same algrithm can accmmdate ther fitting methds in exactly the same way, by specifying apprpriate smthing peratrs S j : ther univariate regressin smthers such as lcal plynmial regressin and kernel methds;

318 9. Generalized Additive Mdels 99 linear regressin peratrs yielding plynmial fits, piecewise cnstant fits, parametric spline fits, series and Furier fits; mre cmplicated peratrs such as surface smthers fr secnd r higher-rder interactins r peridic smthers fr seasnal effects. If we cnsider the peratin f smther S j nly at the training pints, it can be represented by an N N peratr matrix S j (see Sectin 5.4.). Then the degrees f freedm fr the jth term are (apprximately) cmputed as df j = trace[s j ], by analgy with degrees f freedm fr smthers discussed in Chapters 5 and 6. Fr a large class f linear smthers S j, backfitting is equivalent t a Gauss Seidel algrithm fr slving a certain linear system f equatins. Details are given in Exercise 9.. Fr the lgistic regressin mdel and ther generalized additive mdels, the apprpriate criterin is a penalized lg-likelihd. T maximize it, the backfitting prcedure is used in cnjunctin with a likelihd maximizer. The usual Newtn Raphsn rutine fr maximizing lg-likelihds in generalized linear mdels can be recast as an IRLS (iteratively reweighted least squares) algrithm. This invlves repeatedly fitting a weighted linear regressin f a wrking respnse variable n the cvariates; each regressin yields a new value f the parameter estimates, which in turn give new wrking respnses and weights, and the prcess is iterated (see Sectin 4.4.). In the generalized additive mdel, the weighted linear regressin is simply replaced by a weighted backfitting algrithm. We describe the algrithm in mre detail fr lgistic regressin belw, and mre generally in Chapter 6 f Hastie and Tibshirani (990). 9.. Example: Additive Lgistic Regressin Prbably the mst widely used mdel in medical research is the lgistic mdel fr binary data. In this mdel the utcme Y can be cded as 0 r, with indicating an event (like death r relapse f a disease) and 0 indicating n event. We wish t mdel Pr(Y = X), the prbability f an event given values f the prgnstic factrs X T = (X,...,X p ). The gal is usually t understand the rles f the prgnstic factrs, rather than t classify new individuals. Lgistic mdels are als used in applicatins where ne is interested in estimating the class prbabilities, fr use in risk screening. Apart frm medical applicatins, credit risk screening is a ppular applicatin. The generalized additive lgistic mdel has the frm lg Pr(Y = X) Pr(Y = 0 X) = α + f (X ) + + f p (X p ). (9.8) The functins f,f,...,f p are estimated by a backfitting algrithm within a Newtn Raphsn prcedure, shwn in Algrithm 9..

319 00 9. Additive Mdels, Trees, and Related Methds Algrithm 9. Lcal Scring Algrithm fr the Additive Lgistic Regressin Mdel.. Cmpute starting values: ˆα = lg[ȳ/( ȳ)], where ȳ = ave(y i ), the sample prprtin f nes, and set ˆf j 0 j.. Define ˆη i = ˆα + j ˆf j (x ij ) and ˆp i = /[ + exp( ˆη i )]. Iterate: (a) Cnstruct the wrking target variable z i = ˆη i + (y i ˆp i ) ˆp i ( ˆp i ). (b) Cnstruct weights w i = ˆp i ( ˆp i ) (c) Fit an additive mdel t the targets z i with weights w i, using a weighted backfitting algrithm. This gives new estimates ˆα, ˆf j, j. Cntinue step. until the change in the functins falls belw a prespecified threshld. The additive mdel fitting in step () f Algrithm 9. requires a weighted scatterplt smther. Mst smthing prcedures can accept bservatin weights (Exercise 5.); see Chapter f Hastie and Tibshirani (990) fr further details. The additive lgistic regressin mdel can be generalized further t handle mre than tw classes, using the multilgit frmulatin as utlined in Sectin 4.4. While the frmulatin is a straightfrward extensin f (9.8), the algrithms fr fitting such mdels are mre cmplex. See Yee and Wild (996) fr details, and the VGAM sftware currently available frm: yee. Example: Predicting Spam We apply a generalized additive mdel t the spam data intrduced in Chapter. The data cnsists f infrmatin frm 460 messages, in a study t screen fr spam (i.e., junk ). The data is publicly available at ftp.ics.uci.edu, and was dnated by Gerge Frman frm Hewlett-Packard labratries, Pal Alt, Califrnia. The respnse variable is binary, with values r spam, and there are 57 predictrs as described belw: 48 quantitative predictrs the percentage f wrds in the that match a given wrd. Examples include business, address, internet,

320 9. Generalized Additive Mdels 0 TABLE 9.. Test data cnfusin matrix fr the additive lgistic regressin mdel fit t the spam training data. The verall test errr rate is 5.5%. Predicted Class True Class (0) spam () (0) 58.%.5% spam ().0% 6.% free, and gerge. The idea was that these culd be custmized fr individual users. 6 quantitative predictrs the percentage f characters in the that match a given character. The characters are ch;, ch(, ch[, ch!, ch\$, and ch#. The average length f uninterrupted sequences f capital letters: CAPAVE. The length f the lngest uninterrupted sequence f capital letters: CAPMAX. The sum f the length f uninterrupted sequences f capital letters: CAPTT. We cdedspam as and as zer. A test set f size 56 was randmly chsen, leaving 065 bservatins in the training set. A generalized additive mdel was fit, using a cubic smthing spline with a nminal fur degrees f freedm fr each predictr. What this means is that fr each predictr X j, the smthing-spline parameter λ j was chsen s that trace[s j (λ j )] = 4, where S j (λ) is the smthing spline peratr matrix cnstructed using the bserved values x ij, i =,...,N. This is a cnvenient way f specifying the amunt f smthing in such a cmplex mdel. Mst f the spam predictrs have a very lng-tailed distributin. Befre fitting the GAM mdel, we lg-transfrmed each variable (actually lg(x + 0.)), but the plts in Figure 9. are shwn as a functin f the riginal variables. The test errr rates are shwn in Table 9.; the verall errr rate is 5.%. By cmparisn, a linear lgistic regressin has a test errr rate f 7.6%. Table 9. shws the predictrs that are highly significant in the additive mdel. Fr ease f interpretatin, in Table 9. the cntributin fr each variable is decmpsed int a linear cmpnent and the remaining nnlinear cmpnent. The tp blck f predictrs are psitively crrelated with spam, while the bttm blck is negatively crrelated. The linear cmpnent is a weighted least squares linear fit f the fitted curve n the predictr, while the nnlinear part is the residual. The linear cmpnent f an estimated

321 0 9. Additive Mdels, Trees, and Related Methds TABLE 9.. Significant predictrs frm the additive mdel fit t the spam training data. The cefficients represent the linear part f ˆf j, alng with their standard errrs and Z-scre. The nnlinear P-value is fr a test f nnlinearity f ˆf j. Name Num. df Cefficient Std. Errr Z Scre Nnlinear P-value Psitive effects ur ver remve internet free business hpl ch! ch\$ CAPMAX CAPTT Negative effects hp gerge re edu functin is summarized by the cefficient, standard errr and Z-scre; the latter is the cefficient divided by its standard errr, and is cnsidered significant if it exceeds the apprpriate quantile f a standard nrmal distributin. The clumn labeled nnlinear P-value is a test f nnlinearity f the estimated functin. Nte, hwever, that the effect f each predictr is fully adjusted fr the entire effects f the ther predictrs, nt just fr their linear parts. The predictrs shwn in the table were judged significant by at least ne f the tests (linear r nnlinear) at the p = 0.0 level (tw-sided). Figure 9. shws the estimated functins fr the significant predictrs appearing in Table 9.. Many f the nnlinear effects appear t accunt fr a strng discntinuity at zer. Fr example, the prbability f spam drps significantly as the frequency f gerge increases frm zer, but then des nt change much after that. This suggests that ne might replace each f the frequency predictrs by an indicatr variable fr a zer cunt, and resrt t a linear lgistic mdel. This gave a test errr rate f 7.4%; including the linear effects f the frequencies as well drpped the test errr t 6.6%. It appears that the nnlinearities in the additive mdel have an additinal predictive pwer.

322 9. Generalized Additive Mdels ur ver remve internet free business hp hpl ˆf(gerge) ˆf(999) ˆf(re) ˆf(edu) ˆf(free) ˆf(business) ˆf(hp) ˆf(hpl) ˆf(ur) ˆf(ver) ˆf(ch!) ˆf(remve) ˆf(internet) gerge 999 re edu ˆf(ch\$) ˆf(CAPMAX) ˆf(CAPTT) ch! ch\$ CAPMAX CAPTT FIGURE 9.. Spam analysis: estimated functins fr significant predictrs. The rug plt alng the bttm f each frame indicates the bserved values f the crrespnding predictr. Fr many f the predictrs the nnlinearity picks up the discntinuity at zer.

323 04 9. Additive Mdels, Trees, and Related Methds It is mre serius t classify a genuine message as spam, since then a gd wuld be filtered ut and wuld nt reach the user. We can alter the balance between the class errr rates by changing the lsses (see Sectin.4). If we assign a lss L 0 fr predicting a true class 0 as class, and L 0 fr predicting a true class as class 0, then the estimated Bayes rule predicts class if its prbability is greater than L 0 /(L 0 + L 0 ). Fr example, if we take L 0 = 0,L 0 = then the (true) class 0 and class errr rates change t 0.8% and 8.7%. Mre ambitiusly, we can encurage the mdel t fit better data in the class 0 by using weights L 0 fr the class 0 bservatins and L 0 fr the class bservatins. As abve, we then use the estimated Bayes rule t predict. This gave errr rates f.% and 8.0% in (true) class 0 and class, respectively. We discuss belw the issue f unequal lsses further, in the cntext f tree-based mdels. After fitting an additive mdel, ne shuld check whether the inclusin f sme interactins can significantly imprve the fit. This can be dne manually, by inserting prducts f sme r all f the significant inputs, r autmatically via the MARS prcedure (Sectin 9.4). This example uses the additive mdel in an autmatic fashin. As a data analysis tl, additive mdels are ften used in a mre interactive fashin, adding and drpping terms t determine their effect. By calibrating the amunt f smthing in terms f df j, ne can mve seamlessly between linear mdels (df j = ) and partially linear mdels, where sme terms are mdeled mre flexibly. See Hastie and Tibshirani (990) fr mre details. 9.. Summary Additive mdels prvide a useful extensin f linear mdels, making them mre flexible while still retaining much f their interpretability. The familiar tls fr mdeling and inference in linear mdels are als available fr additive mdels, seen fr example in Table 9.. The backfitting prcedure fr fitting these mdels is simple and mdular, allwing ne t chse a fitting methd apprpriate fr each input variable. As a result they have becme widely used in the statistical cmmunity. Hwever additive mdels can have limitatins fr large data-mining applicatins. The backfitting algrithm fits all predictrs, which is nt feasible r desirable when a large number are available. The BRUT prcedure (Hastie and Tibshirani, 990, Chapter 9) cmbines backfitting with selectin f inputs, but is nt designed fr large data-mining prblems. There has als been recent wrk using lass-type penalties t estimate sparse additive mdels, fr example the CSS prcedure f Lin and Zhang (006) and the SpAM prpsal f Ravikumar et al. (008). Fr large prblems a frward stagewise apprach such as bsting (Chapter 0) is mre effective, and als allws fr interactins t be included in the mdel.

324 9. Tree-Based Methds Tree-Based Methds 9.. Backgrund Tree-based methds partitin the feature space int a set f rectangles, and then fit a simple mdel (like a cnstant) in each ne. They are cnceptually simple yet pwerful. We first describe a ppular methd fr tree-based regressin and classificatin called CART, and later cntrast it with C4.5, a majr cmpetitr. Let s cnsider a regressin prblem with cntinuus respnse Y and inputs X and X, each taking values in the unit interval. The tp left panel f Figure 9. shws a partitin f the feature space by lines that are parallel t the crdinate axes. In each partitin element we can mdel Y with a different cnstant. Hwever, there is a prblem: althugh each partitining line has a simple descriptin like X = c, sme f the resulting regins are cmplicated t describe. T simplify matters, we restrict attentin t recursive binary partitins like that in the tp right panel f Figure 9.. We first split the space int tw regins, and mdel the respnse by the mean f Y in each regin. We chse the variable and split-pint t achieve the best fit. Then ne r bth f these regins are split int tw mre regins, and this prcess is cntinued, until sme stpping rule is applied. Fr example, in the tp right panel f Figure 9., we first split at X = t. Then the regin X t is split at X = t and the regin X > t is split at X = t. Finally, the regin X > t is split at X = t 4. The result f this prcess is a partitin int the five regins R,R,...,R 5 shwn in the figure. The crrespnding regressin mdel predicts Y with a cnstant c m in regin R m, that is, ˆf(X) = 5 c m I{(X,X ) R m }. (9.9) m= This same mdel can be represented by the binary tree in the bttm left panel f Figure 9.. The full dataset sits at the tp f the tree. bservatins satisfying the cnditin at each junctin are assigned t the left branch, and the thers t the right branch. The terminal ndes r leaves f the tree crrespnd t the regins R,R,...,R 5. The bttm right panel f Figure 9. is a perspective plt f the regressin surface frm this mdel. Fr illustratin, we chse the nde means c = 5,c = 7,c = 0,c 4 =,c 5 = 4 t make this plt. A key advantage f the recursive binary tree is its interpretability. The feature space partitin is fully described by a single tree. With mre than tw inputs, partitins like that in the tp right panel f Figure 9. are difficult t draw, but the binary tree representatin wrks in the same way. This representatin is als ppular amng medical scientists, perhaps because it mimics the way that a dctr thinks. The tree stratifies the

325 06 9. Additive Mdels, Trees, and Related Methds R 5 R t 4 X X t R R 4 R t t X X X t X t X t X t 4 R R R X X R 4 R 5 FIGURE 9.. Partitins and CART. Tp right panel shws a partitin f a tw-dimensinal feature space by recursive binary splitting, as used in CART, applied t sme fake data. Tp left panel shws a general partitin that cannt be btained frm recursive binary splitting. Bttm left panel shws the tree crrespnding t the partitin in the tp right panel, and a perspective plt f the predictin surface appears in the bttm right panel.

326 9. Tree-Based Methds 07 ppulatin int strata f high and lw utcme, n the basis f patient characteristics. 9.. Regressin Trees We nw turn t the questin f hw t grw a regressin tree. ur data cnsists f p inputs and a respnse, fr each f N bservatins: that is, (x i,y i ) fr i =,,...,N, with x i = (x i,x i,...,x ip ). The algrithm needs t autmatically decide n the splitting variables and split pints, and als what tplgy (shape) the tree shuld have. Suppse first that we have a partitin int M regins R,R,...,R M, and we mdel the respnse as a cnstant c m in each regin: f(x) = M c m I(x R m ). (9.0) m= If we adpt as ur criterin minimizatin f the sum f squares (y i f(x i )), it is easy t see that the best ĉ m is just the average f y i in regin R m : ĉ m = ave(y i x i R m ). (9.) Nw finding the best binary partitin in terms f minimum sum f squares is generally cmputatinally infeasible. Hence we prceed with a greedy algrithm. Starting with all f the data, cnsider a splitting variable j and split pint s, and define the pair f half-planes R (j,s) = {X X j s} and R (j,s) = {X X j > s}. (9.) Then we seek the splitting variable j and split pint s that slve [ min min (y i c ) + min (y i c ) ]. (9.) j, s c c x i R (j,s) x i R (j,s) Fr any chice j and s, the inner minimizatin is slved by ĉ = ave(y i x i R (j,s)) and ĉ = ave(y i x i R (j,s)). (9.4) Fr each splitting variable, the determinatin f the split pint s can be dne very quickly and hence by scanning thrugh all f the inputs, determinatin f the best pair (j,s) is feasible. Having fund the best split, we partitin the data int the tw resulting regins and repeat the splitting prcess n each f the tw regins. Then this prcess is repeated n all f the resulting regins. Hw large shuld we grw the tree? Clearly a very large tree might verfit the data, while a small tree might nt capture the imprtant structure.

327 08 9. Additive Mdels, Trees, and Related Methds Tree size is a tuning parameter gverning the mdel s cmplexity, and the ptimal tree size shuld be adaptively chsen frm the data. ne apprach wuld be t split tree ndes nly if the decrease in sum-f-squares due t the split exceeds sme threshld. This strategy is t shrt-sighted, hwever, since a seemingly wrthless split might lead t a very gd split belw it. The preferred strategy is t grw a large tree T 0, stpping the splitting prcess nly when sme minimum nde size (say 5) is reached. Then this large tree is pruned using cst-cmplexity pruning, which we nw describe. We define a subtree T T 0 t be any tree that can be btained by pruning T 0, that is, cllapsing any number f its internal (nn-terminal) ndes. We index terminal ndes by m, with nde m representing regin R m. Let T dente the number f terminal ndes in T. Letting N m = #{x i R m }, ĉ m = y i, N m x i R m Q m (T) = (y i ĉ m ), N m x i R m (9.5) we define the cst cmplexity criterin C α (T) = T m= N m Q m (T) + α T. (9.6) The idea is t find, fr each α, the subtree T α T 0 t minimize C α (T). The tuning parameter α 0 gverns the tradeff between tree size and its gdness f fit t the data. Large values f α result in smaller trees T α, and cnversely fr smaller values f α. As the ntatin suggests, with α = 0 the slutin is the full tree T 0. We discuss hw t adaptively chse α belw. Fr each α ne can shw that there is a unique smallest subtree T α that minimizes C α (T). T find T α we use weakest link pruning: we successively cllapse the internal nde that prduces the smallest per-nde increase in m N mq m (T), and cntinue until we prduce the single-nde (rt) tree. This gives a (finite) sequence f subtrees, and ne can shw this sequence must cntain T α. See Breiman et al. (984) r Ripley (996) fr details. Estimatin f α is achieved by five- r tenfld crss-validatin: we chse the value ˆα t minimize the crss-validated sum f squares. ur final tree is Tˆα. 9.. Classificatin Trees If the target is a classificatin utcme taking values,,...,k, the nly changes needed in the tree algrithm pertain t the criteria fr splitting ndes and pruning the tree. Fr regressin we used the squared-errr nde

328 9. Tree-Based Methds Gini index Misclassificatin errr Entrpy p FIGURE 9.. Nde impurity measures fr tw-class classificatin, as a functin f the prprtin p in class. Crss-entrpy has been scaled t pass thrugh (0.5, 0.5). impurity measure Q m (T) defined in (9.5), but this is nt suitable fr classificatin. In a nde m, representing a regin R m with N m bservatins, let ˆp mk = I(y i = k), N m x i R m the prprtin f class k bservatins in nde m. We classify the bservatins in nde m t class k(m) = arg max k ˆp mk, the majrity class in nde m. Different measures Q m (T) f nde impurity include the fllwing: Misclassificatin errr: N m i R m I(y i k(m)) = ˆp mk(m). Gini index: k k ˆp mkˆp mk = K k= ˆp mk( ˆp mk ). Crss-entrpy r deviance: K k= ˆp mk lg ˆp mk. (9.7) Fr tw classes, if p is the prprtin in the secnd class, these three measures are max(p, p), p( p) and plg p ( p)lg ( p), respectively. They are shwn in Figure 9.. All three are similar, but crssentrpy and the Gini index are differentiable, and hence mre amenable t numerical ptimizatin. Cmparing (9.) and (9.5), we see that we need t weight the nde impurity measures by the number N ml and N mr f bservatins in the tw child ndes created by splitting nde m. In additin, crss-entrpy and the Gini index are mre sensitive t changes in the nde prbabilities than the misclassificatin rate. Fr example, in a tw-class prblem with 400 bservatins in each class (dente this by (400, 400)), suppse ne split created ndes (00, 00) and (00, 00), while

329 0 9. Additive Mdels, Trees, and Related Methds the ther created ndes (00,400) and (00,0). Bth splits prduce a misclassificatin rate f 0.5, but the secnd split prduces a pure nde and is prbably preferable. Bth the Gini index and crss-entrpy are lwer fr the secnd split. Fr this reasn, either the Gini index r crss-entrpy shuld be used when grwing the tree. T guide cst-cmplexity pruning, any f the three measures can be used, but typically it is the misclassificatin rate. The Gini index can be interpreted in tw interesting ways. Rather than classify bservatins t the majrity class in the nde, we culd classify them t class k with prbability ˆp mk. Then the training errr rate f this rule in the nde is k k ˆp mkˆp mk the Gini index. Similarly, if we cde each bservatin as fr class k and zer therwise, the variance ver the nde f this 0- respnse is ˆp mk ( ˆp mk ). Summing ver classes k again gives the Gini index ther Issues Categrical Predictrs When splitting a predictr having q pssible unrdered values, there are q pssible partitins f the q values int tw grups, and the cmputatins becme prhibitive fr large q. Hwever, with a 0 utcme, this cmputatin simplifies. We rder the predictr classes accrding t the prprtin falling in utcme class. Then we split this predictr as if it were an rdered predictr. ne can shw this gives the ptimal split, in terms f crss-entrpy r Gini index, amng all pssible q splits. This result als hlds fr a quantitative utcme and square errr lss the categries are rdered by increasing mean f the utcme. Althugh intuitive, the prfs f these assertins are nt trivial. The prf fr binary utcmes is given in Breiman et al. (984) and Ripley (996); the prf fr quantitative utcmes can be fund in Fisher (958). Fr multicategry utcmes, n such simplificatins are pssible, althugh varius apprximatins have been prpsed (Lh and Vanichsetakul, 988). The partitining algrithm tends t favr categrical predictrs with many levels q; the number f partitins grws expnentially in q, and the mre chices we have, the mre likely we can find a gd ne fr the data at hand. This can lead t severe verfitting if q is large, and such variables shuld be avided. The Lss Matrix In classificatin prblems, the cnsequences f misclassifying bservatins are mre serius in sme classes than thers. Fr example, it is prbably wrse t predict that a persn will nt have a heart attack when he/she actually will, than vice versa. T accunt fr this, we define a K K lss matrix L, with L kk being the lss incurred fr classifying a class k bservatin as class k. Typically n lss is incurred fr crrect classificatins,

330 9. Tree-Based Methds that is, L kk = 0 k. T incrprate the lsses int the mdeling prcess, we culd mdify the Gini index t k k L kk ˆp mkˆp mk ; this wuld be the expected lss incurred by the randmized rule. This wrks fr the multiclass case, but in the tw-class case has n effect, since the cefficient f ˆp mkˆp mk is L kk + L k k. Fr tw classes a better apprach is t weight the bservatins in class k by L kk. This can be used in the multiclass case nly if, as a functin f k, L kk desn t depend n k. bservatin weighting can be used with the deviance as well. The effect f bservatin weighting is t alter the prir prbability n the classes. In a terminal nde, the empirical Bayes rule implies that we classify t class k(m) = arg min k l L lkˆp ml. Missing Predictr Values Suppse ur data has sme missing predictr values in sme r all f the variables. We might discard any bservatin with sme missing values, but this culd lead t serius depletin f the training set. Alternatively we might try t fill in (impute) the missing values, with say the mean f that predictr ver the nnmissing bservatins. Fr tree-based mdels, there are tw better appraches. The first is applicable t categrical predictrs: we simply make a new categry fr missing. Frm this we might discver that bservatins with missing values fr sme measurement behave differently than thse with nnmissing values. The secnd mre general apprach is the cnstructin f surrgate variables. When cnsidering a predictr fr a split, we use nly the bservatins fr which that predictr is nt missing. Having chsen the best (primary) predictr and split pint, we frm a list f surrgate predictrs and split pints. The first surrgate is the predictr and crrespnding split pint that best mimics the split f the training data achieved by the primary split. The secnd surrgate is the predictr and crrespnding split pint that des secnd best, and s n. When sending bservatins dwn the tree either in the training phase r during predictin, we use the surrgate splits in rder, if the primary splitting predictr is missing. Surrgate splits explit crrelatins between predictrs t try and alleviate the effect f missing data. The higher the crrelatin between the missing predictr and the ther predictrs, the smaller the lss f infrmatin due t the missing value. The general prblem f missing data is discussed in Sectin 9.6. Why Binary Splits? Rather than splitting each nde int just tw grups at each stage (as abve), we might cnsider multiway splits int mre than tw grups. While this can smetimes be useful, it is nt a gd general strategy. The prblem is that multiway splits fragment the data t quickly, leaving insufficient data at the next level dwn. Hence we wuld want t use such splits nly when needed. Since multiway splits can be achieved by a series f binary splits, the latter are preferred.

331 9. Additive Mdels, Trees, and Related Methds ther Tree-Building Prcedures The discussin abve fcuses n the CART (classificatin and regressin tree) implementatin f trees. The ther ppular methdlgy is ID and its later versins, C4.5 and C5.0 (Quinlan, 99). Early versins f the prgram were limited t categrical predictrs, and used a tp-dwn rule with n pruning. With mre recent develpments, C5.0 has becme quite similar t CART. The mst significant feature unique t C5.0 is a scheme fr deriving rule sets. After a tree is grwn, the splitting rules that define the terminal ndes can smetimes be simplified: that is, ne r mre cnditin can be drpped withut changing the subset f bservatins that fall in the nde. We end up with a simplified set f rules defining each terminal nde; these n lnger fllw a tree structure, but their simplicity might make them mre attractive t the user. Linear Cmbinatin Splits Rather than restricting splits t be f the frm X j s, ne can allw splits alng linear cmbinatins f the frm a j X j s. The weights a j and split pint s are ptimized t minimize the relevant criterin (such as the Gini index). While this can imprve the predictive pwer f the tree, it can hurt interpretability. Cmputatinally, the discreteness f the split pint search precludes the use f a smth ptimizatin fr the weights. A better way t incrprate linear cmbinatin splits is in the hierarchical mixtures f experts (HME) mdel, the tpic f Sectin 9.5. Instability f Trees ne majr prblem with trees is their high variance. ften a small change in the data can result in a very different series f splits, making interpretatin smewhat precarius. The majr reasn fr this instability is the hierarchical nature f the prcess: the effect f an errr in the tp split is prpagated dwn t all f the splits belw it. ne can alleviate this t sme degree by trying t use a mre stable split criterin, but the inherent instability is nt remved. It is the price t be paid fr estimating a simple, tree-based structure frm the data. Bagging (Sectin 8.7) averages many trees t reduce this variance. Lack f Smthness Anther limitatin f trees is the lack f smthness f the predictin surface, as can be seen in the bttm right panel f Figure 9.. In classificatin with 0/ lss, this desn t hurt much, since bias in estimatin f the class prbabilities has a limited effect. Hwever, this can degrade perfrmance in the regressin setting, where we wuld nrmally expect the underlying functin t be smth. The MARS prcedure, described in Sectin 9.4,

333 4 9. Additive Mdels, Trees, and Related Methds α Misclassificatin Rate Tree Size FIGURE 9.4. Results fr spam example. The blue curve is the 0-fld crss-validatin estimate f misclassificatin rate as a functin f tree size, with standard errr bars. The minimum ccurs at a tree size with abut 7 terminal ndes (using the ne-standard-errr rule). The range curve is the test errr, which tracks the CV errr quite clsely. The crss-validatin is indexed by values f α, shwn abve. The tree sizes shwn belw refer t T α, the size f the riginal tree indexed by α. Hwever, if in additin the phrase hp ccurs frequently, then this is likely t be cmpany business and we classify as . All f the cases in the test set satisfying these criteria were crrectly classified. If the secnd cnditin is nt met, and in additin the average length f repeated capital letters CAPAVE is larger than.9, then we classify as spam. f the 7 test cases, nly seven were misclassified. In medical classificatin prblems, the terms sensitivity and specificity are used t characterize a rule. They are defined as fllws: Sensitivity: prbability f predicting disease given true state is disease. Specificity: prbability f predicting nn-disease given true state is nndisease.

334 9. Tree-Based Methds /56 ch\$< ch\$> remve< /77 remve>0.06 spam 48/59 hp<0.405 hp> spam spam 80/065 9/ 6/7 0/ ch!<0.9 ch!>0.9 gerge<0.5 CAPAVE<.907 gerge>0.5 CAPAVE> spam spam spam 80/86 00/04 6/09 0/ 9/0 7/7 gerge<0.005 gerge>0.005 CAPAVE<.7505 CAPAVE> < > spam spam 80/65 0/09 6/ 6/8 8/09 0/ hp<0.0 hp>0.0 free<0.065 free> spam 77/4 /9 6/94 9/9 CAPMAX<0.5 business<0.45 CAPMAX>0.5 business> /8 57/85 spam 4/89 /5 receive<0.5 edu<0.045 receive>0.5 edu> spam 9/6 / 48/ 9/7 ur<. ur>. spam 7/0 / FIGURE 9.5. The pruned tree fr the spam example. The split variables are shwn in blue n the branches, and the classificatin is shwn in every nde.the numbers under the terminal ndes indicate misclassificatin rates n the test data.

335 6 9. Additive Mdels, Trees, and Related Methds Sensitivity Tree (0.95) GAM (0.98) Weighted Tree (0.90) Specificity FIGURE 9.6. RC curves fr the classificatin rules fit t the spam data. Curves that are clser t the nrtheast crner represent better classifiers. In this case the GAM classifier dminates the trees. The weighted tree achieves better sensitivity fr higher specificity than the unweighted tree. The numbers in the legend represent the area under the curve. If we think f spam and as the presence and absence f disease, respectively, then frm Table 9. we have.4 Sensitivity = = 86.%, Specificity = = 9.4%. In this analysis we have used equal lsses. As befre let L kk be the lss assciated with predicting a class k bject as class k. By varying the relative sizes f the lsses L 0 and L 0, we increase the sensitivity and decrease the specificity f the rule, r vice versa. In this example, we want t avid marking gd as spam, and thus we want the specificity t be very high. We can achieve this by setting L 0 > say, with L 0 =. The Bayes rule in each terminal nde classifies t class (spam) if the prprtin f spam is L 0 /(L 0 + L 0 ), and class zer therwise. The

336 9. PRIM: Bump Hunting 7 receiver perating characteristic curve (RC) is a cmmnly used summary fr assessing the tradeff between sensitivity and specificity. It is a plt f the sensitivity versus specificity as we vary the parameters f a classificatin rule. Varying the lss L 0 between 0. and 0, and applying Bayes rule t the 7-nde tree selected in Figure 9.4, prduced the RC curve shwn in Figure 9.6. The standard errr f each curve near 0.9 is apprximately 0.9( 0.9)/56 = 0.008, and hence the standard errr f the difference is abut 0.0. We see that in rder t achieve a specificity f clse t 00%, the sensitivity has t drp t abut 50%. The area under the curve is a cmmnly used quantitative summary; extending the curve linearly in each directin s that it is defined ver [0,00], the area is apprximately Fr cmparisn, we have included the RC curve fr the GAM mdel fit t these data in Sectin 9.; it gives a better classificatin rule fr any lss, with an area f Rather than just mdifying the Bayes rule in the ndes, it is better t take full accunt f the unequal lsses in grwing the tree, as was dne in Sectin 9.. With just tw classes 0 and, lsses may be incrprated int the tree-grwing prcess by using weight L k, k fr an bservatin in class k. Here we chse L 0 = 5,L 0 = and fit the same size tree as befre ( T α = 7). This tree has higher sensitivity at high values f the specificity than the riginal tree, but des mre prly at the ther extreme. Its tp few splits are the same as the riginal tree, and then it departs frm it. Fr this applicatin the tree grwn using L 0 = 5 is clearly better than the riginal tree. The area under the RC curve, used abve, is smetimes called the c- statistic. Interestingly, it can be shwn that the area under the RC curve is equivalent t the Mann-Whitney U statistic (r Wilcxn rank-sum test), fr the median difference between the predictin scres in the tw grups (Hanley and McNeil, 98). Fr evaluating the cntributin f an additinal predictr when added t a standard mdel, the c-statistic may nt be an infrmative measure. The new predictr can be very significant in terms f the change in mdel deviance, but shw nly a small increase in the c- statistic. Fr example, remval f the highly significant term gerge frm the mdel f Table 9. results in a decrease in the c-statistic f less than 0.0. Instead, it is useful t examine hw the additinal predictr changes the classificatin n an individual sample basis. A gd discussin f this pint appears in Ck (007). 9. PRIM: Bump Hunting Tree-based methds (fr regressin) partitin the feature space int bxshaped regins, t try t make the respnse averages in each bx as differ-

337 8 9. Additive Mdels, Trees, and Related Methds ent as pssible. The splitting rules defining the bxes are related t each thrugh a binary tree, facilitating their interpretatin. The patient rule inductin methd (PRIM) als finds bxes in the feature space, but seeks bxes in which the respnse average is high. Hence it lks fr maxima in the target functin, an exercise knwn as bump hunting. (If minima rather than maxima are desired, ne simply wrks with the negative respnse values.) PRIM als differs frm tree-based partitining methds in that the bx definitins are nt described by a binary tree. This makes interpretatin f the cllectin f rules mre difficult; hwever, by remving the binary tree cnstraint, the individual rules are ften simpler. The main bx cnstructin methd in PRIM wrks frm the tp dwn, starting with a bx cntaining all f the data. The bx is cmpressed alng ne face by a small amunt, and the bservatins then falling utside the bx are peeled ff. The face chsen fr cmpressin is the ne resulting in the largest bx mean, after the cmpressin is perfrmed. Then the prcess is repeated, stpping when the current bx cntains sme minimum number f data pints. This prcess is illustrated in Figure 9.7. There are 00 data pints unifrmly distributed ver the unit square. The clr-cded plt indicates the respnse Y taking the value (red) when 0.5 < X < 0.8 and 0.4 < X < 0.6. and zer (blue) therwise. The panels shws the successive bxes fund by the tp-dwn peeling prcedure, peeling ff a prprtin α = 0. f the remaining data pints at each stage. Figure 9.8 shws the mean f the respnse values in the bx, as the bx is cmpressed. After the tp-dwn sequence is cmputed, PRIM reverses the prcess, expanding alng any edge, if such an expansin increases the bx mean. This is called pasting. Since the tp-dwn prcedure is greedy at each step, such an expansin is ften pssible. The result f these steps is a sequence f bxes, with different numbers f bservatin in each bx. Crss-validatin, cmbined with the judgment f the data analyst, is used t chse the ptimal bx size. Dente by B the indices f the bservatins in the bx fund in step. The PRIM prcedure then remves the bservatins in B frm the training set, and the tw-step prcess tp dwn peeling, fllwed by bttm-up pasting is repeated n the remaining dataset. This entire prcess is repeated several times, prducing a sequence f bxes B,B,...,B k. Each bx is defined by a set f rules invlving a subset f predictrs like (a X b ) and (b X b ). A summary f the PRIM prcedure is given Algrithm 9.. PRIM can handle a categrical predictr by cnsidering all partitins f the predictr, as in CART. Missing values are als handled in a manner similar t CART. PRIM is designed fr regressin (quantitative respnse

338 9. PRIM: Bump Hunting FIGURE 9.7. Illustratin f PRIM algrithm. There are tw classes, indicated by the blue (class 0) and red (class ) pints. The prcedure starts with a rectangle (brken black lines) surrunding all f the data, and then peels away pints alng ne edge by a prespecified amunt in rder t maximize the mean f the pints remaining in the bx. Starting at the tp left panel, the sequence f peelings is shwn, until a pure red regin is islated in the bttm right panel. The iteratin number is indicated at the tp f each panel. Number f bservatins in Bx Bx Mean FIGURE 9.8. Bx mean as a functin f number f bservatins in the bx.

339 0 9. Additive Mdels, Trees, and Related Methds Algrithm 9. Patient Rule Inductin Methd.. Start with all f the training data, and a maximal bx cntaining all f the data.. Cnsider shrinking the bx by cmpressing ne face, s as t peel ff the prprtin α f bservatins having either the highest values f a predictr X j, r the lwest. Chse the peeling that prduces the highest respnse mean in the remaining bx. (Typically α = 0.05 r 0.0.). Repeat step until sme minimal number f bservatins (say 0) remain in the bx. 4. Expand the bx alng any face, as lng as the resulting bx mean increases. 5. Steps 4 give a sequence f bxes, with different numbers f bservatins in each bx. Use crss-validatin t chse a member f the sequence. Call the bx B. 6. Remve the data in bx B frm the dataset and repeat steps 5 t btain a secnd bx, and cntinue t get as many bxes as desired. variable); a tw-class utcme can be handled simply by cding it as 0 and. There is n simple way t deal with k > classes simultaneusly: ne apprach is t run PRIM separately fr each class versus a baseline class. An advantage f PRIM ver CART is its patience. Because f its binary splits, CART fragments the data quite quickly. Assuming splits f equal size, with N bservatins it can nly make lg (N) splits befre running ut f data. If PRIM peels ff a prprtin α f training pints at each stage, it can perfrm apprximately lg(n)/lg( α) peeling steps befre running ut f data. Fr example, if N = 8 and α = 0.0, then lg (N) = 6 while lg(n)/lg( α) 46. Taking int accunt that there must be an integer number f bservatins at each stage, PRIM in fact can peel nly 9 times. In any case, the ability f PRIM t be mre patient shuld help the tp-dwn greedy algrithm find a better slutin. 9.. Spam Example (Cntinued) We applied PRIM t the spam data, with the respnse cded as fr spam and 0 fr . The first tw bxes fund by PRIM are summarized belw:

340 9.4 MARS: Multivariate Adaptive Regressin Splines Rule Glbal Mean Bx Mean Bx Supprt Training Test Rule ch! > 0.09 CAPAVE >. yur > < CAPTT > edu < re < 0.55 ch; < 0.00 Rule Remain Mean Bx Mean Bx Supprt Training Test { remve > 0.00 Rule gerge < 0.0 The bx supprt is the prprtin f bservatins falling in the bx. The first bx is purely spam, and cntains abut 5% f the test data. The secnd bx cntains 0.6% f the test bservatins, 9.6% f which are spam. Tgether the tw bxes cntain 6% f the data and are abut 97% spam. The next few bxes (nt shwn) are quite small, cntaining nly abut % f the data. The predictrs are listed in rder f imprtance. Interestingly the tp splitting variables in the CART tree (Figure 9.5) d nt appear in PRIM s first bx. 9.4 MARS: Multivariate Adaptive Regressin Splines MARS is an adaptive prcedure fr regressin, and is well suited fr highdimensinal prblems (i.e., a large number f inputs). It can be viewed as a generalizatin f stepwise linear regressin r a mdificatin f the CART methd t imprve the latter s perfrmance in the regressin setting. We intrduce MARS frm the first pint f view, and later make the cnnectin t CART. MARS uses expansins in piecewise linear basis functins f the frm (x t) + and (t x) +. The + means psitive part, s (x t) + = { x t, if x > t, 0, therwise, and (t x) + = { t x, if x < t, 0, therwise.

341 9. Additive Mdels, Trees, and Related Methds Basis Functin (t x) + (x t) t x FIGURE 9.9. The basis functins (x t) + (slid range) and (t x) + (brken blue) used by MARS. As an example, the functins (x 0.5) + and (0.5 x) + are shwn in Figure 9.9. Each functin is piecewise linear, with a knt at the value t. In the terminlgy f Chapter 5, these are linear splines. We call the tw functins a reflected pair in the discussin belw. The idea is t frm reflected pairs fr each input X j with knts at each bserved value x ij f that input. Therefre, the cllectin f basis functins is C = {(X j t) +, (t X j ) + } t {xj, x j,..., x Nj} j =,,..., p. (9.8) If all f the input values are distinct, there are Np basis functins altgether. Nte that althugh each basis functin depends nly n a single X j, fr example, h(x) = (X j t) +, it is cnsidered as a functin ver the entire input space IR p. The mdel-building strategy is like a frward stepwise linear regressin, but instead f using the riginal inputs, we are allwed t use functins frm the set C and their prducts. Thus the mdel has the frm M f(x) = β 0 + β m h m (X), (9.9) m= where each h m (X) is a functin in C, r a prduct f tw r mre such functins. Given a chice fr the h m, the cefficients β m are estimated by minimizing the residual sum-f-squares, that is, by standard linear regressin. The real art, hwever, is in the cnstructin f the functins h m (x). We start with nly the cnstant functin h 0 (X) = in ur mdel, and all functins in the set C are candidate functins. This is depicted in Figure 9.0. At each stage we cnsider as a new basis functin pair all prducts f a functin h m in the mdel set M with ne f the reflected pairs in C. We add t the mdel M the term f the frm ˆβ M+ h l (X) (X j t) + + ˆβ M+ h l (X) (t X j ) +, h l M,

342 9.4 MARS: Multivariate Adaptive Regressin Splines Cnstant X X X p X X X X p X X X X X p FIGURE 9.0. Schematic f the MARS frward mdel-building prcedure. n the left are the basis functins currently in the mdel: initially, this is the cnstant functin h(x) =. n the right are all candidate basis functins t be cnsidered in building the mdel. These are pairs f piecewise linear basis functins as in Figure 9.9, with knts t at all unique bserved values x ij f each predictr X j. At each stage we cnsider all prducts f a candidate pair with a basis functin in the mdel. The prduct that decreases the residual errr the mst is added int the current mdel. Abve we illustrate the first three steps f the prcedure, with the selected functins shwn in red.

343 4 9. Additive Mdels, Trees, and Related Methds h(x, X ) X X FIGURE 9.. The functin h(x, X ) = (X x 5) + (x 7 X ) +, resulting frm multiplicatin f tw piecewise linear MARS basis functins. that prduces the largest decrease in training errr. Here ˆβ M+ and ˆβ M+ are cefficients estimated by least squares, alng with all the ther M + cefficients in the mdel. Then the winning prducts are added t the mdel and the prcess is cntinued until the mdel set M cntains sme preset maximum number f terms. Fr example, at the first stage we cnsider adding t the mdel a functin f the frm β (X j t) + + β (t X j ) + ; t {x ij }, since multiplicatin by the cnstant functin just prduces the functin itself. Suppse the best chice is ˆβ (X x 7 ) + + ˆβ (x 7 X ) +. Then this pair f basis functins is added t the set M, and at the next stage we cnsider including a pair f prducts the frm h m (X) (X j t) + and h m (X) (t X j ) +, t {x ij }, where fr h m we have the chices h 0 (X) =, h (X) = (X x 7 ) +, r h (X) = (x 7 X ) +. The third chice prduces functins such as (X x 5 ) + (x 7 X ) +, depicted in Figure 9.. At the end f this prcess we have a large mdel f the frm (9.9). This mdel typically verfits the data, and s a backward deletin prcedure is applied. The term whse remval causes the smallest increase in residual squared errr is deleted frm the mdel at each stage, prducing an estimated best mdel ˆfλ f each size (number f terms) λ. ne culd use crss-validatin t estimate the ptimal value f λ, but fr cmputatinal

344 9.4 MARS: Multivariate Adaptive Regressin Splines 5 savings the MARS prcedure instead uses generalized crss-validatin. This criterin is defined as N i= GCV(λ) = (y i ˆf λ (x i )) ( M(λ)/N). (9.0) The value M(λ) is the effective number f parameters in the mdel: this accunts bth fr the number f terms in the mdels, plus the number f parameters used in selecting the ptimal psitins f the knts. Sme mathematical and simulatin results suggest that ne shuld pay a price f three parameters fr selecting a knt in a piecewise linear regressin. Thus if there are r linearly independent basis functins in the mdel, and K knts were selected in the frward prcess, the frmula is M(λ) = r+ck, where c =. (When the mdel is restricted t be additive details belw a penalty f c = is used). Using this, we chse the mdel alng the backward sequence that minimizes GCV(λ). Why these piecewise linear basis functins, and why this particular mdel strategy? A key prperty f the functins f Figure 9.9 is their ability t perate lcally; they are zer ver part f their range. When they are multiplied tgether, as in Figure 9., the result is nnzer nly ver the small part f the feature space where bth cmpnent functins are nnzer. As a result, the regressin surface is built up parsimniusly, using nnzer cmpnents lcally nly where they are needed. This is imprtant, since ne shuld spend parameters carefully in high dimensins, as they can run ut quickly. The use f ther basis functins such as plynmials, wuld prduce a nnzer prduct everywhere, and wuld nt wrk as well. The secnd imprtant advantage f the piecewise linear basis functin cncerns cmputatin. Cnsider the prduct f a functin in M with each f the N reflected pairs fr an input X j. This appears t require the fitting f N single-input linear regressin mdels, each f which uses (N) peratins, making a ttal f (N ) peratins. Hwever, we can explit the simple frm f the piecewise linear functin. We first fit the reflected pair with rightmst knt. As the knt is mved successively ne psitin at a time t the left, the basis functins differ by zer ver the left part f the dmain, and by a cnstant ver the right part. Hence after each such mve we can update the fit in () peratins. This allws us t try every knt in nly (N) peratins. The frward mdeling strategy in MARS is hierarchical, in the sense that multiway prducts are built up frm prducts invlving terms already in the mdel. Fr example, a fur-way prduct can nly be added t the mdel if ne f its three-way cmpnents is already in the mdel. The philsphy here is that a high-rder interactin will likely nly exist if sme f its lwerrder ftprints exist as well. This need nt be true, but is a reasnable wrking assumptin and avids the search ver an expnentially grwing space f alternatives.

345 6 9. Additive Mdels, Trees, and Related Methds Test Misclassificatin Errr GCV chice Rank f Mdel FIGURE 9.. Spam data: test errr misclassificatin rate fr the MARS prcedure, as a functin f the rank (number f independent basis functins) in the mdel. There is ne restrictin put n the frmatin f mdel terms: each input can appear at mst nce in a prduct. This prevents the frmatin f higher-rder pwers f an input, which increase r decrease t sharply near the bundaries f the feature space. Such pwers can be apprximated in a mre stable way with piecewise linear functins. A useful ptin in the MARS prcedure is t set an upper limit n the rder f interactin. Fr example, ne can set a limit f tw, allwing pairwise prducts f piecewise linear functins, but nt three- r higherway prducts. This can aid in the interpretatin f the final mdel. An upper limit f ne results in an additive mdel Spam Example (Cntinued) We applied MARS t the spam data analyzed earlier in this chapter. T enhance interpretability, we restricted MARS t secnd-degree interactins. Althugh the target is a tw-class variable, we used the squared-errr lss functin nnetheless (see Sectin 9.4.). Figure 9. shws the test errr misclassificatin rate as a functin f the rank (number f independent basis functins) in the mdel. The errr rate levels ff at abut 5.5%, which is slightly higher than that f the generalized additive mdel (5.%) discussed earlier. GCV chse a mdel size f 60, which is rughly the smallest mdel giving ptimal perfrmance. The leading interactins fund by MARS invlved inputs (ch\$, remve), (ch\$, free) and (hp, CAPTT). Hwever, these interactins give n imprvement in perfrmance ver the generalized additive mdel.

346 9.4 MARS: Multivariate Adaptive Regressin Splines Example (Simulated Data) Here we examine the perfrmance f MARS in three cntrasting scenaris. There are N = 00 bservatins, and the predictrs X,X,...,X p and errrs ε have independent standard nrmal distributins. Scenari : The data generatin mdel is Y = (X ) + + (X ) + (X.8) ε. (9.) The nise standard deviatin 0. was chsen s that the signal-tnise rati was abut 5. We call this the tensr-prduct scenari; the prduct term gives a surface that lks like that f Figure 9.. Scenari : This is the same as scenari, but with p = 0 ttal predictrs; that is, there are 8 inputs that are independent f the respnse. Scenari : This has the structure f a neural netwrk: l = X + X + X + X 4 + X 5, l = X 6 X 7 + X 8 X 9 + X 0, σ(t) = /( + e t ), Y = σ(l ) + σ(l ) + 0. ε. (9.) Scenaris and are ideally suited fr MARS, while scenari cntains high-rder interactins and may be difficult fr MARS t apprximate. We ran five simulatins frm each mdel, and recrded the results. In scenari, MARS typically uncvered the crrect mdel almst perfectly. In scenari, it fund the crrect structure but als fund a few extraneus terms invlving ther predictrs. Let µ(x) be the true mean f Y, and let MSE 0 = ave x Test (ȳ µ(x)), MSE = ave x Test ( ˆf(x) µ(x)). (9.) These represent the mean-square errr f the cnstant mdel and the fitted MARS mdel, estimated by averaging at the 000 test values f x. Table 9.4 shws the prprtinal decrease in mdel errr r R fr each scenari: R = MSE 0 MSE MSE 0. (9.4) The values shwn are means and standard errr ver the five simulatins. The perfrmance f MARS is degraded nly slightly by the inclusin f the useless inputs in scenari ; it perfrms substantially wrse in scenari.

347 8 9. Additive Mdels, Trees, and Related Methds TABLE 9.4. Prprtinal decrease in mdel errr (R ) when MARS is applied t three different scenaris. Scenari Mean (S.E.) : Tensr prduct p = 0.97 (0.0) : Tensr prduct p = (0.0) : Neural netwrk 0.79 (0.0) 9.4. ther Issues MARS fr Classificatin The MARS methd and algrithm can be extended t handle classificatin prblems. Several strategies have been suggested. Fr tw classes, ne can cde the utput as 0/ and treat the prblem as a regressin; we did this fr the spam example. Fr mre than tw classes, ne can use the indicatr respnse apprach described in Sectin 4.. ne cdes the K respnse classes via 0/ indicatr variables, and then perfrms a multi-respnse MARS regressin. Fr the latter we use a cmmn set f basis functins fr all respnse variables. Classificatin is made t the class with the largest predicted respnse value. There are, hwever, ptential masking prblems with this apprach, as described in Sectin 4.. A generally superir apprach is the ptimal scring methd discussed in Sectin.5. Stne et al. (997) develped a hybrid f MARS called PlyMARS specifically designed t handle classificatin prblems. It uses the multiple lgistic framewrk described in Sectin 4.4. It grws the mdel in a frward stagewise fashin like MARS, but at each stage uses a quadratic apprximatin t the multinmial lg-likelihd t search fr the next basis-functin pair. nce fund, the enlarged mdel is fit by maximum likelihd, and the prcess is repeated. Relatinship f MARS t CART Althugh they might seem quite different, the MARS and CART strategies actually have strng similarities. Suppse we take the MARS prcedure and make the fllwing changes: Replace the piecewise linear basis functins by step functins I(x t > 0) and I(x t 0). When a mdel term is invlved in a multiplicatin by a candidate term, it gets replaced by the interactin, and hence is nt available fr further interactins. With these changes, the MARS frward prcedure is the same as the CART tree-grwing algrithm. Multiplying a step functin by a pair f reflected

348 9.5 Hierarchical Mixtures f Experts 9 step functins is equivalent t splitting a nde at the step. The secnd restrictin implies that a nde may nt be split mre than nce, and leads t the attractive binary-tree representatin f the CART mdel. n the ther hand, it is this restrictin that makes it difficult fr CART t mdel additive structures. MARS frges the tree structure and gains the ability t capture additive effects. Mixed Inputs Mars can handle mixed predictrs quantitative and qualitative in a natural way, much like CART des. MARS cnsiders all pssible binary partitins f the categries fr a qualitative predictr int tw grups. Each such partitin generates a pair f piecewise cnstant basis functins indicatr functins fr the tw sets f categries. This basis pair is nw treated as any ther, and is used in frming tensr prducts with ther basis functins already in the mdel. 9.5 Hierarchical Mixtures f Experts The hierarchical mixtures f experts (HME) prcedure can be viewed as a variant f tree-based methds. The main difference is that the tree splits are nt hard decisins but rather sft prbabilistic nes. At each nde an bservatin ges left r right with prbabilities depending n its input values. This has sme cmputatinal advantages since the resulting parameter ptimizatin prblem is smth, unlike the discrete split pint search in the tree-based apprach. The sft splits might als help in predictin accuracy and prvide a useful alternative descriptin f the data. There are ther differences between HMEs and the CART implementatin f trees. In an HME, a linear (r lgistic regressin) mdel is fit in each terminal nde, instead f a cnstant as in CART. The splits can be multiway, nt just binary, and the splits are prbabilistic functins f a linear cmbinatin f inputs, rather than a single input as in the standard use f CART. Hwever, the relative merits f these chices are nt clear, and mst were discussed at the end f Sectin 9.. A simple tw-level HME mdel in shwn in Figure 9.. It can be thught f as a tree with sft splits at each nn-terminal nde. Hwever, the inventrs f this methdlgy use a different terminlgy. The terminal ndes are called experts, and the nn-terminal ndes are called gating netwrks. The idea is that each expert prvides an pinin (predictin) abut the respnse, and these are cmbined tgether by the gating netwrks. As we will see, the mdel is frmally a mixture mdel, and the tw-level mdel in the figure can be extend t multiple levels, hence the name hierarchical mixtures f experts.

349 0 9. Additive Mdels, Trees, and Related Methds Gating Netwrk g g Gating Netwrk Gating Netwrk g g g g Expert Netwrk Expert Netwrk Expert Netwrk Expert Netwrk Pr(y x, θ ) Pr(y x, θ ) Pr(y x, θ ) Pr(y x, θ ) FIGURE 9.. A tw-level hierarchical mixture f experts (HME) mdel. Cnsider the regressin r classificatin prblem, as described earlier in the chapter. The data is (x i,y i ),i =,,...,N, with y i either a cntinuus r binary-valued respnse, and x i a vectr-valued input. Fr ease f ntatin we assume that the first element f x i is ne, t accunt fr intercepts. Here is hw an HME is defined. The tp gating netwrk has the utput g j (x,γ j ) = e γt j x K, j =,,...,K, (9.5) k= eγt k x where each γ j is a vectr f unknwn parameters. This represents a sft K-way split (K = in Figure 9..) Each g j (x,γ j ) is the prbability f assigning an bservatin with feature vectr x t the jth branch. Ntice that with K = grups, if we take the cefficient f ne f the elements f x t be +, then we get a lgistic curve with infinite slpe. In this case, the gating prbabilities are either 0 r, crrespnding t a hard split n that input. At the secnd level, the gating netwrks have a similar frm: g l j (x,γ jl ) = e γt jl x K, l =,,...,K. (9.6) k= eγt jkx

350 9.5 Hierarchical Mixtures f Experts This is the prbability f assignment t the lth branch, given assignment t the jth branch at the level abve. At each expert (terminal nde), we have a mdel fr the respnse variable f the frm This differs accrding t the prblem. Y Pr(y x,θ jl ). (9.7) Regressin: The Gaussian linear regressin mdel is used, with θ jl = (β jl,σ jl ): Y = β T jlx + ε and ε N(0,σ jl). (9.8) Classificatin: The linear lgistic regressin mdel is used: Pr(Y = x,θ jl ) = + e θt jl x. (9.9) Denting the cllectin f all parameters by Ψ = {γ j,γ jl,θ jl }, the ttal prbability that Y = y is K K Pr(y x,ψ) = g j (x,γ j ) g l j (x,γ jl )Pr(y x,θ jl ). (9.0) j= l= This is a mixture mdel, with the mixture prbabilities determined by the gating netwrk mdels. T estimate the parameters, we maximize the lg-likelihd f the data, i lg Pr(y i x i,ψ), ver the parameters in Ψ. The mst cnvenient methd fr ding this is the EM algrithm, which we describe fr mixtures in Sectin 8.5. We define latent variables j, all f which are zer except fr a single ne. We interpret these as the branching decisins made by the tp level gating netwrk. Similarly we define latent variables l j t describe the gating decisins at the secnd level. In the E-step, the EM algrithm cmputes the expectatins f the j and l j given the current values f the parameters. These expectatins are then used as bservatin weights in the M-step f the prcedure, t estimate the parameters in the expert netwrks. The parameters in the internal ndes are estimated by a versin f multiple lgistic regressin. The expectatins f the j r l j are prbability prfiles, and these are used as the respnse vectrs fr these lgistic regressins. The hierarchical mixtures f experts apprach is a prmising cmpetitr t CART trees. By using sft splits rather than hard decisin rules it can capture situatins where the transitin frm lw t high respnse is gradual. The lg-likelihd is a smth functin f the unknwn weights and hence is amenable t numerical ptimizatin. The mdel is similar t CART with linear cmbinatin splits, but the latter is mre difficult t ptimize. n

351 9. Additive Mdels, Trees, and Related Methds the ther hand, t ur knwledge there are n methds fr finding a gd tree tplgy fr the HME mdel, as there are in CART. Typically ne uses a fixed tree f sme depth, pssibly the utput f the CART prcedure. The emphasis in the research n HMEs has been n predictin rather than interpretatin f the final mdel. A clse cusin f the HME is the latent class mdel (Lin et al., 000), which typically has nly ne layer; here the ndes r latent classes are interpreted as grups f subjects that shw similar respnse behavir. 9.6 Missing Data It is quite cmmn t have bservatins with missing values fr ne r mre input features. The usual apprach is t impute (fill-in) the missing values in sme way. Hwever, the first issue in dealing with the prblem is determining whether the missing data mechanism has distrted the bserved data. Rughly speaking, data are missing at randm if the mechanism resulting in its missin is independent f its (unbserved) value. A mre precise definitin is given in Little and Rubin (00). Suppse y is the respnse vectr and X is the N p matrix f inputs (sme f which are missing). Dente by X bs the bserved entries in X and let Z = (y,x), Z bs = (y,x bs ). Finally, if R is an indicatr matrix with ijth entry if x ij is missing and zer therwise, then the data is said t be missing at randm (MAR) if the distributin f R depends n the data Z nly thrugh Z bs : Pr(R Z,θ) = Pr(R Z bs,θ). (9.) Here θ are any parameters in the distributin f R. Data are said t be missing cmpletely at randm (MCAR) if the distributin f R desn t depend n the bserved r missing data: Pr(R Z, θ) = Pr(R θ). (9.) MCAR is a strnger assumptin than MAR: mst imputatin methds rely n MCAR fr their validity. Fr example, if a patient s measurement was nt taken because the dctr felt he was t sick, that bservatin wuld nt be MAR r MCAR. In this case the missing data mechanism causes ur bserved training data t give a distrted picture f the true ppulatin, and data imputatin is dangerus in this instance. ften the determinatin f whether features are MCAR must be made frm infrmatin abut the data cllectin prcess. Fr categrical features, ne way t diagnse this prblem is t cde missing as an additinal class. Then we fit ur mdel t the training data and see if class missing is predictive f the respnse.

352 9.6 Missing Data Assuming the features are missing cmpletely at randm, there are a number f ways f prceeding:. Discard bservatins with any missing values.. Rely n the learning algrithm t deal with missing values in its training phase.. Impute all missing values befre training. Apprach () can be used if the relative amunt f missing data is small, but therwise shuld be avided. Regarding (), CART is ne learning algrithm that deals effectively with missing values, thrugh surrgate splits (Sectin 9..4). MARS and PRIM use similar appraches. In generalized additive mdeling, all bservatins missing fr a given input feature are mitted when the partial residuals are smthed against that feature in the backfitting algrithm, and their fitted values are set t zer. Since the fitted curves have mean zer (when the mdel includes an intercept), this amunts t assigning the average fitted value t the missing bservatins. Fr mst learning methds, the imputatin apprach () is necessary. The simplest tactic is t impute the missing value with the mean r median f the nnmissing values fr that feature. (Nte that the abve prcedure fr generalized additive mdels is analgus t this.) If the features have at least sme mderate degree f dependence, ne can d better by estimating a predictive mdel fr each feature given the ther features and then imputing each missing value by its predictin frm the mdel. In chsing the learning methd fr imputatin f the features, ne must remember that this chice is distinct frm the methd used fr predicting y frm X. Thus a flexible, adaptive methd will ften be preferred, even fr the eventual purpse f carrying ut a linear regressin f y n X. In additin, if there are many missing feature values in the training set, the learning methd must itself be able t deal with missing feature values. CART therefre is an ideal chice fr this imputatin engine. After imputatin, missing values are typically treated as if they were actually bserved. This ignres the uncertainty due t the imputatin, which will itself intrduce additinal uncertainty int estimates and predictins frm the respnse mdel. ne can measure this additinal uncertainty by ding multiple imputatins and hence creating many different training sets. The predictive mdel fr y can be fit t each training set, and the variatin acrss training sets can be assessed. If CART was used fr the imputatin engine, the multiple imputatins culd be dne by sampling frm the values in the crrespnding terminal ndes.

353 4 9. Additive Mdels, Trees, and Related Methds 9.7 Cmputatinal Cnsideratins With N bservatins and p predictrs, additive mdel fitting requires sme number mp f applicatins f a ne-dimensinal smther r regressin methd. The required number f cycles m f the backfitting algrithm is usually less than 0 and ften less than 0, and depends n the amunt f crrelatin in the inputs. With cubic smthing splines, fr example, N lg N peratins are needed fr an initial srt and N peratins fr the spline fit. Hence the ttal peratins fr an additive mdel fit is pn lg N + mpn. Trees require pn lg N peratins fr an initial srt fr each predictr, and typically anther pn lg N peratins fr the split cmputatins. If the splits ccurred near the edges f the predictr ranges, this number culd increase t N p. MARS requires Nm + pmn peratins t add a basis functin t a mdel with m terms already present, frm a pl f p predictrs. Hence t build an M-term mdel requires NM + pm N cmputatins, which can be quite prhibitive if M is a reasnable fractin f N. Each f the cmpnents f an HME are typically inexpensive t fit at each M-step: Np fr the regressins, and Np K fr a K-class lgistic regressin. The EM algrithm, hwever, can take a lng time t cnverge, and s sizable HME mdels are cnsidered cstly t fit. Bibligraphic Ntes The mst cmprehensive surce fr generalized additive mdels is the text f that name by Hastie and Tibshirani (990). Different applicatins f this wrk in medical prblems are discussed in Hastie et al. (989) and Hastie and Herman (990), and the sftware implementatin in Splus is described in Chambers and Hastie (99). Green and Silverman (994) discuss penalizatin and spline mdels in a variety f settings. Efrn and Tibshirani (99) give an expsitin f mdern develpments in statistics (including generalized additive mdels), fr a nnmathematical audience. Classificatin and regressin trees date back at least as far as Mrgan and Snquist (96). We have fllwed the mdern appraches f Breiman et al. (984) and Quinlan (99). The PRIM methd is due t Friedman and Fisher (999), while MARS is intrduced in Friedman (99), with an additive precursr in Friedman and Silverman (989). Hierarchical mixtures f experts were prpsed in Jrdan and Jacbs (994); see als Jacbs et al. (99).

354 Exercises 5 Exercises Ex. 9. Shw that a smthing spline fit f y i t x i preserves the linear part f the fit. In ther wrds, if y i = ŷ i + r i, where ŷ i represents the linear regressin fits, and S is the smthing matrix, then Sy = ŷ + Sr. Shw that the same is true fr lcal linear regressin (Sectin 6..). Hence argue that the adjustment step in the secnd line f () in Algrithm 9. is unnecessary. Ex. 9. Let A be a knwn k k matrix, b be a knwn k-vectr, and z be an unknwn k-vectr. A Gauss Seidel algrithm fr slving the linear system f equatins Az = b wrks by successively slving fr element z j in the jth equatin, fixing all ther z j s at their current guesses. This prcess is repeated fr j =,,...,k,,,...,k,..., until cnvergence (Glub and Van Lan, 98). (a) Cnsider an additive mdel with N bservatins and p terms, with the jth term t be fit by a linear smther S j. Cnsider the fllwing system f equatins: I S S S f S y S I S S f = S y.. (9.) S p S p S p I S p y Here each f j is an N-vectr f evaluatins f the jth functin at the data pints, and y is an N-vectr f the respnse values. Shw that backfitting is a blckwise Gauss Seidel algrithm fr slving this system f equatins. (b) Let S and S be symmetric smthing peratrs (matrices) with eigenvalues in [0, ). Cnsider a backfitting algrithm with respnse vectr y and smthers S,S. Shw that with any starting values, the algrithm cnverges and give a frmula fr the final iterates. Ex. 9. Backfitting equatins. Cnsider a backfitting prcedure with rthgnal prjectins, and let D be the verall regressin matrix whse clumns span V = L cl (S ) L cl (S ) L cl (S p ), where L cl (S) dentes the clumn space f a matrix S. Shw that the estimating equatins I S S S f S y S I S S f = S y. S p S p S p I S p y are equivalent t the least squares nrmal equatins D T Dβ = D T y where β is the vectr f cefficients. f p f p

355 6 9. Additive Mdels, Trees, and Related Methds Ex. 9.4 Suppse the same smther S is used t estimate bth terms in a tw-term additive mdel (i.e., bth variables are identical). Assume that S is symmetric with eigenvalues in [0, ). Shw that the backfitting residual cnverges t (I + S) (I S)y, and that the residual sum f squares cnverges upward. Can the residual sum f squares cnverge upward in less structured situatins? Hw des this fit cmpare t the fit with a single term fit by S? [Hint: Use the eigen-decmpsitin f S t help with this cmparisn.] Ex. 9.5 Degrees f freedm f a tree. Given data y i with mean f(x i ) and variance σ, and a fitting peratin y ŷ, let s define the degrees f freedm f a fit by i cv(y i,ŷ i )/σ. Cnsider a fit ŷ estimated by a regressin tree, fit t a set f predictrs X,X,...,X p. (a) In terms f the number f terminal ndes m, give a rugh frmula fr the degrees f freedm f the fit. (b) Generate 00 bservatins with predictrs X,X,...,X 0 as independent standard Gaussian variates and fix these values. (c) Generate respnse values als as standard Gaussian (σ = ), independent f the predictrs. Fit regressin trees t the data f fixed size,5 and 0 terminal ndes and hence estimate the degrees f freedm f each fit. [D ten simulatins f the respnse and average the results, t get a gd estimate f degrees f freedm.] (d) Cmpare yur estimates f degrees f freedm in (a) and (c) and discuss. (e) If the regressin tree fit were a linear peratin, we culd write ŷ = Sy fr sme matrix S. Then the degrees f freedm wuld be tr(s). Suggest a way t cmpute an apprximate S matrix fr a regressin tree, cmpute it and cmpare the resulting degrees f freedm t thse in (a) and (c). Ex. 9.6 Cnsider the zne data f Figure 6.9. (a) Fit an additive mdel t the cube rt f zne cncentratin. as a functin f temperature, wind speed, and radiatin. Cmpare yur results t thse btained via the trellis display in Figure 6.9. (b) Fit trees, MARS, and PRIM t the same data, and cmpare the results t thse fund in (a) and in Figure 6.9.

356 0 Bsting and Additive Trees This is page 7 Printer: paque this 0. Bsting Methds Bsting is ne f the mst pwerful learning ideas intrduced in the last twenty years. It was riginally designed fr classificatin prblems, but as will be seen in this chapter, it can prfitably be extended t regressin as well. The mtivatin fr bsting was a prcedure that cmbines the utputs f many weak classifiers t prduce a pwerful cmmittee. Frm this perspective bsting bears a resemblance t bagging and ther cmmittee-based appraches (Sectin 8.8). Hwever we shall see that the cnnectin is at best superficial and that bsting is fundamentally different. We begin by describing the mst ppular bsting algrithm due t Freund and Schapire (997) called AdaBst.M. Cnsider a tw-class prblem, with the utput variable cded as Y {,}. Given a vectr f predictr variables X, a classifier G(X) prduces a predictin taking ne f the tw values {,}. The errr rate n the training sample is err = N N I(y i G(x i )), i= and the expected errr rate n future predictins is E XY I(Y G(X)). A weak classifier is ne whse errr rate is nly slightly better than randm guessing. The purpse f bsting is t sequentially apply the weak classificatin algrithm t repeatedly mdified versins f the data, thereby prducing a sequence f weak classifiers G m (x),m =,,...,M.

357 8 0. Bsting and Additive Trees Final Classifier [ M ] G(x) = sign m= α mg m (x) Weighted Sample G M (x) Weighted Sample G (x) Weighted Sample G (x) Training Sample G (x) FIGURE 0.. Schematic f AdaBst. Classifiers are trained n weighted versins f the dataset, and then cmbined t prduce a final predictin. The predictins frm all f them are then cmbined thrugh a weighted majrity vte t prduce the final predictin: ( M ) G(x) = sign α m G m (x). (0.) m= Here α,α,...,α M are cmputed by the bsting algrithm, and weight the cntributin f each respective G m (x). Their effect is t give higher influence t the mre accurate classifiers in the sequence. Figure 0. shws a schematic f the AdaBst prcedure. The data mdificatins at each bsting step cnsist f applying weights w,w,...,w N t each f the training bservatins (x i,y i ), i =,,...,N. Initially all f the weights are set t w i = /N, s that the first step simply trains the classifier n the data in the usual manner. Fr each successive iteratin m =,,...,M the bservatin weights are individually mdified and the classificatin algrithm is reapplied t the weighted bservatins. At step m, thse bservatins that were misclassified by the classifier G m (x) induced at the previus step have their weights increased, whereas the weights are decreased fr thse that were classified crrectly. Thus as iteratins prceed, bservatins that are difficult t classify crrectly receive ever-increasing influence. Each successive classifier is thereby frced

358 0. Bsting Methds 9 Algrithm 0. AdaBst.M.. Initialize the bservatin weights w i = /N, i =,,...,N.. Fr m = t M: (a) Fit a classifier G m (x) t the training data using weights w i. (b) Cmpute err m = N i= w ii(y i G m (x i )) N i= w. i (c) Cmpute α m = lg(( err m )/err m ). (d) Set w i w i exp[α m I(y i G m (x i ))], i =,,...,N. [ M ]. utput G(x) = sign m= α mg m (x). t cncentrate n thse training bservatins that are missed by previus nes in the sequence. Algrithm 0. shws the details f the AdaBst.M algrithm. The current classifier G m (x) is induced n the weighted bservatins at line a. The resulting weighted errr rate is cmputed at line b. Line c calculates the weight α m given t G m (x) in prducing the final classifier G(x) (line ). The individual weights f each f the bservatins are updated fr the next iteratin at line d. bservatins misclassified by G m (x) have their weights scaled by a factr exp(α m ), increasing their relative influence fr inducing the next classifier G m+ (x) in the sequence. The AdaBst.M algrithm is knwn as Discrete AdaBst in Friedman et al. (000), because the base classifier G m (x) returns a discrete class label. If the base classifier instead returns a real-valued predictin (e.g., a prbability mapped t the interval [, ]), AdaBst can be mdified apprpriately (see Real AdaBst in Friedman et al. (000)). The pwer f AdaBst t dramatically increase the perfrmance f even a very weak classifier is illustrated in Figure 0.. The features X,...,X 0 are standard independent Gaussian, and the deterministic target Y is defined by { 0 if Y = j= X j > χ 0(0.5), (0.) therwise. Here χ 0(0.5) = 9.4 is the median f a chi-squared randm variable with 0 degrees f freedm (sum f squares f 0 standard Gaussians). There are 000 training cases, with apprximately 000 cases in each class, and 0,000 test bservatins. Here the weak classifier is just a stump : a tw terminalnde classificatin tree. Applying this classifier alne t the training data set yields a very pr test set errr rate f 45.8%, cmpared t 50% fr

359 40 0. Bsting and Additive Trees Test Errr Single Stump 44 Nde Tree Bsting Iteratins FIGURE 0.. Simulated data (0.): test errr rate fr bsting with stumps, as a functin f the number f iteratins. Als shwn are the test errr rate fr a single stump, and a 44-nde classificatin tree. randm guessing. Hwever, as bsting iteratins prceed the errr rate steadily decreases, reaching 5.8% after 400 iteratins. Thus, bsting this simple very weak classifier reduces its predictin errr rate by almst a factr f fur. It als utperfrms a single large classificatin tree (errr rate 4.7%). Since its intrductin, much has been written t explain the success f AdaBst in prducing accurate classifiers. Mst f this wrk has centered n using classificatin trees as the base learner G(x), where imprvements are ften mst dramatic. In fact, Breiman (NIPS Wrkshp, 996) referred t AdaBst with trees as the best ff-the-shelf classifier in the wrld (see als Breiman (998)). This is especially the case fr datamining applicatins, as discussed mre fully in Sectin 0.7 later in this chapter. 0.. utline f This Chapter Here is an utline f the develpments in this chapter: We shw that AdaBst fits an additive mdel in a base learner, ptimizing a nvel expnential lss functin. This lss functin is

360 0. Bsting Fits an Additive Mdel 4 very similar t the (negative) binmial lg-likelihd (Sectins ). The ppulatin minimizer f the expnential lss functin is shwn t be the lg-dds f the class prbabilities (Sectin 0.5). We describe lss functins fr regressin and classificatin that are mre rbust than squared errr r expnential lss (Sectin 0.6). It is argued that decisin trees are an ideal base learner fr data mining applicatins f bsting (Sectins 0.7 and 0.9). We develp a class f gradient bsted mdels (GBMs), fr bsting trees with any lss functin (Sectin 0.0). The imprtance f slw learning is emphasized, and implemented by shrinkage f each new term that enters the mdel (Sectin 0.), as well as randmizatin (Sectin 0..). Tls fr interpretatin f the fitted mdel are described (Sectin 0.). 0. Bsting Fits an Additive Mdel The success f bsting is really nt very mysterius. The key lies in expressin (0.). Bsting is a way f fitting an additive expansin in a set f elementary basis functins. Here the basis functins are the individual classifiers G m (x) {,}. Mre generally, basis functin expansins take the frm M f(x) = β m b(x;γ m ), (0.) m= where β m,m =,,...,M are the expansin cefficients, and b(x;γ) IR are usually simple functins f the multivariate argument x, characterized by a set f parameters γ. We discuss basis expansins in sme detail in Chapter 5. Additive expansins like this are at the heart f many f the learning techniques cvered in this bk: In single-hidden-layer neural netwrks (Chapter ), b(x;γ) = σ(γ 0 + γ T x), where σ(t) = /(+e t ) is the sigmid functin, and γ parameterizes a linear cmbinatin f the input variables. In signal prcessing, wavelets (Sectin 5.9.) are a ppular chice with γ parameterizing the lcatin and scale shifts f a mther wavelet. Multivariate adaptive regressin splines (Sectin 9.4) uses truncatedpwer spline basis functins where γ parameterizes the variables and values fr the knts.

361 4 0. Bsting and Additive Trees Algrithm 0. Frward Stagewise Additive Mdeling.. Initialize f 0 (x) = 0.. Fr m = t M: (a) Cmpute (β m,γ m ) = arg min β,γ N L(y i,f m (x i ) + βb(x i ;γ)). i= (b) Set f m (x) = f m (x) + β m b(x;γ m ). Fr trees, γ parameterizes the split variables and split pints at the internal ndes, and the predictins at the terminal ndes. Typically these mdels are fit by minimizing a lss functin averaged ver the training data, such as the squared-errr r a likelihd-based lss functin, ( ) min {β m,γ m} M N L i= y i, M β m b(x i ;γ m ) m=. (0.4) Fr many lss functins L(y,f(x)) and/r basis functins b(x;γ), this requires cmputatinally intensive numerical ptimizatin techniques. Hwever, a simple alternative ften can be fund when it is feasible t rapidly slve the subprblem f fitting just a single basis functin, min β,γ N L(y i,βb(x i ;γ)). (0.5) i= 0. Frward Stagewise Additive Mdeling Frward stagewise mdeling apprximates the slutin t (0.4) by sequentially adding new basis functins t the expansin withut adjusting the parameters and cefficients f thse that have already been added. This is utlined in Algrithm 0.. At each iteratin m, ne slves fr the ptimal basis functin b(x;γ m ) and crrespnding cefficient β m t add t the current expansin f m (x). This prduces f m (x), and the prcess is repeated. Previusly added terms are nt mdified. Fr squared-errr lss L(y,f(x)) = (y f(x)), (0.6)

362 0.4 Expnential Lss and AdaBst 4 ne has L(y i,f m (x i ) + βb(x i ;γ)) = (y i f m (x i ) βb(x i ;γ)) = (r im βb(x i ;γ)), (0.7) where r im = y i f m (x i ) is simply the residual f the current mdel n the ith bservatin. Thus, fr squared-errr lss, the term β m b(x;γ m ) that best fits the current residuals is added t the expansin at each step. This idea is the basis fr least squares regressin bsting discussed in Sectin Hwever, as we shw near the end f the next sectin, squared-errr lss is generally nt a gd chice fr classificatin; hence the need t cnsider ther lss criteria. 0.4 Expnential Lss and AdaBst We nw shw that AdaBst.M (Algrithm 0.) is equivalent t frward stagewise additive mdeling (Algrithm 0.) using the lss functin L(y,f(x)) = exp( y f(x)). (0.8) The apprpriateness f this criterin is addressed in the next sectin. Fr AdaBst the basis functins are the individual classifiers G m (x) {,}. Using the expnential lss functin, ne must slve (β m,g m ) = arg min β,g N exp[ y i (f m (x i ) + β G(x i ))] i= fr the classifier G m and crrespnding cefficient β m t be added at each step. This can be expressed as (β m,g m ) = arg min β,g N i= w (m) i exp( β y i G(x i )) (0.9) with w (m) i = exp( y i f m (x i )). Since each w (m) i depends neither n β nr G(x), it can be regarded as a weight that is applied t each bservatin. This weight depends n f m (x i ), and s the individual weight values change with each iteratin m. The slutin t (0.9) can be btained in tw steps. First, fr any value f β > 0, the slutin t (0.9) fr G m (x) is G m = arg min G N i= w (m) i I(y i G(x i )), (0.0)

363 44 0. Bsting and Additive Trees which is the classifier that minimizes the weighted errr rate in predicting y. This can be easily seen by expressing the criterin in (0.9) as e β w (m) i + e β w (m) i, y i=g(x i) which in turn can be written as ( e β e β) N i= y i G(x i) w (m) i I(y i G(x i )) + e β N i= Plugging this G m int (0.9) and slving fr β ne btains where err m is the minimized weighted errr rate err m = w (m) i. (0.) β m = lg err m err m, (0.) N i= w(m) The apprximatin is then updated i I(y i G m (x i )) N. (0.) i= w(m) i f m (x) = f m (x) + β m G m (x), which causes the weights fr the next iteratin t be w (m+) i = w (m) i e βmyigm(xi). (0.4) Using the fact that y i G m (x i ) = I(y i G m (x i )), (0.4) becmes w (m+) i = w (m) i e αmi(yi Gm(xi)) e βm, (0.5) where α m = β m is the quantity defined at line c f AdaBst.M (Algrithm 0.). The factr e βm in (0.5) multiplies all weights by the same value, s it has n effect. Thus (0.5) is equivalent t line (d) f Algrithm 0.. ne can view line (a) f the Adabst.M algrithm as a methd fr apprximately slving the minimizatin in (0.) and hence (0.0). Hence we cnclude that AdaBst.M minimizes the expnential lss criterin (0.8) via a frward-stagewise additive mdeling apprach. Figure 0. shws the training-set misclassificatin errr rate and average expnential lss fr the simulated data prblem (0.) f Figure 0.. The training-set misclassificatin errr decreases t zer at arund 50 iteratins (and remains there), but the expnential lss keeps decreasing. Ntice als in Figure 0. that the test-set misclassificatin errr cntinues t imprve after iteratin 50. Clearly Adabst is nt ptimizing trainingset misclassificatin errr; the expnential lss is mre sensitive t changes in the estimated class prbabilities.

364 0.5 Why Expnential Lss? 45 Training Errr Expnential Lss Misclassificatin Rate Bsting Iteratins FIGURE 0.. Simulated data, bsting with stumps: misclassificatin errr rate n the training set, and average expnential lss: (/N) P N i= exp( yif(xi)). After abut 50 iteratins, the misclassificatin errr is zer, while the expnential lss cntinues t decrease. 0.5 Why Expnential Lss? The AdaBst.M algrithm was riginally mtivated frm a very different perspective than presented in the previus sectin. Its equivalence t frward stagewise additive mdeling based n expnential lss was nly discvered five years after its inceptin. By studying the prperties f the expnential lss criterin, ne can gain insight int the prcedure and discver ways it might be imprved. The principal attractin f expnential lss in the cntext f additive mdeling is cmputatinal; it leads t the simple mdular reweighting AdaBst algrithm. Hwever, it is f interest t inquire abut its statistical prperties. What des it estimate and hw well is it being estimated? The first questin is answered by seeking its ppulatin minimizer. It is easy t shw (Friedman et al., 000) that f (x) = arg min E Y x(e Y f(x) ) = Pr(Y = x) lg f(x) Pr(Y = x), (0.6)

365 46 0. Bsting and Additive Trees r equivalently Pr(Y = x) = + e. f (x) Thus, the additive expansin prduced by AdaBst is estimating nehalf the lg-dds f P(Y = x). This justifies using its sign as the classificatin rule in (0.). Anther lss criterin with the same ppulatin minimizer is the binmial negative lg-likelihd r deviance (als knwn as crss-entrpy), interpreting f as the lgit transfrm. Let p(x) = Pr(Y = x) = e f(x) e f(x) + e f(x) = + e f(x) (0.7) and define Y = (Y + )/ {0,}. Then the binmial lg-likelihd lss functin is l(y,p(x)) = Y lg p(x) + ( Y )lg( p(x)), r equivalently the deviance is l(y,f(x)) = lg ( + e Y f(x)). (0.8) Since the ppulatin maximizer f lg-likelihd is at the true prbabilities p(x) = Pr(Y = x), we see frm (0.7) that the ppulatin minimizers f the deviance E Y x [ l(y,f(x))] and E Y x [e Y f(x) ] are the same. Thus, using either criterin leads t the same slutin at the ppulatin level. Nte that e Y f itself is nt a prper lg-likelihd, since it is nt the lgarithm f any prbability mass functin fr a binary randm variable Y {,}. 0.6 Lss Functins and Rbustness In this sectin we examine the different lss functins fr classificatin and regressin mre clsely, and characterize them in terms f their rbustness t extreme data. Rbust Lss Functins fr Classificatin Althugh bth the expnential (0.8) and binmial deviance (0.8) yield the same slutin when applied t the ppulatin jint distributin, the same is nt true fr finite data sets. Bth criteria are mntne decreasing functins f the margin yf(x). In classificatin (with a / respnse) the margin plays a rle analgus t the residuals y f(x) in regressin. The classificatin rule G(x) = sign[f(x)] implies that bservatins with psitive margin y i f(x i ) > 0 are classified crrectly whereas thse with negative margin y i f(x i ) < 0 are misclassified. The decisin bundary is defined by

366 0.6 Lss Functins and Rbustness 47 Lss Misclassificatin Expnential Binmial Deviance Squared Errr Supprt Vectr 0 y f FIGURE 0.4. Lss functins fr tw-class classificatin. The respnse is y = ±; the predictin is f, with class predictin sign(f). The lsses are misclassificatin: I(sign(f) y); expnential: exp( yf); binmial deviance: lg( + exp( yf)); squared errr: (y f) ; and supprt vectr: ( yf) + (see Sectin.). Each functin has been scaled s that it passes thrugh the pint (0, ). f(x) = 0. The gal f the classificatin algrithm is t prduce psitive margins as frequently as pssible. Any lss criterin used fr classificatin shuld penalize negative margins mre heavily than psitive nes since psitive margin bservatins are already crrectly classified. Figure 0.4 shws bth the expnential (0.8) and binmial deviance criteria as a functin f the margin y f(x). Als shwn is misclassificatin lss L(y,f(x)) = I(y f(x) < 0), which gives unit penalty fr negative margin values, and n penalty at all fr psitive nes. Bth the expnential and deviance lss can be viewed as mntne cntinuus apprximatins t misclassificatin lss. They cntinuusly penalize increasingly negative margin values mre heavily than they reward increasingly psitive nes. The difference between them is in degree. The penalty assciated with binmial deviance increases linearly fr large increasingly negative margin, whereas the expnential criterin increases the influence f such bservatins expnentially. At any pint in the training prcess the expnential criterin cncentrates much mre influence n bservatins with large negative margins. Binmial deviance cncentrates relatively less influence n such bserva-

368 0.6 Lss Functins and Rbustness 49 f K (x) = 0, as in (4.7). Here we prefer t retain the symmetry, and impse the cnstraint K k= f k(x) = 0. The binmial deviance extends naturally t the K-class multinmial deviance lss functin: L(y, p(x)) = = K I(y = G k )lg p k (x) k= ( K K I(y = G k )f k (x) + lg k= l= e f l(x) ). (0.) As in the tw-class case, the criterin (0.) penalizes incrrect predictins nly linearly in their degree f incrrectness. Zhu et al. (005) generalize the expnential lss fr K-class prblems. See Exercise 0.5 fr details. Rbust Lss Functins fr Regressin In the regressin setting, analgus t the relatinship between expnential lss and binmial lg-likelihd is the relatinship between squared-errr lss L(y,f(x)) = (y f(x)) and abslute lss L(y,f(x)) = y f(x). The ppulatin slutins are f(x) = E(Y x) fr squared-errr lss, and f(x) = median(y x) fr abslute lss; fr symmetric errr distributins these are the same. Hwever, n finite samples squared-errr lss places much mre emphasis n bservatins with large abslute residuals y i f(x i ) during the fitting prcess. It is thus far less rbust, and its perfrmance severely degrades fr lng-tailed errr distributins and especially fr grssly mismeasured y-values ( utliers ). ther mre rbust criteria, such as abslute lss, perfrm much better in these situatins. In the statistical rbustness literature, a variety f regressin lss criteria have been prpsed that prvide strng resistance (if nt abslute immunity) t grss utliers while being nearly as efficient as least squares fr Gaussian errrs. They are ften better than either fr errr distributins with mderately heavy tails. ne such criterin is the Huber lss criterin used fr M-regressin (Huber, 964) { L(y,f(x)) = [y f(x)] fr y f(x) δ, δ y f(x) δ therwise. (0.) Figure 0.5 cmpares these three lss functins. These cnsideratins suggest than when rbustness is a cncern, as is especially the case in data mining applicatins (see Sectin 0.7), squarederrr lss fr regressin and expnential lss fr classificatin are nt the best criteria frm a statistical perspective. Hwever, they bth lead t the elegant mdular bsting algrithms in the cntext f frward stagewise additive mdeling. Fr squared-errr lss ne simply fits the base learner t the residuals frm the current mdel y i f m (x i ) at each step. Fr

369 50 0. Bsting and Additive Trees Lss Squared Errr Abslute Errr Huber 0 y f FIGURE 0.5. A cmparisn f three lss functins fr regressin, pltted as a functin f the margin y f. The Huber lss functin cmbines the gd prperties f squared-errr lss near zer and abslute errr lss when y f is large. expnential lss ne perfrms a weighted fit f the base learner t the utput values y i, with weights w i = exp( y i f m (x i )). Using ther mre rbust criteria directly in their place des nt give rise t such simple feasible bsting algrithms. Hwever, in Sectin 0.0. we shw hw ne can derive simple elegant bsting algrithms based n any differentiable lss criterin, thereby prducing highly rbust bsting prcedures fr data mining. 0.7 ff-the-shelf Prcedures fr Data Mining Predictive learning is an imprtant aspect f data mining. As can be seen frm this bk, a wide variety f methds have been develped fr predictive learning frm data. Fr each particular methd there are situatins fr which it is particularly well suited, and thers where it perfrms badly cmpared t the best that can be dne with that data. We have attempted t characterize apprpriate situatins in ur discussins f each f the respective methds. Hwever, it is seldm knwn in advance which prcedure will perfrm best r even well fr any given prblem. Table 0. summarizes sme f the characteristics f a number f learning methds. Industrial and cmmercial data mining applicatins tend t be especially challenging in terms f the requirements placed n learning prcedures. Data sets are ften very large in terms f number f bservatins and number f variables measured n each f them. Thus, cmputatinal cn-

370 0.7 ff-the-shelf Prcedures fr Data Mining 5 TABLE 0.. Sme characteristics f different learning methds. Key: = gd, =fair, and =pr. Characteristic Neural SVM Trees MARS k-nn, Natural handling f data f mixed type Nets Kernels Handling f missing values Rbustness t utliers in input space Insensitive t mntne transfrmatins f inputs Cmputatinal scalability (large N) Ability t deal with irrelevant inputs Ability t extract linear cmbinatins f features Interpretability Predictive pwer sideratins play an imprtant rle. Als, the data are usually messy: the inputs tend t be mixtures f quantitative, binary, and categrical variables, the latter ften with many levels. There are generally many missing values, cmplete bservatins being rare. Distributins f numeric predictr and respnse variables are ften lng-tailed and highly skewed. This is the case fr the spam data (Sectin 9..); when fitting a generalized additive mdel, we first lg-transfrmed each f the predictrs in rder t get a reasnable fit. In additin they usually cntain a substantial fractin f grss mis-measurements (utliers). The predictr variables are generally measured n very different scales. In data mining applicatins, usually nly a small fractin f the large number f predictr variables that have been included in the analysis are actually relevant t predictin. Als, unlike many applicatins such as pattern recgnitin, there is seldm reliable dmain knwledge t help create especially relevant features and/r filter ut the irrelevant nes, the inclusin f which dramatically degrades the perfrmance f many methds. In additin, data mining applicatins generally require interpretable mdels. It is nt enugh t simply prduce predictins. It is als desirable t have infrmatin prviding qualitative understanding f the relatinship

371 5 0. Bsting and Additive Trees between jint values f the input variables and the resulting predicted respnse value. Thus, black bx methds such as neural netwrks, which can be quite useful in purely predictive settings such as pattern recgnitin, are far less useful fr data mining. These requirements f speed, interpretability and the messy nature f the data sharply limit the usefulness f mst learning prcedures as ffthe-shelf methds fr data mining. An ff-the-shelf methd is ne that can be directly applied t the data withut requiring a great deal f timecnsuming data preprcessing r careful tuning f the learning prcedure. f all the well-knwn learning methds, decisin trees cme clsest t meeting the requirements fr serving as an ff-the-shelf prcedure fr data mining. They are relatively fast t cnstruct and they prduce interpretable mdels (if the trees are small). As discussed in Sectin 9., they naturally incrprate mixtures f numeric and categrical predictr variables and missing values. They are invariant under (strictly mntne) transfrmatins f the individual predictrs. As a result, scaling and/r mre general transfrmatins are nt an issue, and they are immune t the effects f predictr utliers. They perfrm internal feature selectin as an integral part f the prcedure. They are thereby resistant, if nt cmpletely immune, t the inclusin f many irrelevant predictr variables. These prperties f decisin trees are largely the reasn that they have emerged as the mst ppular learning methd fr data mining. Trees have ne aspect that prevents them frm being the ideal tl fr predictive learning, namely inaccuracy. They seldm prvide predictive accuracy cmparable t the best that can be achieved with the data at hand. As seen in Sectin 0., bsting decisin trees imprves their accuracy, ften dramatically. At the same time it maintains mst f their desirable prperties fr data mining. Sme advantages f trees that are sacrificed by bsting are speed, interpretability, and, fr AdaBst, rbustness against verlapping class distributins and especially mislabeling f the training data. A gradient bsted mdel (GBM) is a generalizatin f tree bsting that attempts t mitigate these prblems, s as t prduce an accurate and effective ff-the-shelf prcedure fr data mining. 0.8 Example: Spam Data Befre we g int the details f gradient bsting, we demnstrate its abilities n a tw-class classificatin prblem. The spam data are intrduced in Chapter, and used as an example fr many f the prcedures in Chapter 9 (Sectins 9.., 9..5, 9.. and 9.4.). Applying gradient bsting t these data resulted in a test errr rate f 4.5%, using the same test set as was used in Sectin 9... By cmparisn, an additive lgistic regressin achieved 5.5%, a CART tree fully grwn and

372 0.9 Bsting Trees 5 pruned by crss-validatin 8.7%, and MARS 5.5%. The standard errr f these estimates is arund 0.6%, althugh gradient bsting is significantly better than all f them using the McNemar test (Exercise 0.6). In Sectin 0. belw we develp a relative imprtance measure fr each predictr, as well as a partial dependence plt describing a predictr s cntributin t the fitted mdel. We nw illustrate these fr the spam data. Figure 0.6 displays the relative imprtance spectrum fr all 57 predictr variables. Clearly sme predictrs are mre imprtant than thers in separating spam frm . The frequencies f the character strings!, \$, hp, and remve are estimated t be the fur mst relevant predictr variables. At the ther end f the spectrum, the character strings 857, 45, table, and d have virtually n relevance. The quantity being mdeled here is the lg-dds f spam versus f(x) = lg Pr(spam x) Pr( x) (0.4) (see Sectin 0. belw). Figure 0.7 shws the partial dependence f the lg-dds n selected imprtant predictrs, tw psitively assciated with spam (! and remve), and tw negatively assciated (edu and hp). These particular dependencies are seen t be essentially mntnic. There is a general agreement with the crrespnding functins fund by the additive lgistic regressin mdel; see Figure 9. n page 0. Running a gradient bsted mdel n these data with J = terminalnde trees prduces a purely additive (main effects) mdel fr the lgdds, with a crrespnding errr rate f 4.7%, as cmpared t 4.5% fr the full gradient bsted mdel (with J = 5 terminal-nde trees). Althugh nt significant, this slightly higher errr rate suggests that there may be interactins amng sme f the imprtant predictr variables. This can be diagnsed thrugh tw-variable partial dependence plts. Figure 0.8 shws ne f the several such plts displaying strng interactin effects. ne sees that fr very lw frequencies f hp, the lg-dds f spam are greatly increased. Fr high frequencies f hp, the lg-dds f spam tend t be much lwer and rughly cnstant as a functin f!. As the frequency f hp decreases, the functinal relatinship with! strengthens. 0.9 Bsting Trees Regressin and classificatin trees are discussed in detail in Sectin 9.. They partitin the space f all jint predictr variable values int disjint regins R j,j =,,...,J, as represented by the terminal ndes f the tree. A cnstant γ j is assigned t each such regin and the predictive rule is x R j f(x) = γ j.

373 54 0. Bsting and Additive Trees d addresses labs telnet direct table cs 85 parts # credit lab [ cnference reprt riginal data prject fnt make address rder all hpl technlgy peple pm mail ver 650 meeting ; 000 internet receive business re( 999 will mney ur yu edu CAPTT gerge CAPMAX yur CAPAVE free remve hp \$! Relative Imprtance FIGURE 0.6. Predictr variable imprtance spectrum fr the spam data. The variable names are written n the vertical axis.

374 0.9 Bsting Trees 55 Partial Dependence Partial Dependence ! remve Partial Dependence edu Partial Dependence hp FIGURE 0.7. Partial dependence f lg-dds f spam n fur imprtant predictrs. The red ticks at the base f the plts are deciles f the input variable ! hp FIGURE 0.8. Partial dependence f the lg-dds f spam vs. as a functin f jint frequencies f hp and the character!.

375 56 0. Bsting and Additive Trees Thus a tree can be frmally expressed as T(x;Θ) = J γ j I(x R j ), (0.5) j= with parameters Θ = {R j,γ j } J. J is usually treated as a meta-parameter. The parameters are fund by minimizing the empirical risk ˆΘ = arg min Θ J j= x i R j L(y i,γ j ). (0.6) This is a frmidable cmbinatrial ptimizatin prblem, and we usually settle fr apprximate subptimal slutins. It is useful t divide the ptimizatin prblem int tw parts: Finding γ j given R j : Given the R j, estimating the γ j is typically trivial, and ften ˆγ j = ȳ j, the mean f the y i falling in regin R j. Fr misclassificatin lss, ˆγ j is the mdal class f the bservatins falling in regin R j. Finding R j : This is the difficult part, fr which apprximate slutins are fund. Nte als that finding the R j entails estimating the γ j as well. A typical strategy is t use a greedy, tp-dwn recursive partitining algrithm t find the R j. In additin, it is smetimes necessary t apprximate (0.6) by a smther and mre cnvenient criterin fr ptimizing the R j : Θ = arg min Θ N L(y i,t(x i,θ)). (0.7) i= Then given the ˆR j = R j, the γ j can be estimated mre precisely using the riginal criterin. In Sectin 9. we described such a strategy fr classificatin trees. The Gini index replaced misclassificatin lss in the grwing f the tree (identifying the R j ). The bsted tree mdel is a sum f such trees, f M (x) = M T(x;Θ m ), (0.8) m= induced in a frward stagewise manner (Algrithm 0.). At each step in the frward stagewise prcedure ne must slve ˆΘ m = arg min Θ m N L(y i,f m (x i ) + T(x i ;Θ m )) (0.9) i=

376 0.9 Bsting Trees 57 fr the regin set and cnstants Θ m = {R jm,γ jm } Jm f the next tree, given the current mdel f m (x). Given the regins R jm, finding the ptimal cnstants γ jm in each regin is typically straightfrward: ˆγ jm = arg min γ jm x i R jm L(y i,f m (x i ) + γ jm ). (0.0) Finding the regins is difficult, and even mre difficult than fr a single tree. Fr a few special cases, the prblem simplifies. Fr squared-errr lss, the slutin t (0.9) is n harder than fr a single tree. It is simply the regressin tree that best predicts the current residuals y i f m (x i ), and ˆγ jm is the mean f these residuals in each crrespnding regin. Fr tw-class classificatin and expnential lss, this stagewise apprach gives rise t the AdaBst methd fr bsting classificatin trees (Algrithm 0.). In particular, if the trees T(x;Θ m ) are restricted t be scaled classificatin trees, then we shwed in Sectin 0.4 that the slutin t (0.9) is the tree that minimizes the weighted errr rate N I(y i i= w(m) i T(x i ;Θ m )) with weights w (m) i = e yifm (xi). By a scaled classificatin tree, we mean β m T(x;Θ m ), with the restrictin that γ jm {,}). Withut this restrictin, (0.9) still simplifies fr expnential lss t a weighted expnential criterin fr the new tree: ˆΘ m = arg min Θ m N i= w (m) i exp[ y i T(x i ;Θ m )]. (0.) It is straightfrward t implement a greedy recursive-partitining algrithm using this weighted expnential lss as a splitting criterin. Given the R jm, ne can shw (Exercise 0.7) that the slutin t (0.0) is the weighted lg-dds in each crrespnding regin ˆγ jm = lg x i R jm w (m) i I(y i = ) x i R jm w (m) i I(y i = ). (0.) This requires a specialized tree-grwing algrithm; in practice, we prefer the apprximatin presented belw that uses a weighted least squares regressin tree. Using lss criteria such as the abslute errr r the Huber lss (0.) in place f squared-errr lss fr regressin, and the deviance (0.) in place f expnential lss fr classificatin, will serve t rbustify bsting trees. Unfrtunately, unlike their nnrbust cunterparts, these rbust criteria d nt give rise t simple fast bsting algrithms. Fr mre general lss criteria the slutin t (0.0), given the R jm, is typically straightfrward since it is a simple lcatin estimate. Fr

377 58 0. Bsting and Additive Trees abslute lss it is just the median f the residuals in each respective regin. Fr the ther criteria fast iterative algrithms exist fr slving (0.0), and usually their faster single-step apprximatins are adequate. The prblem is tree inductin. Simple fast algrithms d nt exist fr slving (0.9) fr these mre general lss criteria, and apprximatins like (0.7) becme essential. 0.0 Numerical ptimizatin via Gradient Bsting Fast apprximate algrithms fr slving (0.9) with any differentiable lss criterin can be derived by analgy t numerical ptimizatin. The lss in using f(x) t predict y n the training data is L(f) = N L(y i,f(x i )). (0.) i= The gal is t minimize L(f) with respect t f, where here f(x) is cnstrained t be a sum f trees (0.8). Ignring this cnstraint, minimizing (0.) can be viewed as a numerical ptimizatin ˆf = arg min f L(f), (0.4) where the parameters f IR N are the values f the apprximating functin f(x i ) at each f the N data pints x i : f = {f(x ),f(x )),...,f(x N )}. Numerical ptimizatin prcedures slve (0.4) as a sum f cmpnent vectrs M f M = h m, h m IR N, m=0 where f 0 = h 0 is an initial guess, and each successive f m is induced based n the current parameter vectr f m, which is the sum f the previusly induced updates. Numerical ptimizatin methds differ in their prescriptins fr cmputing each increment vectr h m ( step ) Steepest Descent Steepest descent chses h m = ρ m g m where ρ m is a scalar and g m IR N is the gradient f L(f) evaluated at f = f m. The cmpnents f the gradient g m are [ ] L(yi,f(x i )) g im = (0.5) f(x i ) f(x i)=f m (x i)

378 0.0 Numerical ptimizatin via Gradient Bsting 59 The step length ρ m is the slutin t The current slutin is then updated ρ m = arg min ρ L(f m ρg m ). (0.6) f m = f m ρ m g m and the prcess repeated at the next iteratin. Steepest descent can be viewed as a very greedy strategy, since g m is the lcal directin in IR N fr which L(f) is mst rapidly decreasing at f = f m Gradient Bsting Frward stagewise bsting (Algrithm 0.) is als a very greedy strategy. At each step the slutin tree is the ne that maximally reduces (0.9), given the current mdel f m and its fits f m (x i ). Thus, the tree predictins T(x i ;Θ m ) are analgus t the cmpnents f the negative gradient (0.5). The principal difference between them is that the tree cmpnents t m = (T(x ;Θ m ),...,T(x N ;Θ m ) are nt independent. They are cnstrained t be the predictins f a J m -terminal nde decisin tree, whereas the negative gradient is the uncnstrained maximal descent directin. The slutin t (0.0) in the stagewise apprach is analgus t the line search (0.6) in steepest descent. The difference is that (0.0) perfrms a separate line search fr thse cmpnents f t m that crrespnd t each separate terminal regin {T(x i ;Θ m )} xi R jm. If minimizing lss n the training data (0.) were the nly gal, steepest descent wuld be the preferred strategy. The gradient (0.5) is trivial t calculate fr any differentiable lss functin L(y,f(x)), whereas slving (0.9) is difficult fr the rbust criteria discussed in Sectin 0.6. Unfrtunately the gradient (0.5) is defined nly at the training data pints x i, whereas the ultimate gal is t generalize f M (x) t new data nt represented in the training set. A pssible reslutin t this dilemma is t induce a tree T(x;Θ m ) at the mth iteratin whse predictins t m are as clse as pssible t the negative gradient. Using squared errr t measure clseness, this leads us t Θ m = arg min Θ N ( g im T(x i ;Θ)). (0.7) i= That is, ne fits the tree T t the negative gradient values (0.5) by least squares. As nted in Sectin 0.9 fast algrithms exist fr least squares decisin tree inductin. Althugh the slutin regins R jm t (0.7) will nt be identical t the regins R jm that slve (0.9), it is generally similar enugh t serve the same purpse. In any case, the frward stagewise

380 0. Right-Sized Trees fr Bsting 6 Algrithm 0. Gradient Tree Bsting Algrithm.. Initialize f 0 (x) = arg min γ N i= L(y i,γ).. Fr m = t M: (a) Fr i =,,...,N cmpute [ ] L(yi,f(x i )) r im = f(x i ) f=f m. (b) Fit a regressin tree t the targets r im giving terminal regins R jm, j =,,...,J m. (c) Fr j =,,...,J m cmpute γ jm = arg min L(y i,f m (x i ) + γ). γ x i R jm (d) Update f m (x) = f m (x) + J m j= γ jmi(x R jm ).. utput ˆf(x) = f M (x). The algrithm fr classificatin is similar. Lines (a) (d) are repeated K times at each iteratin m, nce fr each class using (0.8). The result at line is K different (cupled) tree expansins f km (x),k =,,...,K. These prduce prbabilities via (0.) r d classificatin as in (0.0). Details are given in Exercise 0.9. Tw basic tuning parameters are the number f iteratins M and the sizes f each f the cnstituent trees J m, m =,,...,M. The riginal implementatin f this algrithm was called MART fr multiple additive regressin trees, and was referred t in the first editin f this bk. Many f the figures in this chapter were prduced by MART. Gradient bsting as described here is implemented in the R gbm package (Ridgeway, 999, Gradient Bsted Mdels ), and is freely available. The gbm package is used in Sectin 0.4., and extensively in Chapters 6 and 5. Anther R implementatin f bsting is mbst (Hthrn and Bühlmann, 006). A cmmercial implementatin f gradient bsting/mart called TreeNet is available frm Salfrd Systems, Inc. 0. Right-Sized Trees fr Bsting Histrically, bsting was cnsidered t be a technique fr cmbining mdels, here trees. As such, the tree building algrithm was regarded as a

381 6 0. Bsting and Additive Trees primitive that prduced mdels t be cmbined by the bsting prcedure. In this scenari, the ptimal size f each tree is estimated separately in the usual manner when it is built (Sectin 9.). A very large (versized) tree is first induced, and then a bttm-up prcedure is emplyed t prune it t the estimated ptimal number f terminal ndes. This apprach assumes implicitly that each tree is the last ne in the expansin (0.8). Except perhaps fr the very last tree, this is clearly a very pr assumptin. The result is that trees tend t be much t large, especially during the early iteratins. This substantially degrades perfrmance and increases cmputatin. The simplest strategy fr aviding this prblem is t restrict all trees t be the same size, J m = J m. At each iteratin a J-terminal nde regressin tree is induced. Thus J becmes a meta-parameter f the entire bsting prcedure, t be adjusted t maximize estimated perfrmance fr the data at hand. ne can get an idea f useful values fr J by cnsidering the prperties f the target functin η = arg min f E XY L(Y,f(X)). (0.9) Here the expected value is ver the ppulatin jint distributin f (X,Y ). The target functin η(x) is the ne with minimum predictin risk n future data. This is the functin we are trying t apprximate. ne relevant prperty f η(x) is the degree t which the crdinate variables X T = (X,X,...,X p ) interact with ne anther. This is captured by its ANVA (analysis f variance) expansin η(x) = j η j (X j )+ jk η jk (X j,x k )+ jkl η jkl (X j,x k,x l )+. (0.40) The first sum in (0.40) is ver functins f nly a single predictr variable X j. The particular functins η j (X j ) are thse that jintly best apprximate η(x) under the lss criterin being used. Each such η j (X j ) is called the main effect f X j. The secnd sum is ver thse tw-variable functins that when added t the main effects best fit η(x). These are called the secnd-rder interactins f each respective variable pair (X j,x k ). The third sum represents third-rder interactins, and s n. Fr many prblems encuntered in practice, lw-rder interactin effects tend t dminate. When this is the case, mdels that prduce strng higher-rder interactin effects, such as large decisin trees, suffer in accuracy. The interactin level f tree-based apprximatins is limited by the tree size J. Namely, n interactin effects f level greater that J are pssible. Since bsted mdels are additive in the trees (0.8), this limit extends t them as well. Setting J = (single split decisin stump ) prduces bsted mdels with nly main effects; n interactins are permitted. With J =, tw-variable interactin effects are als allwed, and

382 0. Right-Sized Trees fr Bsting 6 Test Errr Stumps 0 Nde 00 Nde Adabst Number f Terms FIGURE 0.9. Bsting with different sized trees, applied t the example (0.) used in Figure 0.. Since the generative mdel is additive, stumps perfrm the best. The bsting algrithm used the binmial deviance lss in Algrithm 0.; shwn fr cmparisn is the AdaBst Algrithm 0.. s n. This suggests that the value chsen fr J shuld reflect the level f dminant interactins f η(x). This is f curse generally unknwn, but in mst situatins it will tend t be lw. Figure 0.9 illustrates the effect f interactin rder (chice f J) n the simulatin example (0.). The generative functin is additive (sum f quadratic mnmials), s bsting mdels with J > incurs unnecessary variance and hence the higher test errr. Figure 0.0 cmpares the crdinate functins fund by bsted stumps with the true functins. Althugh in many applicatins J = will be insufficient, it is unlikely that J > 0 will be required. Experience s far indicates that 4 J 8 wrks well in the cntext f bsting, with results being fairly insensitive t particular chices in this range. ne can fine-tune the value fr J by trying several different values and chsing the ne that prduces the lwest risk n a validatin sample. Hwever, this seldm prvides significant imprvement ver using J 6.

383 64 0. Bsting and Additive Trees Crdinate Functins fr Additive Lgistic Trees f (x ) f (x ) f (x ) f 4(x 4) f 5(x 5) f 6(x 6) f 7(x 7) f 8(x 8) f 9(x 9) f 0(x 0) FIGURE 0.0. Crdinate functins estimated by bsting stumps fr the simulated example used in Figure 0.9. The true quadratic functins are shwn fr cmparisn. 0. Regularizatin Besides the size f the cnstituent trees, J, the ther meta-parameter f gradient bsting is the number f bsting iteratins M. Each iteratin usually reduces the training risk L(f M ), s that fr M large enugh this risk can be made arbitrarily small. Hwever, fitting the training data t well can lead t verfitting, which degrades the risk n future predictins. Thus, there is an ptimal number M minimizing future risk that is applicatin dependent. A cnvenient way t estimate M is t mnitr predictin risk as a functin f M n a validatin sample. The value f M that minimizes this risk is taken t be an estimate f M. This is analgus t the early stpping strategy ften used with neural netwrks (Sectin.4). 0.. Shrinkage Cntrlling the value f M is nt the nly pssible regularizatin strategy. As with ridge regressin and neural netwrks, shrinkage techniques can be emplyed as well (see Sectins.4. and.5). The simplest implementatin f shrinkage in the cntext f bsting is t scale the cntributin f each tree by a factr 0 < ν < when it is added t the current apprximatin. That is, line (d) f Algrithm 0. is replaced by f m (x) = f m (x) + ν J γ jm I(x R jm ). (0.4) The parameter ν can be regarded as cntrlling the learning rate f the bsting prcedure. Smaller values f ν (mre shrinkage) result in larger training risk fr the same number f iteratins M. Thus, bth ν and M cntrl predictin risk n the training data. Hwever, these parameters d j=

384 0. Regularizatin 65 nt perate independently. Smaller values f ν lead t larger values f M fr the same training risk, s that there is a tradeff between them. Empirically it has been fund (Friedman, 00) that smaller values f ν favr better test errr, and require crrespndingly larger values f M. In fact, the best strategy appears t be t set ν t be very small (ν < 0.) and then chse M by early stpping. This yields dramatic imprvements (ver n shrinkage ν = ) fr regressin and fr prbability estimatin. The crrespnding imprvements in misclassificatin risk via (0.0) are less, but still substantial. The price paid fr these imprvements is cmputatinal: smaller values f ν give rise t larger values f M, and cmputatin is prprtinal t the latter. Hwever, as seen belw, many iteratins are generally cmputatinally feasible even n very large data sets. This is partly due t the fact that small trees are induced at each step with n pruning. Figure 0. shws test errr curves fr the simulated example (0.) f Figure 0.. A gradient bsted mdel (MART) was trained using binmial deviance, using either stumps r six terminal-nde trees, and with r withut shrinkage. The benefits f shrinkage are evident, especially when the binmial deviance is tracked. With shrinkage, each test errr curve reaches a lwer value, and stays there fr many iteratins. Sectin 6.. draws a cnnectin between frward stagewise shrinkage in bsting and the use f an L penalty fr regularizing mdel parameters (the lass ). We argue that L penalties may be superir t the L penalties used by methds such as the supprt vectr machine. 0.. Subsampling We saw in Sectin 8.7 that btstrap averaging (bagging) imprves the perfrmance f a nisy classifier thrugh averaging. Chapter 5 discusses in sme detail the variance-reductin mechanism f this sampling fllwed by averaging. We can explit the same device in gradient bsting, bth t imprve perfrmance and cmputatinal efficiency. With stchastic gradient bsting (Friedman, 999), at each iteratin we sample a fractin η f the training bservatins (withut replacement), and grw the next tree using that subsample. The rest f the algrithm is identical. A typical value fr η can be, althugh fr large N, η can be substantially smaller than. Nt nly des the sampling reduce the cmputing time by the same fractin η, but in many cases it actually prduces a mre accurate mdel. Figure 0. illustrates the effect f subsampling using the simulated example (0.), bth as a classificatin and as a regressin example. We see in bth cases that sampling alng with shrinkage slightly utperfrmed the rest. It appears here that subsampling withut shrinkage des prly.

385 66 0. Bsting and Additive Trees Stumps Deviance Stumps Misclassificatin Errr Test Set Deviance N shrinkage Shrinkage=0. Test Set Misclassificatin Errr N shrinkage Shrinkage= Bsting Iteratins Bsting Iteratins 6-Nde Trees Deviance 6-Nde Trees Misclassificatin Errr Test Set Deviance N shrinkage Shrinkage=0.6 Test Set Misclassificatin Errr N shrinkage Shrinkage= Bsting Iteratins Bsting Iteratins FIGURE 0.. Test errr curves fr simulated example (0.) f Figure 0.9, using gradient bsting (MART). The mdels were trained using binmial deviance, either stumps r six terminal-nde trees, and with r withut shrinkage. The left panels reprt test deviance, while the right panels shw misclassificatin errr. The beneficial effect f shrinkage can be seen in all cases, especially fr deviance in the left panels.

386 0. Interpretatin 67 4 Nde Trees Deviance Abslute Errr Test Set Deviance Test Set Abslute Errr N shrinkage Shrink=0. Sample=0.5 Shrink=0. Sample= Bsting Iteratins Bsting Iteratins FIGURE 0.. Test-errr curves fr the simulated example (0.), shwing the effect f stchasticity. Fr the curves labeled Sample= 0.5, a different 50% subsample f the training data was used each time a tree was grwn. In the left panel the mdels were fit by gbm using a binmial deviance lss functin; in the right-hand panel using square-errr lss. The dwnside is that we nw have fur parameters t set: J, M, ν and η. Typically sme early explratins determine suitable values fr J, ν and η, leaving M as the primary parameter. 0. Interpretatin Single decisin trees are highly interpretable. The entire mdel can be cmpletely represented by a simple tw-dimensinal graphic (binary tree) that is easily visualized. Linear cmbinatins f trees (0.8) lse this imprtant feature, and must therefre be interpreted in a different way. 0.. Relative Imprtance f Predictr Variables In data mining applicatins the input predictr variables are seldm equally relevant. ften nly a few f them have substantial influence n the respnse; the vast majrity are irrelevant and culd just as well have nt been included. It is ften useful t learn the relative imprtance r cntributin f each input variable in predicting the respnse.

387 68 0. Bsting and Additive Trees Fr a single decisin tree T, Breiman et al. (984) prpsed J Il (T) = î t I(v(t) = l) (0.4) t= as a measure f relevance fr each predictr variable X l. The sum is ver the J internal ndes f the tree. At each such nde t, ne f the input variables X v(t) is used t partitin the regin assciated with that nde int tw subregins; within each a separate cnstant is fit t the respnse values. The particular variable chsen is the ne that gives maximal estimated imprvement î t in squared errr risk ver that fr a cnstant fit ver the entire regin. The squared relative imprtance f variable X l is the sum f such squared imprvements ver all internal ndes fr which it was chsen as the splitting variable. This imprtance measure is easily generalized t additive tree expansins (0.8); it is simply averaged ver the trees I l = M M Il (T m ). (0.4) m= Due t the stabilizing effect f averaging, this measure turns ut t be mre reliable than is its cunterpart (0.4) fr a single tree. Als, because f shrinkage (Sectin 0..) the masking f imprtant variables by thers with which they are highly crrelated is much less f a prblem. Nte that (0.4) and (0.4) refer t squared relevance; the actual relevances are their respective square rts. Since these measures are relative, it is custmary t assign the largest a value f 00 and then scale the thers accrdingly. Figure 0.6 shws the relevant imprtance f the 57 inputs in predicting spam versus . Fr K-class classificatin, K separate mdels f k (x),k =,,...,K are induced, each cnsisting f a sum f trees f k (x) = In this case (0.4) generalizes t I lk = M M T km (x). (0.44) m= M Il (T km ). (0.45) m= Here I lk is the relevance f X l in separating the class k bservatins frm the ther classes. The verall relevance f X l is btained by averaging ver all f the classes Il = K I K lk. (0.46) k=

388 0. Interpretatin 69 Figures 0. and 0.4 illustrate the use f these averaged and separate relative imprtances. 0.. Partial Dependence Plts After the mst relevant variables have been identified, the next step is t attempt t understand the nature f the dependence f the apprximatin f(x) n their jint values. Graphical renderings f the f(x) as a functin f its arguments prvides a cmprehensive summary f its dependence n the jint values f the input variables. Unfrtunately, such visualizatin is limited t lw-dimensinal views. We can easily display functins f ne r tw arguments, either cntinuus r discrete (r mixed), in a variety f different ways; this bk is filled with such displays. Functins f slightly higher dimensins can be pltted by cnditining n particular sets f values f all but ne r tw f the arguments, prducing a trellis f plts (Becker et al., 996). Fr mre than tw r three variables, viewing functins f the crrespnding higher-dimensinal arguments is mre difficult. A useful alternative can smetimes be t view a cllectin f plts, each ne f which shws the partial dependence f the apprximatin f(x) n a selected small subset f the input variables. Althugh such a cllectin can seldm prvide a cmprehensive depictin f the apprximatin, it can ften prduce helpful clues, especially when f(x) is dminated by lw-rder interactins (0.40). Cnsider the subvectr X S f l < p f the input predictr variables X T = (X,X,...,X p ), indexed by S {,,...,p}. Let C be the cmplement set, with S C = {,,...,p}. A general functin f(x) will in principle depend n all f the input variables: f(x) = f(x S,X C ). ne way t define the average r partial dependence f f(x) n X S is f S (X S ) = E XC f(x S,X C ). (0.47) This is a marginal average f f, and can serve as a useful descriptin f the effect f the chsen subset n f(x) when, fr example, the variables in X S d nt have strng interactins with thse in X C. Partial dependence functins can be used t interpret the results f any black bx learning methd. They can be estimated by f S (X S ) = N N f(x S,x ic ), (0.48) i= where {x C,x C,...,x NC } are the values f X C ccurring in the training data. This requires a pass ver the data fr each set f jint values f X S fr which f S (X S ) is t be evaluated. This can be cmputatinally intensive, lattice in R.

389 70 0. Bsting and Additive Trees even fr mderately sized data sets. Frtunately with decisin trees, f S (X S ) (0.48) can be rapidly cmputed frm the tree itself withut reference t the data (Exercise 0.). It is imprtant t nte that partial dependence functins defined in (0.47) represent the effect f X S n f(x) after accunting fr the (average) effects f the ther variables X C n f(x). They are nt the effect f X S n f(x) ignring the effects f X C. The latter is given by the cnditinal expectatin f S (X S ) = E(f(X S,X C ) X S ), (0.49) and is the best least squares apprximatin t f(x) by a functin f X S alne. The quantities f S (X S ) and f S (X S ) will be the same nly in the unlikely event that X S and X C are independent. Fr example, if the effect f the chsen variable subset happens t be purely additive, f(x) = h (X S ) + h (X C ). (0.50) Then (0.47) prduces the h (X S ) up t an additive cnstant. If the effect is purely multiplicative, f(x) = h (X S ) h (X C ), (0.5) then (0.47) prduces h (X S ) up t a multiplicative cnstant factr. n the ther hand, (0.49) will nt prduce h (X S ) in either case. In fact, (0.49) can prduce strng effects n variable subsets fr which f(x) has n dependence at all. Viewing plts f the partial dependence f the bsted-tree apprximatin (0.8) n selected variables subsets can help t prvide a qualitative descriptin f its prperties. Illustratins are shwn in Sectins 0.8 and 0.4. wing t the limitatins f cmputer graphics, and human perceptin, the size f the subsets X S must be small (l,,). There are f curse a large number f such subsets, but nly thse chsen frm amng the usually much smaller set f highly relevant predictrs are likely t be infrmative. Als, thse subsets whse effect n f(x) is apprximately additive (0.50) r multiplicative (0.5) will be mst revealing. Fr K-class classificatin, there are K separate mdels (0.44), ne fr each class. Each ne is related t the respective prbabilities (0.) thrugh f k (X) = lg p k (X) K K lg p l (X). (0.5) l= Thus each f k (X) is a mntne increasing functin f its respective prbability n a lgarithmic scale. Partial dependence plts f each respective f k (X) (0.44) n its mst relevant predictrs (0.45) can help reveal hw the lg-dds f realizing that class depend n the respective input variables.

390 0.4 Illustratins Illustratins In this sectin we illustrate gradient bsting n a number f larger datasets, using different lss functins as apprpriate Califrnia Husing This data set (Pace and Barry, 997) is available frm the Carnegie-Melln StatLib repsitry. It cnsists f aggregated data frm each f 0,460 neighbrhds (990 census blck grups) in Califrnia. The respnse variable Y is the median huse value in each neighbrhd measured in units f \$00,000. The predictr variables are demgraphics such as median incme MedInc, husing density as reflected by the number f huses Huse, and the average ccupancy in each huse Aveccup. Als included as predictrs are the lcatin f each neighbrhd (lngitude and latitude), and several quantities reflecting the prperties f the huses in the neighbrhd: average number f rms AveRms and bedrms AveBedrms. There are thus a ttal f eight predictrs, all numeric. We fit a gradient bsting mdel using the MART prcedure, with J = 6 terminal ndes, a learning rate (0.4) f ν = 0., and the Huber lss criterin fr predicting the numeric respnse. We randmly divided the dataset int a training set (80%) and a test set (0%). Figure 0. shws the average abslute errr AAE = E y ˆf M (x) (0.5) as a functin fr number f iteratins M n bth the training data and test data. The test errr is seen t decrease mntnically with increasing M, mre rapidly during the early stages and then leveling ff t being nearly cnstant as iteratins increase. Thus, the chice f a particular value f M is nt critical, as lng as it is nt t small. This tends t be the case in many applicatins. The shrinkage strategy (0.4) tends t eliminate the prblem f verfitting, especially fr larger data sets. The value f AAE after 800 iteratins is 0.. This can be cmpared t that f the ptimal cnstant predictr median{y i } which is In terms f mre familiar quantities, the squared multiple crrelatin cefficient f this mdel is R = Pace and Barry (997) use a sphisticated spatial autregressin prcedure, where predictin fr each neighbrhd is based n median huse values in nearby neighbrhds, using the ther predictrs as cvariates. Experimenting with transfrmatins they achieved R = 0.85, predicting lg Y. Using lg Y as the respnse the crrespnding value fr gradient bsting was R =

391 7 0. Bsting and Additive Trees Training and Test Abslute Errr Abslute Errr Train Errr Test Errr Iteratins M FIGURE 0.. Average-abslute errr as a functin f number f iteratins fr the Califrnia husing data. Figure 0.4 displays the relative variable imprtances fr each f the eight predictr variables. Nt surprisingly, median incme in the neighbrhd is the mst relevant predictr. Lngitude, latitude, and average ccupancy all have rughly half the relevance f incme, whereas the thers are smewhat less influential. Figure 0.5 shws single-variable partial dependence plts n the mst relevant nnlcatin predictrs. Nte that the plts are nt strictly smth. This is a cnsequence f using tree-based mdels. Decisin trees prduce discntinuus piecewise cnstant mdels (0.5). This carries ver t sums f trees (0.8), with f curse many mre pieces. Unlike mst f the methds discussed in this bk, there is n smthness cnstraint impsed n the result. Arbitrarily sharp discntinuities can be mdeled. The fact that these curves generally exhibit a smth trend is because that is what is estimated t best predict the respnse fr this prblem. This is ften the case. The hash marks at the base f each plt delineate the deciles f the data distributin f the crrespnding variables. Nte that here the data density is lwer near the edges, especially fr larger values. This causes the curves t be smewhat less well determined in thse regins. The vertical scales f the plts are the same, and give a visual cmparisn f the relative imprtance f the different variables. The partial dependence f median huse value n median incme is mntnic increasing, being nearly linear ver the main bdy f data. Huse value is generally mntnic decreasing with increasing average ccupancy, except perhaps fr average ccupancy rates less than ne. Median huse

392 0.4 Illustratins 7 Ppulatin AveBedrms AveRms HuseAge Latitude Aveccup Lngitude MedInc Relative imprtance FIGURE 0.4. Relative imprtance f the predictrs fr the Califrnia husing data. value has a nnmntnic partial dependence n average number f rms. It has a minimum at apprximately three rms and is increasing bth fr smaller and larger values. Median huse value is seen t have a very weak partial dependence n huse age that is incnsistent with its imprtance ranking (Figure 0.4). This suggests that this weak main effect may be masking strnger interactin effects with ther variables. Figure 0.6 shws the tw-variable partial dependence f husing value n jint values f median age and average ccupancy. An interactin between these tw variables is apparent. Fr values f average ccupancy greater than tw, huse value is nearly independent f median age, whereas fr values less than tw there is a strng dependence n age. Figure 0.7 shws the tw-variable partial dependence f the fitted mdel n jint values f lngitude and latitude, displayed as a shaded cntur plt. There is clearly a very strng dependence f median huse value n the neighbrhd lcatin in Califrnia. Nte that Figure 0.7 is nt a plt f huse value versus lcatin ignring the effects f the ther predictrs (0.49). Like all partial dependence plts, it represents the effect f lcatin after accunting fr the effects f the ther neighbrhd and huse attributes (0.47). It can be viewed as representing an extra premium ne pays fr lcatin. This premium is seen t be relatively large near the Pacific cast especially in the Bay Area and Ls Angeles San Dieg re-

393 74 0. Bsting and Additive Trees Partial Dependence Partial Dependence MedInc 4 5 Aveccup Partial Dependence Partial Dependence HuseAge AveRms FIGURE 0.5. Partial dependence f husing value n the nnlcatin variables fr the Califrnia husing data. The red ticks at the base f the plt are deciles f the input variables HuseAge Aveccup FIGURE 0.6. Partial dependence f huse value n median age and average ccupancy. There appears t be a strng interactin effect between these tw variables.

394 76 0. Bsting and Additive Trees presence and abundance f the Black re Dry, a marine fish fund in the ceanic waters arund New Zealand. Figure 0.8 shws the lcatins f 7,000 trawls (deep-water net fishing, with a maximum depth f km), and the red pints indicate thse 5 trawls fr which the Black re was present, ne f ver a hundred species regularly recrded. The catch size in kg fr each species was recrded fr each trawl. Alng with the species catch, a number f envirnmental measurements are available fr each trawl. These include the average depth f the trawl (AvgDepth), and the temperature and salinity f the water. Since the latter tw are strngly crrelated with depth, Leathwick et al. (006) derived instead TempResid and SalResid, the residuals btained when these tw measures are adjusted fr depth (via separate nn-parametric regressins). SSTGrad is a measure f the gradient f the sea surface temperature, and Chla is a brad indicatr f ecsytem prductivity via satellite-image measurements. SusPartMatter prvides a measure f suspended particulate matter, particularly in castal waters, and is als satellite derived. The gal f this analysis is t estimate the prbability f finding Black re in a trawl, as well as the expected catch size, standardized t take int accunt the effects f variatin in trawl speed and distance, as well as the mesh size f the trawl net. The authrs used lgistic regressin fr estimating the prbability. Fr the catch size, it might seem natural t assume a Pissn distributin and mdel the lg f the mean cunt, but this is ften nt apprpriate because f the excessive number f zers. Althugh specialized appraches have been develped, such as the zerinflated Pissn (Lambert, 99), they chse a simpler apprach. If Y is the (nn-negative) catch size, E(Y X) = E(Y Y > 0,X) Pr(Y > 0 X). (0.54) The secnd term is estimated by the lgistic regressin, and the first term can be estimated using nly the 5 trawls with a psitive catch. Fr the lgistic regressin the authrs used a gradient bsted mdel (GBM) 4 with binmial deviance lss functin, depth-0 trees, and a shrinkage factr ν = Fr the psitive-catch regressin, they mdeled lg(y ) using a GBM with squared-errr lss (als depth-0 trees, but ν = 0.0), and un-lgged the predictins. In bth cases they used 0-fld crss-validatin fr selecting the number f terms, as well as the shrinkage factr. The mdels, data, and maps shwn here were kindly prvided by Dr Jhn Leathwick f the Natinal Institute f Water and Atmspheric Research in New Zealand, and Dr Jane Elith, Schl f Btany, University f Melburne. The cllectin f the research trawl data tk place frm , and was funded by the New Zealand Ministry f Fisheries. 4 Versin.5-7 f package gbm in R, ver...0.

395 0.4 Illustratins 77 FIGURE 0.8. Map f New Zealand and its surrunding exclusive ecnmic zne, shwing the lcatins f 7,000 trawls (small blue dts) taken between 979 and 005. The red pints indicate trawls fr which the species Black re Dry were present.

396 78 0. Bsting and Additive Trees Mean Deviance GBM Test GBM CV GAM Test Sensitivity AUC GAM 0.97 GBM Number f Trees Specificity FIGURE 0.9. The left panel shws the mean deviance as a functin f the number f trees fr the GBM lgistic regressin mdel fit t the presence/absence data. Shwn are 0-fld crss-validatin n the training data (and s.e. bars), and test deviance n the test data. Als shwn fr cmparisn is the test deviance using a GAM mdel with 8 df fr each term. The right panel shws RC curves n the test data fr the chsen GBM mdel (vertical line in left plt) and the GAM mdel. Figure 0.9 (left panel) shws the mean binmial deviance fr the sequence f GBM mdels, bth fr 0-fld CV and test data. There is a mdest imprvement ver the perfrmance f a GAM mdel, fit using smthing splines with 8 degrees-f-freedm (df) per term. The right panel shws the RC curves (see Sectin 9..5) fr bth mdels, which measures predictive perfrmance. Frm this pint f view, the perfrmance lks very similar, with GBM perhaps having a slight edge as summarized by the AUC (area under the curve). At the pint f equal sensitivity/specificity, GBM achieves 9%, and GAM 90%. Figure 0.0 summarizes the cntributins f the variables in the lgistic GBM fit. We see that there is a well-defined depth range ver which Black re are caught, with much mre frequent capture in clder waters. We d nt give details f the quantitative catch mdel; the imprtant variables were much the same. All the predictrs used in these mdels are available n a fine gegraphical grid; in fact they were derived frm envirnmental atlases, satellite images and the like see Leathwick et al. (006) fr details. This als means that predictins can be made n this grid, and imprted int GIS mapping systems. Figure 0. shws predictin maps fr bth presence and catch size, with bth standardized t a cmmn set f trawl cnditins; since the predictrs vary in a cntinuus fashin with gegraphical lcatin, s d the predictins.

397 0.4 Illustratins 79 TempResid AvgDepth SusPartMatter SalResid SSTGrad ChlaCase Slpe TidalCurr Pentade CdendSize DisrgMatter Distance Speed rbvel f(tempresid) 7 5 f(avgdepth) Relative influence TempResid AvgDepth f(suspartmatter) 7 5 f(salresid) 7 5 f(sstgrad) SusPartMatter SalResid SSTGrad FIGURE 0.0. The tp-left panel shws the relative influence cmputed frm the GBM lgistic regressin mdel. The remaining panels shw the partial dependence plts fr the leading five variables, all pltted n the same scale fr cmparisn. Because f their ability t mdel interactins and autmatically select variables, as well as rbustness t utliers and missing data, GBM mdels are rapidly gaining ppularity in this data-rich and enthusiastic cmmunity Demgraphics Data In this sectin we illustrate gradient bsting n a multiclass classificatin prblem, using MART. The data cme frm 94 questinnaires filled ut by shpping mall custmers in the San Francisc Bay Area (Impact Resurces, Inc., Clumbus, H). Amng the questins are 4 cncerning demgraphics. Fr this illustratin the gal is t predict ccupatin using the ther variables as predictrs, and hence identify demgraphic variables that discriminate between different ccupatinal categries. We randmly divided the data int a training set (80%) and test set (0%), and used J = 6 nde trees with a learning rate ν = 0.. Figure 0. shws the K = 9 ccupatin class values alng with their crrespnding errr rates. The verall errr rate is 4.5%, which can be cmpared t the null rate f 69% btained by predicting the mst numerus

398 80 0. Bsting and Additive Trees FIGURE 0.. Gelgical predictin maps f the presence prbability (left map) and catch size (right map) btained frm the gradient bsted mdels. class Prf/Man (Prfessinal/Managerial). The fur best predicted classes are seen t be Retired, Student, Prf/Man, and Hmemaker. Figure 0. shws the relative predictr variable imprtances as averaged ver all classes (0.46). Figure 0.4 displays the individual relative imprtance distributins (0.45) fr each f the fur best predicted classes. ne sees that the mst relevant predictrs are generally different fr each respective class. An exceptin is age which is amng the three mst relevant fr predicting Retired, Student, and Prf/Man. Figure 0.5 shws the partial dependence f the lg-dds (0.5) n age fr these three classes. The abscissa values are rdered cdes fr respective equally spaced age intervals. ne sees that after accunting fr the cntributins f the ther variables, the dds f being retired are higher fr lder peple, whereas the ppsite is the case fr being a student. The dds f being prfessinal/managerial are highest fr middle-aged peple. These results are f curse nt surprising. They illustrate that inspecting partial dependences separately fr each class can lead t sensible results. Bibligraphic Ntes Schapire (990) develped the first simple bsting prcedure in the PAC learning framewrk (Valiant, 984; Kearns and Vazirani, 994). Schapire

399 0.4 Illustratins 8 verall Errr Rate = 0.45 Student Retired Prf/Man Hmemaker Labr Clerical Military Unemplyed Sales Errr Rate FIGURE 0.. Errr rate fr each ccupatin in the demgraphics data. yrs-ba children num-hsld lang typ-hme mar-stat ethnic sex mar-dlinc hsld-stat edu incme age Relative Imprtance FIGURE 0.. Relative imprtance f the predictrs as averaged ver all classes fr the demgraphics data.

400 8 0. Bsting and Additive Trees Class = Retired Class = Student yrs-ba num-hsld edu children typ-hme lang mar-stat hsld-stat incme ethnic sex mar-dlinc age children yrs-ba lang mar-dlinc sex typ-hme num-hsld ethnic edu mar-stat incme age hsld-stat Relative Imprtance Class = Prf/Man Relative Imprtance Class = Hmemaker children yrs-ba mar-stat lang num-hsld sex typ-hme hsld-stat ethnic mar-dlinc age incme edu yrs-ba hsld-stat age incme typ-hme lang mar-stat edu num-hsld ethnic children mar-dlinc sex Relative Imprtance Relative Imprtance FIGURE 0.4. Predictr variable imprtances separately fr each f the fur classes with lwest errr rate fr the demgraphics data.

401 0.4 Illustratins 8 Retired Student Partial Dependence 0 4 Partial Dependence age age Prf/Man Partial Dependence age FIGURE 0.5. Partial dependence f the dds f three different ccupatins n age, fr the demgraphics data. shwed that a weak learner culd always imprve its perfrmance by training tw additinal classifiers n filtered versins f the input data stream. A weak learner is an algrithm fr prducing a tw-class classifier with perfrmance guaranteed (with high prbability) t be significantly better than a cin-flip. After learning an initial classifier G n the first N training pints, G is learned n a new sample f N pints, half f which are misclassified by G ; G is learned n N pints fr which G and G disagree; the bsted classifier is G B = majrity vte(g,g,g ). Schapire s Strength f Weak Learnability therem prves that G B has imprved perfrmance ver G. Freund (995) prpsed a bst by majrity variatin which cmbined many weak learners simultaneusly and imprved the perfrmance f the simple bsting algrithm f Schapire. The thery supprting bth f these

403 Exercises 85 Ex. 0.4 (a) Write a prgram implementing AdaBst with trees. (b) Red the cmputatins fr the example f Figure 0.. Plt the training errr as well as test errr, and discuss its behavir. (c) Investigate the number f iteratins needed t make the test errr finally start t rise. (d) Change the setup f this example as fllws: define tw classes, with the features in Class being X,X,...,X 0, standard independent Gaussian variates. In Class, the features X,X,...,X 0 are als standard independent Gaussian, but cnditined n the event j X j >. Nw the classes have significant verlap in feature space. Repeat the AdaBst experiments as in Figure 0. and discuss the results. Ex. 0.5 Multiclass expnential lss (Zhu et al., 005). Fr a K-class classificatin prblem, cnsider the cding Y = (Y,...,Y K ) T with {, if G = Gk Y k = K, therwise. (0.55) Let f = (f,...,f K ) T with K k= f k = 0, and define ( L(Y,f) = exp ) K Y T f. (0.56) (a) Using Lagrange multipliers, derive the ppulatin minimizer f f E(Y,f), subject t the zer-sum cnstraint, and relate these t the class prbabilities. (b) Shw that a multiclass bsting using this lss functin leads t a reweighting algrithm similar t Adabst, as in Sectin 0.4. Ex. 0.6 McNemar test (Agresti, 996). We reprt the test errr rates n the spam data t be 5.5% fr a generalized additive mdel (GAM), and 4.5% fr gradient bsting (GBM), with a test sample f size 56. (a) Shw that the standard errr f these estimates is abut 0.6%. Since the same test data are used fr bth methds, the errr rates are crrelated, and we cannt perfrm a tw-sample t-test. We can cmpare the methds directly n each test bservatin, leading t the summary GBM GAM Crrect Errr Crrect 44 8 Errr 5

404 86 0. Bsting and Additive Trees The McNemar test fcuses n the discrdant errrs, vs. 8. (b) Cnduct a test t shw that GAM makes significantly mre errrs than gradient bsting, with a tw-sided p-value f Ex. 0.7 Derive expressin (0.). Ex. 0.8 Cnsider a K-class prblem where the targets y ik are cded as if bservatin i is in class k and zer therwise. Suppse we have a current mdel f k (x), k =,...,K, with K k= f k(x) = 0 (see (0.) in Sectin 0.6). We wish t update the mdel fr bservatins in a regin R in predictr space, by adding cnstants f k (x) + γ k, with γ K = 0. (a) Write dwn the multinmial lg-likelihd fr this prblem, and its first and secnd derivatives. (b) Using nly the diagnal f the Hessian matrix in (), and starting frm γ k = 0 k, shw that a ne-step apprximate Newtn update fr γ k is γk x = (y i R ik p ik ) x p, k =,...,K, (0.57) i R ik( p ik ) where p ik = exp(f k (x i ))/( K l= f l(x i )). (c) We prefer ur update t sum t zer, as the current mdel des. Using symmetry arguments, shw that ˆγ k = K K (γ k K K γl ), k =,...,K (0.58) l= is an apprpriate update, where γk is defined as in (0.57) fr all k =,...,K. Ex. 0.9 Cnsider a K-class prblem where the targets y ik are cded as if bservatin i is in class k and zer therwise. Using the multinmial deviance lss functin (0.) and the symmetric lgistic transfrm, use the arguments leading t the gradient bsting Algrithm 0. t derive Algrithm 0.4. Hint: See exercise 0.8 fr step (b)iii. Ex. 0.0 Shw that fr K = class classificatin, nly ne tree needs t be grwn at each gradient-bsting iteratin. Ex. 0. Shw hw t cmpute the partial dependence functin f S (X S ) in (0.47) efficiently. Ex. 0. Referring t (0.49), let S = {} and C = {}, with f(x,x ) = X. Assume X and X are bivariate Gaussian, each with mean zer, variance ne, and E(X,X ) = ρ. Shw that E(f(X,X X ) = ρx, even thugh f is nt a functin f X.

405 Exercises 87 Algrithm 0.4 Gradient Bsting fr K-class Classificatin.. Initialize f k0 (x) = 0, k =,,...,K.. Fr m= t M: (a) Set p k (x) = e f k(x) K l= ef l(x), k =,,...,K. (b) Fr k = t K: i. Cmpute r ikm = y ik p k (x i ), i =,,...,N. ii. Fit a regressin tree t the targets r ikm, i =,,...,N, giving terminal regins R jkm, j =,,...,J m. iii. Cmpute γ jkm = K K x i R jkm r ikm x i R jkm r ikm ( r ikm ), j =,,...,J m. iv. Update f km (x) = f k,m (x) + J m j= γ jkmi(x R jkm ).. utput ˆf k (x) = f km (x), k =,,...,K.

406 88 0. Bsting and Additive Trees

407 Neural Netwrks This is page 89 Printer: paque this. Intrductin In this chapter we describe a class f learning methds that was develped separately in different fields statistics and artificial intelligence based n essentially identical mdels. The central idea is t extract linear cmbinatins f the inputs as derived features, and then mdel the target as a nnlinear functin f these features. The result is a pwerful learning methd, with widespread applicatins in many fields. We first discuss the prjectin pursuit mdel, which evlved in the dmain f semiparametric statistics and smthing. The rest f the chapter is devted t neural netwrk mdels.. Prjectin Pursuit Regressin As in ur generic supervised learning prblem, assume we have an input vectr X with p cmpnents, and a target Y. Let ω m, m =,,...,M, be unit p-vectrs f unknwn parameters. The prjectin pursuit regressin (PPR) mdel has the frm f(x) = M g m (ωmx). T (.) m= This is an additive mdel, but in the derived features V m = ω T mx rather than the inputs themselves. The functins g m are unspecified and are esti-

408 90 Neural Netwrks g(v ) g(v ) X X FIGURE.. Perspective plts f tw ridge functins. (Left:) g(v ) = /[ + exp( 5(V 0.5))], where V = (X + X )/. (Right:) g(v ) = (V + 0.) sin(/(v/ + 0.)), where V = X. X X mated alng with the directins ω m using sme flexible smthing methd (see belw). The functin g m (ω T mx) is called a ridge functin in IR p. It varies nly in the directin defined by the vectr ω m. The scalar variable V m = ω T mx is the prjectin f X nt the unit vectr ω m, and we seek ω m s that the mdel fits well, hence the name prjectin pursuit. Figure. shws sme examples f ridge functins. In the example n the left ω = (/ )(,) T, s that the functin nly varies in the directin X + X. In the example n the right, ω = (,0). The PPR mdel (.) is very general, since the peratin f frming nnlinear functins f linear cmbinatins generates a surprisingly large class f mdels. Fr example, the prduct X X can be written as [(X + X ) (X X ) ]/4, and higher-rder prducts can be represented similarly. In fact, if M is taken arbitrarily large, fr apprpriate chice f g m the PPR mdel can apprximate any cntinuus functin in IR p arbitrarily well. Such a class f mdels is called a universal apprximatr. Hwever this generality cmes at a price. Interpretatin f the fitted mdel is usually difficult, because each input enters int the mdel in a cmplex and multifaceted way. As a result, the PPR mdel is mst useful fr predictin, and nt very useful fr prducing an understandable mdel fr the data. The M = mdel, knwn as the single index mdel in ecnmetrics, is an exceptin. It is slightly mre general than the linear regressin mdel, and ffers a similar interpretatin. Hw d we fit a PPR mdel, given training data (x i,y i ), i =,,...,N? We seek the apprximate minimizers f the errr functin [ N y i i= M g m (ωmx T i )] (.) m=

409 . Prjectin Pursuit Regressin 9 ver functins g m and directin vectrs ω m, m =,,...,M. As in ther smthing prblems, we need either explicitly r implicitly t impse cmplexity cnstraints n the g m, t avid verfit slutins. Cnsider just ne term (M =, and drp the subscript). Given the directin vectr ω, we frm the derived variables v i = ω T x i. Then we have a ne-dimensinal smthing prblem, and we can apply any scatterplt smther, such as a smthing spline, t btain an estimate f g. n the ther hand, given g, we want t minimize (.) ver ω. A Gauss Newtn search is cnvenient fr this task. This is a quasi-newtn methd, in which the part f the Hessian invlving the secnd derivative f g is discarded. It can be simply derived as fllws. Let ω ld be the current estimate fr ω. We write t give g(ω T x i ) g(ω T ldx i ) + g (ω T ldx i )(ω ω ld ) T x i (.) N [ yi g(ω T x i ) ] N i= i= [( g (ωldx T i ) ωldx T i + y i g(ωld T x ) ] i) g (ωld T x ω T x i. i) (.4) T minimize the right-hand side, we carry ut a least squares regressin with target ω T ld x i+(y i g(ω T ld x i))/g (ω T ld x i) n the input x i, with weights g (ω T ld x i) and n intercept (bias) term. This prduces the updated cefficient vectr ω new. These tw steps, estimatin f g and ω, are iterated until cnvergence. With mre than ne term in the PPR mdel, the mdel is built in a frward stage-wise manner, adding a pair (ω m,g m ) at each stage. There are a number f implementatin details. Althugh any smthing methd can in principle be used, it is cnvenient if the methd prvides derivatives. Lcal regressin and smthing splines are cnvenient. After each step the g m s frm previus steps can be readjusted using the backfitting prcedure described in Chapter 9. While this may lead ultimately t fewer terms, it is nt clear whether it imprves predictin perfrmance. Usually the ω m are nt readjusted (partly t avid excessive cmputatin), althugh in principle they culd be as well. The number f terms M is usually estimated as part f the frward stage-wise strategy. The mdel building stps when the next term des nt appreciably imprve the fit f the mdel. Crss-validatin can als be used t determine M.

410 9 Neural Netwrks There are many ther applicatins, such as density estimatin (Friedman et al., 984; Friedman, 987), where the prjectin pursuit idea can be used. In particular, see the discussin f ICA in Sectin 4.7 and its relatinship with explratry prjectin pursuit. Hwever the prjectin pursuit regressin mdel has nt been widely used in the field f statistics, perhaps because at the time f its intrductin (98), its cmputatinal demands exceeded the capabilities f mst readily available cmputers. But it des represent an imprtant intellectual advance, ne that has blssmed in its reincarnatin in the field f neural netwrks, the tpic f the rest f this chapter.. Neural Netwrks The term neural netwrk has evlved t encmpass a large class f mdels and learning methds. Here we describe the mst widely used vanilla neural net, smetimes called the single hidden layer back-prpagatin netwrk, r single layer perceptrn. There has been a great deal f hype surrunding neural netwrks, making them seem magical and mysterius. As we make clear in this sectin, they are just nnlinear statistical mdels, much like the prjectin pursuit regressin mdel discussed abve. A neural netwrk is a tw-stage regressin r classificatin mdel, typically represented by a netwrk diagram as in Figure.. This netwrk applies bth t regressin r classificatin. Fr regressin, typically K = and there is nly ne utput unit Y at the tp. Hwever, these netwrks can handle multiple quantitative respnses in a seamless fashin, s we will deal with the general case. Fr K-class classificatin, there are K units at the tp, with the kth unit mdeling the prbability f class k. There are K target measurements Y k, k =,...,K, each being cded as a 0 variable fr the kth class. Derived features Z m are created frm linear cmbinatins f the inputs, and then the target Y k is mdeled as a functin f linear cmbinatins f the Z m, Z m = σ(α 0m + α T mx), m =,...,M, T k = β 0k + β T k Z, k =,...,K, f k (X) = g k (T), k =,...,K, (.5) where Z = (Z,Z,...,Z M ), and T = (T,T,...,T K ). The activatin functin σ(v) is usually chsen t be the sigmid σ(v) = /( + e v ); see Figure. fr a plt f /( + e v ). Smetimes Gaussian radial basis functins (Chapter 6) are used fr the σ(v), prducing what is knwn as a radial basis functin netwrk. Neural netwrk diagrams like Figure. are smetimes drawn with an additinal bias unit feeding int every unit in the hidden and utput layers.

411 . Neural Netwrks Y Y Y K Z Z Z Z m M X X X X p- Xp FIGURE.. Schematic f a single hidden layer, feed-frward neural netwrk. Thinking f the cnstant as an additinal input feature, this bias unit captures the intercepts α 0m and β 0k in mdel (.5). The utput functin g k (T) allws a final transfrmatin f the vectr f utputs T. Fr regressin we typically chse the identity functin g k (T) = T k. Early wrk in K-class classificatin als used the identity functin, but this was later abandned in favr f the sftmax functin g k (T) = e T k K. (.6) l= et l This is f curse exactly the transfrmatin used in the multilgit mdel (Sectin 4.4), and prduces psitive estimates that sum t ne. In Sectin 4. we discuss ther prblems with linear activatin functins, in particular ptentially severe masking effects. The units in the middle f the netwrk, cmputing the derived features Z m, are called hidden units because the values Z m are nt directly bserved. In general there can be mre than ne hidden layer, as illustrated in the example at the end f this chapter. We can think f the Z m as a basis expansin f the riginal inputs X; the neural netwrk is then a standard linear mdel, r linear multilgit mdel, using these transfrmatins as inputs. There is, hwever, an imprtant enhancement ver the basisexpansin techniques discussed in Chapter 5; here the parameters f the basis functins are learned frm the data.

412 94 Neural Netwrks /( + e v ) FIGURE.. Plt f the sigmid functin σ(v) = /(+exp( v)) (red curve), cmmnly used in the hidden layer f a neural netwrk. Included are σ(sv) fr s = (blue curve) and s = 0 (purple curve). The scale parameter s cntrls the activatin rate, and we can see that large s amunts t a hard activatin at v = 0. Nte that σ(s(v v 0)) shifts the activatin threshld frm 0 t v 0. Ntice that if σ is the identity functin, then the entire mdel cllapses t a linear mdel in the inputs. Hence a neural netwrk can be thught f as a nnlinear generalizatin f the linear mdel, bth fr regressin and classificatin. By intrducing the nnlinear transfrmatin σ, it greatly enlarges the class f linear mdels. In Figure. we see that the rate f activatin f the sigmid depends n the nrm f α m, and if α m is very small, the unit will indeed be perating in the linear part f its activatin functin. Ntice als that the neural netwrk mdel with ne hidden layer has exactly the same frm as the prjectin pursuit mdel described abve. The difference is that the PPR mdel uses nnparametric functins g m (v), while the neural netwrk uses a far simpler functin based n σ(v), with three free parameters in its argument. In detail, viewing the neural netwrk mdel as a PPR mdel, we identify g m (ω T mx) = β m σ(α 0m + α T mx) v = β m σ(α 0m + α m (ω T mx)), (.7) where ω m = α m / α m is the mth unit-vectr. Since σ β,α0,s(v) = βσ(α 0 + sv) has lwer cmplexity than a mre general nnparametric g(v), it is nt surprising that a neural netwrk might use 0 r 00 such functins, while the PPR mdel typically uses fewer terms (M = 5 r 0, fr example). Finally, we nte that the name neural netwrks derives frm the fact that they were first develped as mdels fr the human brain. Each unit represents a neurn, and the cnnectins (links in Figure.) represent synapses. In early mdels, the neurns fired when the ttal signal passed t that unit exceeded a certain threshld. In the mdel abve, this crrespnds

413 .4 Fitting Neural Netwrks 95 t use f a step functin fr σ(z) and g m (T). Later the neural netwrk was recgnized as a useful tl fr nnlinear statistical mdeling, and fr this purpse the step functin is nt smth enugh fr ptimizatin. Hence the step functin was replaced by a smther threshld functin, the sigmid in Figure...4 Fitting Neural Netwrks The neural netwrk mdel has unknwn parameters, ften called weights, and we seek values fr them that make the mdel fit the training data well. We dente the cmplete set f weights by θ, which cnsists f {α 0m,α m ; m =,,...,M} M(p + ) weights, {β 0k,β k ; k =,,...,K} K(M + ) weights. (.8) Fr regressin, we use sum-f-squared errrs as ur measure f fit (errr functin) K N R(θ) = (y ik f k (x i )). (.9) k= i= Fr classificatin we use either squared errr r crss-entrpy (deviance): R(θ) = N i= k= K y ik lg f k (x i ), (.0) and the crrespnding classifier is G(x) = argmax k f k (x). With the sftmax activatin functin and the crss-entrpy errr functin, the neural netwrk mdel is exactly a linear lgistic regressin mdel in the hidden units, and all the parameters are estimated by maximum likelihd. Typically we dn t want the glbal minimizer f R(θ), as this is likely t be an verfit slutin. Instead sme regularizatin is needed: this is achieved directly thrugh a penalty term, r indirectly by early stpping. Details are given in the next sectin. The generic apprach t minimizing R(θ) is by gradient descent, called back-prpagatin in this setting. Because f the cmpsitinal frm f the mdel, the gradient can be easily derived using the chain rule fr differentiatin. This can be cmputed by a frward and backward sweep ver the netwrk, keeping track nly f quantities lcal t each unit.

414 96 Neural Netwrks Here is back-prpagatin in detail fr squared errr lss. Let z mi = σ(α 0m + α T mx i ), frm (.5) and let z i = (z i,z i,...,z Mi ). Then we have R(θ) = N i= N R i i= k= K (y ik f k (x i )), (.) with derivatives R i = (y ik f k (x i ))g β k(β k T z i )z mi, km R i K = (y ik f k (x i ))g α k(β k T z i )β km σ (αmx T i )x il. ml k= (.) Given these derivatives, a gradient descent update at the (r + )st iteratin has the frm β (r+) km α (r+) ml = β(r) km γ r = α (r) ml γ r N i= N i= R i, β (r) km R i α (r) ml, (.) where γ r is the learning rate, discussed belw. Nw write (.) as R i β km = δ ki z mi, R i α ml = s mi x il. (.4) The quantities δ ki and s mi are errrs frm the current mdel at the utput and hidden layer units, respectively. Frm their definitins, these errrs satisfy K s mi = σ (αmx T i ) β km δ ki, (.5) knwn as the back-prpagatin equatins. Using this, the updates in (.) can be implemented with a tw-pass algrithm. In the frward pass, the current weights are fixed and the predicted values ˆf k (x i ) are cmputed frm frmula (.5). In the backward pass, the errrs δ ki are cmputed, and then back-prpagated via (.5) t give the errrs s mi. Bth sets f errrs are then used t cmpute the gradients fr the updates in (.), via (.4). k=

415 .5 Sme Issues in Training Neural Netwrks 97 This tw-pass prcedure is what is knwn as back-prpagatin. It has als been called the delta rule (Widrw and Hff, 960). The cmputatinal cmpnents fr crss-entrpy have the same frm as thse fr the sum f squares errr functin, and are derived in Exercise.. The advantages f back-prpagatin are its simple, lcal nature. In the back prpagatin algrithm, each hidden unit passes and receives infrmatin nly t and frm units that share a cnnectin. Hence it can be implemented efficiently n a parallel architecture cmputer. The updates in (.) are a kind f batch learning, with the parameter updates being a sum ver all f the training cases. Learning can als be carried ut nline prcessing each bservatin ne at a time, updating the gradient after each training case, and cycling thrugh the training cases many times. In this case, the sums in equatins (.) are replaced by a single summand. A training epch refers t ne sweep thrugh the entire training set. nline training allws the netwrk t handle very large training sets, and als t update the weights as new bservatins cme in. The learning rate γ r fr batch learning is usually taken t be a cnstant, and can als be ptimized by a line search that minimizes the errr functin at each update. With nline learning γ r shuld decrease t zer as the iteratin r. This learning is a frm f stchastic apprximatin (Rbbins and Munr, 95); results in this field ensure cnvergence if γ r 0, r γ r =, and r γ r < (satisfied, fr example, by γ r = /r). Back-prpagatin can be very slw, and fr that reasn is usually nt the methd f chice. Secnd-rder techniques such as Newtn s methd are nt attractive here, because the secnd derivative matrix f R (the Hessian) can be very large. Better appraches t fitting include cnjugate gradients and variable metric methds. These avid explicit cmputatin f the secnd derivative matrix while still prviding faster cnvergence..5 Sme Issues in Training Neural Netwrks There is quite an art in training neural netwrks. The mdel is generally verparametrized, and the ptimizatin prblem is nncnvex and unstable unless certain guidelines are fllwed. In this sectin we summarize sme f the imprtant issues..5. Starting Values Nte that if the weights are near zer, then the perative part f the sigmid (Figure.) is rughly linear, and hence the neural netwrk cllapses int an apprximately linear mdel (Exercise.). Usually starting values fr weights are chsen t be randm values near zer. Hence the mdel starts ut nearly linear, and becmes nnlinear as the weights increase. Individual

416 98 Neural Netwrks units lcalize t directins and intrduce nnlinearities where needed. Use f exact zer weights leads t zer derivatives and perfect symmetry, and the algrithm never mves. Starting instead with large weights ften leads t pr slutins..5. verfitting ften neural netwrks have t many weights and will verfit the data at the glbal minimum f R. In early develpments f neural netwrks, either by design r by accident, an early stpping rule was used t avid verfitting. Here we train the mdel nly fr a while, and stp well befre we apprach the glbal minimum. Since the weights start at a highly regularized (linear) slutin, this has the effect f shrinking the final mdel tward a linear mdel. A validatin dataset is useful fr determining when t stp, since we expect the validatin errr t start increasing. A mre explicit methd fr regularizatin is weight decay, which is analgus t ridge regressin used fr linear mdels (Sectin.4.). We add a penalty t the errr functin R(θ) + λj(θ), where J(θ) = km β km + ml α ml (.6) and λ 0 is a tuning parameter. Larger values f λ will tend t shrink the weights tward zer: typically crss-validatin is used t estimate λ. The effect f the penalty is t simply add terms β km and α ml t the respective gradient expressins (.). ther frms fr the penalty have been prpsed, fr example, J(θ) = km β km + βkm + ml α ml + αml, (.7) knwn as the weight eliminatin penalty. This has the effect f shrinking smaller weights mre than (.6) des. Figure.4 shws the result f training a neural netwrk with ten hidden units, withut weight decay (upper panel) and with weight decay (lwer panel), t the mixture example f Chapter. Weight decay has clearly imprved the predictin. Figure.5 shws heat maps f the estimated weights frm the training (grayscale versins f these are called Hintn diagrams.) We see that weight decay has dampened the weights in bth layers: the resulting weights are spread fairly evenly ver the ten hidden units..5. Scaling f the Inputs Since the scaling f the inputs determines the effective scaling f the weights in the bttm layer, it can have a large effect n the quality f the final

417 .5 Sme Issues in Training Neural Netwrks 99 Neural Netwrk - 0 Units, N Weight Decay Training Errr: 0.00 Test Errr: 0.59 Bayes Errr: 0.0 Neural Netwrk - 0 Units, Weight Decay= Training Errr: 0.60 Test Errr: 0. Bayes Errr: 0.0 FIGURE.4. A neural netwrk n the mixture example f Chapter. The upper panel uses n weight decay, and verfits the training data. The lwer panel uses weight decay, and achieves clse t the Bayes errr rate (brken purple bundary). Bth use the sftmax activatin functin and crss-entrpy errr.

418 400 Neural Netwrks N weight decay Weight decay y y y y x x x x z z z z z5 z6 z7 z8 z9 z0 z z z z z5 z6 z7 z8 z9 z0 z z z z z5 z6 z7 z8 z9 z0 z z z z z5 z6 z7 z8 z9 z0 FIGURE.5. Heat maps f the estimated weights frm the training f neural netwrks frm Figure.4. The display ranges frm bright green (negative) t bright red (psitive). slutin. At the utset it is best t standardize all inputs t have mean zer and standard deviatin ne. This ensures all inputs are treated equally in the regularizatin prcess, and allws ne t chse a meaningful range fr the randm starting weights. With standardized inputs, it is typical t take randm unifrm weights ver the range [ 0.7,+0.7]..5.4 Number f Hidden Units and Layers Generally speaking it is better t have t many hidden units than t few. With t few hidden units, the mdel might nt have enugh flexibility t capture the nnlinearities in the data; with t many hidden units, the extra weights can be shrunk tward zer if apprpriate regularizatin is used. Typically the number f hidden units is smewhere in the range f 5 t 00, with the number increasing with the number f inputs and number f training cases. It is mst cmmn t put dwn a reasnably large number f units and train them with regularizatin. Sme researchers use crss-validatin t estimate the ptimal number, but this seems unnecessary if crss-validatin is used t estimate the regularizatin parameter. Chice f the number f hidden layers is guided by backgrund knwledge and experimentatin. Each layer extracts features f the input fr regressin r classificatin. Use f multiple hidden layers allws cnstructin f hierarchical features at different levels f reslutin. An example f the effective use f multiple layers is given in Sectin Multiple Minima The errr functin R(θ) is nncnvex, pssessing many lcal minima. As a result, the final slutin btained is quite dependent n the chice f start-

419 .6 Example: Simulated Data 40 ing weights. ne must at least try a number f randm starting cnfiguratins, and chse the slutin giving lwest (penalized) errr. Prbably a better apprach is t use the average predictins ver the cllectin f netwrks as the final predictin (Ripley, 996). This is preferable t averaging the weights, since the nnlinearity f the mdel implies that this averaged slutin culd be quite pr. Anther apprach is via bagging, which averages the predictins f netwrks training frm randmly perturbed versins f the training data. This is described in Sectin Example: Simulated Data We generated data frm tw additive errr mdels Y = f(x) + ε: Sum f sigmids: Y = σ(a T X) + σ(a T X) + ε ; Radial: Y = 0 m= φ(x m ) + ε. Here X T = (X,X,...,X p ), each X j being a standard Gaussian variate, with p = in the first mdel, and p = 0 in the secnd. Fr the sigmid mdel, a = (,), a = (, ); fr the radial mdel, φ(t) = (/π) / exp( t /). Bth ε and ε are Gaussian errrs, with variance chsen s that the signal-t-nise rati Var(E(Y X)) Var(Y E(Y X)) = Var(f(X)) Var(ε) (.8) is 4 in bth mdels. We tk a training sample f size 00 and a test sample f size 0,000. We fit neural netwrks with weight decay and varius numbers f hidden units, and recrded the average test errr E Test (Y ˆf(X)) fr each f 0 randm starting weights. nly ne training set was generated, but the results are typical fr an average training set. The test errrs are shwn in Figure.6. Nte that the zer hidden unit mdel refers t linear least squares regressin. The neural netwrk is perfectly suited t the sum f sigmids mdel, and the tw-unit mdel des perfrm the best, achieving an errr clse t the Bayes rate. (Recall that the Bayes rate fr regressin with squared errr is the errr variance; in the figures, we reprt test errr relative t the Bayes errr). Ntice, hwever, that with mre hidden units, verfitting quickly creeps in, and with sme starting weights the mdel des wrse than the linear mdel (zer hidden unit) mdel. Even with tw hidden units, tw f the ten starting weight cnfiguratins prduced results n better than the linear mdel, cnfirming the imprtance f multiple starting values. A radial functin is in a sense the mst difficult fr the neural net, as it is spherically symmetric and with n preferred directins. We see in the right

420 40 Neural Netwrks Sum f Sigmids Radial Test Errr Test Errr Number f Hidden Units Number f Hidden Units FIGURE.6. Bxplts f test errr, fr simulated data example, relative t the Bayes errr (brken hrizntal line). True functin is a sum f tw sigmids n the left, and a radial functin is n the right. The test errr is displayed fr 0 different starting weights, fr a single hidden layer neural netwrk with the number f units as indicated. panel f Figure.6 that it des prly in this case, with the test errr staying well abve the Bayes errr (nte the different vertical scale frm the left panel). In fact, since a cnstant fit (such as the sample average) achieves a relative errr f 5 (when the SNR is 4), we see that the neural netwrks perfrm increasingly wrse than the mean. In this example we used a fixed weight decay parameter f , representing a mild amunt f regularizatin. The results in the left panel f Figure.6 suggest that mre regularizatin is needed with greater numbers f hidden units. In Figure.7 we repeated the experiment fr the sum f sigmids mdel, with n weight decay in the left panel, and strnger weight decay (λ = 0.) in the right panel. With n weight decay, verfitting becmes even mre severe fr larger numbers f hidden units. The weight decay value λ = 0. prduces gd results fr all numbers f hidden units, and there des nt appear t be verfitting as the number f units increase. Finally, Figure.8 shws the test errr fr a ten hidden unit netwrk, varying the weight decay parameter ver a wide range. The value 0. is apprximately ptimal. In summary, there are tw free parameters t select: the weight decay λ and number f hidden units M. As a learning strategy, ne culd fix either parameter at the value crrespnding t the least cnstrained mdel, t ensure that the mdel is rich enugh, and use crss-validatin t chse the ther parameter. Here the least cnstrained values are zer weight decay and ten hidden units. Cmparing the left panel f Figure.7 t Figure.8, we see that the test errr is less sensitive t the value f the weight

421 .6 Example: Simulated Data 40 N Weight Decay Weight Decay=0. Test Errr Test Errr Number f Hidden Units Number f Hidden Units FIGURE.7. Bxplts f test errr, fr simulated data example, relative t the Bayes errr. True functin is a sum f tw sigmids. The test errr is displayed fr ten different starting weights, fr a single hidden layer neural netwrk with the number units as indicated. The tw panels represent n weight decay (left) and strng weight decay λ = 0. (right). Sum f Sigmids, 0 Hidden Unit Mdel Test Errr Weight Decay Parameter FIGURE.8. Bxplts f test errr, fr simulated data example. True functin is a sum f tw sigmids. The test errr is displayed fr ten different starting weights, fr a single hidden layer neural netwrk with ten hidden units and weight decay parameter value as indicated.

422 404 Neural Netwrks FIGURE.9. Examples f training cases frm ZIP cde data. Each image is a bit grayscale representatin f a handwritten digit. decay parameter, and hence crss-validatin f this parameter wuld be preferred..7 Example: ZIP Cde Data This example is a character recgnitin task: classificatin f handwritten numerals. This prblem captured the attentin f the machine learning and neural netwrk cmmunity fr many years, and has remained a benchmark prblem in the field. Figure.9 shws sme examples f nrmalized handwritten digits, autmatically scanned frm envelpes by the U.S. Pstal Service. The riginal scanned digits are binary and f different sizes and rientatins; the images shwn here have been deslanted and size nrmalized, resulting in 6 6 grayscale images (Le Cun et al., 990). These 56 pixel values are used as inputs t the neural netwrk classifier. A black bx neural netwrk is nt ideally suited t this pattern recgnitin task, partly because the pixel representatin f the images lack certain invariances (such as small rtatins f the image). Cnsequently early attempts with neural netwrks yielded misclassificatin rates arund 4.5% n varius examples f the prblem. In this sectin we shw sme f the pineering effrts t handcraft the neural netwrk t vercme sme these deficiencies (Le Cun, 989), which ultimately led t the state f the art in neural netwrk perfrmance(le Cun et al., 998). Althugh current digit datasets have tens f thusands f training and test examples, the sample size here is deliberately mdest in rder t em- The figures and tables in this example were recreated frm Le Cun (989).

423 .7 Example: ZIP Cde Data x4 8x8 Net- 0 6x6 Net- 6x6 6x6 Net- Lcal Cnnectivity 0 4x4 4x4x4 8x8x 8x8x 6x6 6x6 Net-4 Shared Weights Net-5 FIGURE.0. Architecture f the five netwrks used in the ZIP cde example. phasize the effects. The examples were btained by scanning sme actual hand-drawn digits, and then generating additinal images by randm hrizntal shifts. Details may be fund in Le Cun (989). There are 0 digits in the training set, and 60 in the test set. Five different netwrks were fit t the data: Net-: N hidden layer, equivalent t multinmial lgistic regressin. Net-: ne hidden layer, hidden units fully cnnected. Net-: Tw hidden layers lcally cnnected. Net-4: Tw hidden layers, lcally cnnected with weight sharing. Net-5: Tw hidden layers, lcally cnnected, tw levels f weight sharing. These are depicted in Figure.0. Net- fr example has 56 inputs, ne each fr the 6 6 input pixels, and ten utput units fr each f the digits 0 9. The predicted value ˆf k (x) represents the estimated prbability that an image x has digit class k, fr k = 0,,,...,9.

424 406 Neural Netwrks 00 Net-4 Net-5 % Crrect n Test Data Net- Net- Net Training Epchs FIGURE.. Test perfrmance curves, as a functin f the number f training epchs, fr the five netwrks f Table. applied t the ZIP cde data. (Le Cun, 989) The netwrks all have sigmidal utput units, and were all fit with the sum-f-squares errr functin. The first netwrk has n hidden layer, and hence is nearly equivalent t a linear multinmial regressin mdel (Exercise.4). Net- is a single hidden layer netwrk with hidden units, f the kind described abve. The training set errr fr all f the netwrks was 0%, since in all cases there are mre parameters than training bservatins. The evlutin f the test errr during the training epchs is shwn in Figure.. The linear netwrk (Net-) starts t verfit fairly quickly, while test perfrmance f the thers level ff at successively superir values. The ther three netwrks have additinal features which demnstrate the pwer and flexibility f the neural netwrk paradigm. They intrduce cnstraints n the netwrk, natural fr the prblem at hand, which allw fr mre cmplex cnnectivity but fewer parameters. Net- uses lcal cnnectivity: this means that each hidden unit is cnnected t nly a small patch f units in the layer belw. In the first hidden layer (an 8 8 array), each unit takes inputs frm a patch f the input layer; fr units in the first hidden layer that are ne unit apart, their receptive fields verlap by ne rw r clumn, and hence are tw pixels apart. In the secnd hidden layer, inputs are frm a 5 5 patch, and again units that are ne unit apart have receptive fields that are tw units apart. The weights fr all ther cnnectins are set t zer. Lcal cnnectivity makes each unit respnsible fr extracting lcal features frm the layer belw, and

425 .7 Example: ZIP Cde Data 407 TABLE.. Test set perfrmance f five different neural netwrks n a handwritten digit classificatin example (Le Cun, 989). Netwrk Architecture Links Weights % Crrect Net-: Single layer netwrk % Net-: Tw layer netwrk % Net-: Lcally cnnected % Net-4: Cnstrained netwrk % Net-5: Cnstrained netwrk % reduces cnsiderably the ttal number f weights. With many mre hidden units than Net-, Net- has fewer links and hence weights (6 vs. 4), and achieves similar perfrmance. Net-4 and Net-5 have lcal cnnectivity with shared weights. All units in a lcal feature map perfrm the same peratin n different parts f the image, achieved by sharing the same weights. The first hidden layer f Net- 4 has tw 8 8 arrays, and each unit takes input frm a patch just like in Net-. Hwever, each f the units in a single 8 8 feature map share the same set f nine weights (but have their wn bias parameter). This frces the extracted features in different parts f the image t be cmputed by the same linear functinal, and cnsequently these netwrks are smetimes knwn as cnvlutinal netwrks. The secnd hidden layer f Net-4 has n weight sharing, and is the same as in Net-. The gradient f the errr functin R with respect t a shared weight is the sum f the gradients f R with respect t each cnnectin cntrlled by the weights in questin. Table. gives the number f links, the number f weights and the ptimal test perfrmance fr each f the netwrks. We see that Net-4 has mre links but fewer weights than Net-, and superir test perfrmance. Net-5 has fur 4 4 feature maps in the secnd hidden layer, each unit cnnected t a 5 5 lcal patch in the layer belw. Weights are shared in each f these feature maps. We see that Net-5 des the best, having errrs f nly.6%, cmpared t % fr the vanilla netwrk Net-. The clever design f netwrk Net-5, mtivated by the fact that features f handwriting style shuld appear in mre than ne part f a digit, was the result f many persn years f experimentatin. This and similar netwrks gave better perfrmance n ZIP cde prblems than any ther learning methd at that time (early 990s). This example als shws that neural netwrks are nt a fully autmatic tl, as they are smetimes advertised. As with all statistical mdels, subject matter knwledge can and shuld be used t imprve their perfrmance. This netwrk was later utperfrmed by the tangent distance apprach (Simard et al., 99) described in Sectin.., which explicitly incrprates natural affine invariances. At this pint the digit recgnitin datasets becme test beds fr every new learning prcedure, and researchers wrked

426 408 Neural Netwrks hard t drive dwn the errr rates. As f this writing, the best errr rates n a large database (60,000 training, 0,000 test bservatins), derived frm standard NIST databases, were reprted t be the fllwing: (Le Cun et al., 998):.% fr tangent distance with a -nearest neighbr classifier (Sectin..); 0.8% fr a degree-9 plynmial SVM (Sectin.); 0.8% fr LeNet-5, a mre cmplex versin f the cnvlutinal netwrk described here; 0.7% fr bsted LeNet-4. Bsting is described in Chapter 8. LeNet- 4 is a predecessr f LeNet-5. Le Cun et al. (998) reprt a much larger table f perfrmance results, and it is evident that many grups have been wrking very hard t bring these test errr rates dwn. They reprt a standard errr f 0.% n the errr estimates, which is based n a binmial average with N = 0,000 and p 0.0. This implies that errr rates within 0. 0.% f ne anther are statistically equivalent. Realistically the standard errr is even higher, since the test data has been implicitly used in the tuning f the varius prcedures..8 Discussin Bth prjectin pursuit regressin and neural netwrks take nnlinear functins f linear cmbinatins ( derived features ) f the inputs. This is a pwerful and very general apprach fr regressin and classificatin, and has been shwn t cmpete well with the best learning methds n many prblems. These tls are especially effective in prblems with a high signal-t-nise rati and settings where predictin withut interpretatin is the gal. They are less effective fr prblems where the gal is t describe the physical prcess that generated the data and the rles f individual inputs. Each input enters int the mdel in many places, in a nnlinear fashin. Sme authrs (Hintn, 989) plt a diagram f the estimated weights int each hidden unit, t try t understand the feature that each unit is extracting. This is limited hwever by the lack f identifiability f the parameter vectrs α m, m =,...,M. ften there are slutins with α m spanning the same linear space as the nes fund during training, giving predicted values that The Natinal Institute f Standards and Technlgy maintain large databases, including handwritten character databases;

427 .9 Bayesian Neural Nets and the NIPS 00 Challenge 409 are rughly the same. Sme authrs suggest carrying ut a principal cmpnent analysis f these weights, t try t find an interpretable slutin. In general, the difficulty f interpreting these mdels has limited their use in fields like medicine, where interpretatin f the mdel is very imprtant. There has been a great deal f research n the training f neural netwrks. Unlike methds like CART and MARS, neural netwrks are smth functins f real-valued parameters. This facilitates the develpment f Bayesian inference fr these mdels. The next sectins discusses a successful Bayesian implementatin f neural netwrks..9 Bayesian Neural Nets and the NIPS 00 Challenge A classificatin cmpetitin was held in 00, in which five labeled training datasets were prvided t participants. It was rganized fr a Neural Infrmatin Prcessing Systems (NIPS) wrkshp. Each f the data sets cnstituted a tw-class classificatin prblems, with different sizes and frm a variety f dmains (see Table.). Feature measurements fr a validatin dataset were als available. Participants develped and applied statistical learning prcedures t make predictins n the datasets, and culd submit predictins t a website n the validatin set fr a perid f weeks. With this feedback, participants were then asked t submit predictins fr a separate test set and they received their results. Finally, the class labels fr the validatin set were released and participants had ne week t train their algrithms n the cmbined training and validatin sets, and submit their final predictins t the cmpetitin website. A ttal f 75 grups participated, with 0 and 6 eventually making submissins n the validatin and test sets, respectively. There was an emphasis n feature extractin in the cmpetitin. Artificial prbes were added t the data: these are nise features with distributins resembling the real features but independent f the class labels. The percentage f prbes that were added t each dataset, relative t the ttal set f features, is shwn n Table.. Thus each learning algrithm had t figure ut a way f identifying the prbes and dwnweighting r eliminating them. A number f metrics were used t evaluate the entries, including the percentage crrect n the test set, the area under the RC curve, and a cmbined scre that cmpared each pair f classifiers head-t-head. The results f the cmpetitin are very interesting and are detailed in Guyn et al. (006). The mst ntable result: the entries f Neal and Zhang (006) were the clear verall winners. In the final cmpetitin they finished first

428 40 Neural Netwrks TABLE.. NIPS 00 challenge data sets. The clumn labeled p is the number f features. Fr the Drthea dataset the features are binary. N tr, N val and N te are the number f training, validatin and test cases, respectively Dataset Dmain Feature p Percent N tr N val N te Type Prbes Arcene Mass spectrmetry Dense 0, Dexter Text classificatin Sparse 0, Drthea Drug discvery Sparse 00, Gisette Digit recgnitin Dense Madeln Artificial Dense in three f the five datasets, and were 5th and 7th n the remaining tw datasets. In their winning entries, Neal and Zhang (006) used a series f preprcessing feature-selectin steps, fllwed by Bayesian neural netwrks, Dirichlet diffusin trees, and cmbinatins f these methds. Here we fcus nly n the Bayesian neural netwrk apprach, and try t discern which aspects f their apprach were imprtant fr its success. We rerun their prgrams and cmpare the results t bsted neural netwrks and bsted trees, and ther related methds..9. Bayes, Bsting and Bagging Let us first review briefly the Bayesian apprach t inference and its applicatin t neural netwrks. Given training data X tr,y tr, we assume a sampling mdel with parameters θ; Neal and Zhang (006) use a tw-hiddenlayer neural netwrk, with utput ndes the class prbabilities Pr(Y X,θ) fr the binary utcmes. Given a prir distributin Pr(θ), the psterir distributin fr the parameters is Pr(θ X tr,y tr ) = Pr(θ)Pr(y tr X tr,θ) Pr(θ)Pr(ytr X tr,θ)dθ (.9) Fr a test case with features X new, the predictive distributin fr the label Y new is Pr(Y new X new,x tr,y tr ) = Pr(Y new X new,θ)pr(θ X tr,y tr )dθ (.0) (c.f. equatin 8.4). Since the integral in (.0) is intractable, sphisticated Markv Chain Mnte Carl (MCMC) methds are used t sample frm the psterir distributin Pr(Y new X new,x tr,y tr ). A few hundred values θ are generated and then a simple average f these values estimates the integral. Neal and Zhang (006) use diffuse Gaussian prirs fr all f the parameters. The particular MCMC apprach that was used is called hybrid Mnte Carl, and may be imprtant fr the success f the methd. It includes an auxiliary mmentum vectr and implements Hamiltnian dynamics in which the ptential functin is the target density. This is dne t avid

429 .9 Bayesian Neural Nets and the NIPS 00 Challenge 4 randm walk behavir; the successive candidates mve acrss the sample space in larger steps. They tend t be less crrelated and hence cnverge t the target distributin mre rapidly. Neal and Zhang (006) als tried different frms f pre-prcessing f the features:. univariate screening using t-tests, and. autmatic relevance determinatin. In the latter methd (ARD), the weights (cefficients) fr the jth feature t each f the first hidden layer units all share a cmmn prir variance σ j, and prir mean zer. The psterir distributins fr each variance σ j are cmputed, and the features whse psterir variance cncentrates n small values are discarded. There are thus three main features f this apprach that culd be imprtant fr its success: (a) the feature selectin and pre-prcessing, (b) the neural netwrk mdel, and (c) the Bayesian inference fr the mdel using MCMC. Accrding t Neal and Zhang (006), feature screening in (a) is carried ut purely fr cmputatinal efficiency; the MCMC prcedure is slw with a large number f features. There is n need t use feature selectin t avid verfitting. The psterir average (.0) takes care f this autmatically. We wuld like t understand the reasns fr the success f the Bayesian methd. In ur view, pwer f mdern Bayesian methds des nt lie in their use as a frmal inference prcedure; mst peple wuld nt believe that the prirs in a high-dimensinal, cmplex neural netwrk mdel are actually crrect. Rather the Bayesian/MCMC apprach gives an efficient way f sampling the relevant parts f mdel space, and then averaging the predictins fr the high-prbability mdels. Bagging and bsting are nn-bayesian prcedures that have sme similarity t MCMC in a Bayesian mdel. The Bayesian apprach fixes the data and perturbs the parameters, accrding t current estimate f the psterir distributin. Bagging perturbs the data in an i.i.d fashin and then re-estimates the mdel t give a new set f mdel parameters. At the end, a simple average f the mdel predictins frm different bagged samples is cmputed. Bsting is similar t bagging, but fits a mdel that is additive in the mdels f each individual base learner, which are learned using nn i.i.d. samples. We can write all f these mdels in the frm ˆf(x new ) = L w l E(Y new x new, ˆθ l ) (.) l=

430 4 Neural Netwrks In all cases the ˆθ l are a large cllectin f mdel parameters. Fr the Bayesian mdel the w l = /L, and the average estimates the psterir mean (.) by sampling θ l frm the psterir distributin. Fr bagging, w l = /L as well, and the ˆθ l are the parameters refit t btstrap resamples f the training data. Fr bsting, the weights are all equal t, but the ˆθ l are typically chsen in a nnrandm sequential fashin t cnstantly imprve the fit..9. Perfrmance Cmparisns Based n the similarities abve, we decided t cmpare Bayesian neural netwrks t bsted trees, bsted neural netwrks, randm frests and bagged neural netwrks n the five datasets in Table.. Bagging and bsting f neural netwrks are nt methds that we have previusly used in ur wrk. We decided t try them here, because f the success f Bayesian neural netwrks in this cmpetitin, and the gd perfrmance f bagging and bsting with trees. We als felt that by bagging and bsting neural nets, we culd assess bth the chice f mdel as well as the mdel search strategy. Here are the details f the learning methds that were cmpared: Bayesian neural nets. The results here are taken frm Neal and Zhang (006), using their Bayesian apprach t fitting neural netwrks. The mdels had tw hidden layers f 0 and 8 units. We re-ran sme netwrks fr timing purpses nly. Bsted trees. We used the gbm package (versin.5-7) in the R language. Tree depth and shrinkage factrs varied frm dataset t dataset. We cnsistently bagged 80% f the data at each bsting iteratin (the default is 50%). Shrinkage was between 0.00 and 0.. Tree depth was between and 9. Bsted neural netwrks. Since bsting is typically mst effective with weak learners, we bsted a single hidden layer neural netwrk with tw r fur units, fit with the nnet package (versin 7.-6) in R. Randm frests. We used the R package randmfrest (versin 4.5-6) with default settings fr the parameters. Bagged neural netwrks. We used the same architecture as in the Bayesian neural netwrk abve (tw hidden layers f 0 and 8 units), fit using bth Neal s C language package Flexible Bayesian Mdeling ( release), and Matlab neural-net tlbx (versin 5.).

431 .9 Bayesian Neural Nets and the NIPS 00 Challenge 4 Univariate Screened Features ARD Reduced Features Bayesian neural nets bsted trees bsted neural nets randm frests bagged neural netwrks Test Errr (%) Test Errr (%) Arcene Dexter Drthea Gisette Madeln Arcene Dexter Drthea Gisette Madeln FIGURE.. Perfrmance f different learning methds n five prblems, using bth univariate screening f features (tp panel) and a reduced feature set frm autmatic relevance determinatin. The errr bars at the tp f each plt have width equal t ne standard errr f the difference between tw errr rates. n mst f the prblems several cmpetitrs are within this errr bund. This analysis was carried ut by Nichlas Jhnsn, and full details may be fund in Jhnsn (008). The results are shwn in Figure. and Table.. The figure and table shw Bayesian, bsted and bagged neural netwrks, bsted trees, and randm frests, using bth the screened and reduced features sets. The errr bars at the tp f each plt indicate ne standard errr f the difference between tw errr rates. Bayesian neural netwrks again emerge as the winner, althugh fr sme datasets the differences between the test errr rates is nt statistically significant. Randm frests perfrms the best amng the cmpetitrs using the selected feature set, while the bsted neural netwrks perfrm best with the reduced feature set, and nearly match the Bayesian neural net. The superirity f bsted neural netwrks ver bsted trees suggest that the neural netwrk mdel is better suited t these particular prblems. Specifically, individual features might nt be gd predictrs here We als thank Isabelle Guyn fr help in preparing the results f this sectin.

432 44 Neural Netwrks TABLE.. Perfrmance f different methds. Values are average rank f test errr acrss the five prblems (lw is gd), and mean cmputatin time and standard errr f the mean, in minutes. Screened Features ARD Reduced Features Methd Average Average Average Average Rank Time Rank Time Bayesian neural netwrks.5 84(8).6 600(86) Bsted trees.4.0(.5) (.4) Bsted neural netwrks.8 9.4(8.6). 5.6(.5) Randm frests.7.9(.7)..(9.) Bagged neural netwrks.6.5(.) (4.4) and linear cmbinatins f features wrk better. Hwever the impressive perfrmance f randm frests is at dds with this explanatin, and came as a surprise t us. Since the reduced feature sets cme frm the Bayesian neural netwrk apprach, nly the methds that use the screened features are legitimate, self-cntained prcedures. Hwever, this des suggest that better methds fr internal feature selectin might help the verall perfrmance f bsted neural netwrks. The table als shws the apprximate training time required fr each methd. Here the nn-bayesian methds shw a clear advantage. verall, the superir perfrmance f Bayesian neural netwrks here may be due t the fact that (a) the neural netwrk mdel is well suited t these five prblems, and (b) the MCMC apprach prvides an efficient way f explring the imprtant part f the parameter space, and then averaging the resulting mdels accrding t their quality. The Bayesian apprach wrks well fr smthly parametrized mdels like neural nets; it is nt yet clear that it wrks as well fr nn-smth mdels like trees..0 Cmputatinal Cnsideratins With N bservatins, p predictrs, M hidden units and L training epchs, a neural netwrk fit typically requires (N pm L) peratins. There are many packages available fr fitting neural netwrks, prbably many mre than exist fr mainstream statistical methds. Because the available sftware varies widely in quality, and the learning prblem fr neural netwrks is sensitive t issues such as input scaling, such sftware shuld be carefully chsen and tested.

433 Exercises 45 Bibligraphic Ntes Prjectin pursuit was prpsed by Friedman and Tukey (974), and specialized t regressin by Friedman and Stuetzle (98). Huber (985) gives a schlarly verview, and Rsen and Hastie (994) present a frmulatin using smthing splines. The mtivatin fr neural netwrks dates back t McCullch and Pitts (94), Widrw and Hff (960) (reprinted in Andersn and Rsenfeld (988)) and Rsenblatt (96). Hebb (949) heavily influenced the develpment f learning algrithms. The resurgence f neural netwrks in the mid 980s was due t Werbs (974), Parker (985) and Rumelhart et al. (986), wh prpsed the back-prpagatin algrithm. Tday there are many bks written n the tpic, fr a brad range f audiences. Fr readers f this bk, Hertz et al. (99), Bishp (995) and Ripley (996) may be the mst infrmative. Bayesian learning fr neural netwrks is described in Neal (996). The ZIP cde example was taken frm Le Cun (989); see als Le Cun et al. (990) and Le Cun et al. (998). We d nt discuss theretical tpics such as apprximatin prperties f neural netwrks, such as the wrk f Barrn (99), Girsi et al. (995) and Jnes (99). Sme f these results are summarized by Ripley (996). Exercises Ex.. Establish the exact crrespndence between the prjectin pursuit regressin mdel (.) and the neural netwrk (.5). In particular, shw that the single-layer regressin netwrk is equivalent t a PPR mdel with g m (ω T mx) = β m σ(α 0m + s m (ω T mx)), where ω m is the mth unit vectr. Establish a similar equivalence fr a classificatin netwrk. Ex.. Cnsider a neural netwrk fr a quantitative utcme as in (.5), using squared-errr lss and identity utput functin g k (t) = t. Suppse that the weights α m frm the input t hidden layer are nearly zer. Shw that the resulting mdel is nearly linear in the inputs. Ex.. Derive the frward and backward prpagatin equatins fr the crss-entrpy lss functin. Ex..4 Cnsider a neural netwrk fr a K class utcme that uses crssentrpy lss. If the netwrk has n hidden layer, shw that the mdel is equivalent t the multinmial lgistic mdel described in Chapter 4. Ex..5 (a) Write a prgram t fit a single hidden layer neural netwrk (ten hidden units) via back-prpagatin and weight decay.

434 46 Neural Netwrks (b) Apply it t 00 bservatins frm the mdel Y = σ(a T X) + (a T X) Z, where σ is the sigmid functin, Z is standard nrmal, X T = (X,X ), each X j being independent standard nrmal, and a = (,),a = (, ). Generate a test sample f size 000, and plt the training and test errr curves as a functin f the number f training epchs, fr different values f the weight decay parameter. Discuss the verfitting behavir in each case. (c) Vary the number f hidden units in the netwrk, frm up t 0, and determine the minimum number needed t perfrm well fr this task. Ex..6 Write a prgram t carry ut prjectin pursuit regressin, using cubic smthing splines with fixed degrees f freedm. Fit it t the data frm the previus exercise, fr varius values f the smthing parameter and number f mdel terms. Find the minimum number f mdel terms necessary fr the mdel t perfrm well and cmpare this t the number f hidden units frm the previus exercise. Ex..7 Fit a neural netwrk t the spam data f Sectin 9.., and cmpare the results t thse fr the additive mdel given in that chapter. Cmpare bth the classificatin perfrmance and interpretability f the final mdel.

435 Supprt Vectr Machines and Flexible Discriminants This is page 47 Printer: paque this. Intrductin In this chapter we describe generalizatins f linear decisin bundaries fr classificatin. ptimal separating hyperplanes are intrduced in Chapter 4 fr the case when tw classes are linearly separable. Here we cver extensins t the nnseparable case, where the classes verlap. These techniques are then generalized t what is knwn as the supprt vectr machine, which prduces nnlinear bundaries by cnstructing a linear bundary in a large, transfrmed versin f the feature space. The secnd set f methds generalize Fisher s linear discriminant analysis (LDA). The generalizatins include flexible discriminant analysis which facilitates cnstructin f nnlinear bundaries in a manner very similar t the supprt vectr machines, penalized discriminant analysis fr prblems such as signal and image classificatin where the large number f features are highly crrelated, and mixture discriminant analysis fr irregularly shaped classes.. The Supprt Vectr Classifier In Chapter 4 we discussed a technique fr cnstructing an ptimal separating hyperplane between tw perfectly separated classes. We review this and generalize t the nnseparable case, where the classes may nt be separable by a linear bundary.

436 48. Flexible Discriminants x T β + β 0 = 0 M = β M = β margin x T β + β 0 = 0 ξ 4 ξ5 ξ ξ ξ M = β M = β margin FIGURE.. Supprt vectr classifiers. The left panel shws the separable case. The decisin bundary is the slid line, while brken lines bund the shaded maximal margin f width M = / β. The right panel shws the nnseparable (verlap) case. The pints labeled ξ j are n the wrng side f their margin by an amunt ξ j = Mξ j; pints n the crrect side have ξ j = 0. The margin is maximized subject t a ttal budget P ξ i cnstant. Hence P ξ j is the ttal distance f pints n the wrng side f their margin. ur training data cnsists f N pairs (x,y ),(x,y ),...,(x N,y N ), with x i IR p and y i {,}. Define a hyperplane by {x : f(x) = x T β + β 0 = 0}, (.) where β is a unit vectr: β =. A classificatin rule induced by f(x) is G(x) = sign[x T β + β 0 ]. (.) The gemetry f hyperplanes is reviewed in Sectin 4.5, where we shw that f(x) in (.) gives the signed distance frm a pint x t the hyperplane f(x) = x T β+β 0 = 0. Since the classes are separable, we can find a functin f(x) = x T β + β 0 with y i f(x i ) > 0 i. Hence we are able t find the hyperplane that creates the biggest margin between the training pints fr class and (see Figure.). The ptimizatin prblem max M β,β 0, β = subject t y i (x T i β + β 0 ) M, i =,...,N, (.) captures this cncept. The band in the figure is M units away frm the hyperplane n either side, and hence M units wide. It is called the margin. We shwed that this prblem can be mre cnveniently rephrased as min β β,β 0 subject t y i (x T i β + β 0 ), i =,...,N, (.4)

437 . The Supprt Vectr Classifier 49 where we have drpped the nrm cnstraint n β. Nte that M = / β. Expressin (.4) is the usual way f writing the supprt vectr criterin fr separated data. This is a cnvex ptimizatin prblem (quadratic criterin, linear inequality cnstraints), and the slutin is characterized in Sectin Suppse nw that the classes verlap in feature space. ne way t deal with the verlap is t still maximize M, but allw fr sme pints t be n the wrng side f the margin. Define the slack variables ξ = (ξ,ξ,...,ξ N ). There are tw natural ways t mdify the cnstraint in (.): y i (x T i β + β 0 ) M ξ i, (.5) r y i (x T i β + β 0 ) M( ξ i ), (.6) i, ξ i 0, N i= ξ i cnstant. The tw chices lead t different slutins. The first chice seems mre natural, since it measures verlap in actual distance frm the margin; the secnd chice measures the verlap in relative distance, which changes with the width f the margin M. Hwever, the first chice results in a nncnvex ptimizatin prblem, while the secnd is cnvex; thus (.6) leads t the standard supprt vectr classifier, which we use frm here n. Here is the idea f the frmulatin. The value ξ i in the cnstraint y i (x T i β+ β 0 ) M( ξ i ) is the prprtinal amunt by which the predictin f(x i ) = x T i β+β 0 is n the wrng side f its margin. Hence by bunding the sum ξ i, we bund the ttal prprtinal amunt by which predictins fall n the wrng side f their margin. Misclassificatins ccur when ξ i >, s bunding ξ i at a value K say, bunds the ttal number f training misclassificatins at K. As in (4.48) in Sectin 4.5., we can drp the nrm cnstraint n β, define M = / β, and write (.4) in the equivalent frm min β subject t { y i (x T i β + β 0) ξ i i, ξ i 0, ξ i cnstant. (.7) This is the usual way the supprt vectr classifier is defined fr the nnseparable case. Hwever we find cnfusing the presence f the fixed scale in the cnstraint y i (x T i β +β 0) ξ i, and prefer t start with (.6). The right panel f Figure. illustrates this verlapping case. By the nature f the criterin (.7), we see that pints well inside their class bundary d nt play a big rle in shaping the bundary. This seems like an attractive prperty, and ne that differentiates it frm linear discriminant analysis (Sectin 4.). In LDA, the decisin bundary is determined by the cvariance f the class distributins and the psitins f the class centrids. We will see in Sectin.. that lgistic regressin is mre similar t the supprt vectr classifier in this regard.

438 40. Flexible Discriminants.. Cmputing the Supprt Vectr Classifier The prblem (.7) is quadratic with linear inequality cnstraints, hence it is a cnvex ptimizatin prblem. We describe a quadratic prgramming slutin using Lagrange multipliers. Cmputatinally it is cnvenient t re-express (.7) in the equivalent frm min β,β 0 β + C N i= ξ i subject t ξ i 0, y i (x T i β + β 0 ) ξ i i, (.8) where the cst parameter C replaces the cnstant in (.7); the separable case crrespnds t C =. The Lagrange (primal) functin is L P = N N N β + C ξ i α i [y i (x T i β + β 0 ) ( ξ i )] µ i ξ i, (.9) i= i= which we minimize w.r.t β, β 0 and ξ i. Setting the respective derivatives t zer, we get β = 0 = i= N α i y i x i, (.0) i= N α i y i, (.) i= α i = C µ i, i, (.) as well as the psitivity cnstraints α i, µ i, ξ i 0 i. By substituting (.0) (.) int (.9), we btain the Lagrangian (Wlfe) dual bjective functin N L D = α i N N α i α i y i y i x T i x i, (.) i= i= i = which gives a lwer bund n the bjective functin (.8) fr any feasible pint. We maximize L D subject t 0 α i C and N i= α iy i = 0. In additin t (.0) (.), the Karush Kuhn Tucker cnditins include the cnstraints α i [y i (x T i β + β 0 ) ( ξ i )] = 0, (.4) µ i ξ i = 0, (.5) y i (x T i β + β 0 ) ( ξ i ) 0, (.6) fr i =,...,N. Tgether these equatins (.0) (.6) uniquely characterize the slutin t the primal and dual prblem.

439 . The Supprt Vectr Classifier 4 Frm (.0) we see that the slutin fr β has the frm ˆβ = N ˆα i y i x i, (.7) i= with nnzer cefficients ˆα i nly fr thse bservatins i fr which the cnstraints in (.6) are exactly met (due t (.4)). These bservatins are called the supprt vectrs, since ˆβ is represented in terms f them alne. Amng these supprt pints, sme will lie n the edge f the margin (ˆξ i = 0), and hence frm (.5) and (.) will be characterized by 0 < ˆα i < C; the remainder (ˆξ i > 0) have ˆα i = C. Frm (.4) we can see that any f these margin pints (0 < ˆα i, ˆξi = 0) can be used t slve fr β 0, and we typically use an average f all the slutins fr numerical stability. Maximizing the dual (.) is a simpler cnvex quadratic prgramming prblem than the primal (.9), and can be slved with standard techniques (Murray et al., 98, fr example). Given the slutins ˆβ 0 and ˆβ, the decisin functin can be written as Ĝ(x) = sign[ ˆf(x)] = sign[x T ˆβ + ˆβ0 ]. (.8) The tuning parameter f this prcedure is the cst parameter C... Mixture Example (Cntinued) Figure. shws the supprt vectr bundary fr the mixture example f Figure.5 n page, with tw verlapping classes, fr tw different values f the cst parameter C. The classifiers are rather similar in their perfrmance. Pints n the wrng side f the bundary are supprt vectrs. In additin, pints n the crrect side f the bundary but clse t it (in the margin), are als supprt vectrs. The margin is larger fr C = 0.0 than it is fr C = 0,000. Hence larger values f C fcus attentin mre n (crrectly classified) pints near the decisin bundary, while smaller values invlve data further away. Either way, misclassified pints are given weight, n matter hw far away. In this example the prcedure is nt very sensitive t chices f C, because f the rigidity f a linear bundary. The ptimal value fr C can be estimated by crss-validatin, as discussed in Chapter 7. Interestingly, the leave-ne-ut crss-validatin errr can be bunded abve by the prprtin f supprt pints in the data. The reasn is that leaving ut an bservatin that is nt a supprt vectr will nt change the slutin. Hence these bservatins, being classified crrectly by the riginal bundary, will be classified crrectly in the crss-validatin prcess. Hwever this bund tends t be t high, and nt generally useful fr chsing C (6% and 85%, respectively, in ur examples).

440 4. Flexible Discriminants Training Errr: 0.70 Test Errr: 0.88 Bayes Errr: 0.0 C = Training Errr: 0.6 Test Errr: 0.0 Bayes Errr: 0. C = 0.0 FIGURE.. The linear supprt vectr bundary fr the mixture data example with tw verlapping classes, fr tw different values f C. The brken lines indicate the margins, where f(x) = ±. The supprt pints (α i > 0) are all the pints n the wrng side f their margin. The black slid dts are thse supprt pints falling exactly n the margin (ξ i = 0, α i > 0). In the upper panel 6% f the bservatins are supprt pints, while in the lwer panel 85% are. The brken purple curve in the backgrund is the Bayes decisin bundary.

441 . Supprt Vectr Machines and Kernels 4. Supprt Vectr Machines and Kernels The supprt vectr classifier described s far finds linear bundaries in the input feature space. As with ther linear methds, we can make the prcedure mre flexible by enlarging the feature space using basis expansins such as plynmials r splines (Chapter 5). Generally linear bundaries in the enlarged space achieve better training-class separatin, and translate t nnlinear bundaries in the riginal space. nce the basis functins h m (x), m =,...,M are selected, the prcedure is the same as befre. We fit the SV classifier using input features h(x i ) = (h (x i ),h (x i ),...,h M (x i )), i =,...,N, and prduce the (nnlinear) functin ˆf(x) = h(x) T ˆβ + ˆβ 0. The classifier is Ĝ(x) = sign( ˆf(x)) as befre. The supprt vectr machine classifier is an extensin f this idea, where the dimensin f the enlarged space is allwed t get very large, infinite in sme cases. It might seem that the cmputatins wuld becme prhibitive. It wuld als seem that with sufficient basis functins, the data wuld be separable, and verfitting wuld ccur. We first shw hw the SVM technlgy deals with these issues. We then see that in fact the SVM classifier is slving a functin-fitting prblem using a particular criterin and frm f regularizatin, and is part f a much bigger class f prblems that includes the smthing splines f Chapter 5. The reader may wish t cnsult Sectin 5.8, which prvides backgrund material and verlaps smewhat with the next tw sectins... Cmputing the SVM fr Classificatin We can represent the ptimizatin prblem (.9) and its slutin in a special way that nly invlves the input features via inner prducts. We d this directly fr the transfrmed feature vectrs h(x i ). We then see that fr particular chices f h, these inner prducts can be cmputed very cheaply. The Lagrange dual functin (.) has the frm L D = N α i i= N i= i = N α i α i y i y i h(x i ),h(x i ). (.9) Frm (.0) we see that the slutin functin f(x) can be written f(x) = h(x) T β + β 0 N = α i y i h(x),h(x i ) + β 0. (.0) i= As befre, given α i, β 0 can be determined by slving y i f(x i ) = in (.0) fr any (r all) x i fr which 0 < α i < C.

442 44. Flexible Discriminants S bth (.9) and (.0) invlve h(x) nly thrugh inner prducts. In fact, we need nt specify the transfrmatin h(x) at all, but require nly knwledge f the kernel functin K(x,x ) = h(x),h(x ) (.) that cmputes inner prducts in the transfrmed space. K shuld be a symmetric psitive (semi-) definite functin; see Sectin Three ppular chices fr K in the SVM literature are dth-degree plynmial: K(x,x ) = ( + x,x ) d, Radial basis: K(x,x ) = exp( γ x x ), Neural netwrk: K(x,x ) = tanh(κ x,x + κ ). (.) Cnsider fr example a feature space with tw inputs X and X, and a plynmial kernel f degree. Then K(X,X ) = ( + X,X ) = ( + X X + X X ) = + X X + X X + (X X ) + (X X ) + X X X X. (.) Then M = 6, and if we chse h (X) =, h (X) = X, h (X) = X, h 4 (X) = X, h 5 (X) = X, and h 6 (X) = X X, then K(X,X ) = h(x),h(x ). Frm (.0) we see that the slutin can be written ˆf(x) = N ˆα i y i K(x,x i ) + ˆβ 0. (.4) i= The rle f the parameter C is clearer in an enlarged feature space, since perfect separatin is ften achievable there. A large value f C will discurage any psitive ξ i, and lead t an verfit wiggly bundary in the riginal feature space; a small value f C will encurage a small value f β, which in turn causes f(x) and hence the bundary t be smther. Figure. shw tw nnlinear supprt vectr machines applied t the mixture example f Chapter. The regularizatin parameter was chsen in bth cases t achieve gd test errr. The radial basis kernel prduces a bundary quite similar t the Bayes ptimal bundary fr this example; cmpare Figure.5. In the early literature n supprt vectrs, there were claims that the kernel prperty f the supprt vectr machine is unique t it and allws ne t finesse the curse f dimensinality. Neither f these claims is true, and we g int bth f these issues in the next three subsectins.

443 . Supprt Vectr Machines and Kernels 45 SVM - Degree-4 Plynmial in Feature Space Training Errr: 0.80 Test Errr: 0.45 Bayes Errr: 0.0 SVM - Radial Kernel in Feature Space Training Errr: 0.60 Test Errr: 0.8 Bayes Errr: 0.0 FIGURE.. Tw nnlinear SVMs fr the mixture data. The upper plt uses a 4th degree plynmial kernel, the lwer a radial basis kernel (with γ = ). In each case C was tuned t apprximately achieve the best test errr perfrmance, and C = wrked well in bth cases. The radial basis kernel perfrms the best (clse t Bayes ptimal), as might be expected given the data arise frm mixtures f Gaussians. The brken purple curve in the backgrund is the Bayes decisin bundary.

444 46. Flexible Discriminants Lss Hinge Lss Binmial Deviance Squared Errr Class Huber 0 yf FIGURE.4. The supprt vectr lss functin (hinge lss), cmpared t the negative lg-likelihd lss (binmial deviance) fr lgistic regressin, squared-errr lss, and a Huberized versin f the squared hinge lss. All are shwn as a functin f yf rather than f, because f the symmetry between the y = + and y = case. The deviance and Huber have the same asympttes as the SVM lss, but are runded in the interir. All are scaled t have the limiting left-tail slpe f... The SVM as a Penalizatin Methd With f(x) = h(x) T β + β 0, cnsider the ptimizatin prblem min β 0, β N [ y i f(x i )] + + λ β (.5) i= where the subscript + indicates psitive part. This has the frm lss + penalty, which is a familiar paradigm in functin estimatin. It is easy t shw (Exercise.) that the slutin t (.5), with λ = /C, is the same as that fr (.8). Examinatin f the hinge lss functin L(y,f) = [ yf] + shws that it is reasnable fr tw-class classificatin, when cmpared t ther mre traditinal lss functins. Figure.4 cmpares it t the lg-likelihd lss fr lgistic regressin, as well as squared-errr lss and a variant theref. The (negative) lg-likelihd r binmial deviance has similar tails as the SVM lss, giving zer penalty t pints well inside their margin, and a

445 . Supprt Vectr Machines and Kernels 47 TABLE.. The ppulatin minimizers fr the different lss functins in Figure.4. Lgistic regressin uses the binmial lg-likelihd r deviance. Linear discriminant analysis (Exercise 4.) uses squared-errr lss. The SVM hinge lss estimates the mde f the psterir class prbabilities, whereas the thers estimate a linear transfrmatin f these prbabilities. Lss Functin L[y, f(x)] Minimizing Functin Binmial Pr(Y = + x) Deviance lg[ + e yf(x) ] f(x) = lg Pr(Y = - x) SVM Hinge Lss Squared Errr [ yf(x)] + f(x) = sign[pr(y = + x) ] [y f(x)] = [ yf(x)] f(x) = Pr(Y = + x) Huberised Square Hinge Lss 4yf(x), yf(x) < - [ yf(x)] + therwise f(x) = Pr(Y = + x) linear penalty t pints n the wrng side and far away. Squared-errr, n the ther hand gives a quadratic penalty, and pints well inside their wn margin have a strng influence n the mdel as well. The squared hinge lss L(y,f) = [ yf] + is like the quadratic, except it is zer fr pints inside their margin. It still rises quadratically in the left tail, and will be less rbust than hinge r deviance t misclassified bservatins. Recently Rsset and Zhu (007) prpsed a Huberized versin f the squared hinge lss, which cnverts smthly t a linear lss at yf =. We can characterize these lss functins in terms f what they are estimating at the ppulatin level. We cnsider minimizing EL(Y, f(x)). Table. summarizes the results. Whereas the hinge lss estimates the classifier G(x) itself, all the thers estimate a transfrmatin f the class psterir prbabilities. The Huberized square hinge lss shares attractive prperties f lgistic regressin (smth lss functin, estimates prbabilities), as well as the SVM hinge lss (supprt pints). Frmulatin (.5) casts the SVM as a regularized functin estimatin prblem, where the cefficients f the linear expansin f(x) = β 0 +h(x) T β are shrunk tward zer (excluding the cnstant). If h(x) represents a hierarchical basis having sme rdered structure (such as rdered in rughness),

446 48. Flexible Discriminants then the unifrm shrinkage makes mre sense if the rugher elements h j in the vectr h have smaller nrm. All the lss-functin in Table. except squared-errr are s called margin maximizing lss-functins (Rsset et al., 004b). This means that if the data are separable, then the limit f ˆβ λ in (.5) as λ 0 defines the ptimal separating hyperplane... Functin Estimatin and Reprducing Kernels Here we describe SVMs in terms f functin estimatin in reprducing kernel Hilbert spaces, where the kernel prperty abunds. This material is discussed in sme detail in Sectin 5.8. This prvides anther view f the supprt vectr classifier, and helps t clarify hw it wrks. Suppse the basis h arises frm the (pssibly finite) eigen-expansin f a psitive definite kernel K, K(x,x ) = φ m (x)φ m (x )δ m (.6) m= and h m (x) = δ m φ m (x). Then with θ m = δ m β m, we can write (.5) as [ ] N min y i (β 0 + θ m φ m (x i )) + λ θm. (.7) β 0, θ δ m i= m= + m= Nw (.7) is identical in frm t (5.49) n page 69 in Sectin 5.8, and the thery f reprducing kernel Hilbert spaces described there guarantees a finite-dimensinal slutin f the frm f(x) = β 0 + N α i K(x,x i ). (.8) i= In particular we see there an equivalent versin f the ptimizatin criterin (.9) [Equatin (5.67) in Sectin 5.8.; see als Wahba et al. (000)], min β 0,α N ( y i f(x i )) + + λ αt Kα, (.9) i= where K is the N N matrix f kernel evaluatins fr all pairs f training features (Exercise.). These mdels are quite general, and include, fr example, the entire family f smthing splines, additive and interactin spline mdels discussed Fr lgistic regressin with separable data, ˆβ λ diverges, but ˆβ λ / ˆβ λ cnverges t the ptimal separating directin.

447 . Supprt Vectr Machines and Kernels 49 in Chapters 5 and 9, and in mre detail in Wahba (990) and Hastie and Tibshirani (990). They can be expressed mre generally as min f H i= N [ y i f(x i )] + + λj(f), (.0) where H is the structured space f functins, and J(f) an apprpriate regularizer n that space. Fr example, suppse H is the space f additive functins f(x) = p j= f j(x j ), and J(f) = j {f j (x j )} dx j. Then the slutin t (.0) is an additive cubic spline, and has a kernel representatin (.8) with K(x,x ) = p j= K j(x j,x j ). Each f the K j is the kernel apprpriate fr the univariate smthing spline in x j (Wahba, 990). Cnversely this discussin als shws that, fr example, any f the kernels described in (.) abve can be used with any cnvex lss functin, and will als lead t a finite-dimensinal representatin f the frm (.8). Figure.5 uses the same kernel functins as in Figure., except using the binmial lg-likelihd as a lss functin. The fitted functin is hence an estimate f the lg-dds, ˆf(x) = lg ˆPr(Y = + x) ˆPr(Y = x) = ˆβ N 0 + ˆα i K(x,x i ), (.) i= r cnversely we get an estimate f the class prbabilities ˆPr(Y = + x) = + e ˆβ 0 P N i= ˆαiK(x,xi). (.) The fitted mdels are quite similar in shape and perfrmance. Examples and mre details are given in Sectin 5.8. It des happen that fr SVMs, a sizable fractin f the N values f α i can be zer (the nnsupprt pints). In the tw examples in Figure., these fractins are 4% and 45%, respectively. This is a cnsequence f the piecewise linear nature f the first part f the criterin (.5). The lwer the class verlap (n the training data), the greater this fractin will be. Reducing λ will generally reduce the verlap (allwing a mre flexible f). A small number f supprt pints means that ˆf(x) can be evaluated mre quickly, which is imprtant at lkup time. f curse, reducing the verlap t much can lead t pr generalizatin. Ji Zhu assisted in the preparatin f these examples.

448 40. Flexible Discriminants LR - Degree-4 Plynmial in Feature Space Training Errr: 0.90 Test Errr: 0.6 Bayes Errr: 0.0 LR - Radial Kernel in Feature Space Training Errr: 0.50 Test Errr: 0. Bayes Errr: 0.0 FIGURE.5. The lgistic regressin versins f the SVM mdels in Figure., using the identical kernels and hence penalties, but the lg-likelihd lss instead f the SVM lss functin. The tw brken cnturs crrespnd t psterir prbabilities f 0.75 and 0.5 fr the + class (r vice versa). The brken purple curve in the backgrund is the Bayes decisin bundary.

449 . Supprt Vectr Machines and Kernels 4 TABLE.. Skin f the range: Shwn are mean (standard errr f the mean) f the test errr ver 50 simulatins. BRUT fits an additive spline mdel adaptively, while MARS fits a lw-rder interactin mdel adaptively. Test Errr (SE) Methd N Nise Features Six Nise Features SV Classifier (0.00) 0.47 (0.00) SVM/ply (0.00) 0.5 (0.004) SVM/ply (0.004) 0.70 (0.004) 4 SVM/ply (0.00) 0.44 (0.00) 5 BRUT (0.00) (0.00) 6 MARS 0.56 (0.004) 0.7 (0.005) Bayes SVMs and the Curse f Dimensinality In this sectin, we address the questin f whether SVMs have sme edge n the curse f dimensinality. Ntice that in expressin (.) we are nt allwed a fully general inner prduct in the space f pwers and prducts. Fr example, all terms f the frm X j X j are given equal weight, and the kernel cannt adapt itself t cncentrate n subspaces. If the number f features p were large, but the class separatin ccurred nly in the linear subspace spanned by say X and X, this kernel wuld nt easily find the structure and wuld suffer frm having many dimensins t search ver. ne wuld have t build knwledge abut the subspace int the kernel; that is, tell it t ignre all but the first tw inputs. If such knwledge were available a priri, much f statistical learning wuld be made much easier. A majr gal f adaptive methds is t discver such structure. We supprt these statements with an illustrative example. We generated 00 bservatins in each f tw classes. The first class has fur standard nrmal independent features X,X,X,X 4. The secnd class als has fur standard nrmal independent features, but cnditined n 9 Xj 6. This is a relatively easy prblem. As a secnd harder prblem, we augmented the features with an additinal six standard Gaussian nise features. Hence the secnd class almst cmpletely surrunds the first, like the skin surrunding the range, in a fur-dimensinal subspace. The Bayes errr rate fr this prblem is 0.09 (irrespective f dimensin). We generated 000 test bservatins t cmpare different prcedures. The average test errrs ver 50 simulatins, with and withut nise features, are shwn in Table.. Line uses the supprt vectr classifier in the riginal feature space. Lines 4 refer t the supprt vectr machine with a -, 5- and 0-dimensinal plynmial kernel. Fr all supprt vectr prcedures, we chse the cst parameter C t minimize the test errr, t be as fair as pssible t the

451 . Supprt Vectr Machines and Kernels f(x) = 0 f(x) = f(x) = + 6 / β 4 λ α i(λ) FIGURE.7. A simple example illustrates the SVM path algrithm. (left panel:) This plt illustrates the state f the mdel at λ = The + pints are range, the blue. λ = /, and the width f the sft margin is / β = Tw blue pints {, 5} are misclassified, while the tw range pints {0, } are crrectly classified, but n the wrng side f their margin f(x) = +; each f these has y if(x i) <. The three square shaped pints {, 6, 7} are exactly n their margins. (right panel:) This plt shws the piecewise linear prfiles α i(λ). The hrizntal brken line at λ = / indicates the state f the α i fr the mdel in the left plt. (the value used in Figure.), an intermediate value f C is required. Clearly in situatins such as these, we need t determine a gd chice fr C, perhaps by crss-validatin. Here we describe a path algrithm (in the spirit f Sectin.8) fr efficiently fitting the entire sequence f SVM mdels btained by varying C. It is cnvenient t use the lss+penalty frmulatin (.5), alng with Figure.4. This leads t a slutin fr β at a given value f λ: β λ = λ N α i y i x i. (.) i= The α i are again Lagrange multipliers, but in this case they all lie in [0,]. Figure.7 illustrates the setup. It can be shwn that the KKT ptimality cnditins imply that the labeled pints (x i,y i ) fall int three distinct grups:

452 44. Flexible Discriminants bservatins crrectly classified and utside their margins. They have y i f(x i ) >, and Lagrange multipliers α i = 0. Examples are the range pints 8, 9 and, and the blue pints and 4. bservatins sitting n their margins with y i f(x i ) =, with Lagrange multipliers α i [0,]. Examples are the range 7 and the blue and 8. bservatins inside their margins have y i f(x i ) <, with α i =. Examples are the blue and 5, and the range 0 and. The idea fr the path algrithm is as fllws. Initially λ is large, the margin / β λ is wide, and all pints are inside their margin and have α i =. As λ decreases, / β λ decreases, and the margin gets narrwer. Sme pints will mve frm inside their margins t utside their margins, and their α i will change frm t 0. By cntinuity f the α i (λ), these pints will linger n the margin during this transitin. Frm (.) we see that the pints with α i = make fixed cntributins t β(λ), and thse with α i = 0 make n cntributin. S all that changes as λ decreases are the α i [0,] f thse (small number) f pints n the margin. Since all these pints have y i f(x i ) =, this results in a small set f linear equatins that prescribe hw α i (λ) and hence β λ changes during these transitins. This results in piecewise linear paths fr each f the α i (λ). The breaks ccur when pints crss the margin. Figure.7 (right panel) shws the α i (λ) prfiles fr the small example in the left panel. Althugh we have described this fr linear SVMs, exactly the same idea wrks fr nnlinear mdels, in which (.) is replaced by f λ (x) = λ N α i y i K(x,x i ). (.4) i= Details can be fund in Hastie et al. (004). An R package svmpath is available n CRAN fr fitting these mdels...6 Supprt Vectr Machines fr Regressin In this sectin we shw hw SVMs can be adapted fr regressin with a quantitative respnse, in ways that inherit sme f the prperties f the SVM classifier. We first discuss the linear regressin mdel f(x) = x T β + β 0, (.5) and then handle nnlinear generalizatins. T estimate β, we cnsider minimizatin f H(β,β 0 ) = N V (y i f(x i )) + λ β, (.6) i=

453 . Supprt Vectr Machines and Kernels 45 Vǫ(r) ǫ r ǫ VH(r) c FIGURE.8. The left panel shws the ǫ-insensitive errr functin used by the supprt vectr regressin machine. The right panel shws the errr functin used in Huber s rbust regressin (blue curve). Beynd c, the functin changes frm quadratic t linear. r c where V ǫ (r) = { 0 if r < ǫ, r ǫ, therwise. (.7) This is an ǫ-insensitive errr measure, ignring errrs f size less than ǫ (left panel f Figure.8). There is a rugh analgy with the supprt vectr classificatin setup, where pints n the crrect side f the decisin bundary and far away frm it, are ignred in the ptimizatin. In regressin, these lw errr pints are the nes with small residuals. It is interesting t cntrast this with errr measures used in rbust regressin in statistics. The mst ppular, due t Huber (964), has the frm V H (r) = { r / if r c, c r c /, r > c, (.8) shwn in the right panel f Figure.8. This functin reduces frm quadratic t linear the cntributins f bservatins with abslute residual greater than a prechsen cnstant c. This makes the fitting less sensitive t utliers. The supprt vectr errr measure (.7) als has linear tails (beynd ǫ), but in additin it flattens the cntributins f thse cases with small residuals. If ˆβ, ˆβ 0 are the minimizers f H, the slutin functin can be shwn t have the frm ˆβ = ˆf(x) = N (ˆα i ˆα i )x i, (.9) i= N (ˆα i ˆα i ) x,x i + β 0, (.40) i=

454 46. Flexible Discriminants where ˆα i, ˆα i are psitive and slve the quadratic prgramming prblem N N ǫ (αi + α i ) y i (αi α i ) + N (αi α i )(αi α i ) x i,x i min α i,α i i= i= subject t the cnstraints i,i = 0 α i, αi /λ, N (αi α i ) = 0, (.4) i= α i α i = 0. Due t the nature f these cnstraints, typically nly a subset f the slutin values (ˆα i ˆα i) are nnzer, and the assciated data values are called the supprt vectrs. As was the case in the classificatin setting, the slutin depends n the input values nly thrugh the inner prducts x i,x i. Thus we can generalize the methds t richer spaces by defining an apprpriate inner prduct, fr example, ne f thse defined in (.). Nte that there are parameters, ǫ and λ, assciated with the criterin (.6). These seem t play different rles. ǫ is a parameter f the lss functin V ǫ, just like c is fr V H. Nte that bth V ǫ and V H depend n the scale f y and hence r. If we scale ur respnse (and hence use V H (r/σ) and V ǫ (r/σ) instead), then we might cnsider using preset values fr c and ǫ (the value c =.45 achieves 95% efficiency fr the Gaussian). The quantity λ is a mre traditinal regularizatin parameter, and can be estimated fr example by crss-validatin...7 Regressin and Kernels As discussed in Sectin.., this kernel prperty is nt unique t supprt vectr machines. Suppse we cnsider apprximatin f the regressin functin in terms f a set f basis functins {h m (x)},m =,,...,M: f(x) = M β m h m (x) + β 0. (.4) m= T estimate β and β 0 we minimize H(β,β 0 ) = N V (y i f(x i )) + λ β m (.4) i= fr sme general errr measure V (r). Fr any chice f V (r), the slutin ˆf(x) = ˆβm h m (x) + ˆβ 0 has the frm ˆf(x) = N â i K(x,x i ) (.44) i=

455 . Supprt Vectr Machines and Kernels 47 with K(x,y) = M m= h m(x)h m (y). Ntice that this has the same frm as bth the radial basis functin expansin and a regularizatin estimate, discussed in Chapters 5 and 6. Fr cncreteness, let s wrk ut the case V (r) = r. Let H be the N M basis matrix with imth element h m (x i ), and suppse that M > N is large. Fr simplicity we assume that β 0 = 0, r that the cnstant is absrbed in h; see Exercise. fr an alternative. We estimate β by minimizing the penalized least squares criterin The slutin is with ˆβ determined by H(β) = (y Hβ) T (y Hβ) + λ β. (.45) ŷ = Hˆβ (.46) H T (y Hˆβ) + λˆβ = 0. (.47) Frm this it appears that we need t evaluate the M M matrix f inner prducts in the transfrmed space. Hwever, we can premultiply by H t give Hˆβ = (HH T + λi) HH T y. (.48) The N N matrix HH T cnsists f inner prducts between pairs f bservatins i,i ; that is, the evaluatin f an inner prduct kernel {HH T } i,i = K(x i,x i ). It is easy t shw (.44) directly in this case, that the predicted values at an arbitrary x satisfy ˆf(x) = h(x) T ˆβ N = ˆα i K(x,x i ), (.49) i= where ˆα = (HH T +λi) y. As in the supprt vectr machine, we need nt specify r evaluate the large set f functins h (x),h (x),...,h M (x). nly the inner prduct kernel K(x i,x i ) need be evaluated, at the N training pints fr each i,i and at pints x fr predictins there. Careful chice f h m (such as the eigenfunctins f particular, easy-t-evaluate kernels K) means, fr example, that HH T can be cmputed at a cst f N / evaluatins f K, rather than the direct cst N M. Nte, hwever, that this prperty depends n the chice f squared nrm β in the penalty. It des nt hld, fr example, fr the L nrm β, which may lead t a superir mdel.

456 48. Flexible Discriminants..8 Discussin The supprt vectr machine can be extended t multiclass prblems, essentially by slving many tw-class prblems. A classifier is built fr each pair f classes, and the final classifier is the ne that dminates the mst (Kressel, 999; Friedman, 996; Hastie and Tibshirani, 998). Alternatively, ne culd use the multinmial lss functin alng with a suitable kernel, as in Sectin... SVMs have applicatins in many ther supervised and unsupervised learning prblems. At the time f this writing, empirical evidence suggests that it perfrms well in many real learning prblems. Finally, we mentin the cnnectin f the supprt vectr machine and structural risk minimizatin (7.9). Suppse the training pints (r their basis expansin) are cntained in a sphere f radius R, and let G(x) = sign[f(x)] = sign[β T x + β 0 ] as in (.). Then ne can shw that the class f functins {G(x), β A} has VC-dimensin h satisfying h R A. (.50) If f(x) separates the training data, ptimally fr β A, then with prbability at least η ver training sets (Vapnik, 996, page 9): Errr Test 4 h[lg (N/h) + ] lg (η/4). (.5) N The supprt vectr classifier was ne f the first practical learning prcedures fr which useful bunds n the VC dimensin culd be btained, and hence the SRM prgram culd be carried ut. Hwever in the derivatin, balls are put arund the data pints a prcess that depends n the bserved values f the features. Hence in a strict sense, the VC cmplexity f the class is nt fixed a priri, befre seeing the features. The regularizatin parameter C cntrls an upper bund n the VC dimensin f the classifier. Fllwing the SRM paradigm, we culd chse C by minimizing the upper bund n the test errr, given in (.5). Hwever, it is nt clear that this has any advantage ver the use f crss-validatin fr chice f C..4 Generalizing Linear Discriminant Analysis In Sectin 4. we discussed linear discriminant analysis (LDA), a fundamental tl fr classificatin. Fr the remainder f this chapter we discuss a class f techniques that prduce better classifiers than LDA by directly generalizing LDA. Sme f the virtues f LDA are as fllws: It is a simple prttype classifier. A new bservatin is classified t the class with clsest centrid. A slight twist is that distance is measured in the Mahalanbis metric, using a pled cvariance estimate.

457 .4 Generalizing Linear Discriminant Analysis 49 LDA is the estimated Bayes classifier if the bservatins are multivariate Gaussian in each class, with a cmmn cvariance matrix. Since this assumptin is unlikely t be true, this might nt seem t be much f a virtue. The decisin bundaries created by LDA are linear, leading t decisin rules that are simple t describe and implement. LDA prvides natural lw-dimensinal views f the data. Fr example, Figure. is an infrmative tw-dimensinal view f data in 56 dimensins with ten classes. ften LDA prduces the best classificatin results, because f its simplicity and lw variance. LDA was amng the tp three classifiers fr f the datasets studied in the STATLG prject (Michie et al., 994). Unfrtunately the simplicity f LDA causes it t fail in a number f situatins as well: ften linear decisin bundaries d nt adequately separate the classes. When N is large, it is pssible t estimate mre cmplex decisin bundaries. Quadratic discriminant analysis (QDA) is ften useful here, and allws fr quadratic decisin bundaries. Mre generally we wuld like t be able t mdel irregular decisin bundaries. The afrementined shrtcming f LDA can ften be paraphrased by saying that a single prttype per class is insufficient. LDA uses a single prttype (class centrid) plus a cmmn cvariance matrix t describe the spread f the data in each class. In many situatins, several prttypes are mre apprpriate. At the ther end f the spectrum, we may have way t many (crrelated) predictrs, fr example, in the case f digitized analgue signals and images. In this case LDA uses t many parameters, which are estimated with high variance, and its perfrmance suffers. In cases such as this we need t restrict r regularize LDA even further. In the remainder f this chapter we describe a class f techniques that attend t all these issues by generalizing the LDA mdel. This is achieved largely by three different ideas. The first idea is t recast the LDA prblem as a linear regressin prblem. Many techniques exist fr generalizing linear regressin t mre flexible, nnparametric frms f regressin. This in turn leads t mre flexible frms f discriminant analysis, which we call FDA. In mst cases f interest, the This study predated the emergence f SVMs.

458 440. Flexible Discriminants regressin prcedures can be seen t identify an enlarged set f predictrs via basis expansins. FDA amunts t LDA in this enlarged space, the same paradigm used in SVMs. In the case f t many predictrs, such as the pixels f a digitized image, we d nt want t expand the set: it is already t large. The secnd idea is t fit an LDA mdel, but penalize its cefficients t be smth r therwise cherent in the spatial dmain, that is, as an image. We call this prcedure penalized discriminant analysis r PDA. With FDA itself, the expanded basis set is ften s large that regularizatin is als required (again as in SVMs). Bth f these can be achieved via a suitably regularized regressin in the cntext f the FDA mdel. The third idea is t mdel each class by a mixture f tw r mre Gaussians with different centrids, but with every cmpnent Gaussian, bth within and between classes, sharing the same cvariance matrix. This allws fr mre cmplex decisin bundaries, and allws fr subspace reductin as in LDA. We call this extensin mixture discriminant analysis r MDA. All three f these generalizatins use a cmmn framewrk by expliting their cnnectin with LDA..5 Flexible Discriminant Analysis In this sectin we describe a methd fr perfrming LDA using linear regressin n derived respnses. This in turn leads t nnparametric and flexible alternatives t LDA. As in Chapter 4, we assume we have bservatins with a quantitative respnse G falling int ne f K classes G = {,...,K}, each having measured features X. Suppse θ : G IR is a functin that assigns scres t the classes, such that the transfrmed class labels are ptimally predicted by linear regressin n X: If ur training sample has the frm (g i,x i ), i =,,...,N, then we slve min β,θ N ( θ(gi ) x T i β ), (.5) i= with restrictins n θ t avid a trivial slutin (mean zer and unit variance ver the training data). This prduces a ne-dimensinal separatin between the classes. Mre generally, we can find up t L K sets f independent scrings fr the class labels, θ,θ,...,θ L, and L crrespnding linear maps η l (X) = X T β l, l =,...,L, chsen t be ptimal fr multiple regressin in IR p. The scres θ l (g) and the maps β l are chsen t minimize the average squared residual, [ ASR = L N ] ( θl (g i ) x T ) i β l. (.5) N l= i=

459 .5 Flexible Discriminant Analysis 44 The set f scres are assumed t be mutually rthgnal and nrmalized with respect t an apprpriate inner prduct t prevent trivial zer slutins. Why are we ging dwn this rad? It can be shwn that the sequence f discriminant (cannical) vectrs ν l derived in Sectin 4.. are identical t the sequence β l up t a cnstant (Mardia et al., 979; Hastie et al., 995). Mrever, the Mahalanbis distance f a test pint x t the kth class centrid ˆµ k is given by δ J (x, ˆµ k ) = K l= w l (ˆη l (x) η k l ) + D(x), (.54) where η l k is the mean f the ˆη l(x i ) in the kth class, and D(x) des nt depend n k. Here w l are crdinate weights that are defined in terms f the mean squared residual rl f the lth ptimally scred fit w l = rl ( (.55) r l ). In Sectin 4.. we saw that these cannical distances are all that is needed fr classificatin in the Gaussian setup, with equal cvariances in each class. T summarize: LDA can be perfrmed by a sequence f linear regressins, fllwed by classificatin t the clsest class centrid in the space f fits. The analgy applies bth t the reduced rank versin, r the full rank case when L = K. The real pwer f this result is in the generalizatins that it invites. We can replace the linear regressin fits η l (x) = x T β l by far mre flexible, nnparametric fits, and by analgy achieve a mre flexible classifier than LDA. We have in mind generalized additive fits, spline functins, MARS mdels and the like. In this mre general frm the regressin prblems are defined via the criterin [ ASR({θ l,η l } L l=) = L N ] (θ l (g i ) η l (x i )) + λj(η l ), (.56) N l= i= where J is a regularizer apprpriate fr sme frms f nnparametric regressin, such as smthing splines, additive splines and lwer-rder ANVA spline mdels. Als included are the classes f functins and assciated penalties generated by kernels, as in Sectin... Befre we describe the cmputatins invlved in this generalizatin, let us cnsider a very simple example. Suppse we use degree- plynmial regressin fr each η l. The decisin bundaries implied by the (.54) will be quadratic surfaces, since each f the fitted functins is quadratic, and as

460 44. Flexible Discriminants FIGURE.9. The data cnsist f 50 pints generated frm each f N(0, I) and N(0, 9 I). The slid black ellipse is the decisin bundary fund by FDA using 4 degree-tw plynmial regressin. The dashed purple circle is the Bayes decisin bundary. in LDA their squares cancel ut when cmparing distances. We culd have achieved identical quadratic bundaries in a mre cnventinal way, by augmenting ur riginal predictrs with their squares and crss-prducts. In the enlarged space ne perfrms an LDA, and the linear bundaries in the enlarged space map dwn t quadratic bundaries in the riginal space. A classic example is a pair f multivariate Gaussians centered at the rigin, ne having cvariance matrix I, and the ther ci fr c > ; Figure.9 pc lg c illustrates. The Bayes decisin bundary is the sphere x = (c ), which is a linear bundary in the enlarged space. Many nnparametric regressin prcedures perate by generating a basis expansin f derived variables, and then perfrming a linear regressin in the enlarged space. The MARS prcedure (Chapter 9) is exactly f this frm. Smthing splines and additive spline mdels generate an extremely large basis set (N p basis functins fr additive splines), but then perfrm a penalized regressin fit in the enlarged space. SVMs d as well; see als the kernel-based regressin example in Sectin..7. FDA in this case can be shwn t perfrm a penalized linear discriminant analysis in the enlarged space. We elabrate in Sectin.6. Linear bundaries in the enlarged space map dwn t nnlinear bundaries in the reduced space. This is exactly the same paradigm that is used with supprt vectr machines (Sectin.). We illustrate FDA n the speech recgnitin example used in Chapter 4.), with K = classes and p = 0 predictrs. The classes crrespnd t

461 .5 Flexible Discriminant Analysis 44 Linear Discriminant Analysis Flexible Discriminant Analysis -- Brut Crdinate fr Training Data Crdinate fr Training Data Crdinate fr Training Data Crdinate fr Training Data FIGURE.0. The left plt shws the first tw LDA cannical variates fr the vwel training data. The right plt shws the crrespnding prjectin when FDA/BRUT is used t fit the mdel; pltted are the fitted regressin functins ˆη (x i) and ˆη (x i). Ntice the imprved separatin. The clrs represent the eleven different vwel sunds. vwel sunds, each cntained in different wrds. Here are the wrds, preceded by the symbls that represent them: Vwel Wrd Vwel Wrd Vwel Wrd Vwel Wrd i: heed hd I hid C: hard E head U hd A had u: wh d a: hard : heard Y hud Each f eight speakers spke each wrd six times in the training set, and likewise seven speakers in the test set. The ten predictrs are derived frm the digitized speech in a rather cmplicated way, but standard in the speech recgnitin wrld. There are thus 58 training bservatins, and 46 test bservatins. Figure.0 shws tw-dimensinal prjectins prduced by LDA and FDA. The FDA mdel used adaptive additive-spline regressin functins t mdel the η l (x), and the pints pltted in the right plt have crdinates ˆη (x i ) and ˆη (x i ). The rutine used in S-PLUS is called brut, hence the heading n the plt and in Table.. We see that flexible mdeling has helped t separate the classes in this case. Table. shws training and test errr rates fr a number f classificatin techniques. FDA/MARS refers t Friedman s multivariate adaptive regressin splines; degree = means pairwise prducts are permitted. Ntice that fr FDA/MARS, the best classificatin results are btained in a reduced-rank subspace.

462 444. Flexible Discriminants TABLE.. Vwel recgnitin data perfrmance results. The results fr neural netwrks are the best amng a much larger set, taken frm a neural netwrk archive. The ntatin FDA/BRUT refers t the regressin methd used with FDA. Technique Errr Rates Training Test () LDA Sftmax () QDA () CART (4) CART (linear cmbinatin splits) (5) Single-layer perceptrn 0.67 (6) Multi-layer perceptrn (88 hidden units) 0.49 (7) Gaussian nde netwrk (58 hidden units) 0.45 (8) Nearest neighbr 0.44 (9) FDA/BRUT Sftmax (0) FDA/MARS (degree = ) Best reduced dimensin (=) Sftmax () FDA/MARS (degree = ) Best reduced dimensin (=6) Sftmax Cmputing the FDA Estimates The cmputatins fr the FDA crdinates can be simplified in many imprtant cases, in particular when the nnparametric regressin prcedure can be represented as a linear peratr. We will dente this peratr by S λ ; that is, ŷ = S λ y, where y is the vectr f respnses and ŷ the vectr f fits. Additive splines have this prperty, if the smthing parameters are fixed, as des MARS nce the basis functins are selected. The subscript λ dentes the entire set f smthing parameters. In this case ptimal scring is equivalent t a cannical crrelatin prblem, and the slutin can be cmputed by a single eigen-decmpsitin. This is pursued in Exercise.6, and the resulting algrithm is presented here. We create an N K indicatr respnse matrix Y frm the respnses g i, such that y ik = if g i = k, therwise y ik = 0. Fr a five-class prblem Y might lk like the fllwing:

463 .5 Flexible Discriminant Analysis 445 Here are the cmputatinal steps: C C C C 4 C 5 0 g = g = g = g 4 = g 5 = B A g N = Multivariate nnparametric regressin. Fit a multirespnse, adaptive nnparametric regressin f Y n X, giving fitted values Ŷ. Let S λ be the linear peratr that fits the final chsen mdel, and η (x) be the vectr f fitted regressin functins.. ptimal scres. Cmpute the eigen-decmpsitin f Y T Ŷ = Y T S λ Y, where the eigenvectrs Θ are nrmalized: Θ T D π Θ = I. Here D π = Y T Y/N is a diagnal matrix f the estimated class prir prbabilities.. Update the mdel frm step using the ptimal scres: η(x) = Θ T η (x). The first f the K functins in η(x) is the cnstant functin a trivial slutin; the remaining K functins are the discriminant functins. The cnstant functin, alng with the nrmalizatin, causes all the remaining functins t be centered. Again S λ can crrespnd t any regressin methd. When S λ = H X, the linear regressin prjectin peratr, then FDA is linear discriminant analysis. The sftware that we reference in the Cmputatinal Cnsideratins sectin n page 455 makes gd use f this mdularity; the fda functin has a methd= argument that allws ne t supply any regressin functin, as lng as it fllws sme natural cnventins. The regressin functins we prvide allw fr plynmial regressin, adaptive additive mdels and MARS. They all efficiently handle multiple respnses, s step () is a single call t a regressin rutine. The eigen-decmpsitin in step () simultaneusly cmputes all the ptimal scring functins. In Sectin 4. we discussed the pitfalls f using linear regressin n an indicatr respnse matrix as a methd fr classificatin. In particular, severe masking can ccur with three r mre classes. FDA uses the fits frm such a regressin in step (), but then transfrms them further t prduce useful discriminant functins that are devid f these pitfalls. Exercise.9 takes anther view f this phenmenn.

464 446. Flexible Discriminants.6 Penalized Discriminant Analysis Althugh FDA is mtivated by generalizing ptimal scring, it can als be viewed directly as a frm f regularized discriminant analysis. Suppse the regressin prcedure used in FDA amunts t a linear regressin nt a basis expansin h(x), with a quadratic penalty n the cefficients: ASR({θ l,β l } L l=) = N [ L N ] (θ l (g i ) h T (x i )β l ) + λβl T Ωβ l. (.57) l= i= The chice f Ω depends n the prblem. If η l (x) = h(x)β l is an expansin n spline basis functins, Ω might cnstrain η l t be smth ver IR p. In the case f additive splines, there are N spline basis functins fr each crdinate, resulting in a ttal f Np basis functins in h(x); Ω in this case is Np Np and blck diagnal. The steps in FDA can then be viewed as a generalized frm f LDA, which we call penalized discriminant analysis, r PDA: Enlarge the set f predictrs X via a basis expansin h(x). Use (penalized) LDA in the enlarged space, where the penalized Mahalanbis distance is given by D(x,µ) = (h(x) h(µ)) T (Σ W + λω) (h(x) h(µ)), (.58) where Σ W is the within-class cvariance matrix f the derived variables h(x i ). Decmpse the classificatin subspace using a penalized metric: max u T Σ Bet u subject t u T (Σ W + λω)u =. Lsely speaking, the penalized Mahalanbis distance tends t give less weight t rugh crdinates, and mre weight t smth nes; since the penalty is nt diagnal, the same applies t linear cmbinatins that are rugh r smth. Fr sme classes f prblems, the first step, invlving the basis expansin, is nt needed; we already have far t many (crrelated) predictrs. A leading example is when the bjects t be classified are digitized analg signals: the lg-peridgram f a fragment f spken speech, sampled at a set f 56 frequencies; see Figure 5.5 n page 49. the grayscale pixel values in a digitized image f a handwritten digit.

465 .6 Penalized Discriminant Analysis 447 LDA: Cefficient PDA: Cefficient LDA: Cefficient PDA: Cefficient LDA: Cefficient PDA: Cefficient LDA: Cefficient 4 PDA: Cefficient 4 LDA: Cefficient 5 PDA: Cefficient 5 LDA: Cefficient 6 PDA: Cefficient 6 LDA: Cefficient 7 PDA: Cefficient 7 LDA: Cefficient 8 PDA: Cefficient 8 LDA: Cefficient 9 PDA: Cefficient 9 FIGURE.. The images appear in pairs, and represent the nine discriminant cefficient functins fr the digit recgnitin prblem. The left member f each pair is the LDA cefficient, while the right member is the PDA cefficient, regularized t enfrce spatial smthness. It is als intuitively clear in these cases why regularizatin is needed. Take the digitized image as an example. Neighbring pixel values will tend t be crrelated, being ften almst the same. This implies that the pair f crrespnding LDA cefficients fr these pixels can be wildly different and ppsite in sign, and thus cancel when applied t similar pixel values. Psitively crrelated predictrs lead t nisy, negatively crrelated cefficient estimates, and this nise results in unwanted sampling variance. A reasnable strategy is t regularize the cefficients t be smth ver the spatial dmain, as with images. This is what PDA des. The cmputatins prceed just as fr FDA, except that an apprpriate penalized regressin methd is used. Here h T (X)β l = Xβ l, and Ω is chsen s that β T l Ωβ l penalizes rughness in β l when viewed as an image. Figure. n page 4 shws sme examples f handwritten digits. Figure. shws the discriminant variates using LDA and PDA. Thse prduced by LDA appear as salt-and-pepper images, while thse prduced by PDA are smth images. The first smth image can be seen as the cefficients f a linear cntrast functinal fr separating images with a dark central vertical strip (nes, pssibly sevens) frm images that are hllw in the middle (zers, sme furs). Figure. supprts this interpretatin, and with mre difficulty allws an interpretatin f the secnd crdinate. This and ther

466 448. Flexible Discriminants PDA: Discriminant Crdinate PDA: Discriminant Crdinate FIGURE.. The first tw penalized cannical variates, evaluated fr the test data. The circles indicate the class centrids. The first crdinate cntrasts mainly 0 s and s, while the secnd cntrasts 6 s and 7/9 s.

467 .7 Mixture Discriminant Analysis 449 examples are discussed in mre detail in Hastie et al. (995), wh als shw that the regularizatin imprves the classificatin perfrmance f LDA n independent test data by a factr f arund 5% in the cases they tried..7 Mixture Discriminant Analysis Linear discriminant analysis can be viewed as a prttype classifier. Each class is represented by its centrid, and we classify t the clsest using an apprpriate metric. In many situatins a single prttype is nt sufficient t represent inhmgeneus classes, and mixture mdels are mre apprpriate. In this sectin we review Gaussian mixture mdels and shw hw they can be generalized via the FDA and PDA methds discussed earlier. A Gaussian mixture mdel fr the kth class has density R k P(X G = k) = π kr φ(x;µ kr,σ), (.59) r= where the mixing prprtins π kr sum t ne. This has R k prttypes fr the kth class, and in ur specificatin, the same cvariance matrix Σ is used as the metric thrughut. Given such a mdel fr each class, the class psterir prbabilities are given by P(G = k X = x) = Rk r= π krφ(x;µ kr,σ)π k K l= Rl r= π lrφ(x;µ lr,σ)π l, (.60) where Π k represent the class prir prbabilities. We saw these calculatins fr the special case f tw cmpnents in Chapter 8. As in LDA, we estimate the parameters by maximum likelihd, using the jint lg-likelihd based n P(G,X): K lg k= g i=k [ Rk ] π kr φ(x i ;µ kr,σ)π k. (.6) r= The sum within the lg makes this a rather messy ptimizatin prblem if tackled directly. The classical and natural methd fr cmputing the maximum-likelihd estimates (MLEs) fr mixture distributins is the EM algrithm (Dempster et al., 977), which is knwn t pssess gd cnvergence prperties. EM alternates between the tw steps:

468 450. Flexible Discriminants E-step: Given the current parameters, cmpute the respnsibility f subclass c kr within class k fr each f the class-k bservatins (g i = k): W(c kr x i,g i ) = π kr φ(x i ;µ kr,σ) Rk l= π klφ(x i ;µ kl,σ). (.6) M-step: Cmpute the weighted MLEs fr the parameters f each f the cmpnent Gaussians within each f the classes, using the weights frm the E-step. In the E-step, the algrithm apprtins the unit weight f an bservatin in class k t the varius subclasses assigned t that class. If it is clse t the centrid f a particular subclass, and far frm the thers, it will receive a mass clse t ne fr that subclass. n the ther hand, bservatins halfway between tw subclasses will get apprximately equal weight fr bth. In the M-step, an bservatin in class k is used R k times, t estimate the parameters in each f the R k cmpnent densities, with a different weight fr each. The EM algrithm is studied in detail in Chapter 8. The algrithm requires initializatin, which can have an impact, since mixture likelihds are generally multimdal. ur sftware (referenced in the Cmputatinal Cnsideratins n page 455) allws several strategies; here we describe the default. The user supplies the number R k f subclasses per class. Within class k, a k-means clustering mdel, with multiple randm starts, is fitted t the data. This partitins the bservatins int R k disjint grups, frm which an initial weight matrix, cnsisting f zers and nes, is created. ur assumptin f an equal cmpnent cvariance matrix Σ thrughut buys an additinal simplicity; we can incrprate rank restrictins in the mixture frmulatin just like in LDA. T understand this, we review a littleknwn fact abut LDA. The rank-l LDA fit (Sectin 4..) is equivalent t the maximum-likelihd fit f a Gaussian mdel,where the different mean vectrs in each class are cnfined t a rank-l subspace f IR p (Exercise 4.8). We can inherit this prperty fr the mixture mdel, and maximize the lglikelihd (.6) subject t rank cnstraints n all the k R k centrids: rank{µ kl } = L. Again the EM algrithm is available, and the M-step turns ut t be a weighted versin f LDA, with R = K k= R k classes. Furthermre, we can use ptimal scring as befre t slve the weighted LDA prblem, which allws us t use a weighted versin f FDA r PDA at this stage. ne wuld expect, in additin t an increase in the number f classes, a similar increase in the number f bservatins in the kth class by a factr f R k. It turns ut that this is nt the case if linear peratrs are used fr the ptimal scring regressin. The enlarged indicatr Y matrix cllapses in this case t a blurred respnse matrix Z, which is intuitively pleasing. Fr example, suppse there are K = classes, and R k = subclasses per class. Then Z might be

469 .7 Mixture Discriminant Analysis 45 c c c c c c c c c g = g = g = g 4 = , (.6) g 5 = g N = where the entries in a class-k rw crrespnd t W(c kr x,g i ). The remaining steps are the same: Ẑ = SZ Z T Ẑ = ΘDΘ T M-step f MDA. Update πs and Πs These simple mdificatins add cnsiderable flexibility t the mixture mdel: The dimensin reductin step in LDA, FDA r PDA is limited by the number f classes; in particular, fr K = classes n reductin is pssible. MDA substitutes subclasses fr classes, and then allws us t lk at lw-dimensinal views f the subspace spanned by these subclass centrids. This subspace will ften be an imprtant ne fr discriminatin. By using FDA r PDA in the M-step, we can adapt even mre t particular situatins. Fr example, we can fit MDA mdels t digitized analg signals and images, with smthness cnstraints built in. Figure. cmpares FDA and MDA n the mixture example..7. Example: Wavefrm Data We nw illustrate sme f these ideas n a ppular simulated example, taken frm Breiman et al. (984, pages 49 55), and used in Hastie and Tibshirani (996b) and elsewhere. It is a three-class prblem with variables, and is cnsidered t be a difficult pattern recgnitin prblem. The predictrs are defined by X j = Uh (j) + ( U)h (j) + ǫ j Class, X j = Uh (j) + ( U)h (j) + ǫ j Class, (.64) X j = Uh (j) + ( U)h (j) + ǫ j Class, where j =,,...,, U is unifrm n (0,), ǫ j are standard nrmal variates, and the h l are the shifted triangular wavefrms: h (j) = max(6

470 45. Flexible Discriminants FDA / MARS - Degree Training Errr: 0.85 Test Errr: 0.5 Bayes Errr: 0.0 MDA - 5 Subclasses per Class Training Errr: 0.7 Test Errr: 0. Bayes Errr: 0. FIGURE.. FDA and MDA n the mixture data. The upper plt uses FDA with MARS as the regressin prcedure. The lwer plt uses MDA with five mixture centers per class (indicated). The MDA slutin is clse t Bayes ptimal, as might be expected given the data arise frm mixtures f Gaussians. The brken purple curve in the backgrund is the Bayes decisin bundary.

471 .7 Mixture Discriminant Analysis Class Class Class FIGURE.4. Sme examples f the wavefrms generated frm mdel (.64) befre the Gaussian nise is added. j,0), h (j) = h (j 4) and h (j) = h (j + 4). Figure.4 shws sme example wavefrms frm each class. Table.4 shws the results f MDA applied t the wavefrm data, as well as several ther methds frm this and ther chapters. Each training sample has 00 bservatins, and equal prirs were used, s there are rughly 00 bservatins in each class. We used test samples f size 500. The tw MDA mdels are described in the captin. Figure.5 shws the leading cannical variates fr the penalized MDA mdel, evaluated at the test data. As we might have guessed, the classes appear t lie n the edges f a triangle. This is because the h j (i) are represented by three pints in -space, thereby frming vertices f a triangle, and each class is represented as a cnvex cmbinatin f a pair f vertices, and hence lie n an edge. Als it is clear visually that all the infrmatin lies in the first tw dimensins; the percentage f variance explained by the first tw crdinates is 99.8%, and we wuld lse nthing by truncating the slutin there. The Bayes risk fr this prblem has been estimated t be abut 0.4 (Breiman et al., 984). MDA cmes clse t the ptimal rate, which is nt surprising since the structure f the MDA mdel is similar t the generating mdel.

472 454. Flexible Discriminants TABLE.4. Results fr wavefrm data. The values are averages ver ten simulatins, with the standard errr f the average in parentheses. The five entries abve the line are taken frm Hastie et al. (994). The first mdel belw the line is MDA with three subclasses per class. The next line is the same, except that the discriminant cefficients are penalized via a rughness penalty t effectively 4df. The third is the crrespnding penalized LDA r PDA mdel. Technique Errr Rates Training Test LDA 0.(0.006) 0.9(0.006) QDA 0.09(0.004) 0.05(0.006) CART 0.07(0.00) 0.89(0.004) FDA/MARS (degree = ) 0.00(0.006) 0.9(0.006) FDA/MARS (degree = ) 0.068(0.004) 0.5(0.00) MDA ( subclasses) 0.087(0.005) 0.69(0.006) MDA ( subclasses, penalized 4 df) 0.7(0.006) 0.57(0.005) PDA (penalized 4 df) 0.50(0.005) 0.7(0.005) Bayes 0.40 Discriminant Var Discriminant Var Subclasses, Penalized 4 df Discriminant Var Discriminant Var Subclasses, Penalized 4 df FIGURE.5. Sme tw-dimensinal views f the MDA mdel fitted t a sample f the wavefrm mdel. The pints are independent test data, prjected n t the leading tw cannical crdinates (left panel), and the third and furth (right panel). The subclass centers are indicated.

473 Cmputatinal Cnsideratins Exercises 455 With N training cases, p predictrs, and m supprt vectrs, the supprt vectr machine requires m + mn + mpn peratins, assuming m N. They d nt scale well with N, althugh cmputatinal shrtcuts are available (Platt, 999). Since these are evlving rapidly, the reader is urged t search the web fr the latest technlgy. LDA requires Np + p peratins, as des PDA. The cmplexity f FDA depends n the regressin methd used. Many techniques are linear in N, such as additive mdels and MARS. General splines and kernel-based regressin methds will typically require N peratins. Sftware is available fr fitting FDA, PDA and MDA mdels in the R package mda, which is als available in S-PLUS. Bibligraphic Ntes The thery behind supprt vectr machines is due t Vapnik and is described in Vapnik (996). There is a burgening literature n SVMs; an nline bibligraphy, created and maintained by Alex Smla and Bernhard Schölkpf, can be fund at: ur treatment is based n Wahba et al. (000) and Evgeniu et al. (000), and the tutrial by Burges (Burges, 998). Linear discriminant analysis is due t Fisher (96) and Ra (97). The cnnectin with ptimal scring dates back at least t Breiman and Ihaka (984), and in a simple frm t Fisher (96). There are strng cnnectins with crrespndence analysis (Greenacre, 984). The descriptin f flexible, penalized and mixture discriminant analysis is taken frm Hastie et al. (994), Hastie et al. (995) and Hastie and Tibshirani (996b), and all three are summarized in Hastie et al. (998); see als Ripley (996). Exercises Ex.. Shw that the criteria (.5) and (.8) are equivalent. Ex.. Shw that the slutin t (.9) is the same as the slutin t (.5) fr a particular kernel. Ex.. Cnsider a mdificatin t (.4) where yu d nt penalize the cnstant. Frmulate the prblem, and characterize its slutin. Ex..4 Suppse yu perfrm a reduced-subspace linear discriminant analysis fr a K-grup prblem. Yu cmpute the cannical variables f di-

474 456. Flexible Discriminants mensin L K given by z = U T x, where U is the p L matrix f discriminant cefficients, and p > K is the dimensin f x. (a) If L = K shw that z z k z z k = x x k W x x k W, where W dentes Mahalanbis distance with respect t the cvariance W. (b) If L < K, shw that the same expressin n the left measures the difference in Mahalanbis squared distances fr the distributins prjected nt the subspace spanned by U. Ex..5 The data in phneme.subset, available frm this bk s website cnsists f digitized lg-peridgrams fr phnemes uttered by 60 speakers, each speaker having prduced phnemes frm each f five classes. It is apprpriate t plt each vectr f 56 features against the frequencies (a) Prduce a separate plt f all the phneme curves against frequency fr each class. (b) Yu plan t use a nearest prttype classificatin scheme t classify the curves int phneme classes. In particular, yu will use a K-means clustering algrithm in each class (kmeans() in R), and then classify bservatins t the class f the clsest cluster center. The curves are high-dimensinal and yu have a rather small sample-size-t-variables rati. Yu decide t restrict all the prttypes t be smth functins f frequency. In particular, yu decide t represent each prttype m as m = Bθ where B is a 56 J matrix f natural spline basis functins with J knts unifrmly chsen in (0, 55) and bundary knts at 0 and 55. Describe hw t prceed analytically, and in particular, hw t avid cstly high-dimensinal fitting prcedures. (Hint: It may help t restrict B t be rthgnal.) (c) Implement yur prcedure n the phneme data, and try it ut. Divide the data int a training set and a test set (50-50), making sure that speakers are nt split acrss sets (why?). Use K =,,5,7 centers per class, and fr each use J = 5,0,5 knts (taking care t start the K-means prcedure at the same starting values fr each value f J), and cmpare the results. Ex..6 Suppse that the regressin prcedure used in FDA (Sectin.5.) is a linear expansin f basis functins h m (x), m =,...,M. Let D π = Y T Y/N be the diagnal matrix f class prprtins.

475 Exercises 457 (a) Shw that the ptimal scring prblem (.5) can be written in vectr ntatin as min Yθ θ,β Hβ, (.65) where θ is a vectr f K real numbers, and H is the N M matrix f evaluatins h j (x i ). (b) Suppse that the nrmalizatin n θ is θ T D π = 0 and θ T D π θ =. Interpret these nrmalizatins in terms f the riginal scred θ(g i ). (c) Shw that, with this nrmalizatin, (.65) can be partially ptimized w.r.t. β, and leads t max θ T Sθ, (.66) θ subject t the nrmalizatin cnstraints, where S is the prjectin peratr crrespnding t the basis matrix H. (d) Suppse that the h j include the cnstant functin. Shw that the largest eigenvalue f S is. (e) Let Θ be a K K matrix f scres (in clumns), and suppse the nrmalizatin is Θ T D π Θ = I. Shw that the slutin t (.5) is given by the cmplete set f eigenvectrs f S; the first eigenvectr is trivial, and takes care f the centering f the scres. The remainder characterize the ptimal scring slutin. Ex..7 Derive the slutin t the penalized ptimal scring prblem (.57). Ex..8 Shw that cefficients β l fund by ptimal scring are prprtinal t the discriminant directins ν l fund by linear discriminant analysis. Ex..9 Let Ŷ = XˆB be the fitted N K indicatr respnse matrix after linear regressin n the N p matrix X, where p > K. Cnsider the reduced features x i = ˆB T x i. Shw that LDA using x i is equivalent t LDA in the riginal space. Ex..0 Kernels and linear discriminant analysis. Suppse yu wish t carry ut a linear discriminant analysis (tw classes) using a vectr f transfrmatins f the input variables h(x). Since h(x) is high-dimensinal, yu will use a regularized within-class cvariance matrix W h + γi. Shw that the mdel can be estimated using nly the inner prducts K(x i,x i ) = h(x i ),h(x i ). Hence the kernel prperty f supprt vectr machines is als shared by regularized linear discriminant analysis. Ex.. The MDA prcedure mdels each class as a mixture f Gaussians. Hence each mixture center belngs t ne and nly ne class. A mre general mdel allws each mixture center t be shared by all classes. We take the jint density f labels and features t be

476 458. Flexible Discriminants P(G,X) = R π r P r (G,X), (.67) r= a mixture f jint densities. Furthermre we assume P r (G,X) = P r (G)φ(X;µ r,σ). (.68) This mdel cnsists f regins centered at µ r, and fr each there is a class prfile P r (G). The psterir class distributin is given by P(G = k X = x) = R r= π rp r (G = k)φ(x;µ r,σ) R r= π, (.69) rφ(x;µ r,σ) where the denminatr is the marginal distributin P(X). (a) Shw that this mdel (called MDA) can be viewed as a generalizatin f MDA since R r= P(X G = k) = π rp r (G = k)φ(x;µ r,σ) R r= π, (.70) rp r (G = k) where π rk = π r P r (G = k)/ R r= π rp r (G = k) crrespnds t the mixing prprtins fr the kth class. (b) Derive the EM algrithm fr MDA. (c) Shw that if the initial weight matrix is cnstructed as in MDA, invlving separate k-means clustering in each class, then the algrithm fr MDA is identical t the riginal MDA prcedure.

477 Prttype Methds and Nearest-Neighbrs This is page 459 Printer: paque this. Intrductin In this chapter we discuss sme simple and essentially mdel-free methds fr classificatin and pattern recgnitin. Because they are highly unstructured, they typically are nt useful fr understanding the nature f the relatinship between the features and class utcme. Hwever, as black bx predictin engines, they can be very effective, and are ften amng the best perfrmers in real data prblems. The nearest-neighbr technique can als be used in regressin; this was tuched n in Chapter and wrks reasnably well fr lw-dimensinal prblems. Hwever, with high-dimensinal features, the bias variance tradeff des nt wrk as favrably fr nearestneighbr regressin as it des fr classificatin.. Prttype Methds Thrughut this chapter, ur training data cnsists f the N pairs (x,g ),..., (x n,g N ) where g i is a class label taking values in {,,...,K}. Prttype methds represent the training data by a set f pints in feature space. These prttypes are typically nt examples frm the training sample, except in the case f -nearest-neighbr classificatin discussed later. Each prttype has an assciated class label, and classificatin f a query pint x is made t the class f the clsest prttype. Clsest is usually defined by Euclidean distance in the feature space, after each feature has

478 460. Prttypes and Nearest-Neighbrs been standardized t have verall mean 0 and variance in the training sample. Euclidean distance is apprpriate fr quantitative features. We discuss distance measures between qualitative and ther kinds f feature values in Chapter 4. These methds can be very effective if the prttypes are well psitined t capture the distributin f each class. Irregular class bundaries can be represented, with enugh prttypes in the right places in feature space. The main challenge is t figure ut hw many prttypes t use and where t put them. Methds differ accrding t the number and way in which prttypes are selected... K-means Clustering K-means clustering is a methd fr finding clusters and cluster centers in a set f unlabeled data. ne chses the desired number f cluster centers, say R, and the K-means prcedure iteratively mves the centers t minimize the ttal within cluster variance. Given an initial set f centers, the K- means algrithm alternates the tw steps: fr each center we identify the subset f training pints (its cluster) that is clser t it than any ther center; the means f each feature fr the data pints in each cluster are cmputed, and this mean vectr becmes the new center fr that cluster. These tw steps are iterated until cnvergence. Typically the initial centers are R randmly chsen bservatins frm the training data. Details f the K-means prcedure, as well as generalizatins allwing fr different variable types and mre general distance measures, are given in Chapter 4. T use K-means clustering fr classificatin f labeled data, the steps are: apply K-means clustering t the training data in each class separately, using R prttypes per class; assign a class label t each f the K R prttypes; classify a new feature x t the class f the clsest prttype. Figure. (upper panel) shws a simulated example with three classes and tw features. We used R = 5 prttypes per class, and shw the classificatin regins and the decisin bundary. Ntice that a number f the The K in K-means refers t the number f cluster centers. Since we have already reserved K t dente the number f classes, we dente the number f clusters by R.

479 . Prttype Methds 46 K-means - 5 Prttypes per Class LVQ - 5 Prttypes per Class FIGURE.. Simulated example with three classes and five prttypes per class. The data in each class are generated frm a mixture f Gaussians. In the upper panel, the prttypes were fund by applying the K-means clustering algrithm separately in each class. In the lwer panel, the LVQ algrithm (starting frm the K-means slutin) mves the prttypes away frm the decisin bundary. The brken purple curve in the backgrund is the Bayes decisin bundary.

480 46. Prttypes and Nearest-Neighbrs Algrithm. Learning Vectr Quantizatin LVQ.. Chse R initial prttypes fr each class: m (k),m (k),...,m R (k), k =,,...,K, fr example, by sampling R training pints at randm frm each class.. Sample a training pint x i randmly (with replacement), and let (j,k) index the clsest prttype m j (k) t x i. (a) If g i = k (i.e., they are in the same class), mve the prttype twards the training pint: where ǫ is the learning rate. m j (k) m j (k) + ǫ(x i m j (k)), (b) If g i k (i.e., they are in different classes), mve the prttype away frm the training pint: m j (k) m j (k) ǫ(x i m j (k)).. Repeat step, decreasing the learning rate ǫ with each iteratin twards zer. prttypes are near the class bundaries, leading t ptential misclassificatin errrs fr pints near these bundaries. This results frm an bvius shrtcming with this methd: fr each class, the ther classes d nt have a say in the psitining f the prttypes fr that class. A better apprach, discussed next, uses all f the data t psitin all prttypes... Learning Vectr Quantizatin In this technique due t Khnen (989), prttypes are placed strategically with respect t the decisin bundaries in an ad-hc way. LVQ is an nline algrithm bservatins are prcessed ne at a time. The idea is that the training pints attract prttypes f the crrect class, and repel ther prttypes. When the iteratins settle dwn, prttypes shuld be clse t the training pints in their class. The learning rate ǫ is decreased t zer with each iteratin, fllwing the guidelines fr stchastic apprximatin learning rates (Sectin.4.) Figure. (lwer panel) shws the result f LVQ, using the K-means slutin as starting values. The prttypes have tended t mve away frm the decisin bundaries, and away frm prttypes f cmpeting classes. The prcedure just described is actually called LVQ. Mdificatins (LVQ, LVQ, etc.) have been prpsed, that can smetimes imprve perfrmance. A drawback f learning vectr quantizatin methds is the fact

481 . k-nearest-neighbr Classifiers 46 that they are defined by algrithms, rather than ptimizatin f sme fixed criteria; this makes it difficult t understand their prperties... Gaussian Mixtures The Gaussian mixture mdel can als be thught f as a prttype methd, similar in spirit t K-means and LVQ. We discuss Gaussian mixtures in sme detail in Sectins 6.8, 8.5 and.7. Each cluster is described in terms f a Gaussian density, which has a centrid (as in K-means), and a cvariance matrix. The cmparisn becmes crisper if we restrict the cmpnent Gaussians t have a scalar cvariance matrix (Exercise.). The tw steps f the alternating EM algrithm are very similar t the tw steps in K- means: In the E-step, each bservatin is assigned a respnsibility r weight fr each cluster, based n the likelihd f each f the crrespnding Gaussians. bservatins clse t the center f a cluster will mst likely get weight fr that cluster, and weight 0 fr every ther cluster. bservatins half-way between tw clusters divide their weight accrdingly. In the M-step, each bservatin cntributes t the weighted means (and cvariances) fr every cluster. As a cnsequence, the Gaussian mixture mdel is ften referred t as a sft clustering methd, while K-means is hard. Similarly, when Gaussian mixture mdels are used t represent the feature density in each class, it prduces smth psterir prbabilities ˆp(x) = {ˆp (x),..., ˆp K (x)} fr classifying x (see (.60) n page 449.) ften this is interpreted as a sft classificatin, while in fact the classificatin rule is Ĝ(x) = arg max k ˆp k (x). Figure. cmpares the results f K-means and Gaussian mixtures n the simulated mixture prblem f Chapter. We see that althugh the decisin bundaries are rughly similar, thse fr the mixture mdel are smther (althugh the prttypes are in apprximately the same psitins.) We als see that while bth prcedures devte a blue prttype (incrrectly) t a regin in the nrthwest, the Gaussian mixture classifier can ultimately ignre this regin, while K-means cannt. LVQ gave very similar results t K-means n this example, and is nt shwn.. k-nearest-neighbr Classifiers These classifiers are memry-based, and require n mdel t be fit. Given a query pint x 0, we find the k training pints x (r),r =,...,k clsest in distance t x 0, and then classify using majrity vte amng the k neighbrs.

482 464. Prttypes and Nearest-Neighbrs K-means - 5 Prttypes per Class Training Errr: 0.70 Test Errr: 0.4 Bayes Errr: 0.0 Gaussian Mixtures - 5 Subclasses per Class Training Errr: 0.7 Test Errr: 0. Bayes Errr: 0. FIGURE.. The upper panel shws the K-means classifier applied t the mixture data example. The decisin bundary is piecewise linear. The lwer panel shws a Gaussian mixture mdel with a cmmn cvariance fr all cmpnent Gaussians. The EM algrithm fr the mixture mdel was started at the K-means slutin. The brken purple curve in the backgrund is the Bayes decisin bundary.

483 . k-nearest-neighbr Classifiers 465 Ties are brken at randm. Fr simplicity we will assume that the features are real-valued, and we use Euclidean distance in feature space: d (i) = x (i) x 0. (.) Typically we first standardize each f the features t have mean zer and variance, since it is pssible that they are measured in different units. In Chapter 4 we discuss distance measures apprpriate fr qualitative and rdinal features, and hw t cmbine them fr mixed data. Adaptively chsen distance metrics are discussed later in this chapter. Despite its simplicity, k-nearest-neighbrs has been successful in a large number f classificatin prblems, including handwritten digits, satellite image scenes and EKG patterns. It is ften successful where each class has many pssible prttypes, and the decisin bundary is very irregular. Figure. (upper panel) shws the decisin bundary f a 5-nearestneighbr classifier applied t the three-class simulated data. The decisin bundary is fairly smth cmpared t the lwer panel, where a -nearestneighbr classifier was used. There is a clse relatinship between nearestneighbr and prttype methds: in -nearest-neighbr classificatin, each training pint is a prttype. Figure.4 shws the training, test and tenfld crss-validatin errrs as a functin f the neighbrhd size, fr the tw-class mixture prblem. Since the tenfld CV errrs are averages f ten numbers, we can estimate a standard errr. Because it uses nly the training pint clsest t the query pint, the bias f the -nearest-neighbr estimate is ften lw, but the variance is high. A famus result f Cver and Hart (967) shws that asympttically the errr rate f the -nearest-neighbr classifier is never mre than twice the Bayes rate. The rugh idea f the prf is as fllws (using squared-errr lss). We assume that the query pint cincides with ne f the training pints, s that the bias is zer. This is true asympttically if the dimensin f the feature space is fixed and the training data fills up the space in a dense fashin. Then the errr f the Bayes rule is just the variance f a Bernulli randm variate (the target at the query pint), while the errr f -nearest-neighbr rule is twice the variance f a Bernulli randm variate, ne cntributin each fr the training and query targets. We nw give mre detail fr misclassificatin lss. At x let k be the dminant class, and p k (x) the true cnditinal prbability fr class k. Then Bayes errr = p k (x), (.) K -nearest-neighbr errr = p k (x)( p k (x)), (.) k= p k (x). (.4) The asympttic -nearest-neighbr errr rate is that f a randm rule; we pick bth the classificatin and the test pint at randm with prbabili-

484 466. Prttypes and Nearest-Neighbrs 5-Nearest Neighbrs Nearest Neighbr FIGURE.. k-nearest-neighbr classifiers applied t the simulatin data f Figure.. The brken purple curve in the backgrund is the Bayes decisin bundary.

485 . k-nearest-neighbr Classifiers 467 Number f Neighbrs Misclassificatin Errrs Test Errr 0-fld CV Training Errr Bayes Errr 7-Nearest Neighbrs Training Errr: 0.45 Test Errr: 0.5 Bayes Errr: 0.0 FIGURE.4. k-nearest-neighbrs n the tw-class mixture data. The upper panel shws the misclassificatin errrs as a functin f neighbrhd size. Standard errr bars are included fr 0-fld crss validatin. The lwer panel shws the decisin bundary fr 7-nearest-neighbrs, which appears t be ptimal fr minimizing test errr. The brken purple curve in the backgrund is the Bayes decisin bundary.

486 468. Prttypes and Nearest-Neighbrs ties p k (x), k =,...,K. Fr K = the -nearest-neighbr errr rate is p k (x)( p k (x)) ( p k (x)) (twice the Bayes errr rate). Mre generally, ne can shw (Exercise.) K p k (x)( p k (x)) ( p k (x)) K K ( p k (x)). (.5) k= Many additinal results f this kind have been derived; Ripley (996) summarizes a number f them. This result can prvide a rugh idea abut the best perfrmance that is pssible in a given prblem. Fr example, if the -nearest-neighbr rule has a 0% errr rate, then asympttically the Bayes errr rate is at least 5%. The kicker here is the asympttic part, which assumes the bias f the nearest-neighbr rule is zer. In real prblems the bias can be substantial. The adaptive nearest-neighbr rules, described later in this chapter, are an attempt t alleviate this bias. Fr simple nearest-neighbrs, the bias and variance characteristics can dictate the ptimal number f near neighbrs fr a given prblem. This is illustrated in the next example... Example: A Cmparative Study We tested the nearest-neighbrs, K-means and LVQ classifiers n tw simulated prblems. There are 0 independent features X j, each unifrmly distributed n [0, ]. The tw-class 0- target variable is defined as fllws: ( Y = I X > ) ; prblem : easy, ( Y = I sign X j ) > 0 ; j= prblem : difficult. (.6) Hence in the first prblem the tw classes are separated by the hyperplane X = /; in the secnd prblem, the tw classes frm a checkerbard pattern in the hypercube defined by the first three features. The Bayes errr rate is zer in bth prblems. There were 00 training and 000 test bservatins. Figure.5 shws the mean and standard errr f the misclassificatin errr fr nearest-neighbrs, K-means and LVQ ver ten realizatins, as the tuning parameters are varied. We see that K-means and LVQ give nearly identical results. Fr the best chices f their tuning parameters, K-means and LVQ utperfrm nearest-neighbrs fr the first prblem, and they perfrm similarly fr the secnd prblem. Ntice that the best value f each tuning parameter is clearly situatin dependent. Fr example 5- nearest-neighbrs utperfrms -nearest-neighbr by a factr f 70% in the

487 . k-nearest-neighbr Classifiers 469 Nearest Neighbrs / Easy K-means & LVQ / Easy Misclassificatin Errr Misclassificatin Errr Number f Neighbrs Number f Prttypes per Class Nearest Neighbrs / Difficult K-means & LVQ / Difficult Misclassificatin Errr Misclassificatin Errr Number f Neighbrs Number f Prttypes per Class FIGURE.5. Mean ± ne standard errr f misclassificatin errr fr nearest-neighbrs, K-means (blue) and LVQ (red) ver ten realizatins fr tw simulated prblems: easy and difficult, described in the text.

488 470. Prttypes and Nearest-Neighbrs Spectral Band Spectral Band Spectral Band Spectral Band 4 Land Usage Predicted Land Usage FIGURE.6. The first fur panels are LANDSAT images fr an agricultural area in fur spectral bands, depicted by heatmap shading. The remaining tw panels give the actual land usage (clr cded) and the predicted land usage using a five-nearest-neighbr rule described in the text. first prblem, while -nearest-neighbr is best in the secnd prblem by a factr f 8%. These results underline the imprtance f using an bjective, data-based methd like crss-validatin t estimate the best value f a tuning parameter (see Figure.4 and Chapter 7)... Example: k-nearest-neighbrs and Image Scene Classificatin The STATLG prject (Michie et al., 994) used part f a LANDSAT image as a benchmark fr classificatin (8 00 pixels). Figure.6 shws fur heat-map images, tw in the visible spectrum and tw in the infrared, fr an area f agricultural land in Australia. Each pixel has a class label frm the 7-element set G = {red sil, cttn, vegetatin stubble, mixture, gray sil, damp gray sil, very damp gray sil}, determined manually by research assistants surveying the area. The lwer middle panel shws the actual land usage, shaded by different clrs t indicate the classes. The bjective is t classify the land usage at a pixel, based n the infrmatin in the fur spectral bands. Five-nearest-neighbrs prduced the predicted map shwn in the bttm right panel, and was cmputed as fllws. Fr each pixel we extracted an 8-neighbr feature map the pixel itself and its 8 immediate neighbrs

489 . k-nearest-neighbr Classifiers 47 N N N N X N N N N FIGURE.7. A pixel and its 8-neighbr feature map. (see Figure.7). This is dne separately in the fur spectral bands, giving (+8) 4 = 6 input features per pixel. Then five-nearest-neighbrs classificatin was carried ut in this 6-dimensinal feature space. The resulting test errr rate was abut 9.5% (see Figure.8). f all the methds used in the STATLG prject, including LVQ, CART, neural netwrks, linear discriminant analysis and many thers, k-nearest-neighbrs perfrmed best n this task. Hence it is likely that the decisin bundaries in IR 6 are quite irregular... Invariant Metrics and Tangent Distance In sme prblems, the training features are invariant under certain natural transfrmatins. The nearest-neighbr classifier can explit such invariances by incrprating them int the metric used t measure the distances between bjects. Here we give an example where this idea was used with great success, and the resulting classifier utperfrmed all thers at the time f its develpment (Simard et al., 99). The prblem is handwritten digit recgnitin, as discussed is Chapter and Sectin.7. The inputs are grayscale images with 6 6 = 56 pixels; sme examples are shwn in Figure.9. At the tp f Figure.0, a is shwn, in its actual rientatin (middle) and rtated 7.5 and 5 in either directin. Such rtatins can ften ccur in real handwriting, and it is bvius t ur eye that this is still a after small rtatins. Hence we want ur nearest-neighbr classifier t cnsider these tw s t be clse tgether (similar). Hwever the 56 grayscale pixel values fr a rtated will lk quite different frm thse in the riginal image, and hence the tw bjects can be far apart in Euclidean distance in IR 56. We wish t remve the effect f rtatin in measuring distances between tw digits f the same class. Cnsider the set f pixel values cnsisting f the riginal and its rtated versins. This is a ne-dimensinal curve in IR 56, depicted by the green curve passing thrugh the in Figure.0. Figure. shws a stylized versin f IR 56, with tw images indicated by x i and x i. These might be tw different s, fr example. Thrugh each image we have drawn the curve f rtated versins f that image, called

490 47. Prttypes and Nearest-Neighbrs STATLG results LDA Test Misclassificatin Errr DANN K-NN LVQ RBF ALLC80 CART Neural NewID C4.5 SMARTLgistic QDA Methd FIGURE.8. Test-errr perfrmance fr a number f classifiers, as reprted by the STATLG prject. The entry DANN is a variant f k-nearest neighbrs, using an adaptive metric (Sectin.4.). FIGURE.9. Examples f grayscale images f handwritten digits.

491 . k-nearest-neighbr Classifiers Transfrmatins f Tangent Pixel space α= 0. α= 0. α=0 α=0. α=0. Linear equatin fr images abve + α. FIGURE.0. The tp rw shws a in its riginal rientatin (middle) and rtated versins f it. The green curve in the middle f the figure depicts this set f rtated in 56-dimensinal space. The red line is the tangent line t the curve at the riginal image, with sme s n this tangent line, and its equatin shwn at the bttm f the figure. invariance maniflds in this cntext. Nw, rather than using the usual Euclidean distance between the tw images, we use the shrtest distance between the tw curves. In ther wrds, the distance between the tw images is taken t be the shrtest Euclidean distance between any rtated versin f first image, and any rtated versin f the secnd image. This distance is called an invariant metric. In principle ne culd carry ut -nearest-neighbr classificatin using this invariant metric. Hwever there are tw prblems with it. First, it is very difficult t calculate fr real images. Secnd, it allws large transfrmatins that can lead t pr perfrmance. Fr example a 6 wuld be cnsidered clse t a 9 after a rtatin f 80. We need t restrict attentin t small rtatins. The use f tangent distance slves bth f these prblems. As shwn in Figure.0, we can apprximate the invariance manifld f the image by its tangent at the riginal image. This tangent can be cmputed by estimating the directin vectr frm small rtatins f the image, r by mre sphisticated spatial smthing methds (Exercise.4.) Fr large rtatins, the tangent image n lnger lks like a, s the prblem with large transfrmatins is alleviated.

492 474. Prttypes and Nearest-Neighbrs Transfrmatins f x i Tangent distance Distance between transfrmed x i and x i x i x i Euclidean distance between x i and x i Transfrmatins f x i FIGURE.. Tangent distance cmputatin fr tw images x i and x i. Rather than using the Euclidean distance between x i and x i, r the shrtest distance between the tw curves, we use the shrtest distance between the tw tangent lines. The idea then is t cmpute the invariant tangent line fr each training image. Fr a query image t be classified, we cmpute its invariant tangent line, and find the clsest line t it amng the lines in the training set. The class (digit) crrespnding t this clsest line is ur predicted class fr the query image. In Figure. the tw tangent lines intersect, but this is nly because we have been frced t draw a tw-dimensinal representatin f the actual 56-dimensinal situatin. In IR 56 the prbability f tw such lines intersecting is effectively zer. Nw a simpler way t achieve this invariance wuld be t add int the training set a number f rtated versins f each training image, and then just use a standard nearest-neighbr classifier. This idea is called hints in Abu-Mstafa (995), and wrks well when the space f invariances is small. S far we have presented a simplified versin f the prblem. In additin t rtatin, there are six ther types f transfrmatins under which we wuld like ur classifier t be invariant. There are translatin (tw directins), scaling (tw directins), sheer, and character thickness. Hence the curves and tangent lines in Figures.0 and. are actually 7-dimensinal maniflds and hyperplanes. It is infeasible t add transfrmed versins f each training image t capture all f these pssibilities. The tangent maniflds prvide an elegant way f capturing the invariances. Table. shws the test misclassificatin errr fr a prblem with 79 training images and 007 test digits (the U.S. Pstal Services database), fr a carefully cnstructed neural netwrk, and simple -nearest-neighbr and

493 .4 Adaptive Nearest-Neighbr Methds 475 TABLE.. Test errr rates fr the handwritten ZIP cde prblem. Methd Errr rate Neural-net nearest-neighbr/euclidean distance nearest-neighbr/tangent distance 0.06 tangent distance -nearest-neighbr rules. The tangent distance nearestneighbr classifier wrks remarkably well, with test errr rates near thse fr the human eye (this is a ntriusly difficult test set). In practice, it turned ut that nearest-neighbrs are t slw fr nline classificatin in this applicatin (see Sectin.5), and neural netwrk classifiers were subsequently develped t mimic it..4 Adaptive Nearest-Neighbr Methds When nearest-neighbr classificatin is carried ut in a high-dimensinal feature space, the nearest neighbrs f a pint can be very far away, causing bias and degrading the perfrmance f the rule. T quantify this, cnsider N data pints unifrmly distributed in the unit cube [, ]p. Let R be the radius f a -nearest-neighbrhd centered at the rigin. Then ( median(r) = vp /p /p, (.7) /N) where v p r p is the vlume f the sphere f radius r in p dimensins. Figure. shws the median radius fr varius training sample sizes and dimensins. We see that median radius quickly appraches 0.5, the distance t the edge f the cube. What can be dne abut this prblem? Cnsider the tw-class situatin in Figure.. There are tw features, and a nearest-neighbrhd at a query pint is depicted by the circular regin. Implicit in near-neighbr classificatin is the assumptin that the class prbabilities are rughly cnstant in the neighbrhd, and hence simple averages give gd estimates. Hwever, in this example the class prbabilities vary nly in the hrizntal directin. If we knew this, we wuld stretch the neighbrhd in the vertical directin, as shwn by the tall rectangular regin. This will reduce the bias f ur estimate and leave the variance the same. In general, this calls fr adapting the metric used in nearest-neighbr classificatin, s that the resulting neighbrhds stretch ut in directins fr which the class prbabilities dn t change much. In high-dimensinal feature space, the class prbabilities might change nly a lw-dimensinal subspace and hence there can be cnsiderable advantage t adapting the metric.

494 476. Prttypes and Nearest-Neighbrs Median Radius N=00 N=,000 N=0, Dimensin FIGURE.. Median radius f a -nearest-neighbrhd, fr unifrm data with N bservatins in p dimensins. 5-Nearest Neighbrhds FIGURE.. The pints are unifrm in the cube, with the vertical line separating class red and green. The vertical strip dentes the 5-nearest-neighbr regin using nly the hrizntal crdinate t find the nearest-neighbrs fr the target pint (slid dt). The sphere shws the 5-nearest-neighbr regin using bth crdinates, and we see in this case it has extended int the class-red regin (and is dminated by the wrng class in this instance).

495 .4 Adaptive Nearest-Neighbr Methds 477 Friedman (994a) prpsed a methd in which rectangular neighbrhds are fund adaptively by successively carving away edges f a bx cntaining the training data. Here we describe the discriminant adaptive nearest-neighbr (DANN) rule f Hastie and Tibshirani (996a). Earlier, related prpsals appear in Shrt and Fukunaga (98) and Myles and Hand (990). At each query pint a neighbrhd f say 50 pints is frmed, and the class distributin amng the pints is used t decide hw t defrm the neighbrhd that is, t adapt the metric. The adapted metric is then used in a nearest-neighbr rule at the query pint. Thus at each query pint a ptentially different metric is used. In Figure. it is clear that the neighbrhd shuld be stretched in the directin rthgnal t line jining the class centrids. This directin als cincides with the linear discriminant bundary, and is the directin in which the class prbabilities change the least. In general this directin f maximum change will nt be rthgnal t the line jining the class centrids (see Figure 4.9 n page 6.) Assuming a lcal discriminant mdel, the infrmatin cntained in the lcal within- and between-class cvariance matrices is all that is needed t determine the ptimal shape f the neighbrhd. The discriminant adaptive nearest-neighbr (DANN) metric at a query pint x 0 is defined by where D(x,x 0 ) = (x x 0 ) T Σ(x x 0 ), (.8) Σ = W / [W / BW / + ǫi]w / = W / [B + ǫi]w /. (.9) Here W is the pled within-class cvariance matrix K k= π kw k and B is the between class cvariance matrix K k= π k( x k x)( x k x) T, with W and B cmputed using nly the 50 nearest neighbrs arund x 0. After cmputatin f the metric, it is used in a nearest-neighbr rule at x 0. This cmplicated frmula is actually quite simple in its peratin. It first spheres the data with respect t W, and then stretches the neighbrhd in the zer-eigenvalue directins f B (the between-matrix fr the sphered data ). This makes sense, since lcally the bserved class means d nt differ in these directins. The ǫ parameter runds the neighbrhd, frm an infinite strip t an ellipsid, t avid using pints far away frm the query pint. The value f ǫ = seems t wrk well in general. Figure.4 shws the resulting neighbrhds fr a prblem where the classes frm tw cncentric circles. Ntice hw the neighbrhds stretch ut rthgnally t the decisin bundaries when bth classes are present in the neighbrhd. In the pure regins with nly ne class, the neighbrhds remain circular;

496 478. Prttypes and Nearest-Neighbrs FIGURE.4. Neighbrhds fund by the DANN prcedure, at varius query pints (centers f the crsses). There are tw classes in the data, with ne class surrunding the ther. 50 nearest-neighbrs were used t estimate the lcal metrics. Shwn are the resulting metrics used t frm 5-nearest-neighbrhds. in these cases the between matrix B = 0, and the Σ in (.8) is the identity matrix..4. Example Here we generate tw-class data in ten dimensins, analgus t the twdimensinal example f Figure.4. All ten predictrs in class are independent standard nrmal, cnditined n the radius being greater than.4 and less than 40, while the predictrs in class are independent standard nrmal withut the restrictin. There are 50 bservatins in each class. Hence the first class almst cmpletely surrunds the secnd class in the full ten-dimensinal space. In this example there are n pure nise variables, the kind that a nearestneighbr subset selectin rule might be able t weed ut. At any given pint in the feature space, the class discriminatin ccurs alng nly ne directin. Hwever, this directin changes as we mve acrss the feature space and all variables are imprtant smewhere in the space. Figure.5 shws bxplts f the test errr rates ver ten realizatins, fr standard 5-nearest-neighbrs, LVQ, and discriminant adaptive 5-nearest-neighbrs. We used 50 prttypes per class fr LVQ, t make it cmparable t 5 nearest-neighbrs (since 50/5 = 50). The adaptive metric significantly reduces the errr rate, cmpared t LVQ r standard nearest-neighbrs.

497 .4 Adaptive Nearest-Neighbr Methds 479 Test Errr NN LVQ DANN FIGURE.5. Ten-dimensinal simulated example: bxplts f the test errr rates ver ten realizatins, fr standard 5-nearest-neighbrs, LVQ with 50 centers, and discriminant-adaptive 5-nearest-neighbrs.4. Glbal Dimensin Reductin fr Nearest-Neighbrs The discriminant-adaptive nearest-neighbr methd carries ut lcal dimensin reductin that is, dimensin reductin separately at each query pint. In many prblems we can als benefit frm glbal dimensin reductin, that is, apply a nearest-neighbr rule in sme ptimally chsen subspace f the riginal feature space. Fr example, suppse that the tw classes frm tw nested spheres in fur dimensins f feature space, and there are an additinal six nise features whse distributin is independent f class. Then we wuld like t discver the imprtant fur-dimensinal subspace, and carry ut nearest-neighbr classificatin in that reduced subspace. Hastie and Tibshirani (996a) discuss a variatin f the discriminantadaptive nearest-neighbr methd fr this purpse. At each training pint x i, the between-centrids sum f squares matrix B i is cmputed, and then these matrices are averaged ver all training pints: B = N B i. (.0) N i= Let e,e,...,e p be the eigenvectrs f the matrix B, rdered frm largest t smallest eigenvalue θ k. Then these eigenvectrs span the ptimal subspaces fr glbal subspace reductin. The derivatin is based n the fact that the best rank-l apprximatin t B, B [L] = L l= θ le l e T l, slves the least squares prblem N min trace[(b i M) ]. (.) rank(m)=l i= Since each B i cntains infrmatin n (a) the lcal discriminant subspace, and (b) the strength f discriminatin in that subspace, (.) can be seen

498 480. Prttypes and Nearest-Neighbrs as a way f finding the best apprximating subspace f dimensin L t a series f N subspaces by weighted least squares (Exercise.5.) In the fur-dimensinal sphere example mentined abve and examined in Hastie and Tibshirani (996a), fur f the eigenvalues θ l turn ut t be large (having eigenvectrs nearly spanning the interesting subspace), and the remaining six are near zer. peratinally, we prject the data int the leading fur-dimensinal subspace, and then carry ut nearest neighbr classificatin. In the satellite image classificatin example in Sectin.., the technique labeled DANN in Figure.8 used 5-nearest-neighbrs in a glbally reduced subspace. There are als cnnectins f this technique with the sliced inverse regressin prpsal f Duan and Li (99). These authrs use similar ideas in the regressin setting, but d glbal rather than lcal cmputatins. They assume and explit spherical symmetry f the feature distributin t estimate interesting subspaces..5 Cmputatinal Cnsideratins ne drawback f nearest-neighbr rules in general is the cmputatinal lad, bth in finding the neighbrs and string the entire training set. With N bservatins and p predictrs, nearest-neighbr classificatin requires N p peratins t find the neighbrs per query pint. There are fast algrithms fr finding nearest-neighbrs (Friedman et al., 975; Friedman et al., 977) which can reduce this lad smewhat. Hastie and Simard (998) reduce the cmputatins fr tangent distance by develping analgs f K-means clustering in the cntext f this invariant metric. Reducing the strage requirements is mre difficult, and varius editing and cndensing prcedures have been prpsed. The idea is t islate a subset f the training set that suffices fr nearest-neighbr predictins, and thrw away the remaining training data. Intuitively, it seems imprtant t keep the training pints that are near the decisin bundaries and n the crrect side f thse bundaries, while sme pints far frm the bundaries culd be discarded. The multi-edit algrithm f Devijver and Kittler (98) divides the data cyclically int training and test sets, cmputing a nearest neighbr rule n the training set and deleting test pints that are misclassified. The idea is t keep hmgeneus clusters f training bservatins. The cndensing prcedure f Hart (968) ges further, trying t keep nly imprtant exterir pints f these clusters. Starting with a single randmly chsen bservatin as the training set, each additinal data item is prcessed ne at a time, adding it t the training set nly if it is misclassified by a nearest-neighbr rule cmputed n the current training set. These prcedures are surveyed in Dasarathy (99) and Ripley (996). They can als be applied t ther learning prcedures besides nearest-

499 Exercises 48 neighbrs. While such methds are smetimes useful, we have nt had much practical experience with them, nr have we fund any systematic cmparisn f their perfrmance in the literature. Bibligraphic Ntes The nearest-neighbr methd ges back at least t Fix and Hdges (95). The extensive literature n the tpic is reviewed by Dasarathy (99); Chapter 6 f Ripley (996) cntains a gd summary. K-means clustering is due t Llyd (957) and MacQueen (967). Khnen (989) intrduced learning vectr quantizatin. The tangent distance methd is due t Simard et al. (99). Hastie and Tibshirani (996a) prpsed the discriminant adaptive nearest-neighbr technique. Exercises Ex.. Cnsider a Gaussian mixture mdel where the cvariance matrices are assumed t be scalar: Σ r = σi r =,...,R, and σ is a fixed parameter. Discuss the analgy between the K-means clustering algrithm and the EM algrithm fr fitting this mixture mdel in detail. Shw that in the limit σ 0 the tw methds cincide. Ex.. Derive frmula (.7) fr the median radius f the -nearestneighbrhd. Ex.. Let E be the errr rate f the Bayes rule in a K-class prblem, where the true class prbabilities are given by p k (x), k =,...,K. Assuming the test pint and training pint have identical features x, prve (.5) K p k (x)( p k (x)) ( p k (x)) K K ( p k (x)). k= where k = arg max k p k (x). Hence argue that the errr rate f the - nearest-neighbr rule cnverges in L, as the size f the training set increases, t a value E, bunded abve by E ( E K ). (.) K [This statement f the therem f Cver and Hart (967) is taken frm Chapter 6 f Ripley (996), where a shrt prf is als given].

500 48. Prttypes and Nearest-Neighbrs Ex..4 Cnsider an image t be a functin F(x) : IR IR ver the twdimensinal spatial dmain (paper crdinates). Then F(c+x 0 +A(x x 0 )) represents an affine transfrmatin f the image F, where A is a matrix.. Decmpse A (via Q-R) in such a way that parameters identifying the fur affine transfrmatins (tw scale, shear and rtatin) are clearly identified.. Using the chain rule, shw that the derivative f F(c+x 0 +A(x x 0 )) w.r.t. each f these parameters can be represented in terms f the tw spatial derivatives f F.. Using a tw-dimensinal kernel smther (Chapter 6), describe hw t implement this prcedure when the images are quantized t 6 6 pixels. Ex..5 Let B i,i =,,...,N be square p p psitive semi-definite matrices and let B = (/N) B i. Write the eigen-decmpsitin f B as p l= θ le l e T l with θ l θ l θ. Shw that the best rank-l apprximatin fr the B i, min rank(m)=l i= N trace[(b i M) ], is given by B [L] = L l= θ le l e T l. (Hint: Write N i= trace[(b i M) ] as N trace[(b i B) N ] + trace[(m B) ]). i= Ex..6 Here we cnsider the prblem f shape averaging. In particular, L i, i =,...,M are each N matrices f pints in IR, each sampled frm crrespnding psitins f handwritten (cursive) letters. We seek an affine invariant average V, als N, V T V = I, f the M letters L i with the fllwing prperty: V minimizes M j= i= min A j L j VA j. Characterize the slutin. This slutin can suffer if sme f the letters are big and dminate the average. An alternative apprach is t minimize instead: M j= min A j Lj A j V. Derive the slutin t this prblem. Hw d the criteria differ? Use the SVD f the L j t simplify the cmparisn f the tw appraches.

501 Exercises 48 Ex..7 Cnsider the applicatin f nearest-neighbrs t the easy and hard prblems in the left panel f Figure.5.. Replicate the results in the left panel f Figure.5.. Estimate the misclassificatin errrs using fivefld crss-validatin, and cmpare the errr rate curves t thse in.. Cnsider an AIC-like penalizatin f the training set misclassificatin errr. Specifically, add t/n t the training set misclassificatin errr, where t is the apprximate number f parameters N/r, r being the number f nearest-neighbrs. Cmpare plts f the resulting penalized misclassificatin errr t thse in and. Which methd gives a better estimate f the ptimal number f nearest-neighbrs: crss-validatin r AIC? Ex..8 Generate data in tw classes, with tw features. These features are all independent Gaussian variates with standard deviatin. Their mean vectrs are (, ) in class and (,) in class. T each feature vectr apply a randm rtatin f angle θ, θ chsen unifrmly frm 0 t π. Generate 50 bservatins frm each class t frm the training set, and 500 in each class as the test set. Apply fur different classifiers:. Nearest-neighbrs.. Nearest-neighbrs with hints: ten randmly rtated versins f each data pint are added t the training set befre applying nearestneighbrs.. Invariant metric nearest-neighbrs, using Euclidean distance invariant t rtatins abut the rigin. 4. Tangent distance nearest-neighbrs. In each case chse the number f neighbrs by tenfld crss-validatin. Cmpare the results.

502 484. Prttypes and Nearest-Neighbrs

503 4 Unsupervised Learning This is page 485 Printer: paque this 4. Intrductin The previus chapters have been cncerned with predicting the values f ne r mre utputs r respnse variables Y = (Y,...,Y m ) fr a given set f input r predictr variables X T = (X,...,X p ). Dente by x T i = (x i,...,x ip ) the inputs fr the ith training case, and let y i be a respnse measurement. The predictins are based n the training sample (x,y ),...,(x N,y N ) f previusly slved cases, where the jint values f all f the variables are knwn. This is called supervised learning r learning with a teacher. Under this metaphr the student presents an answer ŷ i fr each x i in the training sample, and the supervisr r teacher prvides either the crrect answer and/r an errr assciated with the student s answer. This is usually characterized by sme lss functin L(y,ŷ), fr example, L(y,ŷ) = (y ŷ). If ne suppses that (X,Y ) are randm variables represented by sme jint prbability density Pr(X, Y ), then supervised learning can be frmally characterized as a density estimatin prblem where ne is cncerned with determining prperties f the cnditinal density Pr(Y X). Usually the prperties f interest are the lcatin parameters µ that minimize the expected errr at each x, µ(x) = argmin E Y X L(Y,θ). (4.) θ

504 Unsupervised Learning Cnditining ne has Pr(X,Y ) = Pr(Y X) Pr(X), where Pr(X) is the jint marginal density f the X values alne. In supervised learning Pr(X) is typically f n direct cncern. ne is interested mainly in the prperties f the cnditinal density Pr(Y X). Since Y is ften f lw dimensin (usually ne), and nly its lcatin µ(x) is f interest, the prblem is greatly simplified. As discussed in the previus chapters, there are many appraches fr successfully addressing supervised learning in a variety f cntexts. In this chapter we address unsupervised learning r learning withut a teacher. In this case ne has a set f N bservatins (x,x,...,x N ) f a randm p-vectr X having jint density Pr(X). The gal is t directly infer the prperties f this prbability density withut the help f a supervisr r teacher prviding crrect answers r degree-f-errr fr each bservatin. The dimensin f X is smetimes much higher than in supervised learning, and the prperties f interest are ften mre cmplicated than simple lcatin estimates. These factrs are smewhat mitigated by the fact that X represents all f the variables under cnsideratin; ne is nt required t infer hw the prperties f Pr(X) change, cnditined n the changing values f anther set f variables. In lw-dimensinal prblems (say p ), there are a variety f effective nnparametric methds fr directly estimating the density Pr(X) itself at all X-values, and representing it graphically (Silverman, 986, e.g.). wing t the curse f dimensinality, these methds fail in high dimensins. ne must settle fr estimating rather crude glbal mdels, such as Gaussian mixtures r varius simple descriptive statistics that characterize Pr(X). Generally, these descriptive statistics attempt t characterize X-values, r cllectins f such values, where Pr(X) is relatively large. Principal cmpnents, multidimensinal scaling, self-rganizing maps, and principal curves, fr example, attempt t identify lw-dimensinal maniflds within the X-space that represent high data density. This prvides infrmatin abut the assciatins amng the variables and whether r nt they can be cnsidered as functins f a smaller set f latent variables. Cluster analysis attempts t find multiple cnvex regins f the X-space that cntain mdes f Pr(X). This can tell whether r nt Pr(X) can be represented by a mixture f simpler densities representing distinct types r classes f bservatins. Mixture mdeling has a similar gal. Assciatin rules attempt t cnstruct simple descriptins (cnjunctive rules) that describe regins f high density in the special case f very high dimensinal binary-valued data. With supervised learning there is a clear measure f success, r lack theref, that can be used t judge adequacy in particular situatins and t cmpare the effectiveness f different methds ver varius situatins.

505 4. Assciatin Rules 487 Lack f success is directly measured by expected lss ver the jint distributin Pr(X,Y ). This can be estimated in a variety f ways including crss-validatin. In the cntext f unsupervised learning, there is n such direct measure f success. It is difficult t ascertain the validity f inferences drawn frm the utput f mst unsupervised learning algrithms. ne must resrt t heuristic arguments nt nly fr mtivating the algrithms, as is ften the case in supervised learning as well, but als fr judgments as t the quality f the results. This uncmfrtable situatin has led t heavy prliferatin f prpsed methds, since effectiveness is a matter f pinin and cannt be verified directly. In this chapter we present thse unsupervised learning techniques that are amng the mst cmmnly used in practice, and additinally, a few thers that are favred by the authrs. 4. Assciatin Rules Assciatin rule analysis has emerged as a ppular tl fr mining cmmercial data bases. The gal is t find jint values f the variables X = (X,X,...,X p ) that appear mst frequently in the data base. It is mst ften applied t binary-valued data X j {0,}, where it is referred t as market basket analysis. In this cntext the bservatins are sales transactins, such as thse ccurring at the checkut cunter f a stre. The variables represent all f the items sld in the stre. Fr bservatin i, each variable X j is assigned ne f tw values; x ij = if the jth item is purchased as part f the transactin, whereas x ij = 0 if it was nt purchased. Thse variables that frequently have jint values f ne represent items that are frequently purchased tgether. This infrmatin can be quite useful fr stcking shelves, crss-marketing in sales prmtins, catalg design, and cnsumer segmentatin based n buying patterns. Mre generally, the basic gal f assciatin rule analysis is t find a cllectin f prttype X-values v,...,v L fr the feature vectr X, such that the prbability density Pr(v l ) evaluated at each f thse values is relatively large. In this general framewrk, the prblem can be viewed as mde finding r bump hunting. As frmulated, this prblem is impssibly difficult. A natural estimatr fr each Pr(v l ) is the fractin f bservatins fr which X = v l. Fr prblems that invlve mre than a small number f variables, each f which can assume mre than a small number f values, the number f bservatins fr which X = v l will nearly always be t small fr reliable estimatin. In rder t have a tractable prblem, bth the gals f the analysis and the generality f the data t which it is applied must be greatly simplified. The first simplificatin mdifies the gal. Instead f seeking values x where Pr(x) is large, ne seeks regins f the X-space with high prbability

506 Unsupervised Learning cntent relative t their size r supprt. Let S j represent the set f all pssible values f the jth variable (its supprt), and let s j S j be a subset f these values. The mdified gal can be stated as attempting t find subsets f variable values s,...,s p such that the prbability f each f the variables simultaneusly assuming a value within its respective subset, p Pr (X j s j ), (4.) j= is relatively large. The intersectin f subsets p j= (X j s j ) is called a cnjunctive rule. Fr quantitative variables the subsets s j are cntiguus intervals; fr categrical variables the subsets are delineated explicitly. Nte that if the subset s j is in fact the entire set f values s j = S j, as is ften the case, the variable X j is said nt t appear in the rule (4.). 4.. Market Basket Analysis General appraches t slving (4.) are discussed in Sectin These can be quite useful in many applicatins. Hwever, they are nt feasible fr the very large (p 0 4, N 0 8 ) cmmercial data bases t which market basket analysis is ften applied. Several further simplificatins f (4.) are required. First, nly tw types f subsets are cnsidered; either s j cnsists f a single value f X j, s j = v 0j, r it cnsists f the entire set f values that X j can assume, s j = S j. This simplifies the prblem (4.) t finding subsets f the integers J {,...,p}, and crrespnding values v 0j, j J, such that Pr (X j = v 0j ) (4.) j J is large. Figure 4. illustrates this assumptin. ne can apply the technique f dummy variables t turn (4.) int a prblem invlving nly binary-valued variables. Here we assume that the supprt S j is finite fr each variable X j. Specifically, a new set f variables Z,...,Z K is created, ne such variable fr each f the values v lj attainable by each f the riginal variables X,...,X p. The number f dummy variables K is p K = S j, j= where S j is the number f distinct values attainable by X j. Each dummy variable is assigned the value Z k = if the variable with which it is assciated takes n the crrespnding value t which Z k is assigned, and Z k = 0 therwise. This transfrms (4.) t finding a subset f the integers K {,...,K} such that

507 4. Assciatin Rules 489 X X X X X X FIGURE 4.. Simplificatins fr assciatin rules. Here there are tw inputs X and X, taking fur and six distinct values, respectively. The red squares indicate areas f high density. T simplify the cmputatins, we assume that the derived subset crrespnds t either a single value f an input r all values. With this assumptin we culd find either the middle r right pattern, but nt the left ne. [ ] [ ] Pr (Z k = ) = Pr Z k = (4.4) k K is large. This is the standard frmulatin f the market basket prblem. The set K is called an item set. The number f variables Z k in the item set is called its size (nte that the size is n bigger than p). The estimated value f (4.4) is taken t be the fractin f bservatins in the data base fr which the cnjunctin in (4.4) is true: [ ] Pr (Z k = ) = N z ik. (4.5) N k K k K i= k K Here z ik is the value f Z k fr this ith case. This is called the supprt r prevalence T(K) f the item set K. An bservatin i fr which k K z ik = is said t cntain the item set K. In assciatin rule mining a lwer supprt bund t is specified, and ne seeks all item sets K l that can be frmed frm the variables Z,...,Z K with supprt in the data base greater than this lwer bund t {K l T(K l ) > t}. (4.6) 4.. The Apriri Algrithm The slutin t this prblem (4.6) can be btained with feasible cmputatin fr very large data bases prvided the threshld t is adjusted s that (4.6) cnsists f nly a small fractin f all K pssible item sets. The Apriri algrithm (Agrawal et al., 995) explits several aspects f the

508 Unsupervised Learning curse f dimensinality t slve (4.6) with a small number f passes ver the data. Specifically, fr a given supprt threshld t: The cardinality {K T(K) > t} is relatively small. Any item set L cnsisting f a subset f the items in K must have supprt greater than r equal t that f K, L K T(L) T(K). The first pass ver the data cmputes the supprt f all single-item sets. Thse whse supprt is less than the threshld are discarded. The secnd pass cmputes the supprt f all item sets f size tw that can be frmed frm pairs f the single items surviving the first pass. In ther wrds, t generate all frequent itemsets with K = m, we need t cnsider nly candidates such that all f their m ancestral item sets f size m are frequent. Thse size-tw item sets with supprt less than the threshld are discarded. Each successive pass ver the data cnsiders nly thse item sets that can be frmed by cmbining thse that survived the previus pass with thse retained frm the first pass. Passes ver the data cntinue until all candidate rules frm the previus pass have supprt less than the specified threshld. The Apriri algrithm requires nly ne pass ver the data fr each value f K, which is crucial since we assume the data cannt be fitted int a cmputer s main memry. If the data are sufficiently sparse (r if the threshld t is high enugh), then the prcess will terminate in reasnable time even fr huge data sets. There are many additinal tricks that can be used as part f this strategy t increase speed and cnvergence (Agrawal et al., 995). The Apriri algrithm represents ne f the majr advances in data mining technlgy. Each high supprt item set K (4.6) returned by the Apriri algrithm is cast int a set f assciatin rules. The items Z k, k K, are partitined int tw disjint subsets, A B = K, and written A B. (4.7) The first item subset A is called the antecedent and the secnd B the cnsequent. Assciatin rules are defined t have several prperties based n the prevalence f the antecedent and cnsequent item sets in the data base. The supprt f the rule T(A B) is the fractin f bservatins in the unin f the antecedent and cnsequent, which is just the supprt f the item set K frm which they were derived. It can be viewed as an estimate (4.5) f the prbability f simultaneusly bserving bth item sets Pr(A and B) in a randmly selected market basket. The cnfidence r predictability C(A B) f the rule is its supprt divided by the supprt f the antecedent C(A B) = T(A B), (4.8) T(A) which can be viewed as an estimate f Pr(B A). The ntatin Pr(A), the prbability f an item set A ccurring in a basket, is an abbreviatin fr

509 4. Assciatin Rules 49 Pr( k A Z k = ). The expected cnfidence is defined as the supprt f the cnsequent T(B), which is an estimate f the uncnditinal prbability Pr(B). Finally, the lift f the rule is defined as the cnfidence divided by the expected cnfidence L(A B) = C(A B). T(B) This is an estimate f the assciatin measure Pr(A and B)/Pr(A)Pr(B). As an example, suppse the item set K = {peanut butter, jelly, bread} and cnsider the rule {peanut butter, jelly} {bread}. A supprt value f 0.0 fr this rule means that peanut butter, jelly, and bread appeared tgether in % f the market baskets. A cnfidence f 0.8 fr this rule implies that when peanut butter and jelly were purchased, 8% f the time bread was als purchased. If bread appeared in 4% f all market baskets then the rule {peanut butter, jelly} {bread} wuld have a lift f.95. The gal f this analysis is t prduce assciatin rules (4.7) with bth high values f supprt and cnfidence (4.8). The Apriri algrithm returns all item sets with high supprt as defined by the supprt threshld t (4.6). A cnfidence threshld c is set, and all rules that can be frmed frm thse item sets (4.6) with cnfidence greater than this value {A B C(A B) > c} (4.9) are reprted. Fr each item set K f size K there are K rules f the frm A (K A), A K. Agrawal et al. (995) present a variant f the Apriri algrithm that can rapidly determine which rules survive the cnfidence threshld (4.9) frm all pssible rules that can be frmed frm the slutin item sets (4.6). The utput f the entire analysis is a cllectin f assciatin rules (4.7) that satisfy the cnstraints T(A B) > t and C(A B) > c. These are generally stred in a data base that can be queried by the user. Typical requests might be t display the rules in srted rder f cnfidence, lift r supprt. Mre specifically, ne might request such a list cnditined n particular items in the antecedent r especially the cnsequent. Fr example, a request might be the fllwing: Display all transactins in which ice skates are the cnsequent that have cnfidence ver 80% and supprt f mre than %. This culd prvide infrmatin n thse items (antecedent) that predicate sales f ice skates. Fcusing n a particular cnsequent casts the prblem int the framewrk f supervised learning. Assciatin rules have becme a ppular tl fr analyzing very large cmmercial data bases in settings where market basket is relevant. That is

510 49 4. Unsupervised Learning when the data can be cast in the frm f a multidimensinal cntingency table. The utput is in the frm f cnjunctive rules (4.4) that are easily understd and interpreted. The Apriri algrithm allws this analysis t be applied t huge data bases, much larger that are amenable t ther types f analyses. Assciatin rules are amng data mining s biggest successes. Besides the restrictive frm f the data t which they can be applied, assciatin rules have ther limitatins. Critical t cmputatinal feasibility is the supprt threshld (4.6). The number f slutin item sets, their size, and the number f passes required ver the data can grw expnentially with decreasing size f this lwer bund. Thus, rules with high cnfidence r lift, but lw supprt, will nt be discvered. Fr example, a high cnfidence rule such as vdka caviar will nt be uncvered wing t the lw sales vlume f the cnsequent caviar. 4.. Example: Market Basket Analysis We illustrate the use f Apriri n a mderately sized demgraphics data base. This data set cnsists f N = 9409 questinnaires filled ut by shpping mall custmers in the San Francisc Bay Area (Impact Resurces, Inc., Clumbus H, 987). Here we use answers t the first 4 questins, relating t demgraphics, fr illustratin. These questins are listed in Table 4.. The data are seen t cnsist f a mixture f rdinal and (unrdered) categrical variables, many f the latter having mre than a few values. There are many missing values. We used a freeware implementatin f the Apriri algrithm due t Christian Brgelt. After remving bservatins with missing values, each rdinal predictr was cut at its median and cded by tw dummy variables; each categrical predictr with k categries was cded by k dummy variables. This resulted in a matrix f 6876 bservatins n 50 dummy variables. The algrithm fund a ttal f 688 assciatin rules, invlving 5 predictrs, with supprt f at least 0%. Understanding this large set f rules is itself a challenging data analysis task. We will nt attempt this here, but nly illustrate in Figure 4. the relative frequency f each dummy variable in the data (tp) and the assciatin rules (bttm). Prevalent categries tend t appear mre ften in the rules, fr example, the first categry in language (English). Hwever, thers such as ccupatin are under-represented, with the exceptin f the first and fifth level. Here are three examples f assciatin rules fund by the Apriri algrithm: Assciatin rule : Supprt 5%, cnfidence 99.7% and lift.0. See brgelt.

511 4. Assciatin Rules 49 Relative Frequency in Data incme sex marstat age educ ccup yrs bay dualinc perhus peryung huse typehme ethnic language Attribute Relative Frequency in Assciatin Rules incme sex marstat age educ ccup yrs bay dualinc perhus peryung huse typehme ethnic language Attribute FIGURE 4.. Market basket analysis: relative frequency f each dummy variable (cding an input categry) in the data (tp), and the assciatin rules fund by the Apriri algrithm (bttm).

512 Unsupervised Learning TABLE 4.. Inputs fr the demgraphic data. Feature Demgraphic # Values Type Sex Categrical Marital status 5 Categrical Age 7 rdinal 4 Educatin 6 rdinal 5 ccupatin 9 Categrical 6 Incme 9 rdinal 7 Years in Bay Area 5 rdinal 8 Dual incmes Categrical 9 Number in husehld 9 rdinal 0 Number f children 9 rdinal Husehlder status Categrical Type f hme 5 Categrical Ethnic classificatin 8 Categrical 4 Language in hme Categrical [ number in husehld = number f children = 0 language in hme = English ] Assciatin rule : Supprt.4%, cnfidence 80.8%, and lift.. language in hme = English husehlder status = wn ccupatin = {prfessinal/managerial} incme \$40,000 Assciatin rule : Supprt 6.5%, cnfidence 8.8% and lift.5. language in hme = English incme < \$40,000 marital status = nt married number f children = 0 educatin / {cllege graduate, graduate study}

513 4. Assciatin Rules 495 We chse the first and third rules based n their high supprt. The secnd rule is an assciatin rule with a high-incme cnsequent, and culd be used t try t target high-incme individuals. As stated abve, we created dummy variables fr each categry f the input predictrs, fr example, Z = I(incme < \$40,000) and Z = I(incme \$40,000) fr belw and abve the median incme. If we were interested nly in finding assciatins with the high-incme categry, we wuld include Z but nt Z. This is ften the case in actual market basket prblems, where we are interested in finding assciatins with the presence f a relatively rare item, but nt assciatins with its absence Unsupervised as Supervised Learning Here we discuss a technique fr transfrming the density estimatin prblem int ne f supervised functin apprximatin. This frms the basis fr the generalized assciatin rules described in the next sectin. Let g(x) be the unknwn data prbability density t be estimated, and g 0 (x) be a specified prbability density functin used fr reference. Fr example, g 0 (x) might be the unifrm density ver the range f the variables. ther pssibilities are discussed belw. The data set x,x,...,x N is presumed t be an i.i.d. randm sample drawn frm g(x). A sample f size N 0 can be drawn frm g 0 (x) using Mnte Carl methds. Pling these tw data sets, and assigning mass w = N 0 /(N +N 0 ) t thse drawn frm g(x), and w 0 = N/(N + N 0 ) t thse drawn frm g 0 (x), results in a randm sample drawn frm the mixture density (g(x) + g 0 (x)) /. If ne assigns the value Y = t each sample pint drawn frm g(x) and Y = 0 thse drawn frm g 0 (x), then µ(x) = E(Y x) = = g(x) g(x) + g 0 (x) g(x)/g 0 (x) + g(x)/g 0 (x) can be estimated by supervised learning using the cmbined sample (4.0) (y,x ),(y,x ),...,(y N+N0,x N+N0 ) (4.) as training data. The resulting estimate ˆµ(x) can be inverted t prvide an estimate fr g(x) ˆµ(x) ĝ(x) = g 0 (x) ˆµ(x). (4.) Generalized versins f lgistic regressin (Sectin 4.4) are especially well suited fr this applicatin since the lg-dds, are estimated directly. In this case ne has f(x) = lg g(x) g 0 (x), (4.)

514 Unsupervised Learning X X X X FIGURE 4.. Density estimatin via classificatin. (Left panel:) Training set f 00 data pints. (Right panel:) Training set plus 00 reference data pints, generated unifrmly ver the rectangle cntaining the training data. The training sample was labeled as class, and the reference sample class 0, and a semiparametric lgistic regressin mdel was fit t the data. Sme cnturs fr ĝ(x) are shwn. ĝ(x) = g 0 (x)eˆf(x). (4.4) An example is shwn in Figure 4.. We generated a training set f size 00 shwn in the left panel. The right panel shws the reference data (blue) generated unifrmly ver the rectangle cntaining the training data. The training sample was labeled as class, and the reference sample class 0, and a lgistic regressin mdel, using a tensr prduct f natural splines (Sectin 5..), was fit t the data. Sme prbability cnturs f ˆµ(x) are shwn in the right panel; these are als the cnturs f the density estimate ĝ(x), since ĝ(x) = ˆµ(x)/( ˆµ(x)), is a mntne functin. The cnturs rughly capture the data density. In principle any reference density can be used fr g 0 (x) in (4.4). In practice the accuracy f the estimate ĝ(x) can depend greatly n particular chices. Gd chices will depend n the data density g(x) and the prcedure used t estimate (4.0) r (4.). If accuracy is the gal, g 0 (x) shuld be chsen s that the resulting functins µ(x) r f(x) are apprximated easily by the methd being used. Hwever, accuracy is nt always the primary gal. Bth µ(x) and f(x) are mntnic functins f the density rati g(x)/g 0 (x). They can thus be viewed as cntrast statistics that prvide infrmatin cncerning departures f the data density g(x) frm the chsen reference density g 0 (x). Therefre, in data analytic settings, a chice fr g 0 (x) is dictated by types f departures that are deemed mst interesting in the cntext f the specific prblem at hand. Fr example, if departures frm unifrmity are f interest, g 0 (x) might be the a unifrm density ver the range f the variables. If departures frm jint nrmality

515 4. Assciatin Rules 497 are f interest, a gd chice fr g 0 (x) wuld be a Gaussian distributin with the same mean vectr and cvariance matrix as the data. Departures frm independence culd be investigated by using g 0 (x) = p g j (x j ), (4.5) j= where g j (x j ) is the marginal data density f X j, the jth crdinate f X. A sample frm this independent density (4.5) is easily generated frm the data itself by applying a different randm permutatin t the data values f each f the variables. As discussed abve, unsupervised learning is cncerned with revealing prperties f the data density g(x). Each technique fcuses n a particular prperty r set f prperties. Althugh this apprach f transfrming the prblem t ne f supervised learning (4.0) (4.4) seems t have been part f the statistics flklre fr sme time, it des nt appear t have had much impact despite its ptential t bring well-develped supervised learning methdlgy t bear n unsupervised learning prblems. ne reasn may be that the prblem must be enlarged with a simulated data set generated by Mnte Carl techniques. Since the size f this data set shuld be at least as large as the data sample N 0 N, the cmputatin and memry requirements f the estimatin prcedure are at least dubled. Als, substantial cmputatin may be required t generate the Mnte Carl sample itself. Althugh perhaps a deterrent in the past, these increased cmputatinal requirements are becming much less f a burden as increased resurces becme rutinely available. We illustrate the use f supervising learning methds fr unsupervised learning in the next sectin Generalized Assciatin Rules The mre general prblem (4.) f finding high-density regins in the data space can be addressed using the supervised learning apprach described abve. Althugh nt applicable t the huge data bases fr which market basket analysis is feasible, useful infrmatin can be btained frm mderately sized data sets. The prblem (4.) can be frmulated as finding subsets f the integers J {,,...,p} and crrespnding value subsets s j, j J fr the crrespnding variables X j, such that Pr (X j s j ) = N I (x ij s j ) (4.6) N j J i= is large. Fllwing the nmenclature f assciatin rule analysis, {(X j s j )} j J will be called a generalized item set. The subsets s j crrespnding t quantitative variables are taken t be cntiguus intervals within j J

516 Unsupervised Learning their range f values, and subsets fr categrical variables can invlve mre than a single value. The ambitius nature f this frmulatin precludes a thrugh search fr all generalized item sets with supprt (4.6) greater than a specified minimum threshld, as was pssible in the mre restrictive setting f market basket analysis. Heuristic search methds must be emplyed, and the mst ne can hpe fr is t find a useful cllectin f such generalized item sets. Bth market basket analysis (4.5) and the generalized frmulatin (4.6) implicitly reference the unifrm prbability distributin. ne seeks item sets that are mre frequent than wuld be expected if all jint data values (x,x,...,x N ) were unifrmly distributed. This favrs the discvery f item sets whse marginal cnstituents (X j s j ) are individually frequent, that is, the quantity N I(x ij s j ) (4.7) N i= is large. Cnjunctins f frequent subsets (4.7) will tend t appear mre ften amng item sets f high supprt (4.6) than cnjunctins f marginally less frequent subsets. This is why the rule vdka caviar is nt likely t be discvered in spite f a high assciatin (lift); neither item has high marginal supprt, s that their jint supprt is especially small. Reference t the unifrm distributin can cause highly frequent item sets with lw assciatins amng their cnstituents t dminate the cllectin f highest supprt item sets. Highly frequent subsets s j are frmed as disjunctins f the mst frequent X j -values. Using the prduct f the variable marginal data densities (4.5) as a reference distributin remves the preference fr highly frequent values f the individual variables in the discvered item sets. This is because the density rati g(x)/g 0 (x) is unifrm if there are n assciatins amng the variables (cmplete independence), regardless f the frequency distributin f the individual variable values. Rules like vdka caviar wuld have a chance t emerge. It is nt clear hwever, hw t incrprate reference distributins ther than the unifrm int the Apriri algrithm. As explained in Sectin 4..4, it is straightfrward t generate a sample frm the prduct density (4.5), given the riginal data set. After chsing a reference distributin, and drawing a sample frm it as in (4.), ne has a supervised learning prblem with a binary-valued utput variable Y {0,}. The gal is t use this training data t find regins R = (X j s j ) (4.8) j J fr which the target functin µ(x) = E(Y x) is relatively large. In additin, ne might wish t require that the data supprt f these regins

517 4. Assciatin Rules 499 T(R) = g(x) dx (4.9) x R nt be t small Chice f Supervised Learning Methd The regins (4.8) are defined by cnjunctive rules. Hence supervised methds that learn such rules wuld be mst apprpriate in this cntext. The terminal ndes f a CART decisin tree are defined by rules precisely f the frm (4.8). Applying CART t the pled data (4.) will prduce a decisin tree that attempts t mdel the target (4.0) ver the entire data space by a disjint set f regins (terminal ndes). Each regin is defined by a rule f the frm (4.8). Thse terminal ndes t with high average y-values ȳ t = ave(y i x i t) are candidates fr high-supprt generalized item sets (4.6). The actual (data) supprt is given by T(R) = ȳ t N t N + N 0, where N t is the number f (pled) bservatins within the regin represented by the terminal nde. By examining the resulting decisin tree, ne might discver interesting generalized item sets f relatively high-supprt. These can then be partitined int antecedents and cnsequents in a search fr generalized assciatin rules f high cnfidence and/r lift. Anther natural learning methd fr this purpse is the patient rule inductin methd PRIM described in Sectin 9.. PRIM als prduces rules precisely f the frm (4.8), but it is especially designed fr finding high-supprt regins that maximize the average target (4.0) value within them, rather than trying t mdel the target functin ver the entire data space. It als prvides mre cntrl ver the supprt/average-target-value tradeff. Exercise 4. addresses an issue that arises with either f these methds when we generate randm data frm the prduct f the marginal distributins Example: Market Basket Analysis (Cntinued) We illustrate the use f PRIM n the demgraphics data f Table 4.. Three f the high-supprt generalized item sets emerging frm the PRIM analysis were the fllwing: Item set : Supprt= 4%.

518 Unsupervised Learning Item set : Supprt= 4%. marital status = married husehlder status = wn type f hme apartment age 4 marital status {living tgether-nt married, single} ccupatin / {prfessinal, hmemaker, retired} husehlder status {rent, live with family} Item set : Supprt= 5%. husehlder status = rent type f hme huse number in husehld number f children = 0 ccupatin / {hmemaker, student, unemplyed} incme [\$0,000, \$50,000] Generalized assciatin rules derived frm these item sets with cnfidence (4.8) greater than 95% are the fllwing: Assciatin rule : Supprt 5%, cnfidence 99.7% and lift.5. [ marital status = married ] husehlder status = wn type f hme apartment Assciatin rule : Supprt 5%, cnfidence 98.7% and lift.97. age 4 ccupatin / {prfessinal, hmemaker, retired} husehlder status {rent, live with family} marital status {single, living tgether-nt married} Assciatin rule : Supprt 5%, cnfidence 95.9% and lift.6. [ ] husehlder status = wn type f hme apartment marital status = married

519 4. Cluster Analysis 50 Assciatin rule 4: Supprt 5%, cnfidence 95.4% and lift.50. husehlder status = rent type f hme huse number in husehld ccupatin / {hmemaker, student, unemplyed} incme [\$0,000, \$50,000] number f children = 0 There are n great surprises amng these particular rules. Fr the mst part they verify intuitin. In ther cntexts where there is less prir infrmatin available, unexpected results have a greater chance t emerge. These results d illustrate the type f infrmatin generalized assciatin rules can prvide, and that the supervised learning apprach, cupled with a ruled inductin methd such as CART r PRIM, can uncver item sets exhibiting high assciatins amng their cnstituents. Hw d these generalized assciatin rules cmpare t thse fund earlier by the Apriri algrithm? Since the Apriri prcedure gives thusands f rules, it is difficult t cmpare them. Hwever sme general pints can be made. The Apriri algrithm is exhaustive it finds all rules with supprt greater than a specified amunt. In cntrast, PRIM is a greedy algrithm and is nt guaranteed t give an ptimal set f rules. n the ther hand, the Apriri algrithm can deal nly with dummy variables and hence culd nt find sme f the abve rules. Fr example, since type f hme is a categrical input, with a dummy variable fr each level, Apriri culd nt find a rule invlving the set type f hme apartment. T find this set, we wuld have t cde a dummy variable fr apartment versus the ther categries f type f hme. It will nt generally be feasible t precde all such ptentially interesting cmparisns. 4. Cluster Analysis Cluster analysis, als called data segmentatin, has a variety f gals. All relate t gruping r segmenting a cllectin f bjects int subsets r clusters, such that thse within each cluster are mre clsely related t ne anther than bjects assigned t different clusters. An bject can be described by a set f measurements, r by its relatin t ther bjects. In additin, the gal is smetimes t arrange the clusters int a natural hierarchy. This invlves successively gruping the clusters themselves s

520 50 4. Unsupervised Learning X X FIGURE 4.4. Simulated data in the plane, clustered int three classes (represented by range, blue and green) by the K-means clustering algrithm that at each level f the hierarchy, clusters within the same grup are mre similar t each ther than thse in different grups. Cluster analysis is als used t frm descriptive statistics t ascertain whether r nt the data cnsists f a set distinct subgrups, each grup representing bjects with substantially different prperties. This latter gal requires an assessment f the degree f difference between the bjects assigned t the respective clusters. Central t all f the gals f cluster analysis is the ntin f the degree f similarity (r dissimilarity) between the individual bjects being clustered. A clustering methd attempts t grup the bjects based n the definitin f similarity supplied t it. This can nly cme frm subject matter cnsideratins. The situatin is smewhat similar t the specificatin f a lss r cst functin in predictin prblems (supervised learning). There the cst assciated with an inaccurate predictin depends n cnsideratins utside the data. Figure 4.4 shws sme simulated data clustered int three grups via the ppular K-means algrithm. In this case tw f the clusters are nt well separated, s that segmentatin mre accurately describes the part f this prcess than clustering. K-means clustering starts with guesses fr the three cluster centers. Then it alternates the fllwing steps until cnvergence: fr each data pint, the clsest cluster center (in Euclidean distance) is identified;

521 4. Cluster Analysis 50 each cluster center is replaced by the crdinate-wise average f all data pints that are clsest t it. We describe K-means clustering in mre detail later, including the prblem f hw t chse the number f clusters (three in this example). K- means clustering is a tp-dwn prcedure, while ther cluster appraches that we discuss are bttm-up. Fundamental t all clustering techniques is the chice f distance r dissimilarity measure between tw bjects. We first discuss distance measures befre describing a variety f algrithms fr clustering. 4.. Prximity Matrices Smetimes the data is represented directly in terms f the prximity (alikeness r affinity) between pairs f bjects. These can be either similarities r dissimilarities (difference r lack f affinity). Fr example, in scial science experiments, participants are asked t judge by hw much certain bjects differ frm ne anther. Dissimilarities can then be cmputed by averaging ver the cllectin f such judgments. This type f data can be represented by an N N matrix D, where N is the number f bjects, and each element d ii recrds the prximity between the ith and i th bjects. This matrix is then prvided as input t the clustering algrithm. Mst algrithms presume a matrix f dissimilarities with nnnegative entries and zer diagnal elements: d ii = 0, i =,,...,N. If the riginal data were cllected as similarities, a suitable mntne-decreasing functin can be used t cnvert them t dissimilarities. Als, mst algrithms assume symmetric dissimilarity matrices, s if the riginal matrix D is nt symmetric it must be replaced by (D+D T )/. Subjectively judged dissimilarities are seldm distances in the strict sense, since the triangle inequality d ii d ik +d i k, fr all k {,...,N} des nt hld. Thus, sme algrithms that assume distances cannt be used with such data. 4.. Dissimilarities Based n Attributes Mst ften we have measurements x ij fr i =,,...,N, n variables j =,,...,p (als called attributes). Since mst f the ppular clustering algrithms take a dissimilarity matrix as their input, we must first cnstruct pairwise dissimilarities between the bservatins. In the mst cmmn case, we define a dissimilarity d j (x ij,x i j) between values f the jth attribute, and then define p D(x i,x i ) = d j (x ij,x i j) (4.0) j= as the dissimilarity between bjects i and i. By far the mst cmmn chice is squared distance

522 Unsupervised Learning d j (x ij,x i j) = (x ij x i j). (4.) Hwever, ther chices are pssible, and can lead t ptentially different results. Fr nnquantitative attributes (e.g., categrical data), squared distance may nt be apprpriate. In additin, it is smetimes desirable t weigh attributes differently rather than giving them equal weight as in (4.0). We first discuss alternatives in terms f the attribute type: Quantitative variables. Measurements f this type f variable r attribute are represented by cntinuus real-valued numbers. It is natural t define the errr between them as a mntne-increasing functin f their abslute difference d(x i,x i ) = l( x i x i ). Besides squared-errr lss (x i x i ), a cmmn chice is the identity (abslute errr). The frmer places mre emphasis n larger differences than smaller nes. Alternatively, clustering can be based n the crrelatin j ρ(x i,x i ) = (x ij x i )(x i j x i ) j (x ij x i ), (4.) j (x i j x i ) with x i = j x ij/p. Nte that this is averaged ver variables, nt bservatins. If the bservatins are first standardized, then j (x ij x i j) ( ρ(x i,x i )). Hence clustering based n crrelatin (similarity) is equivalent t that based n squared distance (dissimilarity). rdinal variables. The values f this type f variable are ften represented as cntiguus integers, and the realizable values are cnsidered t be an rdered set. Examples are academic grades (A, B, C, D, F), degree f preference (can t stand, dislike, K, like, terrific). Rank data are a special kind f rdinal data. Errr measures fr rdinal variables are generally defined by replacing their M riginal values with i /, i =,...,M (4.) M in the prescribed rder f their riginal values. They are then treated as quantitative variables n this scale. Categrical variables. With unrdered categrical (als called nminal) variables, the degree-f-difference between pairs f values must be delineated explicitly. If the variable assumes M distinct values, these can be arranged in a symmetric M M matrix with elements L rr = L r r,l rr = 0,L rr 0. The mst cmmn chice is L rr = fr all r r, while unequal lsses can be used t emphasize sme errrs mre than thers.

523 4. Cluster Analysis bject Dissimilarity Next we define a prcedure fr cmbining the p-individual attribute dissimilarities d j (x ij,x i j), j =,,...,p int a single verall measure f dissimilarity D(x i,x i ) between tw bjects r bservatins (x i,x i ) pssessing the respective attribute values. This is nearly always dne by means f a weighted average (cnvex cmbinatin) p D(x i,x i ) = w j d j (x ij,x i j); j= p w j =. (4.4) j= Here w j is a weight assigned t the jth attribute regulating the relative influence f that variable in determining the verall dissimilarity between bjects. This chice shuld be based n subject matter cnsideratins. It is imprtant t realize that setting the weight w j t the same value fr each variable (say, w j = j) des nt necessarily give all attributes equal influence. The influence f the jth attribute X j n bject dissimilarity D(x i,x i ) (4.4) depends upn its relative cntributin t the average bject dissimilarity measure ver all pairs f bservatins in the data set with D = N N i= i = d j = N N D(x i,x i ) = N i= i = p w j d j, j= N d j (x ij,x i j) (4.5) being the average dissimilarity n the jth attribute. Thus, the relative influence f the jth variable is w j d j, and setting w j / d j wuld give all attributes equal influence in characterizing verall dissimilarity between bjects. Fr example, with p quantitative variables and squared-errr distance used fr each crdinate, then (4.4) becmes the (weighted) squared Euclidean distance D I (x i,x i ) = p w j (x ij x i j) (4.6) j= between pairs f pints in an IR p, with the quantitative variables as axes. In this case (4.5) becmes d j = N N i= i = N (x ij x i j) = var j, (4.7) where var j is the sample estimate f Var(X j ). Thus, the relative imprtance f each such variable is prprtinal t its variance ver the data

524 Unsupervised Learning X X X X FIGURE 4.5. Simulated data: n the left, K-means clustering (with K=) has been applied t the raw data. The tw clrs indicate the cluster memberships. n the right, the features were first standardized befre clustering. This is equivalent t using feature weights /[ var(x j)]. The standardizatin has bscured the tw well-separated grups. Nte that each plt uses the same units in the hrizntal and vertical axes. set. In general, setting w j = / d j fr all attributes, irrespective f type, will cause each ne f them t equally influence the verall dissimilarity between pairs f bjects (x i,x i ). Althugh this may seem reasnable, and is ften recmmended, it can be highly cunterprductive. If the gal is t segment the data int grups f similar bjects, all attributes may nt cntribute equally t the (prblem-dependent) ntin f dissimilarity between bjects. Sme attribute value differences may reflect greater actual bject dissimilarity in the cntext f the prblem dmain. If the gal is t discver natural grupings in the data, sme attributes may exhibit mre f a gruping tendency than thers. Variables that are mre relevant in separating the grups shuld be assigned a higher influence in defining bject dissimilarity. Giving all attributes equal influence in this case will tend t bscure the grups t the pint where a clustering algrithm cannt uncver them. Figure 4.5 shws an example. Althugh simple generic prescriptins fr chsing the individual attribute dissimilarities d j (x ij,x i j) and their weights w j can be cmfrting, there is n substitute fr careful thught in the cntext f each individual prblem. Specifying an apprpriate dissimilarity measure is far mre imprtant in btaining success with clustering than chice f clustering algrithm. This aspect f the prblem is emphasized less in the clustering literature than the algrithms themselves, since it depends n dmain knwledge specifics and is less amenable t general research.

525 4. Cluster Analysis 507 Finally, ften bservatins have missing values in ne r mre f the attributes. The mst cmmn methd f incrprating missing values in dissimilarity calculatins (4.4) is t mit each bservatin pair x ij,x i j having at least ne value missing, when cmputing the dissimilarity between bservatins x i and x i. This methd can fail in the circumstance when bth bservatins have n measured values in cmmn. In this case bth bservatins culd be deleted frm the analysis. Alternatively, the missing values culd be imputed using the mean r median f each attribute ver the nnmissing data. Fr categrical variables, ne culd cnsider the value missing as just anther categrical value, if it were reasnable t cnsider tw bjects as being similar if they bth have missing values n the same variables Clustering Algrithms The gal f cluster analysis is t partitin the bservatins int grups ( clusters ) s that the pairwise dissimilarities between thse assigned t the same cluster tend t be smaller than thse in different clusters. Clustering algrithms fall int three distinct types: cmbinatrial algrithms, mixture mdeling, and mde seeking. Cmbinatrial algrithms wrk directly n the bserved data with n direct reference t an underlying prbability mdel. Mixture mdeling suppses that the data is an i.i.d sample frm sme ppulatin described by a prbability density functin. This density functin is characterized by a parameterized mdel taken t be a mixture f cmpnent density functins; each cmpnent density describes ne f the clusters. This mdel is then fit t the data by maximum likelihd r crrespnding Bayesian appraches. Mde seekers ( bump hunters ) take a nnparametric perspective, attempting t directly estimate distinct mdes f the prbability density functin. bservatins clsest t each respective mde then define the individual clusters. Mixture mdeling is described in Sectin 6.8. The PRIM algrithm, discussed in Sectins 9. and 4..5, is an example f mde seeking r bump hunting. We discuss cmbinatrial algrithms next Cmbinatrial Algrithms The mst ppular clustering algrithms directly assign each bservatin t a grup r cluster withut regard t a prbability mdel describing the data. Each bservatin is uniquely labeled by an integer i {,,N}. A prespecified number f clusters K < N is pstulated, and each ne is labeled by an integer k {,...,K}. Each bservatin is assigned t ne and nly ne cluster. These assignments can be characterized by a manyt-ne mapping, r encder k = C(i), that assigns the ith bservatin t the kth cluster. ne seeks the particular encder C (i) that achieves the

526 Unsupervised Learning required gal (details belw), based n the dissimilarities d(x i,x i ) between every pair f bservatins. These are specified by the user as described abve. Generally, the encder C(i) is explicitly delineated by giving its value (cluster assignment) fr each bservatin i. Thus, the parameters f the prcedure are the individual cluster assignments fr each f the N bservatins. These are adjusted s as t minimize a lss functin that characterizes the degree t which the clustering gal is nt met. ne apprach is t directly specify a mathematical lss functin and attempt t minimize it thrugh sme cmbinatrial ptimizatin algrithm. Since the gal is t assign clse pints t the same cluster, a natural lss (r energy ) functin wuld be W(C) = K k= C(i)=k C(i )=k d(x i,x i ). (4.8) This criterin characterizes the extent t which bservatins assigned t the same cluster tend t be clse t ne anther. It is smetimes referred t as the within cluster pint scatter since T = N N d ii = K d ii + d ii, r i= i = k= C(i)=k C(i )=k T = W(C) + B(C), C(i ) k where d ii = d(x i,x i ). Here T is the ttal pint scatter, which is a cnstant given the data, independent f cluster assignment. The quantity B(C) = K k= C(i)=k C(i ) k d ii (4.9) is the between-cluster pint scatter. This will tend t be large when bservatins assigned t different clusters are far apart. Thus ne has W(C) = T B(C) and minimizing W(C) is equivalent t maximizing B(C). Cluster analysis by cmbinatrial ptimizatin is straightfrward in principle. ne simply minimizes W r equivalently maximizes B ver all pssible assignments f the N data pints t K clusters. Unfrtunately, such ptimizatin by cmplete enumeratin is feasible nly fr very small data sets. The number f distinct assignments is (Jain and Dubes, 988) S(N,K) = K ( ) K ( ) K k k N. (4.0) K! k k= Fr example, S(0,4) = 4,05 which is quite feasible. But, S(N,K) grws very rapidly with increasing values f its arguments. Already S(9, 4)

527 4. Cluster Analysis , and mst clustering prblems invlve much larger data sets than N = 9. Fr this reasn, practical clustering algrithms are able t examine nly a very small fractin f all pssible encders k = C(i). The gal is t identify a small subset that is likely t cntain the ptimal ne, r at least a gd subptimal partitin. Such feasible strategies are based n iterative greedy descent. An initial partitin is specified. At each iterative step, the cluster assignments are changed in such a way that the value f the criterin is imprved frm its previus value. Clustering algrithms f this type differ in their prescriptins fr mdifying the cluster assignments at each iteratin. When the prescriptin is unable t prvide an imprvement, the algrithm terminates with the current assignments as its slutin. Since the assignment f bservatins t clusters at any iteratin is a perturbatin f that fr the previus iteratin, nly a very small fractin f all pssible assignments (4.0) are examined. Hwever, these algrithms cnverge t lcal ptima which may be highly subptimal when cmpared t the glbal ptimum K-means The K-means algrithm is ne f the mst ppular iterative descent clustering methds. It is intended fr situatins in which all variables are f the quantitative type, and squared Euclidean distance d(x i,x i ) = p (x ij x i j) = x i x i j= is chsen as the dissimilarity measure. Nte that weighted Euclidean distance can be used by redefining the x ij values (Exercise 4.). The within-pint scatter (4.8) can be written as W(C) = = K k= C(i)=k C(i )=k K k= N k C(i)=k x i x i x i x k, (4.) where x k = ( x k,..., x pk ) is the mean vectr assciated with the kth cluster, and N k = N i= I(C(i) = k). Thus, the criterin is minimized by assigning the N bservatins t the K clusters in such a way that within each cluster the average dissimilarity f the bservatins frm the cluster mean, as defined by the pints in that cluster, is minimized. An iterative descent algrithm fr slving

528 50 4. Unsupervised Learning Algrithm 4. K-means Clustering.. Fr a given cluster assignment C, the ttal cluster variance (4.) is minimized with respect t {m,...,m K } yielding the means f the currently assigned clusters (4.).. Given a current set f means {m,...,m K }, (4.) is minimized by assigning each bservatin t the clsest (current) cluster mean. That is, C(i) = argmin x i m k. (4.4) k K. Steps and are iterated until the assignments d nt change. C = min C K k= N k C(i)=k x i x k can be btained by nting that fr any set f bservatins S x S = argmin m x i m. (4.) i S Hence we can btain C by slving the enlarged ptimizatin prblem min C,{m k } K K N k x i m k. (4.) k= C(i)=k This can be minimized by an alternating ptimizatin prcedure given in Algrithm 4.. Each f steps and reduces the value f the criterin (4.), s that cnvergence is assured. Hwever, the result may represent a subptimal lcal minimum. The algrithm f Hartigan and Wng (979) ges further, and ensures that there is n single switch f an bservatin frm ne grup t anther grup that will decrease the bjective. In additin, ne shuld start the algrithm with many different randm chices fr the starting means, and chse the slutin having smallest value f the bjective functin. Figure 4.6 shws sme f the K-means iteratins fr the simulated data f Figure 4.4. The centrids are depicted by s. The straight lines shw the partitining f pints, each sectr being the set f pints clsest t each centrid. This partitining is called the Vrni tessellatin. After 0 iteratins the prcedure has cnverged Gaussian Mixtures as Sft K-means Clustering The K-means clustering prcedure is clsely related t the EM algrithm fr estimating a certain Gaussian mixture mdel. (Sectins 6.8 and 8.5.).

529 4. Cluster Analysis Initial Centrids Initial Partitin Iteratin Number Iteratin Number 0 FIGURE 4.6. Successive iteratins f the K-means clustering algrithm fr the simulated data f Figure 4.4.

530 5 4. Unsupervised Learning σ =.0 σ =.0 σ = 0. Respnsibilities Respnsibilities σ = FIGURE 4.7. (Left panels:) tw Gaussian densities g 0(x) and g (x) (blue and range) n the real line, and a single data pint (green dt) at x = 0.5. The clred squares are pltted at x =.0 and x =.0, the means f each density. (Right panels:) the relative densities g 0(x)/(g 0(x) + g (x)) and g (x)/(g 0(x) + g (x)), called the respnsibilities f each cluster, fr this data pint. In the tp panels, the Gaussian standard deviatin σ =.0; in the bttm panels σ = 0.. The EM algrithm uses these respnsibilities t make a sft assignment f each data pint t each f the tw clusters. When σ is fairly large, the respnsibilities can be near 0.5 (they are 0.6 and 0.64 in the tp right panel). As σ 0, the respnsibilities, fr the cluster center clsest t the target pint, and 0 fr all ther clusters. This hard assignment is seen in the bttm right panel. The E-step f the EM algrithm assigns respnsibilities fr each data pint based in its relative density under each mixture cmpnent, while the M-step recmputes the cmpnent density parameters based n the current respnsibilities. Suppse we specify K mixture cmpnents, each with a Gaussian density having scalar cvariance matrix σ I. Then the relative density under each mixture cmpnent is a mntne functin f the Euclidean distance between the data pint and the mixture center. Hence in this setup EM is a sft versin f K-means clustering, making prbabilistic (rather than deterministic) assignments f pints t cluster centers. As the variance σ 0, these prbabilities becme 0 and, and the tw methds cincide. Details are given in Exercise 4.. Figure 4.7 illustrates this result fr tw clusters n the real line Example: Human Tumr Micrarray Data We apply K-means clustering t the human tumr micrarray data described in Chapter. This is an example f high-dimensinal clustering.

531 4. Cluster Analysis 5 Sum f Squares Number f Clusters K FIGURE 4.8. Ttal within-cluster sum f squares fr K-means clustering applied t the human tumr micrarray data. TABLE 4.. Human tumr data: number f cancer cases f each type, in each f the three clusters frm K-means clustering. Cluster Breast CNS Cln K56 Leukemia MCF Cluster Melanma NSCLC varian Prstate Renal Unknwn The data are a matrix f real numbers, each representing an expressin measurement fr a gene (rw) and sample (clumn). Here we cluster the samples, each f which is a vectr f length 680, crrespnding t expressin values fr the 680 genes. Each sample has a label such as breast (fr breast cancer), melanma, and s n; we dn t use these labels in the clustering, but will examine psthc which labels fall int which clusters. We applied K-means clustering with K running frm t 0, and cmputed the ttal within-sum f squares fr each clustering, shwn in Figure 4.8. Typically ne lks fr a kink in the sum f squares curve (r its lgarithm) t lcate the ptimal number f clusters (see Sectin 4..). Here there is n clear indicatin: fr illustratin we chse K = giving the three clusters shwn in Table 4..

532 54 4. Unsupervised Learning FIGURE 4.9. Sir Rnald A. Fisher (890 96) was ne f the funders f mdern day statistics, t whm we we maximum-likelihd, sufficiency, and many ther fundamental cncepts. The image n the left is a grayscale image at 8 bits per pixel. The center image is the result f blck VQ, using 00 cde vectrs, with a cmpressin rate f.9 bits/pixel. The right image uses nly fur cde vectrs, with a cmpressin rate f 0.50 bits/pixel We see that the prcedure is successful at gruping tgether samples f the same cancer. In fact, the tw breast cancers in the secnd cluster were later fund t be misdiagnsed and were melanmas that had metastasized. Hwever, K-means clustering has shrtcmings in this applicatin. Fr ne, it des nt give a linear rdering f bjects within a cluster: we have simply listed them in alphabetic rder abve. Secndly, as the number f clusters K is changed, the cluster memberships can change in arbitrary ways. That is, with say fur clusters, the clusters need nt be nested within the three clusters abve. Fr these reasns, hierarchical clustering (described later), is prbably preferable fr this applicatin Vectr Quantizatin The K-means clustering algrithm represents a key tl in the apparently unrelated area f image and signal cmpressin, particularly in vectr quantizatin r VQ (Gersh and Gray, 99). The left image in Figure 4.9 is a digitized phtgraph f a famus statistician, Sir Rnald Fisher. It cnsists f pixels, where each pixel is a grayscale value ranging frm 0 t 55, and hence requires 8 bits f strage per pixel. The entire image ccupies megabyte f strage. The center image is a VQ-cmpressed versin f the left panel, and requires 0.9 f the strage (at sme lss in quality). The right image is cmpressed even mre, and requires nly f the strage (at a cnsiderable lss in quality). The versin f VQ implemented here first breaks the image int small blcks, in this case blcks f pixels. Each f the 5 5 blcks f fur This example was prepared by Maya Gupta.

533 4. Cluster Analysis 55 numbers is regarded as a vectr in IR 4. A K-means clustering algrithm (als knwn as Llyd s algrithm in this cntext) is run in this space. The center image uses K = 00, while the right image K = 4. Each f the 5 5 pixel blcks (r pints) is apprximated by its clsest cluster centrid, knwn as a cdewrd. The clustering prcess is called the encding step, and the cllectin f centrids is called the cdebk. T represent the apprximated image, we need t supply fr each blck the identity f the cdebk entry that apprximates it. This will require lg (K) bits per blck. We als need t supply the cdebk itself, which is K 4 real numbers (typically negligible). verall, the strage fr the cmpressed image amunts t lg (K)/(4 8) f the riginal (0.9 fr K = 00, 0.06 fr K = 4). This is typically expressed as a rate in bits per pixel: lg (K)/4, which are.9 and 0.50, respectively. The prcess f cnstructing the apprximate image frm the centrids is called the decding step. Why d we expect VQ t wrk at all? The reasn is that fr typical everyday images like phtgraphs, many f the blcks lk the same. In this case there are many almst pure white blcks, and similarly pure gray blcks f varius shades. These require nly ne blck each t represent them, and then multiple pinters t that blck. What we have described is knwn as lssy cmpressin, since ur images are degraded versins f the riginal. The degradatin r distrtin is usually measured in terms f mean squared errr. In this case D = 0.89 fr K = 00 and D = 6.95 fr K = 4. Mre generally a rate/distrtin curve wuld be used t assess the tradeff. ne can als perfrm lssless cmpressin using blck clustering, and still capitalize n the repeated patterns. If yu tk the riginal image and lsslessly cmpressed it, the best yu wuld d is 4.48 bits per pixel. We claimed abve that lg (K) bits were needed t identify each f the K cdewrds in the cdebk. This uses a fixed-length cde, and is inefficient if sme cdewrds ccur many mre times than thers in the image. Using Shannn cding thery, we knw that in general a variable length cde will d better, and the rate then becmes K l= p l lg (p l )/4. The term in the numeratr is the entrpy f the distributin p l f the cdewrds in the image. Using variable length cding ur rates cme dwn t.4 and 0.9, respectively. Finally, there are many generalizatins f VQ that have been develped: fr example, tree-structured VQ finds the centrids with a tp-dwn, -means style algrithm, as alluded t in Sectin 4... This allws successive refinement f the cmpressin. Further details may be fund in Gersh and Gray (99) K-medids As discussed abve, the K-means algrithm is apprpriate when the dissimilarity measure is taken t be squared Euclidean distance D(x i,x i )

534 56 4. Unsupervised Learning Algrithm 4. K-medids Clustering.. Fr a given cluster assignment C find the bservatin in the cluster minimizing ttal distance t ther pints in that cluster: i k = argmin D(x i,x i ). (4.5) {i:c(i)=k} C(i )=k Then m k = x i k, k =,,...,K are the current estimates f the cluster centers.. Given a current set f cluster centers {m,...,m K }, minimize the ttal errr by assigning each bservatin t the clsest (current) cluster center: C(i) = argmin D(x i,m k ). (4.6) k K. Iterate steps and until the assignments d nt change. (4.). This requires all f the variables t be f the quantitative type. In additin, using squared Euclidean distance places the highest influence n the largest distances. This causes the prcedure t lack rbustness against utliers that prduce very large distances. These restrictins can be remved at the expense f cmputatin. The nly part f the K-means algrithm that assumes squared Euclidean distance is the minimizatin step (4.); the cluster representatives {m,...,m K } in (4.) are taken t be the means f the currently assigned clusters. The algrithm can be generalized fr use with arbitrarily defined dissimilarities D(x i,x i ) by replacing this step by an explicit ptimizatin with respect t {m,...,m K } in (4.). In the mst cmmn frm, centers fr each cluster are restricted t be ne f the bservatins assigned t the cluster, as summarized in Algrithm 4.. This algrithm assumes attribute data, but the apprach can als be applied t data described nly by prximity matrices (Sectin 4..). There is n need t explicitly cmpute cluster centers; rather we just keep track f the indices i k. Slving (4.) fr each prvisinal cluster k requires an amunt f cmputatin prprtinal t the number f bservatins assigned t it, whereas fr slving (4.5) the cmputatin increases t (Nk ). Given a set f cluster centers, {i,...,i K }, btaining the new assignments C(i) = argmin d ii k (4.7) k K requires cmputatin prprtinal t K N as befre. Thus, K-medids is far mre cmputatinally intensive than K-means. Alternating between (4.5) and (4.7) represents a particular heuristic search strategy fr trying t slve

535 4. Cluster Analysis 57 TABLE 4.. Data frm a plitical science survey: values are average pairwise dissimilarities f cuntries frm a questinnaire given t plitical science students. BEL BRA CHI CUB EGY FRA IND ISR USA USS YUG BRA 5.58 CHI CUB EGY FRA IND ISR USA USS YUG ZAI min C, {i k } K K k= C(i)=k d iik. (4.8) Kaufman and Russeeuw (990) prpse an alternative strategy fr directly slving (4.8) that prvisinally exchanges each center i k with an bservatin that is nt currently a center, selecting the exchange that prduces the greatest reductin in the value f the criterin (4.8). This is repeated until n advantageus exchanges can be fund. Massart et al. (98) derive a branch-and-bund cmbinatrial methd that finds the glbal minimum f (4.8) that is practical nly fr very small data sets. Example: Cuntry Dissimilarities This example, taken frm Kaufman and Russeeuw (990), cmes frm a study in which plitical science students were asked t prvide pairwise dissimilarity measures fr cuntries: Belgium, Brazil, Chile, Cuba, Egypt, France, India, Israel, United States, Unin f Sviet Scialist Republics, Yugslavia and Zaire. The average dissimilarity scres are given in Table 4.. We applied -medid clustering t these dissimilarities. Nte that K-means clustering culd nt be applied because we have nly distances rather than raw bservatins. The left panel f Figure 4.0 shws the dissimilarities rerdered and blcked accrding t the -medid clustering. The right panel is a tw-dimensinal multidimensinal scaling plt, with the -medid clusters assignments indicated by clrs (multidimensinal scaling is discussed in Sectin 4.8.) Bth plts shw three well-separated clusters, but the MDS display indicates that Egypt falls abut halfway between tw clusters.

536 58 4. Unsupervised Learning USA ISR FRA EGY BEL ZAI IND BRA YUG USS CUB CHI CUB USS YUG BRA IND ZAI BEL EGY FRA ISR Secnd MDS Crdinate ZAI BRA EGY USA BELISR FRA IND YUG CHI CUB USS Rerdered Dissimilarity Matrix First MDS Crdinate FIGURE 4.0. Survey f cuntry dissimilarities. (Left panel:) dissimilarities rerdered and blcked accrding t -medid clustering. Heat map is cded frm mst similar (dark red) t least similar (bright red). (Right panel:) tw-dimensinal multidimensinal scaling plt, with -medid clusters indicated by different clrs. 4.. Practical Issues In rder t apply K-means r K-medids ne must select the number f clusters K and an initializatin. The latter can be defined by specifying an initial set f centers {m,...,m K } r {i,...,i K } r an initial encder C(i). Usually specifying the centers is mre cnvenient. Suggestins range frm simple randm selectin t a deliberate strategy based n frward stepwise assignment. At each step a new center i k is chsen t minimize the criterin (4.) r (4.8), given the centers i,...,i k chsen at the previus steps. This cntinues fr K steps, thereby prducing K initial centers with which t begin the ptimizatin algrithm. A chice fr the number f clusters K depends n the gal. Fr data segmentatin K is usually defined as part f the prblem. Fr example, a cmpany may emply K sales peple, and the gal is t partitin a custmer database int K segments, ne fr each sales persn, such that the custmers assigned t each ne are as similar as pssible. ften, hwever, cluster analysis is used t prvide a descriptive statistic fr ascertaining the extent t which the bservatins cmprising the data base fall int natural distinct grupings. Here the number f such grups K is unknwn and ne requires that it, as well as the grupings themselves, be estimated frm the data. Data-based methds fr estimating K typically examine the withincluster dissimilarity W K as a functin f the number f clusters K. Separate slutins are btained fr K {,,...,K max }. The crrespnding values

537 4. Cluster Analysis 59 {W,W,...,W Kmax } generally decrease with increasing K. This will be the case even when the criterin is evaluated n an independent test set, since a large number f cluster centers will tend t fill the feature space densely and thus will be clse t all data pints. Thus crss-validatin techniques, s useful fr mdel selectin in supervised learning, cannt be utilized in this cntext. The intuitin underlying the apprach is that if there are actually K distinct grupings f the bservatins (as defined by the dissimilarity measure), then fr K < K the clusters returned by the algrithm will each cntain a subset f the true underlying grups. That is, the slutin will nt assign bservatins in the same naturally ccurring grup t different estimated clusters. T the extent that this is the case, the slutin criterin value will tend t decrease substantially with each successive increase in the number f specified clusters, W K+ W K, as the natural grups are successively assigned t separate clusters. Fr K > K, ne f the estimated clusters must partitin at least ne f the natural grups int tw subgrups. This will tend t prvide a smaller decrease in the criterin as K is further increased. Splitting a natural grup, within which the bservatins are all quite clse t each ther, reduces the criterin less than partitining the unin f tw well-separated grups int their prper cnstituents. T the extent this scenari is realized, there will be a sharp decrease in successive differences in criterin value, W K W K+, at K = K. That is, {W K W K+ K < K } {W K W K+ K K }. An estimate ˆK fr K is then btained by identifying a kink in the plt f W K as a functin f K. As with ther aspects f clustering prcedures, this apprach is smewhat heuristic. The recently prpsed Gap statistic (Tibshirani et al., 00b) cmpares the curve lg W K t the curve btained frm data unifrmly distributed ver a rectangle cntaining the data. It estimates the ptimal number f clusters t be the place where the gap between the tw curves is largest. Essentially this is an autmatic way f lcating the afrementined kink. It als wrks reasnably well when the data fall int a single cluster, and in that case will tend t estimate the ptimal number f clusters t be ne. This is the scenari where mst ther cmpeting methds fail. Figure 4. shws the result f the Gap statistic applied t simulated data f Figure 4.4. The left panel shws lg W K fr k =,,...,8 clusters (green curve) and the expected value f lg W K ver 0 simulatins frm unifrm data (blue curve). The right panel shws the gap curve, which is the expected curve minus the bserved curve. Shwn als are errr bars f halfwidth s K = s K + /0, where sk is the standard deviatin f lg W K ver the 0 simulatins. The Gap curve is maximized at K = clusters. If G(K) is the Gap curve at K clusters, the frmal rule fr estimating K is K = argmin{k G(K) G(K + ) s K+}. (4.9) K

538 50 4. Unsupervised Learning lg WK Gap Number f Clusters Number f Clusters FIGURE 4.. (Left panel): bserved (green) and expected (blue) values f lg W K fr the simulated data f Figure 4.4. Bth curves have been translated t equal zer at ne cluster. (Right panel): Gap curve, equal t the difference between the bserved and expected values f lg W K. The Gap estimate K is the smallest K prducing a gap within ne standard deviatin f the gap at K + ; here K =. This gives K =, which lks reasnable frm Figure Hierarchical Clustering The results f applying K-means r K-medids clustering algrithms depend n the chice fr the number f clusters t be searched and a starting cnfiguratin assignment. In cntrast, hierarchical clustering methds d nt require such specificatins. Instead, they require the user t specify a measure f dissimilarity between (disjint) grups f bservatins, based n the pairwise dissimilarities amng the bservatins in the tw grups. As the name suggests, they prduce hierarchical representatins in which the clusters at each level f the hierarchy are created by merging clusters at the next lwer level. At the lwest level, each cluster cntains a single bservatin. At the highest level there is nly ne cluster cntaining all f the data. Strategies fr hierarchical clustering divide int tw basic paradigms: agglmerative (bttm-up) and divisive (tp-dwn). Agglmerative strategies start at the bttm and at each level recursively merge a selected pair f clusters int a single cluster. This prduces a gruping at the next higher level with ne less cluster. The pair chsen fr merging cnsist f the tw grups with the smallest intergrup dissimilarity. Divisive methds start at the tp and at each level recursively split ne f the existing clusters at

539 4. Cluster Analysis 5 that level int tw new clusters. The split is chsen t prduce tw new grups with the largest between-grup dissimilarity. With bth paradigms there are N levels in the hierarchy. Each level f the hierarchy represents a particular gruping f the data int disjint clusters f bservatins. The entire hierarchy represents an rdered sequence f such grupings. It is up t the user t decide which level (if any) actually represents a natural clustering in the sense that bservatins within each f its grups are sufficiently mre similar t each ther than t bservatins assigned t different grups at that level. The Gap statistic described earlier can be used fr this purpse. Recursive binary splitting/agglmeratin can be represented by a rted binary tree. The ndes f the trees represent grups. The rt nde represents the entire data set. The N terminal ndes each represent ne f the individual bservatins (singletn clusters). Each nnterminal nde ( parent ) has tw daughter ndes. Fr divisive clustering the tw daughters represent the tw grups resulting frm the split f the parent; fr agglmerative clustering the daughters represent the tw grups that were merged t frm the parent. All agglmerative and sme divisive methds (when viewed bttm-up) pssess a mntnicity prperty. That is, the dissimilarity between merged clusters is mntne increasing with the level f the merger. Thus the binary tree can be pltted s that the height f each nde is prprtinal t the value f the intergrup dissimilarity between its tw daughters. The terminal ndes representing individual bservatins are all pltted at zer height. This type f graphical display is called a dendrgram. A dendrgram prvides a highly interpretable cmplete descriptin f the hierarchical clustering in a graphical frmat. This is ne f the main reasns fr the ppularity f hierarchical clustering methds. Fr the micrarray data, Figure 4. shws the dendrgram resulting frm agglmerative clustering with average linkage; agglmerative clustering and this example are discussed in mre detail later in this chapter. Cutting the dendrgram hrizntally at a particular height partitins the data int disjint clusters represented by the vertical lines that intersect it. These are the clusters that wuld be prduced by terminating the prcedure when the ptimal intergrup dissimilarity exceeds that threshld cut value. Grups that merge at high values, relative t the merger values f the subgrups cntained within them lwer in the tree, are candidates fr natural clusters. Nte that this may ccur at several different levels, indicating a clustering hierarchy: that is, clusters nested within clusters. Such a dendrgram is ften viewed as a graphical summary f the data itself, rather than a descriptin f the results f the algrithm. Hwever, such interpretatins shuld be treated with cautin. First, different hierarchical methds (see belw), as well as small changes in the data, can lead t quite different dendrgrams. Als, such a summary will be valid nly t the extent that the pairwise bservatin dissimilarities pssess the hierar-

540 5 4. Unsupervised Learning LEUKEMIA K56B-repr K56A-repr BREAST BREAST UNKNWN VARIAN MCF7A-repr BREAST MCF7D-repr LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA MELANMA MELANMA MELANMA MELANMA MELANMA MELANMA MELANMA VARIAN VARIAN BREAST NSCLC CNS CNS RENAL RENAL RENAL RENAL RENAL NSCLC NSCLC NSCLC MELANMA RENAL RENAL VARIAN VARIAN PRSTATE VARIAN PRSTATE CNS NSCLC BREAST NSCLC RENAL CNS CNS BREAST NSCLC NSCLC CLN CLN CLN CLN CLN CLN CLN RENAL LEUKEMIA BREAST NSCLC FIGURE 4.. Dendrgram frm agglmerative hierarchical clustering with average linkage t the human tumr micrarray data. chical structure prduced by the algrithm. Hierarchical methds impse hierarchical structure whether r nt such structure actually exists in the data. The extent t which the hierarchical structure prduced by a dendrgram actually represents the data itself can be judged by the cphenetic crrelatin cefficient. This is the crrelatin between the N(N )/ pairwise bservatin dissimilarities d ii input t the algrithm and their crrespnding cphenetic dissimilarities C ii derived frm the dendrgram. The cphenetic dissimilarity C ii between tw bservatins (i,i ) is the intergrup dissimilarity at which bservatins i and i are first jined tgether in the same cluster. The cphenetic dissimilarity is a very restrictive dissimilarity measure. First, the C ii ver the bservatins must cntain many ties, since nly N f the ttal N(N )/ values can be distinct. Als these dissimilarities bey the ultrametric inequality C ii max{c ik,c i k} (4.40)

541 4. Cluster Analysis 5 fr any three bservatins (i,i,k). As a gemetric example, suppse the data were represented as pints in a Euclidean crdinate system. In rder fr the set f interpint distances ver the data t cnfrm t (4.40), the triangles frmed by all triples f pints must be issceles triangles with the unequal length n lnger than the length f the tw equal sides (Jain and Dubes, 988). Therefre it is unrealistic t expect general dissimilarities ver arbitrary data sets t clsely resemble their crrespnding cphenetic dissimilarities as calculated frm a dendrgram, especially if there are nt many tied values. Thus the dendrgram shuld be viewed mainly as a descriptin f the clustering structure f the data as impsed by the particular algrithm emplyed. Agglmerative Clustering Agglmerative clustering algrithms begin with every bservatin representing a singletn cluster. At each f the N steps the clsest tw (least dissimilar) clusters are merged int a single cluster, prducing ne less cluster at the next higher level. Therefre, a measure f dissimilarity between tw clusters (grups f bservatins) must be defined. Let G and H represent tw such grups. The dissimilarity d(g,h) between G and H is cmputed frm the set f pairwise bservatin dissimilarities d ii where ne member f the pair i is in G and the ther i is in H. Single linkage (SL) agglmerative clustering takes the intergrup dissimilarity t be that f the clsest (least dissimilar) pair d SL (G,H) = min d ii. (4.4) i G i H This is als ften called the nearest-neighbr technique. Cmplete linkage (CL) agglmerative clustering (furthest-neighbr technique) takes the intergrup dissimilarity t be that f the furthest (mst dissimilar) pair d CL (G,H) = max d ii. (4.4) i G i H Grup average (GA) clustering uses the average dissimilarity between the grups d GA (G,H) = d ii (4.4) N G N H i G i H where N G and N H are the respective number f bservatins in each grup. Althugh there have been many ther prpsals fr defining intergrup dissimilarity in the cntext f agglmerative clustering, the abve three are the nes mst cmmnly used. Figure 4. shws examples f all three. If the data dissimilarities {d ii } exhibit a strng clustering tendency, with each f the clusters being cmpact and well separated frm thers, then all three methds prduce similar results. Clusters are cmpact if all f the

542 54 4. Unsupervised Learning Average Linkage Cmplete Linkage Single Linkage FIGURE 4.. Dendrgrams frm agglmerative hierarchical clustering f human tumr micrarray data. bservatins within them are relatively clse tgether (small dissimilarities) as cmpared with bservatins in different clusters. T the extent this is nt the case, results will differ. Single linkage (4.4) nly requires that a single dissimilarity d ii, i G and i H, be small fr tw grups G and H t be cnsidered clse tgether, irrespective f the ther bservatin dissimilarities between the grups. It will therefre have a tendency t cmbine, at relatively lw threshlds, bservatins linked by a series f clse intermediate bservatins. This phenmenn, referred t as chaining, is ften cnsidered a defect f the methd. The clusters prduced by single linkage can vilate the cmpactness prperty that all bservatins within each cluster tend t be similar t ne anther, based n the supplied bservatin dissimilarities {d ii }. If we define the diameter D G f a grup f bservatins as the largest dissimilarity amng its members D G = max d ii, (4.44) i G i G then single linkage can prduce clusters with very large diameters. Cmplete linkage (4.4) represents the ppsite extreme. Tw grups G and H are cnsidered clse nly if all f the bservatins in their unin are relatively similar. It will tend t prduce cmpact clusters with small diameters (4.44). Hwever, it can prduce clusters that vilate the clseness prperty. That is, bservatins assigned t a cluster can be much

543 4. Cluster Analysis 55 clser t members f ther clusters than they are t sme members f their wn cluster. Grup average clustering (4.4) represents a cmprmise between the tw extremes f single and cmplete linkage. It attempts t prduce relatively cmpact clusters that are relatively far apart. Hwever, its results depend n the numerical scale n which the bservatin dissimilarities d ii are measured. Applying a mntne strictly increasing transfrmatin h( ) t the d ii, h ii = h(d ii ), can change the result prduced by (4.4). In cntrast, (4.4) and (4.4) depend nly n the rdering f the d ii and are thus invariant t such mntne transfrmatins. This invariance is ften used as an argument in favr f single r cmplete linkage ver grup average methds. ne can argue that grup average clustering has a statistical cnsistency prperty vilated by single and cmplete linkage. Assume we have attribute-value data X T = (X,...,X p ) and that each cluster k is a randm sample frm sme ppulatin jint density p k (x). The cmplete data set is a randm sample frm a mixture f K such densities. The grup average dissimilarity d GA (G,H) (4.4) is an estimate f d(x,x )p G (x)p H (x )dx dx, (4.45) where d(x,x ) is the dissimilarity between pints x and x in the space f attribute values. As the sample size N appraches infinity d GA (G,H) (4.4) appraches (4.45), which is a characteristic f the relatinship between the tw densities p G (x) and p H (x). Fr single linkage, d SL (G,H) (4.4) appraches zer as N independent f p G (x) and p H (x). Fr cmplete linkage, d CL (G,H) (4.4) becmes infinite as N, again independent f the tw densities. Thus, it is nt clear what aspects f the ppulatin distributin are being estimated by d SL (G,H) and d CL (G,H). Example: Human Cancer Micrarray Data (Cntinued) The left panel f Figure 4. shws the dendrgram resulting frm average linkage agglmerative clustering f the samples (clumns) f the micrarray data. The middle and right panels shw the result using cmplete and single linkage. Average and cmplete linkage gave similar results, while single linkage prduced unbalanced grups with lng thin clusters. We fcus n the average linkage clustering. Like K-means clustering, hierarchical clustering is successful at clustering simple cancers tgether. Hwever it has ther nice features. By cutting ff the dendrgram at varius heights, different numbers f clusters emerge, and the sets f clusters are nested within ne anther. Secndly, it gives sme partial rdering infrmatin abut the samples. In Figure 4.4, we have arranged the genes (rws) and samples (clumns) f the expressin matrix in rderings derived frm hierarchical clustering.

544 56 4. Unsupervised Learning Nte that if we flip the rientatin f the branches f a dendrgram at any merge, the resulting dendrgram is still cnsistent with the series f hierarchical clustering peratins. Hence t determine an rdering f the leaves, we must add a cnstraint. T prduce the rw rdering f Figure 4.4, we have used the default rule in S-PLUS: at each merge, the subtree with the tighter cluster is placed t the left (tward the bttm in the rtated dendrgram in the figure.) Individual genes are the tightest clusters pssible, and merges invlving tw individual genes place them in rder by their bservatin number. The same rule was used fr the clumns. Many ther rules are pssible fr example, rdering by a multidimensinal scaling f the genes; see Sectin 4.8. The tw-way rearrangement f Figure4.4 prduces an infrmative picture f the genes and samples. This picture is mre infrmative than the randmly rdered rws and clumns f Figure. f Chapter. Furthermre, the dendrgrams themselves are useful, as bilgists can, fr example, interpret the gene clusters in terms f bilgical prcesses. Divisive Clustering Divisive clustering algrithms begin with the entire data set as a single cluster, and recursively divide ne f the existing clusters int tw daughter clusters at each iteratin in a tp-dwn fashin. This apprach has nt been studied nearly as extensively as agglmerative methds in the clustering literature. It has been explred smewhat in the engineering literature (Gersh and Gray, 99) in the cntext f cmpressin. In the clustering setting, a ptential advantage f divisive ver agglmerative methds can ccur when interest is fcused n partitining the data int a relatively small number f clusters. The divisive paradigm can be emplyed by recursively applying any f the cmbinatrial methds such as K-means (Sectin 4..6) r K-medids (Sectin 4..0), with K =, t perfrm the splits at each iteratin. Hwever, such an apprach wuld depend n the starting cnfiguratin specified at each step. In additin, it wuld nt necessarily prduce a splitting sequence that pssesses the mntnicity prperty required fr dendrgram representatin. A divisive algrithm that avids these prblems was prpsed by Macnaughtn Smith et al. (965). It begins by placing all bservatins in a single cluster G. It then chses that bservatin whse average dissimilarity frm all the ther bservatins is largest. This bservatin frms the first member f a secnd cluster H. At each successive step that bservatin in G whse average distance frm thse in H, minus that fr the remaining bservatins in G is largest, is transferred t H. This cntinues until the crrespnding difference in averages becmes negative. That is, there are n lnger any bservatins in G that are, n average, clser t thse in H. The result is a split f the riginal cluster int tw daughter clusters,

545 4. Cluster Analysis 57 FIGURE 4.4. DNA micrarray data: average linkage hierarchical clustering has been applied independently t the rws (genes) and clumns (samples), determining the rdering f the rws and clumns (see text). The clrs range frm bright green (negative, under-expressed) t bright red (psitive, ver-expressed).

546 58 4. Unsupervised Learning the bservatins transferred t H, and thse remaining in G. These tw clusters represent the secnd level f the hierarchy. Each successive level is prduced by applying this splitting prcedure t ne f the clusters at the previus level. Kaufman and Russeeuw (990) suggest chsing the cluster at each level with the largest diameter (4.44) fr splitting. An alternative wuld be t chse the ne with the largest average dissimilarity amng its members d G = d ii. N G i G i G The recursive splitting cntinues until all clusters either becme singletns r all members f each ne have zer dissimilarity frm ne anther. 4.4 Self-rganizing Maps This methd can be viewed as a cnstrained versin f K-means clustering, in which the prttypes are encuraged t lie in a ne- r tw-dimensinal manifld in the feature space. The resulting manifld is als referred t as a cnstrained tplgical map, since the riginal high-dimensinal bservatins can be mapped dwn nt the tw-dimensinal crdinate system. The riginal SM algrithm was nline bservatins are prcessed ne at a time and later a batch versin was prpsed. The technique als bears a clse relatinship t principal curves and surfaces, which are discussed in the next sectin. We cnsider a SM with a tw-dimensinal rectangular grid f K prttypes m j IR p (ther chices, such as hexagnal grids, can als be used). Each f the K prttypes are parametrized with respect t an integer crdinate pair l j Q Q. Here Q = {,,...,q }, similarly Q, and K = q q. The m j are initialized, fr example, t lie in the tw-dimensinal principal cmpnent plane f the data (next sectin). We can think f the prttypes as buttns, sewn n the principal cmpnent plane in a regular pattern. The SM prcedure tries t bend the plane s that the buttns apprximate the data pints as well as pssible. nce the mdel is fit, the bservatins can be mapped dwn nt the tw-dimensinal grid. The bservatins x i are prcessed ne at a time. We find the clsest prttype m j t x i in Euclidean distance in IR p, and then fr all neighbrs m k f m j, mve m k tward x i via the update m k m k + α(x i m k ). (4.46) The neighbrs f m j are defined t be all m k such that the distance between l j and l k is small. The simplest apprach uses Euclidean distance, and small is determined by a threshld r. This neighbrhd always includes the clsest prttype m j itself.

547 4.4 Self-rganizing Maps 59 Ntice that distance is defined in the space Q Q f integer tplgical crdinates f the prttypes, rather than in the feature space IR p. The effect f the update (4.46) is t mve the prttypes clser t the data, but als t maintain a smth tw-dimensinal spatial relatinship between the prttypes. The perfrmance f the SM algrithm depends n the learning rate α and the distance threshld r. Typically α is decreased frm say.0 t 0.0 ver a few thusand iteratins (ne per bservatin). Similarly r is decreased linearly frm starting value R t ver a few thusand iteratins. We illustrate a methd fr chsing R in the example belw. We have described the simplest versin f the SM. Mre sphisticated versins mdify the update step accrding t distance: m k m k + αh( l j l k )(x i m k ), (4.47) where the neighbrhd functin h gives mre weight t prttypes m k with indices l k clser t l j than t thse further away. If we take the distance r small enugh s that each neighbrhd cntains nly ne pint, then the spatial cnnectin between prttypes is lst. In that case ne can shw that the SM algrithm is an nline versin f K-means clustering, and eventually stabilizes at ne f the lcal minima fund by K-means. Since the SM is a cnstrained versin f K-means clustering, it is imprtant t check whether the cnstraint is reasnable in any given prblem. ne can d this by cmputing the recnstructin errr x m j, summed ver bservatins, fr bth methds. This will necessarily be smaller fr K-means, but shuld nt be much smaller if the SM is a reasnable apprximatin. As an illustrative example, we generated 90 data pints in three dimensins, near the surface f a half sphere f radius. The pints were in each f three clusters red, green, and blue lcated near (0,,0), (0,0,) and (,0,0). The data are shwn in Figure 4.5 By design, the red cluster was much tighter than the green r blue nes. (Full details f the data generatin are given in Exercise 4.5.) A 5 5 grid f prttypes was used, with initial grid size R = ; this meant that abut a third f the prttypes were initially in each neighbrhd. We did a ttal f 40 passes thrugh the dataset f 90 bservatins, and let r and α decrease linearly ver the 600 iteratins. In Figure 4.6 the prttypes are indicated by circles, and the pints that prject t each prttype are pltted randmly within the crrespnding circle. The left panel shws the initial cnfiguratin, while the right panel shws the final ne. The algrithm has succeeded in separating the clusters; hwever, the separatin f the red cluster indicates that the manifld has flded back n itself (see Figure 4.7). Since the distances in the tw-dimensinal display are nt used, there is little indicatin in the SM prjectin that the red cluster is tighter than the thers.

548 50 4. Unsupervised Learning FIGURE 4.5. Simulated data in three classes, near the surface f a half sphere FIGURE 4.6. Self-rganizing map applied t half-sphere data example. Left panel is the initial cnfiguratin, right panel the final ne. The 5 5 grid f prttypes are indicated by circles, and the pints that prject t each prttype are pltted randmly within the crrespnding circle.

549 4.4 Self-rganizing Maps 5 FIGURE 4.7. Wiremesh representatin f the fitted SM mdel in IR. The lines represent the hrizntal and vertical edges f the tplgical lattice. The duble lines indicate that the surface was flded diagnally back n itself in rder t mdel the red pints. The cluster members have been jittered t indicate their clr, and the purple pints are the nde centers. Figure 4.8 shws the recnstructin errr, equal t the ttal sum f squares f each data pint arund its prttype. Fr cmparisn we carried ut a K-means clustering with 5 centrids, and indicate its recnstructin errr by the hrizntal line n the graph. We see that the SM significantly decreases the errr, nearly t the level f the K-means slutin. This prvides evidence that the tw-dimensinal cnstraint used by the SM is reasnable fr this particular dataset. In the batch versin f the SM, we update each m j via m j = wk x k wk. (4.48) The sum is ver pints x k that mapped (i.e., were clsest t) neighbrs m k f m j. The weight functin may be rectangular, that is, equal t fr the neighbrs f m k, r may decrease smthly with distance l k l j as befre. If the neighbrhd size is chsen small enugh s that it cnsists nly f m k, with rectangular weights, this reduces t the K-means clustering prcedure described earlier. It can als be thught f as a discrete versin f principal curves and surfaces, described in Sectin 4.5.

550 5 4. Unsupervised Learning Recnstructin Errr Iteratin FIGURE 4.8. Half-sphere data: recnstructin errr fr the SM as a functin f iteratin. Errr fr k-means clustering is indicated by the hrizntal line. Example: Dcument rganizatin and Retrieval Dcument retrieval has gained imprtance with the rapid develpment f the Internet and the Web, and SMs have prved t be useful fr rganizing and indexing large crpra. This example is taken frm the WEBSM hmepage (Khnen et al., 000). Figure 4.9 represents a SM fit t,088 newsgrup cmp.ai.neural-nets articles. The labels are generated autmatically by the WEBSM sftware and prvide a guide as t the typical cntent f a nde. In applicatins such as this, the dcuments have t be reprcessed in rder t create a feature vectr. A term-dcument matrix is created, where each rw represents a single dcument. The entries in each rw are the relative frequency f each f a predefined set f terms. These terms culd be a large set f dictinary entries (50,000 wrds), r an even larger set f bigrams (wrd pairs), r subsets f these. These matrices are typically very sparse, and s ften sme preprcessing is dne t reduce the number f features (clumns). Smetimes the SVD (next sectin) is used t reduce the matrix; Khnen et al. (000) use a randmized variant theref. These reduced vectrs are then the input t the SM.

551 4.4 Self-rganizing Maps 5 FIGURE 4.9. Heatmap representatin f the SM mdel fit t a crpus f,088 newsgrup cmp.ai.neural-nets cntributins (curtesy WEBSM hmepage). The lighter areas indicate higher-density areas. Ppulated ndes are autmatically labeled accrding t typical cntent.

552 54 4. Unsupervised Learning u id x i v FIGURE 4.0. The first linear principal cmpnent f a set f data. The line minimizes the ttal squared distance frm each pint t its rthgnal prjectin nt the line. In this applicatin the authrs have develped a zm feature, which allws ne t interact with the map in rder t get mre detail. The final level f zming retrieves the actual news articles, which can then be read. 4.5 Principal Cmpnents, Curves and Surfaces Principal cmpnents are discussed in Sectins.4., where they shed light n the shrinkage mechanism f ridge regressin. Principal cmpnents are a sequence f prjectins f the data, mutually uncrrelated and rdered in variance. In the next sectin we present principal cmpnents as linear maniflds apprximating a set f N pints x i IR p. We then present sme nnlinear generalizatins in Sectin ther recent prpsals fr nnlinear apprximating maniflds are discussed in Sectin Principal Cmpnents The principal cmpnents f a set f data in IR p prvide a sequence f best linear apprximatins t that data, f all ranks q p. Dente the bservatins by x,x,...,x N, and cnsider the rank-q linear mdel fr representing them

553 4.5 Principal Cmpnents, Curves and Surfaces 55 f(λ) = µ + V q λ, (4.49) where µ is a lcatin vectr in IR p, V q is a p q matrix with q rthgnal unit vectrs as clumns, and λ is a q vectr f parameters. This is the parametric representatin f an affine hyperplane f rank q. Figures 4.0 and 4. illustrate fr q = and q =, respectively. Fitting such a mdel t the data by least squares amunts t minimizing the recnstructin errr min µ,{λ i}, V q N x i µ V q λ i. (4.50) i= We can partially ptimize fr µ and the λ i (Exercise 4.7) t btain This leaves us t find the rthgnal matrix V q : min V q ˆµ = x, (4.5) ˆλ i = Vq T (x i x). (4.5) N (x i x) V q Vq T (x i x). (4.5) i= Fr cnvenience we assume that x = 0 (therwise we simply replace the bservatins by their centered versins x i = x i x). The p p matrix H q = V q V T q is a prjectin matrix, and maps each pint x i nt its rankq recnstructin H q x i, the rthgnal prjectin f x i nt the subspace spanned by the clumns f V q. The slutin can be expressed as fllws. Stack the (centered) bservatins int the rws f an N p matrix X. We cnstruct the singular value decmpsitin f X: X = UDV T. (4.54) This is a standard decmpsitin in numerical analysis, and many algrithms exist fr its cmputatin (Glub and Van Lan, 98, fr example). Here U is an N p rthgnal matrix (U T U = I p ) whse clumns u j are called the left singular vectrs; V is a p p rthgnal matrix (V T V = I p ) with clumns v j called the right singular vectrs, and D is a p p diagnal matrix, with diagnal elements d d d p 0 knwn as the singular values. Fr each rank q, the slutin V q t (4.5) cnsists f the first q clumns f V. The clumns f UD are called the principal cmpnents f X (see Sectin.5.). The N ptimal ˆλ i in (4.5) are given by the first q principal cmpnents (the N rws f the N q matrix U q D q ). The ne-dimensinal principal cmpnent line in IR is illustrated in Figure 4.0. Fr each data pint x i, there is a clsest pint n the line, given by u i d v. Here v is the directin f the line and ˆλ i = u i d measures distance alng the line frm the rigin. Similarly Figure 4. shws the

554 56 4. Unsupervised Learning Secnd principal cmpnent First principal cmpnent FIGURE 4.. The best rank-tw linear apprximatin t the half-sphere data. The right panel shws the prjected pints with crdinates given by U D, the first tw principal cmpnents f the data. tw-dimensinal principal cmpnent surface fit t the half-sphere data (left panel). The right panel shws the prjectin f the data nt the first tw principal cmpnents. This prjectin was the basis fr the initial cnfiguratin fr the SM methd shwn earlier. The prcedure is quite successful at separating the clusters. Since the half-sphere is nnlinear, a nnlinear prjectin will d a better jb, and this is the tpic f the next sectin. Principal cmpnents have many ther nice prperties, fr example, the linear cmbinatin Xv has the highest variance amng all linear cmbinatins f the features; Xv has the highest variance amng all linear cmbinatins satisfying v rthgnal t v, and s n. Example: Handwritten Digits Principal cmpnents are a useful tl fr dimensin reductin and cmpressin. We illustrate this feature n the handwritten digits data described in Chapter. Figure 4. shws a sample f 0 handwritten s, each a digitized 6 6 grayscale image, frm a ttal f 658 such s. We see cnsiderable variatin in writing styles, character thickness and rientatin. We cnsider these images as pints x i in IR 56, and cmpute their principal cmpnents via the SVD (4.54). Figure 4. shws the first tw principal cmpnents f these data. Fr each f these first tw principal cmpnents u i and u i, we cmputed the 5%, 5%, 50%, 75% and 95% quantile pints, and used them t define the rectangular grid superimpsed n the plt. The circled pints indicate

555 4.5 Principal Cmpnents, Curves and Surfaces 57 FIGURE 4.. A sample f 0 handwritten s shws a variety f writing styles. thse images clse t the vertices f the grid, where the distance measure fcuses mainly n these prjected crdinates, but gives sme weight t the cmpnents in the rthgnal subspace. The right plt shws the images crrespnding t these circled pints. This allws us t visualize the nature f the first tw principal cmpnents. We see that the v (hrizntal mvement) mainly accunts fr the lengthening f the lwer tail f the three, while v (vertical mvement) accunts fr character thickness. In terms f the parametrized mdel (4.49), this tw-cmpnent mdel has the frm ˆf(λ) = x + λ v + λ v = + λ + λ. (4.55) Here we have displayed the first tw principal cmpnent directins, v and v, as images. Althugh there are a pssible 56 principal cmpnents, apprximately 50 accunt fr 90% f the variatin in the threes, accunt fr 6%. Figure 4.4 cmpares the singular values t thse btained fr equivalent uncrrelated data, btained by randmly scrambling each clumn f X. The pixels in a digitized image are inherently crrelated, and since these are all the same digit the crrelatins are even strnger.

556 58 4. Unsupervised Learning First Principal Cmpnent Secnd Principal Cmpnent FIGURE 4.. (Left panel:) the first tw principal cmpnents f the handwritten threes. The circled pints are the clsest prjected images t the vertices f a grid, defined by the marginal quantiles f the principal cmpnents. (Right panel:) The images crrespnding t the circled pints. These shw the nature f the first tw principal cmpnents. Dimensin Singular Values Real Trace Randmized Trace FIGURE 4.4. The 56 singular values fr the digitized threes, cmpared t thse fr a randmized versin f the data (each clumn f X was scrambled).

557 4.5 Principal Cmpnents, Curves and Surfaces 59 A relatively small subset f the principal cmpnents serve as excellent lwer-dimensinal features fr representing the high-dimensinal data. Example: Prcrustes Transfrmatins and Shape Averaging FIGURE 4.5. (Left panel:) Tw different digitized handwritten Ss, each represented by 96 crrespnding pints in IR. The green S has been deliberately rtated and translated fr visual effect. (Right panel:) A Prcrustes transfrmatin applies a translatin and rtatin t best match up the tw set f pints. Figure 4.5 represents tw sets f pints, the range and green, in the same plt. In this instance these pints represent tw digitized versins f a handwritten S, extracted frm the signature f a subject Suresh. Figure 4.6 shws the entire signatures frm which these were extracted (third and furth panels). The signatures are recrded dynamically using tuch-screen devices, familiar sights in mdern supermarkets. There are N = 96 pints representing each S, which we dente by the N matrices X and X. There is a crrespndence between the pints the ith rws f X and X are meant t represent the same psitins alng the tw S s. In the language f mrphmetrics, these pints represent landmarks n the tw bjects. Hw ne finds such crrespnding landmarks is in general difficult and subject specific. In this particular case we used dynamic time warping f the speed signal alng each signature (Hastie et al., 99), but will nt g int details here. In the right panel we have applied a translatin and rtatin t the green pints s as best t match the range a s-called Prcrustes transfrmatin (Mardia et al., 979, fr example). Cnsider the prblem min X (X R + µ T ) F, (4.56) µ,r Prcrustes was an African bandit in Greek mythlgy, wh stretched r squashed his visitrs t fit his irn bed (eventually killing them).

558 Unsupervised Learning with X and X bth N p matrices f crrespnding pints, R an rthnrmal p p matrix 4, and µ a p-vectr f lcatin crdinates. Here X F = trace(xt X) is the squared Frbenius matrix nrm. Let x and x be the clumn mean vectrs f the matrices, and X and X be the versins f these matrices with the means remved. Cnsider the SVD X T X = UDV T. Then the slutin t (4.56) is given by (Exercise 4.8) ˆR = UV T ˆµ = x ˆR x (4.57), and the minimal distances is referred t as the Prcrustes distance. Frm the frm f the slutin, we can center each matrix at its clumn centrid, and then ignre lcatin cmpletely. Hereafter we assume this is the case. The Prcrustes distance with scaling slves a slightly mre general prblem, min X βx R F, (4.58) β,r where β > 0 is a psitive scalar. The slutin fr R is as befre, with ˆβ = trace(d)/ X F. Related t Prcrustes distance is the Prcrustes average f a cllectin f L shapes, which slves the prblem min {R l } L,M l= L X l R l M F; (4.59) that is, find the shape M clsest in average squared Prcrustes distance t all the shapes. This is slved by a simple alternating algrithm: 0. Initialize M = X (fr example).. Slve the L Prcrustes rtatin prblems with M fixed, yielding X l X ˆR l.. Let M L L l= X l. Steps. and. are repeated until the criterin (4.59) cnverges. Figure 4.6 shws a simple example with three shapes. Nte that we can nly expect a slutin up t a rtatin; alternatively, we can impse a cnstraint, such as that M be upper-triangular, t frce uniqueness. ne can easily incrprate scaling in the definitin (4.59); see Exercise 4.9. Mst generally we can define the affine-invariant average f a set f shapes via 4 T simplify matters, we cnsider nly rthgnal matrices which include reflectins as well as rtatins [the (p) grup]; althugh reflectins are unlikely here, these methds can be restricted further t allw nly rtatins [S(p) grup].

559 4.5 Principal Cmpnents, Curves and Surfaces 54 FIGURE 4.6. The Prcrustes average f three versins f the leading S in Suresh s signatures. The left panel shws the preshape average, with each f the shapes X l in preshape space superimpsed. The right three panels map the preshape M separately t match each f the riginal S s. min {A l } L,M l= L X l A l M F, (4.60) where the A l are any p p nnsingular matrices. Here we require a standardizatin, such as M T M = I, t avid a trivial slutin. The slutin is attractive, and can be cmputed withut iteratin (Exercise 4.0):. Let H l = X l (X T l X l) X l be the rank-p prjectin matrix defined by X l.. M is the N p matrix frmed frm the p largest eigenvectrs f H = L L l= H l Principal Curves and Surfaces Principal curves generalize the principal cmpnent line, prviding a smth ne-dimensinal curved apprximatin t a set f data pints in IR p. A principal surface is mre general, prviding a curved manifld apprximatin f dimensin r mre. We will first define principal curves fr randm variables X IR p, and then mve t the finite data case. Let f(λ) be a parameterized smth curve in IR p. Hence f(λ) is a vectr functin with p crdinates, each a smth functin f the single parameter λ. The parameter λ can be chsen, fr example, t be arc-length alng the curve frm sme fixed rigin. Fr each data value x, let λ f (x) define the clsest pint n the curve t x. Then f(λ) is called a principal curve fr the distributin f the randm vectr X if f(λ) = E(X λ f (X) = λ). (4.6) This says f(λ) is the average f all data pints that prject t it, that is, the pints fr which it is respnsible. This is als knwn as a self-cnsistency prperty. Althugh in practice, cntinuus multivariate distributes have infinitely many principal curves (Duchamp and Stuetzle, 996), we are

560 54 4. Unsupervised Learning. f(λ) = [f (λ), f (λ)] FIGURE 4.7. The principal curve f a set f data. Each pint n the curve is the average f all data pints that prject there. interested mainly in the smth nes. A principal curve is illustrated in Figure 4.7. Principal pints are an interesting related cncept. Cnsider a set f k prttypes and fr each pint x in the supprt f a distributin, identify the clsest prttype, that is, the prttype that is respnsible fr it. This induces a partitin f the feature space int s-called Vrni regins. The set f k pints that minimize the expected distance frm X t its prttype are called the principal pints f the distributin. Each principal pint is self-cnsistent, in that it equals the mean f X in its Vrni regin. Fr example, with k =, the principal pint f a circular nrmal distributin is the mean vectr; with k = they are a pair f pints symmetrically placed n a ray thrugh the mean vectr. Principal pints are the distributinal analgs f centrids fund by K-means clustering. Principal curves can be viewed as k = principal pints, but cnstrained t lie n a smth curve, in a similar way that a SM cnstrains K-means cluster centers t fall n a smth manifld. T find a principal curve f(λ) f a distributin, we cnsider its crdinate functins f(λ) = [f (λ),f (λ),...,f p (λ)] and let X T = (X,X,...,X p ). Cnsider the fllwing alternating steps: (a) ˆfj (λ) E(X j λ(x) = λ); j =,,...,p, (b) ˆλf (x) argmin λ x ˆf(λ ). (4.6) The first equatin fixes λ and enfrces the self-cnsistency requirement (4.6). The secnd equatin fixes the curve and finds the clsest pint n

561 4.5 Principal Cmpnents, Curves and Surfaces 54 λ λ FIGURE 4.8. Principal surface fit t half-sphere data. (Left panel:) fitted tw-dimensinal surface. (Right panel:) prjectins f data pints nt the surface, resulting in crdinates ˆλ, ˆλ. the curve t each data pint. With finite data, the principal curve algrithm starts with the linear principal cmpnent, and iterates the tw steps in (4.6) until cnvergence. A scatterplt smther is used t estimate the cnditinal expectatins in step (a) by smthing each X j as a functin f the arc-length ˆλ(X), and the prjectin in (b) is dne fr each f the bserved data pints. Prving cnvergence in general is difficult, but ne can shw that if a linear least squares fit is used fr the scatterplt smthing, then the prcedure cnverges t the first linear principal cmpnent, and is equivalent t the pwer methd fr finding the largest eigenvectr f a matrix. Principal surfaces have exactly the same frm as principal curves, but are f higher dimensin. The mstly cmmnly used is the tw-dimensinal principal surface, with crdinate functins f(λ,λ ) = [f (λ,λ ),...,f p (λ,λ )]. The estimates in step (a) abve are btained frm tw-dimensinal surface smthers. Principal surfaces f dimensin greater than tw are rarely used, since the visualizatin aspect is less attractive, as is smthing in high dimensins. Figure 4.8 shws the result f a principal surface fit t the half-sphere data. Pltted are the data pints as a functin f the estimated nnlinear crdinates ˆλ (x i ), ˆλ (x i ). The class separatin is evident. Principal surfaces are very similar t self-rganizing maps. If we use a kernel surface smther t estimate each crdinate functin f j (λ,λ ), this has the same frm as the batch versin f SMs (4.48). The SM weights w k are just the weights in the kernel. There is a difference, hwever:

562 Unsupervised Learning the principal surface estimates a separate prttype f(λ (x i ),λ (x i )) fr each data pint x i, while the SM shares a smaller number f prttypes fr all data pints. As a result, the SM and principal surface will agree nly as the number f SM prttypes grws very large. There als is a cnceptual difference between the tw. Principal surfaces prvide a smth parameterizatin f the entire manifld in terms f its crdinate functins, while SMs are discrete and prduce nly the estimated prttypes fr apprximating the data. The smth parameterizatin in principal surfaces preserves distance lcally: in Figure 4.8 it reveals that the red cluster is tighter than the green r blue clusters. In simple examples the estimates crdinate functins themselves can be infrmative: see Exercise Spectral Clustering Traditinal clustering methds like K-means use a spherical r elliptical metric t grup data pints. Hence they will nt wrk well when the clusters are nn-cnvex, such as the cncentric circles in the tp left panel f Figure 4.9. Spectral clustering is a generalizatin f standard clustering methds, and is designed fr these situatins. It has clse cnnectins with the lcal multidimensinal-scaling techniques (Sectin 4.9) that generalize MDS. The starting pint is a N N matrix f pairwise similarities s ii 0 between all bservatin pairs. We represent the bservatins in an undirected similarity graph G = V, E. The N vertices v i represent the bservatins, and pairs f vertices are cnnected by an edge if their similarity is psitive (r exceeds sme threshld). The edges are weighted by the s ii. Clustering is nw rephrased as a graph-partitin prblem, where we identify cnnected cmpnents with clusters. We wish t partitin the graph, such that edges between different grups have lw weight, and within a grup have high weight. The idea in spectral clustering is t cnstruct similarity graphs that represent the lcal neighbrhd relatinships between bservatins. T make things mre cncrete, cnsider a set f N pints x i IR p, and let d ii be the Euclidean distance between x i and x i. We will use as similarity matrix the radial-kernel gram matrix; that is, s ii = exp( d ii /c), where c > 0 is a scale parameter. There are many ways t define a similarity matrix and its assciated similarity graph that reflect lcal behavir. The mst ppular is the mutual K-nearest-neighbr graph. Define N K t be the symmetric set f nearby pairs f pints; specifically a pair (i,i ) is in N K if pint i is amng the K-nearest neighbrs f i, r vice-versa. Then we cnnect all symmetric nearest neighbrs, and give them edge weight w ii = s ii ; therwise the edge weight is zer. Equivalently we set t zer all the pairwise similarities nt in N K, and draw the graph fr this mdified similarity matrix.

563 4.5 Principal Cmpnents, Curves and Surfaces 545 Alternatively, a fully cnnected graph includes all pairwise edges with weights w ii = s ii, and the lcal behavir is cntrlled by the scale parameter c. The matrix f edge weights W = {w ii } frm a similarity graph is called the adjacency matrix. The degree f vertex i is g i = i w ii, the sum f the weights f the edges cnnected t it. Let G be a diagnal matrix with diagnal elements g i. Finally, the graph Laplacian is defined by L = G W (4.6) This is called the unnrmalized graph Laplacian; a number f nrmalized versins have been prpsed these standardize the Laplacian with respect t the nde degrees g i, fr example, L = I G W. Spectral clustering finds the m eigenvectrs Z N m crrespnding t the m smallest eigenvalues f L (ignring the trivial cnstant eigenvectr). Using a standard methd like K-means, we then cluster the rws f Z t yield a clustering f the riginal data pints. An example is presented in Figure 4.9. The tp left panel shws 450 simulated data pints in three circular clusters indicated by the clrs. K- means clustering wuld clearly have difficulty identifying the uter clusters. We applied spectral clustering using a 0-nearest neighbr similarity graph, and display the eigenvectr crrespnding t the secnd and third smallest eigenvalue f the graph Laplacian in the lwer left. The 5 smallest eigenvalues are shwn in the tp right panel. The tw eigenvectrs shwn have identified the three clusters, and a scatterplt f the rws f the eigenvectr matrix Y in the bttm right clearly separates the clusters. A prcedure such as K-means clustering applied t these transfrmed pints wuld easily identify the three grups. Why des spectral clustering wrk? Fr any vectr f we have f T Lf = = N N N g i fi f i f i w ii i= N i= i = i= i = N w ii (f i f i ). (4.64) Frmula 4.64 suggests that a small value f f T Lf will be achieved if pairs f pints with large adjacencies have crdinates f i and f i clse tgether. Since T L = 0 fr any graph, the cnstant vectr is a trivial eigenvectr with eigenvalue zer. Nt s bvius is the fact that if the graph is cnnected 5, it is the nly zer eigenvectr (Exercise 4.). Generalizing this argument, it is easy t shw that fr a graph with m cnnected cmpnents, 5 A graph is cnnected if any tw ndes can be reached via a path f cnnected ndes.

564 Unsupervised Learning x Eigenvalue x Number Eigenvectrs Spectral Clustering rd Smallest nd Smallest Third Smallest Eigenvectr Index Secnd Smallest Eigenvectr FIGURE 4.9. Ty example illustrating spectral clustering. Data in tp left are 450 pints falling in three cncentric clusters f 50 pints each. The pints are unifrmly distributed in angle, with radius,.8 and 5 in the three grups, and Gaussian nise with standard deviatin 0.5 added t each pint. Using a k = 0 nearest-neighbr similarity graph, the eigenvectr crrespnding t the secnd and third smallest eigenvalues f L are shwn in the bttm left; the smallest eigenvectr is cnstant. The data pints are clred in the same way as in the tp left. The 5 smallest eigenvalues are shwn in the tp right panel. The crdinates f the nd and rd eigenvectrs (the 450 rws f Z) are pltted in the bttm right panel. Spectral clustering des standard (e.g., K-means) clustering f these pints and will easily recver the three riginal clusters.

565 4.5 Principal Cmpnents, Curves and Surfaces 547 the ndes can be rerdered s that L is blck diagnal with a blck fr each cnnected cmpnent. Then L has m eigenvectrs f eigenvalue zer, and the eigenspace f eigenvalue zer is spanned by the indicatr vectrs f the cnnected cmpnents. In practice ne has strng and weak cnnectins, s zer eigenvalues are apprximated by small eigenvalues. Spectral clustering is an interesting apprach fr finding nn-cnvex clusters. When a nrmalized graph Laplacian is used, there is anther way t view this methd. Defining P = G W, we cnsider a randm walk n the graph with transitin prbability matrix P. Then spectral clustering yields grups f ndes such that the randm walk seldm transitins frm ne grup t anther. There are a number f issues that ne must deal with in applying spectral clustering in practice. We must chse the type f similarity graph eg. fully cnnected r nearest neighbrs, and assciated parameters such as the number f nearest f neighbrs k r the scale parameter f the kernel c. We must als chse the number f eigenvectrs t extract frm L and finally, as with all clustering methds, the number f clusters. In the ty example f Figure 4.9 we btained gd results fr k [5,00], the value 00 crrespnding t a fully cnnected graph. With k < 5 the results deterirated. Lking at the tp-right panel f Figure 4.9, we see n strng separatin between the smallest three eigenvalues and the rest. Hence it is nt clear hw many eigenvectrs t select Kernel Principal Cmpnents Spectral clustering is related t kernel principal cmpnents, a nn-linear versin f linear principal cmpnents. Standard linear principal cmpnents (PCA) are btained frm the eigenvectrs f the cvariance matrix, and give directins in which the data have maximal variance. Kernel PCA (Schölkpf et al., 999) expand the scpe f PCA, mimicking what we wuld btain if we were t expand the features by nn-linear transfrmatins, and then apply PCA in this transfrmed feature space. We shw in Sectin 8.5. that the principal cmpnents variables Z f a data matrix X can be cmputed frm the inner-prduct (gram) matrix K = XX T. In detail, we cmpute the eigen-decmpsitin f the dublecentered versin f the gram matrix K = (I M)K(I M) = UD U T, (4.65) with M = T /N, and then Z = UD. Exercise 8.5 shws hw t cmpute the prjectins f new bservatins in this space. Kernel PCA simply mimics this prcedure, interpreting the kernel matrix K = {K(x i,x i )} as an inner-prduct matrix f the implicit features φ(x i ),φ(x i ) and finding its eigenvectrs. The elements f the mth cmpnent z m (mth clumn f Z) can be written (up t centering) as z im = N j= α jmk(x i,x j ), where α jm = u jm /d m (Exercise 4.6).

566 Unsupervised Learning We can gain mre insight int kernel PCA by viewing the z m as sample evaluatins f principal cmpnent functins g m H K, with H K the reprducing kernel Hilbert space generated by K (see Sectin 5.8.). The first principal cmpnent functin g slves max Var T g (X) subject t g HK = (4.66) g H K Here Var T refers t the sample variance ver training data T. The nrm cnstraint g HK = cntrls the size and rughness f the functin g, as dictated by the kernel K. As in the regressin case it can be shwn that the slutin t (4.66) is finite dimensinal with representatin g (x) = N j= c jk(x,x j ). Exercise 4.7 shws that the slutin is defined by ĉ j = α j, j =,...,N abve. The secnd principal cmpnent functin is defined in a similar way, with the additinal cnstraint that g,g HK = 0, and s n. 6 Schölkpf et al. (999) demnstrate the use f kernel principal cmpnents as features fr handwritten-digit classificatin, and shw that they can imprve the perfrmance f a classifier when these are used instead f linear principal cmpnents. Nte that if we use the radial kernel K(x,x ) = exp( x x /c), (4.67) then the kernel matrix K has the same frm as the similarity matrix S in spectral clustering. The matrix f edge weights W is a lcalized versin f K, setting t zer all similarities fr pairs f pints that are nt nearest neighbrs. Kernel PCA finds the eigenvectrs crrespnding t the largest eigenvalues f K; this is equivalent t finding the eigenvectrs crrespnding t the smallest eigenvalues f I K. (4.68) This is almst the same as the Laplacian (4.6), the differences being the centering f K and the fact that G has the degrees f the ndes alng the diagnal. Figure 4.0 examines the perfrmance f kernel principal cmpnents in the ty example f Figure 4.9. In the upper left panel we used the radial kernel with c =, the same value that was used in spectral clustering. This des nt separate the grups, but with c = 0 (upper right panel), the first cmpnent separates the grups well. In the lwer-left panel we applied kernel PCA using the nearest-neighbr radial kernel W frm spectral clustering. In the lwer right panel we use the kernel matrix itself as the 6 This sectin benefited frm helpful discussins with Jnathan Taylr.

567 4.5 Principal Cmpnents, Curves and Surfaces 549 Radial Kernel (c=) Radial Kernel (c=0) Secnd Largest Eigenvectr Secnd Largest Eigenvectr First Largest Eigenvectr First Largest Eigenvectr NN Radial Kernel (c=) Radial Kernel Laplacian (c=) Secnd Largest Eigenvectr Third Smallest Eigenvectr First Largest Eigenvectr Secnd Smallest Eigenvectr FIGURE 4.0. Kernel principal cmpnents applied t the ty example f Figure 4.9, using different kernels. (Tp left:) Radial kernel (4.67) with c =. (Tp right:) Radial kernel with c = 0. (Bttm left): Nearest neighbr radial kernel W frm spectral clustering. (Bttm right:) Spectral clustering with Laplacian cnstructed frm the radial kernel.

569 4.5 Principal Cmpnents, Curves and Surfaces 55 Walking Speed Verbal Fluency Principal Cmpnents Sparse Principal Cmpnents FIGURE 4.. Standard and sparse principal cmpnents frm a study f the crpus callsum variatin. The shape variatins crrespnding t significant principal cmpnents (red curves) are verlaid n the mean CC shape (black curves). Fr multiple cmpnents, the sparse principal cmpnents prcedures minimizes N K K x i ΘV T x i + λ v k + λ k v k, (4.7) i= k= subject t Θ T Θ = I K. Here V is a p K matrix with clumns v k and Θ is als p K. Criterin (4.7) is nt jintly cnvex in V and Θ, but it is cnvex in each parameter with the ther parameter fixed 7. Minimizatin ver V with Θ fixed is equivalent t K elastic net prblems (Sectin 8.4) and can be dne efficiently. n the ther hand, minimizatin ver Θ with V fixed is a versin f the Prcrustes prblem (4.56), and is slved by a simple SVD calculatin (Exercise 4.). These steps are alternated until cnvergence. Figure 4. shws an example f sparse principal cmpnents analysis using (4.7), taken frm Sjöstrand et al. (007). Here the shape f the mid-sagittal crss-sectin f the crpus callsum (CC) is related t varius clinical parameters in a study invlving 569 elderly persns 8. In this exam- k= 7 Nte that the usual principal cmpnent criterin, fr example (4.50), is nt jintly cnvex in the parameters either. Nevertheless, the slutin is well defined and an efficient algrithm is available. 8 We thank Rasmus Larsen and Karl Sjöstrand fr suggesting this applicatin, and supplying us with the pstscript figures reprduced here.

570 55 4. Unsupervised Learning FIGURE 4.. An example f a mid-saggital brain slice, with the crpus cllsum anntated with landmarks. ple PCA is applied t shape data, and is a ppular tl in mrphmetrics. Fr such applicatins, a number f landmarks are identified alng the circumference f the shape; an example is given in Figure 4.. These are aligned by Prcrustes analysis t allw fr rtatins, and in this case scaling as well (see Sectin 4.5.). The features used fr PCA are the sequence f crdinate pairs fr each landmark, unpacked int a single vectr. In this analysis, bth standard and sparse principal cmpnents were cmputed, and cmpnents that were significantly assciated with varius clinical parameters were identified. In the figure, the shape variatins crrespnding t significant principal cmpnents (red curves) are verlaid n the mean CC shape (black curves). Lw walking speed relates t CCs that are thinner (displaying atrphy) in regins cnnecting the mtr cntrl and cgnitive centers f the brain. Lw verbal fluency relates t CCs that are thinner in regins cnnecting auditry/visual/cgnitive centers. The sparse principal cmpnents prcedure gives a mre parsimnius, and ptentially mre infrmative picture f the imprtant differences.

571 4.6 Nn-negative Matrix Factrizatin Nn-negative Matrix Factrizatin Nn-negative matrix factrizatin (Lee and Seung, 999) is a recent alternative apprach t principal cmpnents analysis, in which the data and cmpnents are assumed t be nn-negative. It is useful fr mdeling nn-negative data such as images. The N p data matrix X is apprximated by X WH (4.7) where W is N r and H is r p, r max(n,p). We assume that x ij,w ik,h kj 0. The matrices W and H are fund by maximizing L(W,H) = N i= j= p [x ij lg(wh) ij (WH) ij ]. (4.7) This is the lg-likelihd frm a mdel in which x ij has a Pissn distributin with mean (WH) ij quite reasnable fr psitive data. The fllwing alternating algrithm (Lee and Seung, 00) cnverges t a lcal maximum f L(W,H): w ik w ik p j= h kjx ij /(WH) ij p j= h kj h kj h kj N i= w ikx ij /(WH) ij N i= w ik (4.74) This algrithm can be derived as a minrizatin prcedure fr maximizing L(W, H) (Exercise 4.) and is als related t the iterative-prprtinalscaling algrithm fr lg-linear mdels (Exercise 4.4). Figure 4. shws an example taken frm Lee and Seung (999) 9, cmparing nn-negative matrix factrizatin (NMF), vectr quantizatin (VQ, equivalent t k-means clustering) and principal cmpnents analysis (PCA). The three learning methds were applied t a database f N =,49 facial images, each cnsisting f 9 9 pixels, resulting in a,49 8 matrix X. As shwn in the 7 7 array f mntages (each a 9 9 image), each methd has learned a set f r = 49 basis images. Psitive values are illustrated with black pixels and negative values with red pixels. A particular instance f a face, shwn at tp right, is apprximated by a linear superpsitin f basis images. The cefficients f the linear superpsitin are shwn next t each mntage, in a 7 7 array 0, and the resulting superpsitins are shwn t the right f the equality sign. The authrs pint 9 We thank Sebastian Seung fr prviding this image. 0 These 7 7 arrangements allw fr a cmpact display, and have n structural significance.

572 Unsupervised Learning ut that unlike VQ and PCA, NMF learns t represent faces with a set f basis images resembling parts f faces. Dnh and Stdden (004) pint ut a ptentially serius prblem with nn-negative matrix factrizatin. Even in situatins where X = WH hlds exactly, the decmpsitin may nt be unique. Figure 4.4 illustrates the prblem. The data pints lie in p = dimensins, and there is pen space between the data and the crdinate axes. We can chse the basis vectrs h and h anywhere in this pen space, and represent each data pint exactly with a nnnegative linear cmbinatin f these vectrs. This nnuniqueness means that the slutin fund by the abve algrithm depends n the starting values, and it wuld seem t hamper the interpretability f the factrizatin. Despite this interpretatinal drawback, the nn-negative matrix factrizatin and its applicatins has attracted a lt f interest Archetypal Analysis This methd, due t Cutler and Breiman (994), apprximates data pints by prttypes that are themselves linear cmbinatins f data pints. In this sense it has a similar flavr t K-means clustering. Hwever, rather than apprximating each data pint by a single nearby prttype, archetypal analysis apprximates each data pint by a cnvex cmbinatin f a cllectin f prttypes. The use f a cnvex cmbinatin frces the prttypes t lie n the cnvex hull f the data clud. In this sense, the prttypes are pure,, r archetypal. As in (4.7), the N p data matrix X is mdeled as X WH (4.75) where W is N r and H is r p. We assume that w ik 0 and r k= w ik = i. Hence the N data pints (rws f X) in p-dimensinal space are represented by cnvex cmbinatins f the r archetypes (rws f H). We als assume that H = BX (4.76) where B is r N with b ki 0 and N i= b ki = k. Thus the archetypes themselves are cnvex cmbinatins f the data pints. Using bth (4.75) and (4.76) we minimize J(W,B) = X WH = X WBX (4.77) ver the weights W and B. This functin is minimized in an alternating fashin, with each separate minimizatin invlving a cnvex ptimizatin. The verall prblem is nt cnvex hwever, and s the algrithm cnverges t a lcal minimum f the criterin.

573 4.6 Nn-negative Matrix Factrizatin 555 riginal NMF = VQ = PCA = FIGURE 4.. Nn-negative matrix factrizatin (NMF), vectr quantizatin (VQ, equivalent t k-means clustering) and principal cmpnents analysis (PCA) applied t a database f facial images. Details are given in the text. Unlike VQ and PCA, NMF learns t represent faces with a set f basis images resembling parts f faces.

574 Unsupervised Learning h h FIGURE 4.4. Nn-uniqueness f the nn-negative matrix factrizatin. There are data pints in tw dimensins. Any chice f the basis vectrs h and h in the pen space between the crdinate axes and data, gives an exact recnstructin f the data. Figure 4.5 shws an example with simulated data in tw dimensins. The tp panel displays the results f archetypal analysis, while the bttm panel shws the results frm K-means clustering. In rder t best recnstruct the data frm cnvex cmbinatins f the prttypes, it pays t lcate the prttypes n the cnvex hull f the data. This is seen in the tp panels f Figure 4.5 and is the case in general, as prven by Cutler and Breiman (994). K-means clustering, shwn in the bttm panels, chses prttypes in the middle f the data clud. We can think f K-means clustering as a special case f the archetypal mdel, in which each rw f W has a single ne and the rest f the entries are zer. Ntice als that the archetypal mdel (4.75) has the same general frm as the nn-negative matrix factrizatin mdel (4.7). Hwever, the tw mdels are applied in different settings, and have smewhat different gals. Nn-negative matrix factrizatin aims t apprximate the clumns f the data matrix X, and the main utput f interest are the clumns f W representing the primary nn-negative cmpnents in the data. Archetypal analysis fcuses instead n the apprximatin f the rws f X using the rws f H, which represent the archetypal data pints. Nn-negative matrix factrizatin als assumes that r p. With r = p, we can get an exact recnstructin simply chsing W t be the data X with clumns scaled s that they sum t. In cntrast, archetypal analysis requires r N, but allws r > p. In Figure 4.5, fr example, p =,N = 50 while r =,4 r 8. The additinal cnstraint (4.76) implies that the archetypal apprximatin will nt be perfect, even if r > p. Figure 4.6 shws the results f archetypal analysis applied t the database f s displayed in Figure 4.. The three rws in Figure 4.6 are the resulting archetypes frm three runs, specifying tw, three and fur

575 4.7 Independent Cmpnent Analysisand Explratry Prjectin Pursuit 557 Prttypes 4 Prttypes 8 Prttypes FIGURE 4.5. Archetypal analysis (tp panels) and K-means clustering (bttm panels) applied t 50 data pints drawn frm a bivariate Gaussian distributin. The clred pints shw the psitins f the prttypes in each case. archetypes, respectively. As expected, the algrithm has prduced extreme s bth in size and shape. 4.7 Independent Cmpnent Analysis and Explratry Prjectin Pursuit Multivariate data are ften viewed as m