An Introduction to Statistical Learning

Transcription

1 Springer Texts in Statistics Gareth James Daniela Witten Trevr Hastie Rbert Tibshirani An Intrductin t Statistical Learning with Applicatins in R

2 Springer Texts in Statistics Series Editrs: G. Casella S. Fienberg I. Olkin Fr further vlumes:

3

4 Gareth James Daniela Witten Trevr Hastie Rbert Tibshirani An Intrductin t Statistical Learning with Applicatins in R 123

5 Gareth James Department f Infrmatin and Operatins Management University f Suthern Califrnia Ls Angeles, CA, USA Trevr Hastie Department f Statistics Stanfrd University Stanfrd, CA, USA Daniela Witten Department f Bistatistics University f Washingtn Seattle, WA, USA Rbert Tibshirani Department f Statistics Stanfrd University Stanfrd, CA, USA ISSN X ISBN ISBN (ebk) DOI / Springer New Yrk Heidelberg Drdrecht Lndn Library f Cngress Cntrl Number: Springer Science+Business Media New Yrk 2013 (Crrected at 4 printing 2014) This wrk is subject t cpyright. All rights are reserved by the Publisher, whether the whle r part f the material is cncerned, specifically the rights f translatin, reprinting, reuse f illustratins, recitatin, bradcasting, reprductin n micrfilms r in any ther physical way, and transmissin r infrmatin strage and retrieval, electrnic adaptatin, cmputer sftware, r by similar r dissimilar methdlgy nw knwn r hereafter develped. Exempted frm this legal reservatin are brief excerpts in cnnectin with reviews r schlarly analysis r material supplied specifically fr the purpse f being entered and executed n a cmputer system, fr exclusive use by the purchaser f the wrk. Duplicatin f this publicatin r parts theref is permitted nly under the prvisins f the Cpyright Law f the Publisher s lcatin, in its current versin, and permissin fr use must always be btained frm Springer. Permissins fr use may be btained thrugh RightsLink at the Cpyright Clearance Center. Vilatins are liable t prsecutin under the respective Cpyright Law. The use f general descriptive names, registered names, trademarks, service marks, etc. in this publicatin des nt imply, even in the absence f a specific statement, that such names are exempt frm the relevant prtective laws and regulatins and therefre free fr general use. While the advice and infrmatin in this bk are believed t be true and accurate at the date f publicatin, neither the authrs nr the editrs nr the publisher can accept any legal respnsibility fr any errrs r missins that may be made. The publisher makes n warranty, express r implied, with respect t the material cntained herein. Printed n acid-free paper Springer is part f Springer Science+Business Media (

6 T ur parents: Alisn and Michael James Chiara Nappi and Edward Witten Valerie and Patrick Hastie Vera and Sami Tibshirani and t ur families: Michael, Daniel, and Catherine Ari Samantha, Timthy, and Lynda Charlie, Ryan, Julie, and Cheryl

7

8 Preface Statistical learning refers t a set f tls fr mdeling and understanding cmplex datasets. It is a recently develped area in statistics and blends with parallel develpments in cmputer science and, in particular, machine learning. The field encmpasses many methds such as the lass and sparse regressin, classificatin and regressin trees, and bsting and supprt vectr machines. With the explsin f Big Data prblems, statistical learning has becme a very ht field in many scientific areas as well as marketing, finance, and ther business disciplines. Peple with statistical learning skills are in high demand. One f the first bks in this area The Elements f Statistical Learning (ESL) (Hastie, Tibshirani, and Friedman) was published in 2001, with a secnd editin in ESL has becme a ppular text nt nly in statistics but als in related fields. One f the reasns fr ESL s ppularity is its relatively accessible style. But ESL is intended fr individuals with advanced training in the mathematical sciences. An Intrductin t Statistical Learning (ISL) arse frm the perceived need fr a brader and less technical treatment f these tpics. In this new bk, we cver many f the same tpics as ESL, but we cncentrate mre n the applicatins f the methds and less n the mathematical details. We have created labs illustrating hw t implement each f the statistical learning methds using the ppular statistical sftware package R. These labs prvide the reader with valuable hands-n experience. This bk is apprpriate fr advanced undergraduates r master s students in statistics r related quantitative fields r fr individuals in ther vii

9 viii Preface disciplines wh wish t use statistical learning tls t analyze their data. It can be used as a textbk fr a curse spanning ne r tw semesters. We wuld like t thank several readers fr valuable cmments n preliminary drafts f this bk: Pallavi Basu, Alexandra Chuldechva, Patrick Danaher, Will Fithian, Luella Fu, Sam Grss, Max Grazier G Sell, Curtney Paulsn, Xingha Qia, Elisa Sheng, Nah Simn, Kean Ming Tan, and Xin Lu Tan. It s tugh t make predictins, especially abut the future. -Ygi Berra Ls Angeles, USA Seattle, USA Pal Alt, USA Pal Alt, USA Gareth James Daniela Witten Trevr Hastie Rbert Tibshirani

10 Cntents Preface vii 1 Intrductin 1 2 Statistical Learning What Is Statistical Learning? Why Estimate f? Hw D We Estimate f? The Trade-Off Between Predictin Accuracy and Mdel Interpretability Supervised Versus Unsupervised Learning Regressin Versus Classificatin Prblems Assessing Mdel Accuracy Measuring the Quality f Fit The Bias-Variance Trade-Off The Classificatin Setting Lab: Intrductin t R Basic Cmmands Graphics Indexing Data Lading Data Additinal Graphical and Numerical Summaries Exercises ix

11 x Cntents 3 Linear Regressin Simple Linear Regressin Estimating the Cefficients Assessing the Accuracy f the Cefficient Estimates Assessing the Accuracy f the Mdel Multiple Linear Regressin Estimating the Regressin Cefficients Sme Imprtant Questins Other Cnsideratins in the Regressin Mdel Qualitative Predictrs Extensins f the Linear Mdel Ptential Prblems The Marketing Plan Cmparisn f Linear Regressin with K-Nearest Neighbrs Lab: Linear Regressin Libraries Simple Linear Regressin Multiple Linear Regressin Interactin Terms Nn-linear Transfrmatins f the Predictrs Qualitative Predictrs Writing Functins Exercises Classificatin An Overview f Classificatin Why Nt Linear Regressin? Lgistic Regressin The Lgistic Mdel Estimating the Regressin Cefficients Making Predictins Multiple Lgistic Regressin Lgistic Regressin fr >2 Respnse Classes Linear Discriminant Analysis Using Bayes Therem fr Classificatin Linear Discriminant Analysis fr p = Linear Discriminant Analysis fr p> Quadratic Discriminant Analysis A Cmparisn f Classificatin Methds Lab: Lgistic Regressin, LDA, QDA, and KNN The Stck Market Data Lgistic Regressin Linear Discriminant Analysis

12 Cntents xi Quadratic Discriminant Analysis K-Nearest Neighbrs An Applicatin t Caravan Insurance Data Exercises Resampling Methds Crss-Validatin The Validatin Set Apprach Leave-One-Out Crss-Validatin k-fld Crss-Validatin Bias-Variance Trade-Off fr k-fld Crss-Validatin Crss-Validatin n Classificatin Prblems The Btstrap Lab: Crss-Validatin and the Btstrap The Validatin Set Apprach Leave-One-Out Crss-Validatin k-fld Crss-Validatin The Btstrap Exercises Linear Mdel Selectin and Regularizatin Subset Selectin Best Subset Selectin Stepwise Selectin Chsing the Optimal Mdel Shrinkage Methds Ridge Regressin The Lass Selecting the Tuning Parameter Dimensin Reductin Methds Principal Cmpnents Regressin Partial Least Squares Cnsideratins in High Dimensins High-Dimensinal Data What Ges Wrng in High Dimensins? Regressin in High Dimensins Interpreting Results in High Dimensins Lab 1: Subset Selectin Methds Best Subset Selectin Frward and Backward Stepwise Selectin Chsing Amng Mdels Using the Validatin Set Apprach and Crss-Validatin

13 xii Cntents 6.6 Lab 2: Ridge Regressin and the Lass Ridge Regressin The Lass Lab 3: PCR and PLS Regressin Principal Cmpnents Regressin Partial Least Squares Exercises Mving Beynd Linearity Plynmial Regressin Step Functins Basis Functins Regressin Splines Piecewise Plynmials Cnstraints and Splines The Spline Basis Representatin Chsing the Number and Lcatins f the Knts Cmparisn t Plynmial Regressin Smthing Splines An Overview f Smthing Splines Chsing the Smthing Parameter λ Lcal Regressin Generalized Additive Mdels GAMs fr Regressin Prblems GAMs fr Classificatin Prblems Lab: Nn-linear Mdeling Plynmial Regressin and Step Functins Splines GAMs Exercises Tree-Based Methds The Basics f Decisin Trees Regressin Trees Classificatin Trees Trees Versus Linear Mdels Advantages and Disadvantages f Trees Bagging, Randm Frests, Bsting Bagging Randm Frests Bsting Lab: Decisin Trees Fitting Classificatin Trees Fitting Regressin Trees

14 Cntents xiii Bagging and Randm Frests Bsting Exercises Supprt Vectr Machines Maximal Margin Classifier What Is a Hyperplane? Classificatin Using a Separating Hyperplane The Maximal Margin Classifier Cnstructin f the Maximal Margin Classifier The Nn-separable Case Supprt Vectr Classifiers Overview f the Supprt Vectr Classifier Details f the Supprt Vectr Classifier Supprt Vectr Machines Classificatin with Nn-linear Decisin Bundaries The Supprt Vectr Machine An Applicatin t the Heart Disease Data SVMs with Mre than Tw Classes One-Versus-One Classificatin One-Versus-All Classificatin Relatinship t Lgistic Regressin Lab: Supprt Vectr Machines Supprt Vectr Classifier Supprt Vectr Machine ROC Curves SVM with Multiple Classes Applicatin t Gene Expressin Data Exercises Unsupervised Learning The Challenge f Unsupervised Learning Principal Cmpnents Analysis What Are Principal Cmpnents? Anther Interpretatin f Principal Cmpnents Mre n PCA Other Uses fr Principal Cmpnents Clustering Methds K-Means Clustering Hierarchical Clustering Practical Issues in Clustering Lab 1: Principal Cmpnents Analysis

15 xiv Cntents 10.5 Lab 2: Clustering K-Means Clustering Hierarchical Clustering Lab 3: NCI60 Data Example PCA n the NCI60 Data Clustering the Observatins f the NCI60 Data Exercises Index 419

16 1 Intrductin An Overview f Statistical Learning Statistical learning refers t a vast set f tls fr understanding data. These tls can be classified as supervised r unsupervised. Bradly speaking, supervised statistical learning invlves building a statistical mdel fr predicting, r estimating, an utput basednnermreinputs. Prblemsf this nature ccur in fields as diverse as business, medicine, astrphysics, and public plicy. With unsupervised statistical learning, there are inputs but n supervising utput; nevertheless we can learn relatinships and structure frm such data. T prvide an illustratin f sme applicatins f statistical learning, we briefly discuss three real-wrld data sets that are cnsidered in this bk. Wage Data In this applicatin (which we refer t as the Wage data set thrughut this bk), we examine a number f factrs that relate t wages fr a grup f males frm the Atlantic regin f the United States. In particular, we wish t understand the assciatin between an emplyee s age and educatin, as well as the calendar year, n his wage. Cnsider, fr example, the left-hand panel f Figure 1.1, which displays wage versus age fr each f the individuals in the data set. There is evidence that wage increases with age but then decreases again after apprximately age 60. The blue line, which prvides an estimate f the average wage fr a given age, makes this trend clearer. G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics, DOI / , Springer Science+Business Media New Yrk

17 2 1. Intrductin Wage Wage Wage Age Year Educatin Level FIGURE 1.1. Wage data, which cntains incme survey infrmatin fr males frm the central Atlantic regin f the United States. Left: wage as a functin f age. On average, wage increases with age until abut 60 years f age, at which pint it begins t decline. Center: wage as a functin f year. Thereisaslw but steady increase f apprximately $10,000 in the average wage between 2003 and Right: Bxplts displaying wage as a functin f educatin, with1 indicating the lwest level (n high schl diplma) and 5 the highest level (an advanced graduate degree). On average, wage increases with the level f educatin. Givenanemplyee sage, we can use this curve t predict his wage. Hwever, it is als clear frm Figure 1.1 that there is a significant amunt f variability assciated with this average value, and s age alne is unlikely t prvide an accurate predictin f a particular man s wage. We als have infrmatin regarding each emplyee s educatin level and the year in which the wage was earned. The center and right-hand panels f Figure 1.1, which display wage as a functin f bth year and educatin, indicate that bth f these factrs are assciated with wage. Wages increase by apprximately $10,000, in a rughly linear (r straight-line) fashin, between 2003 and 2009, thugh this rise is very slight relative t the variability in the data. Wages are als typically greater fr individuals with higher educatin levels: men with the lwest educatin level (1) tend t have substantially lwer wages than thse with the highest educatin level (5). Clearly, the mst accurate predictin f a given man s wage will be btained by cmbining his age, his educatin, and the year. In Chapter 3, we discuss linear regressin, which can be used t predict wage frm this data set. Ideally, we shuld predict wage in a way that accunts fr the nn-linear relatinship between wage and age. In Chapter 7, we discuss a class f appraches fr addressing this prblem. Stck Market Data The Wage data invlves predicting a cntinuus r quantitative utput value. This is ften referred t as a regressin prblem. Hwever, in certain cases we may instead wish t predict a nn-numerical value that is, a categrical

18 1. Intrductin 3 Yesterday Tw Days Previus Three Days Previus Percentage change in S&P Percentage change in S&P Percentage change in S&P Dwn Up Tday s Directin Dwn Up Tday s Directin Dwn Up Tday s Directin FIGURE 1.2. Left: Bxplts f the previus day s percentage change in the S&P index fr the days fr which the market increased r decreased, btained frm the Smarket data. Center and Right: Same as left panel, but the percentage changes fr 2 and 3 days previus are shwn. r qualitative utput. Fr example, in Chapter 4 we examine a stck market data set that cntains the daily mvements in the Standard & Pr s 500 (S&P) stck index ver a 5-year perid between 2001 and We refer t this as the Smarket data. The gal is t predict whether the index will increase r decrease n a given day using the past 5 days percentage changes in the index. Here the statistical learning prblem des nt invlve predicting a numerical value. Instead it invlves predicting whether a given day s stck market perfrmance will fall int the Up bucket r the Dwn bucket. This is knwn as a classificatin prblem. A mdel that culd accurately predict the directin in which the market will mve wuld be very useful! The left-hand panel f Figure 1.2 displays tw bxplts f the previus day s percentage changes in the stck index: ne fr the 648 days fr which the market increased n the subsequent day, and ne fr the 602 days fr which the market decreased. The tw plts lk almst identical, suggesting that there is n simple strategy fr using yesterday s mvement in the S&P t predict tday s returns. The remaining panels, which display bxplts fr the percentage changes 2 and 3 days previus t tday, similarly indicate little assciatin between past and present returns. Of curse, this lack f pattern is t be expected: in the presence f strng crrelatins between successive days returns, ne culd adpt a simple trading strategy t generate prfits frm the market. Nevertheless, in Chapter 4, we explre these data using several different statistical learning methds. Interestingly, there are hints f sme weak trends in the data that suggest that, at least fr this 5-year perid, it is pssible t crrectly predict the directin f mvement in the market apprximately 60% f the time (Figure 1.3).

19 4 1. Intrductin Predicted Prbability Dwn Up Tday s Directin FIGURE 1.3. We fit a quadratic discriminant analysis mdel t the subset f the Smarket data crrespnding t the time perid, and predicted the prbability f a stck market decrease using the 2005 data. On average, the predicted prbability f decrease is higher fr the days in which the market des decrease. Based n these results, we are able t crrectly predict the directin f mvement in the market 60% f the time. Gene Expressin Data The previus tw applicatins illustrate data sets with bth input and utput variables. Hwever, anther imprtant class f prblems invlves situatins in which we nly bserve input variables, with n crrespnding utput. Fr example, in a marketing setting, we might have demgraphic infrmatin fr a number f current r ptential custmers. We may wish t understand which types f custmers are similar t each ther by gruping individuals accrding t their bserved characteristics. This is knwn as a clustering prblem. Unlike in the previus examples, here we are nt trying t predict an utput variable. We devte Chapter 10 t a discussin f statistical learning methds fr prblems in which n natural utput variable is available. We cnsider the NCI60 data set, which cnsists f 6,830 gene expressin measurements fr each f 64 cancer cell lines. Instead f predicting a particular utput variable, we are interested in determining whether there are grups, r clusters, amng the cell lines based n their gene expressin measurements. This is a difficult questin t address, in part because there are thusands f gene expressin measurements per cell line, making it hard t visualize the data. The left-hand panel f Figure 1.4 addresses this prblem by representing each f the 64 cell lines using just tw numbers, Z 1 and Z 2.These are the first tw principal cmpnents f the data, which summarize the 6, 830 expressin measurements fr each cell line dwn t tw numbers r dimensins. While it is likely that this dimensin reductin has resulted in

20 1. Intrductin 5 Z Z Z Z 1 FIGURE 1.4. Left: Representatin f the NCI60 gene expressin data set in a tw-dimensinal space, Z 1 and Z 2. Each pint crrespnds t ne f the 64 cell lines. There appear t be fur grups f cell lines, which we have represented using different clrs. Right: Same as left panel except that we have represented each f the 14 different types f cancer using a different clred symbl. Cell lines crrespnding t the same cancer type tend t be nearby in the tw-dimensinal space. sme lss f infrmatin, it is nw pssible t visually examine the data fr evidence f clustering. Deciding n the number f clusters is ften a difficult prblem. But the left-hand panel f Figure 1.4 suggests at least fur grups f cell lines, which we have represented using separate clrs. We can nw examine the cell lines within each cluster fr similarities in their types f cancer, in rder t better understand the relatinship between gene expressin levels and cancer. In this particular data set, it turns ut that the cell lines crrespnd t 14 different types f cancer. (Hwever, this infrmatin was nt used t create the left-hand panel f Figure 1.4.) The right-hand panel f Figure 1.4 is identical t the left-hand panel, except that the 14 cancer types are shwn using distinct clred symbls. There is clear evidence that cell lines with the same cancer type tend t be lcated near each ther in this tw-dimensinal representatin. In additin, even thugh the cancer infrmatin was nt used t prduce the left-hand panel, the clustering btained des bear sme resemblance t sme f the actual cancer types bserved in the right-hand panel. This prvides sme independent verificatin f the accuracy f ur clustering analysis. A Brief Histry f Statistical Learning Thugh the term statistical learning is fairly new, many f the cncepts that underlie the field were develped lng ag. At the beginning f the nineteenth century, Legendre and Gauss published papers n the methd

21 6 1. Intrductin f least squares, which implemented the earliest frm f what is nw knwn as linear regressin. The apprach was first successfully applied t prblems in astrnmy. Linear regressin is used fr predicting quantitative values, such as an individual s salary. In rder t predict qualitative values, such as whether a patient survives r dies, r whether the stck market increases r decreases, Fisher prpsed linear discriminant analysis in In the 1940s, varius authrs put frth an alternative apprach, lgistic regressin. In the early 1970s, Nelder and Wedderburn cined the term generalized linear mdels fr an entire class f statistical learning methds that include bth linear and lgistic regressin as special cases. By the end f the 1970s, many mre techniques fr learning frm data were available. Hwever, they were almst exclusively linear methds, because fitting nn-linear relatinships was cmputatinally infeasible at the time. By the 1980s, cmputing technlgy had finally imprved sufficiently that nn-linear methds were n lnger cmputatinally prhibitive. In mid 1980s Breiman, Friedman, Olshen and Stne intrduced classificatin and regressin trees, and were amng the first t demnstrate the pwer f a detailed practical implementatin f a methd, including crss-validatin fr mdel selectin. Hastie and Tibshirani cined the term generalized additive mdels in 1986 fr a class f nn-linear extensins t generalized linear mdels, and als prvided a practical sftware implementatin. Since that time, inspired by the advent f machine learning and ther disciplines, statistical learning has emerged as a new subfield in statistics, fcused n supervised and unsupervised mdeling and predictin. In recent years, prgress in statistical learning has been marked by the increasing availability f pwerful and relatively user-friendly sftware, such as the ppular and freely available R system. This has the ptential t cntinue the transfrmatin f the field frm a set f techniques used and develped by statisticians and cmputer scientists t an essential tlkit fr a much brader cmmunity. This Bk The Elements f Statistical Learning (ESL) by Hastie, Tibshirani, and Friedman was first published in Since that time, it has becme an imprtant reference n the fundamentals f statistical machine learning. Its success derives frm its cmprehensive and detailed treatment f many imprtant tpics in statistical learning, as well as the fact that (relative t many upper-level statistics textbks) it is accessible t a wide audience. Hwever, the greatest factr behind the success f ESL has been its tpical nature. At the time f its publicatin, interest in the field f statistical

22 1. Intrductin 7 learning was starting t explde. ESL prvided ne f the first accessible and cmprehensive intrductins t the tpic. Since ESL was first published, the field f statistical learning has cntinued t flurish. The field s expansin has taken tw frms. The mst bvius grwth has invlved the develpment f new and imprved statistical learning appraches aimed at answering a range f scientific questins acrss a number f fields. Hwever, the field f statistical learning has als expanded its audience. In the 1990s, increases in cmputatinal pwer generated a surge f interest in the field frm nn-statisticians wh were eager t use cutting-edge statistical tls t analyze their data. Unfrtunately, the highly technical nature f these appraches meant that the user cmmunity remained primarily restricted t experts in statistics, cmputer science, and related fields with the training (and time) t understand and implement them. In recent years, new and imprved sftware packages have significantly eased the implementatin burden fr many statistical learning methds. At the same time, there has been grwing recgnitin acrss a number f fields, frm business t health care t genetics t the scial sciences and beynd, that statistical learning is a pwerful tl with imprtant practical applicatins. As a result, the field has mved frm ne f primarily academic interest t a mainstream discipline, with an enrmus ptential audience. This trend will surely cntinue with the increasing availability f enrmus quantities f data and the sftware t analyze it. The purpse f An Intrductin t Statistical Learning (ISL) is t facilitate the transitin f statistical learning frm an academic t a mainstream field. ISL is nt intended t replace ESL, which is a far mre cmprehensive text bth in terms f the number f appraches cnsidered and the depth t which they are explred. We cnsider ESL t be an imprtant cmpanin fr prfessinals (with graduate degrees in statistics, machine learning, r related fields) wh need t understand the technical details behind statistical learning appraches. Hwever, the cmmunity f users f statistical learning techniques has expanded t include individuals with a wider range f interests and backgrunds. Therefre, we believe that there is nw a place fr a less technical and mre accessible versin f ESL. In teaching these tpics ver the years, we have discvered that they are f interest t master s and PhD students in fields as disparate as business administratin, bilgy, and cmputer science, as well as t quantitativelyriented upper-divisin undergraduates. It is imprtant fr this diverse grup t be able t understand the mdels, intuitins, and strengths and weaknesses f the varius appraches. But fr this audience, many f the technical details behind statistical learning methds, such as ptimizatin algrithms and theretical prperties, are nt f primary interest. We believe that these students d nt need a deep understanding f these aspects in rder t becme infrmed users f the varius methdlgies, and

23 8 1. Intrductin in rder t cntribute t their chsen fields thrugh the use f statistical learning tls. ISLR is based n the fllwing fur premises. 1. Many statistical learning methds are relevant and useful in a wide range f academic and nn-academic disciplines, beynd just the statistical sciences. We believe that many cntemprarystatisticallearning prcedures shuld, and will, becme as widely available and used as is currently the case fr classical methds such as linear regressin. As a result, rather than attempting t cnsider every pssible apprach (an impssible task), we have cncentrated n presenting the methds that we believe are mst widely applicable. 2. Statistical learning shuld nt be viewed as a series f black bxes. N single apprach will perfrm well in all pssible applicatins. Withut understanding all f the cgs inside the bx, r the interactin between thse cgs, it is impssible t select the best bx. Hence, we have attempted t carefully describe the mdel, intuitin, assumptins, and trade-ffs behind each f the methds that we cnsider. 3. While it is imprtant t knw what jb is perfrmed by each cg, it is nt necessary t have the skills t cnstruct the machine inside the bx! Thus, we have minimized discussin f technical details related t fitting prcedures and theretical prperties. We assume that the reader is cmfrtable with basic mathematical cncepts, but we d nt assume a graduate degree in the mathematical sciences. Fr instance, we have almst cmpletely avided the use f matrix algebra, and it is pssible t understand the entire bk withut a detailed knwledge f matrices and vectrs. 4. We presume that the reader is interested in applying statistical learning methds t real-wrld prblems. In rder t facilitate this, as well as t mtivate the techniques discussed, we have devted a sectin within each chapter t R cmputer labs. In each lab, we walk the reader thrugh a realistic applicatin f the methds cnsidered in that chapter. When we have taught this material in ur curses, we have allcated rughly ne-third f classrm time t wrking thrugh the labs, and we have fund them t be extremely useful. Many f the less cmputatinally-riented students wh were initially intimidated by R s cmmand level interface gt the hang f things ver the curse f the quarter r semester. We have used R because it is freely available and is pwerful enugh t implement all f the methds discussed in the bk. It als has ptinal packages that can be dwnladed t implement literally thusands f additinal methds. Mst imprtantly, R is the language f chice fr academic statisticians, and new appraches ften becme available in

24 1. Intrductin 9 R years befre they are implemented in cmmercial packages. Hwever, the labs in ISL are self-cntained, and can be skipped if the reader wishes t use a different sftware package r des nt wish t apply the methds discussed t real-wrld prblems. Wh Shuld Read This Bk? This bk is intended fr anyne wh is interested in using mdern statistical methds fr mdeling and predictin frm data. This grup includes scientists, engineers, data analysts, r quants, but als less technical individuals with degrees in nn-quantitative fields such as the scial sciences r business. We expect that the reader will have had at least ne elementary curse in statistics. Backgrund in linear regressin is als useful, thugh nt required, since we review the key cncepts behind linear regressin in Chapter 3. The mathematical level f this bk is mdest, and a detailed knwledge f matrix peratins is nt required. This bk prvides an intrductin t the statistical prgramming language R. Previus expsure t a prgramming language, such as MATLAB r Pythn, is useful but nt required. We have successfully taught material at this level t master s and PhD students in business, cmputer science, bilgy, earth sciences, psychlgy, and many ther areas f the physical and scial sciences. This bk culd als be apprpriate fr advanced undergraduates wh have already taken a curse n linear regressin. In the cntext f a mre mathematically rigrus curse in which ESL serves as the primary textbk, ISL culd be used as a supplementary text fr teaching cmputatinal aspects f the varius appraches. Ntatin and Simple Matrix Algebra Chsing ntatin fr a textbk is always a difficult task. Fr the mst part we adpt the same ntatinal cnventins as ESL. We will use n t represent the number f distinct data pints, r bservatins, in ur sample. We will let p dente the number f variables that are available fr use in making predictins. Fr example, the Wage data set cnsists f 12 variables fr 3,000 peple, s we have n =3,000 bservatins and p = 12 variables (such as year, age, wage, and mre). Nte that thrughut this bk, we indicate variable names using clred fnt: Variable Name. In sme examples, p might be quite large, such as n the rder f thusands r even millins; this situatin arises quite ften, fr example, in the analysis f mdern bilgical data r web-based advertising data.

25 10 1. Intrductin In general, we will let x ij represent the value f the jth variable fr the ith bservatin, where i =1, 2,...,n and j =1, 2,...,p. Thrughut this bk, i will be used t index the samples r bservatins (frm 1 t n) and j will be used t index the variables (frm 1 t p). We let X dente a n p matrix whse (i, j)th element is x ij.thatis, x 11 x x 1p x 21 x x 2p X = x n1 x n2... x np Fr readers wh are unfamiliar with matrices, it is useful t visualize X as a spreadsheet f numbers with n rws and p clumns. At times we will be interested in the rws f X, which we write as x 1,x 2,...,x n.herex i is a vectr f length p, cntaining the p variable measurements fr the ith bservatin. That is, x i1 x i2 x i =... (1.1) x ip (Vectrs are by default represented as clumns.) Fr example, fr the Wage data, x i is a vectr f length 12, cnsisting f year, age, wage, and ther values fr the ith individual. At ther times we will instead be interested in the clumns f X, which we write as x 1, x 2,...,x p. Each is a vectr f length n. Thatis, x 1j x 2j x j =.. x nj Fr example, fr the Wage data, x 1 cntains the n =3,000 values fr year. Using this ntatin, the matrix X can be written as X = ( x 1 x 2 x p ), r x T 1 x T 2 X =... x T n

26 1. Intrductin 11 The T ntatin dentes the transpse f a matrix r vectr. S, fr example, x 11 x x n1 X T x 12 x x n2 =..., x 1p x 2p... x np while x T i = ( x i1 x i2 x ip ). We use y i t dente the ith bservatin f the variable n which we wish t make predictins, such as wage. Hence, we write the set f all n bservatins in vectr frm as y 1 y 2 y =. y n. Then ur bserved data cnsists f {(x 1,y 1 ), (x 2,y 2 ),...,(x n,y n )}, where each x i is a vectr f length p. (Ifp =1,thenx i is simply a scalar.) In this text, a vectr f length n will always be dented in lwer case bld ; e.g. a 1 a 2 a =.. a n Hwever, vectrs that are nt f length n (such as feature vectrs f length p, as in (1.1)) will be dented in lwer case nrmal fnt, e.g. a. Scalars will als be dented in lwer case nrmal fnt, e.g. a. In the rare cases in which these tw uses fr lwer case nrmal fnt lead t ambiguity, we will clarify which use is intended. Matrices will be dented using bld capitals, such as A. Randm variables will be dented using capital nrmal fnt, e.g. A, regardless f their dimensins. Occasinally we will want t indicate the dimensin f a particular bject. T indicate that an bject is a scalar, we will use the ntatin a R. T indicate that it is a vectr f length k, we will use a R k (r a R n if it is f length n). We will indicate that an bject is a r s matrix using A R r s. We have avided using matrix algebra whenever pssible. Hwever, in a few instances it becmes t cumbersme t avid it entirely. In these rare instances it is imprtant t understand the cncept f multiplying tw matrices. Suppse that A R r d and B R d s. Then the prduct

27 12 1. Intrductin f A and B is dented AB. The(i, j)th element f AB is cmputed by multiplying each element f the ith rw f A by the crrespnding element f the jth clumn f B. Thatis,(AB) ij = d k=1 a ikb kj. As an example, cnsider A = Then ( )( ) AB = = ( ) and B = ( ) ( ) = ( ) Nte that this peratin prduces an r s matrix. It is nly pssible t cmpute AB if the number f clumns f A is the same as the number f rws f B. Organizatin f This Bk Chapter 2 intrduces the basic terminlgy and cncepts behind statistical learning. This chapter als presents the K-nearest neighbr classifier, a very simple methd that wrks surprisingly well n many prblems. Chapters 3 and 4 cver classical linear methds fr regressin and classificatin. In particular, Chapter 3 reviews linear regressin, the fundamental starting pint fr all regressin methds. In Chapter 4 we discuss tw f the mst imprtant classical classificatin methds, lgistic regressin and linear discriminant analysis. A central prblem in all statistical learning situatins invlves chsing the best methd fr a given applicatin. Hence, in Chapter 5 we intrduce crss-validatin and the btstrap, which can be used t estimate the accuracy f a number f different methds in rder t chse the best ne. Much f the recent research in statistical learning has cncentrated n nn-linear methds. Hwever, linear methds ften have advantages ver their nn-linear cmpetitrs in terms f interpretability and smetimes als accuracy. Hence, in Chapter 6 we cnsider a hst f linear methds, bth classical and mre mdern, which ffer ptential imprvements ver standard linear regressin. These include stepwise selectin, ridge regressin, principal cmpnents regressin, partial least squares, andthelass. The remaining chapters mve int the wrld f nn-linear statistical learning. We first intrduce in Chapter 7 a number f nn-linear methds that wrk well fr prblems with a single input variable. We then shw hw these methds can be used t fit nn-linear additive mdels fr which there is mre than ne input. In Chapter 8, we investigate tree-based methds, including bagging, bsting, and randm frests. Supprt vectr machines, a set f appraches fr perfrming bth linear and nn-linear classificatin,

28 1. Intrductin 13 are discussed in Chapter 9. Finally, in Chapter 10, we cnsider a setting in which we have input variables but n utput variable. In particular, we present principal cmpnents analysis, K-means clustering, andhierarchical clustering. At the end f each chapter, we present ne r mre R lab sectins in which we systematically wrk thrugh applicatins f the varius methds discussed in that chapter. These labs demnstrate the strengths and weaknesses f the varius appraches, and als prvide a useful reference fr the syntax required t implement the varius methds. The reader may chse t wrk thrugh the labs at his r her wn pace, r the labs may be the fcus f grup sessins as part f a classrm envirnment. Within each R lab, we present the results that we btained when we perfrmed the lab at the time f writing this bk. Hwever, new versins f R are cntinuusly released, and ver time, the packages called in the labs will be updated. Therefre, in the future, it is pssible that the results shwn in the lab sectins may n lnger crrespnd precisely t the results btained by the reader wh perfrms the labs. As necessary, we will pst updates t the labs n the bk website. We use the symbl t dente sectins r exercises that cntain mre challenging cncepts. These can be easily skipped by readers wh d nt wish t delve as deeply int the material, r wh lack the mathematical backgrund. Data Sets Used in Labs and Exercises In this textbk, we illustrate statistical learning methds using applicatins frm marketing, finance, bilgy, and ther areas. The ISLR package available n the bk website cntains a number f data sets that are required in rder t perfrm the labs and exercises assciated with this bk. One ther data set is cntained in the MASS library, and yet anther is part f the base R distributin. Table 1.1 cntains a summary f the data sets required t perfrm the labs and exercises. A cuple f these data sets are als available as text files n the bk website, fr use in Chapter 2. Bk Website Thewebsitefrthisbkislcatedat

29 14 1. Intrductin Name Aut Bstn Caravan Carseats Cllege Default Hitters Khan NCI60 OJ Prtfli Smarket USArrests Wage Weekly Descriptin Gas mileage, hrsepwer, and ther infrmatin fr cars. Husing values and ther infrmatin abut Bstn suburbs. Infrmatin abut individuals ffered caravan insurance. Infrmatin abut car seat sales in 400 stres. Demgraphic characteristics, tuitin, and mre fr USA clleges. Custmer default recrds fr a credit card cmpany. Recrds and salaries fr baseball players. Gene expressin measurements fr fur cancer types. Gene expressin measurements fr 64 cancer cell lines. Sales infrmatin fr Citrus Hill and Minute Maid range juice. Past values f financial assets, fr use in prtfli allcatin. Daily percentage returns fr S&P 500 ver a 5-year perid. Crime statistics per 100,000 residents in 50 states f USA. Incme survey data fr males in central Atlantic regin f USA. 1,089 weekly stck market returns fr 21 years. TABLE 1.1. A list f data sets needed t perfrm the labs and exercises in this textbk. All data sets are available in the ISLR library, with the exceptin f Bstn (part f MASS) and USArrests (part f the base R distributin). It cntains a number f resurces, including the R package assciated with this bk, and sme additinal data sets. Acknwledgements A few f the plts in this bk were taken frm ESL: Figures 6.7, 8.3, and All ther plts are new t this bk.

30 2 Statistical Learning 2.1 What Is Statistical Learning? In rder t mtivate ur study f statistical learning, we begin with a simple example. Suppse that we are statistical cnsultants hired by a client t prvide advice n hw t imprve sales f a particular prduct. The Advertising data set cnsists f the sales f that prduct in 200 different markets, alng with advertising budgets fr the prduct in each f thse markets fr three different media: TV, radi, andnewspaper. The data are displayed in Figure 2.1. It is nt pssible fr ur client t directly increase sales f the prduct. On the ther hand, they can cntrl the advertising expenditure in each f the three media. Therefre, if we determine that there is an assciatin between advertising and sales, then we can instruct ur client t adjust advertising budgets, thereby indirectly increasing sales. In ther wrds, ur gal is t develp an accurate mdel that can be used t predict sales n the basis f the three media budgets. In this setting, the advertising budgets are input variables while sales input is an utput variable. The input variables are typically dented using the symbl X, with a subscript t distinguish them. S X 1 might be the TV budget, X 2 the radi budget, and X 3 the newspaper budget. The inputs g by different names, such as predictrs, independent variables, features, r smetimes just variables. The utput variable in this case, sales is ften called the respnse r dependent variable, and is typically dented using the symbl Y. Thrughut this bk, we will use all f these terms interchangeably. G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics, DOI / , Springer Science+Business Media New Yrk variable utput variable predictr independent variable feature variable respnse dependent variable

31 16 2. Statistical Learning Sales Sales Sales TV Radi Newspaper FIGURE 2.1. The Advertising data set. The plt displays sales, in thusands f units, as a functin f TV, radi, and newspaper budgets, in thusands f dllars, fr 200 different markets. In each plt we shw the simple least squares fit f sales t that variable, as described in Chapter 3. In ther wrds, each blue line represents a simple mdel that can be used t predict sales using TV, radi, and newspaper, respectively. Mre generally, suppse that we bserve a quantitative respnse Y and p different predictrs, X 1,X 2,...,X p. We assume that there is sme relatinship between Y and X =(X 1,X 2,...,X p ), which can be written in the very general frm Y = f(x)+ɛ. (2.1) Here f is sme fixed but unknwn functin f X 1,...,X p,andɛ is a randm errr term, which is independent f X and has mean zer. In this frmula- errr term tin, f represents the systematic infrmatin that X prvides abut Y. systematic As anther example, cnsider the left-hand panel f Figure 2.2, a plt f incme versus years f educatin fr 30 individuals in the Incme data set. The plt suggests that ne might be able t predict incme using years f educatin. Hwever, the functin f that cnnects the input variable t the utput variable is in general unknwn. In this situatin ne must estimate f basednthebservedpints.sinceincme is a simulated data set, f is knwn and is shwn by the blue curve in the right-hand panel f Figure 2.2. The vertical lines represent the errr terms ɛ. We nte that sme f the 30 bservatins lie abve the blue curve and sme lie belw it; verall, the errrs have apprximately mean zer. In general, the functin f may invlve mre than ne input variable. In Figure 2.3 we plt incme as a functin f years f educatin and senirity. Here f is a tw-dimensinal surface that must be estimated based n the bserved data.

32 2.1 What Is Statistical Learning? 17 Incme Incme Years f Educatin Years f Educatin FIGURE 2.2. The Incme data set. Left: The red dts are the bserved values f incme (in tens f thusands f dllars) and years f educatin fr 30 individuals. Right: The blue curve represents the true underlying relatinship between incme and years f educatin, which is generally unknwn (but is knwn in this case because the data were simulated). The black lines represent the errr assciated with each bservatin. Nte that sme errrs are psitive (if an bservatin lies abve the blue curve) and sme are negative (if an bservatin lies belw the curve). Overall, these errrs have apprximately mean zer. In essence, statistical learning refers t a set f appraches fr estimating f. In this chapter we utline sme f the key theretical cncepts that arise in estimating f, as well as tls fr evaluating the estimates btained Why Estimate f? There are tw main reasns that we may wish t estimate f: predictin and inference. We discuss each in turn. Predictin In many situatins, a set f inputs X are readily available, but the utput Y cannt be easily btained. In this setting, since the errr term averages t zer, we can predict Y using Ŷ = ˆf(X), (2.2) where ˆf represents ur estimate fr f, andŷ represents the resulting predictin fr Y. In this setting, ˆf is ften treated as a black bx, in the sense that ne is nt typically cncerned with the exact frm f ˆf, prvided that it yields accurate predictins fr Y.

33 18 2. Statistical Learning Incme Years f Educatin Senirity FIGURE 2.3. The plt displays incme as a functin f years f educatin and senirity in the Incme data set. The blue surface represents the true underlying relatinship between incme and years f educatin and senirity, which is knwn since the data are simulated. The red dts indicate the bserved values f these quantities fr 30 individuals. As an example, suppse that X 1,...,X p are characteristics f a patient s bld sample that can be easily measured in a lab, and Y is a variable encding the patient s risk fr a severe adverse reactin t a particular drug. It is natural t seek t predict Y using X, since we can then avid giving the drug in questin t patients wh are at high risk f an adverse reactin that is, patients fr whm the estimate f Y is high. The accuracy f Ŷ as a predictin fr Y depends n tw quantities, which we will call the reducible errr and the irreducible errr. In general, reducible ˆf will nt be a perfect estimate fr f, and this inaccuracy will intrduce sme errr. This errr is reducible because we can ptentially imprve the accuracy f ˆf by using the mst apprpriate statistical learning technique t estimate f. Hwever, even if it were pssible t frm a perfect estimate fr f, s that ur estimated respnse tk the frm Ŷ = f(x), ur predictin wuld still have sme errr in it! This is because Y is als a functin f ɛ, which, by definitin, cannt be predicted using X. Therefre, variability assciated with ɛ als affects the accuracy f ur predictins. This is knwn as the irreducible errr, because n matter hw well we estimate f, we cannt reduce the errr intrduced by ɛ. Why is the irreducible errr larger than zer? The quantity ɛ may cntain unmeasured variables that are useful in predicting Y : since we dn t measure them, f cannt use them fr its predictin. The quantity ɛ may als cntain unmeasurable variatin. Fr example, the risk f an adverse reactin might vary fr a given patient n a given day, depending n errr irreducible errr

34 2.1 What Is Statistical Learning? 19 manufacturing variatin in the drug itself r the patient s general feeling f well-being n that day. Cnsider a given estimate ˆf and a set f predictrs X, which yields the predictin Ŷ = ˆf(X). Assume fr a mment that bth ˆf and X are fixed. Then, it is easy t shw that E(Y Ŷ )2 = E[f(X)+ɛ ˆf(X)] 2 = [f(x) ˆf(X)] 2 } {{ } Reducible + Var(ɛ) } {{ }, (2.3) Irreducible where E(Y Ŷ )2 represents the average, r expected value, f the squared expected difference between the predicted and actual value f Y,andVar(ɛ) represents the variance assciated with the errr term ɛ. value variance The fcus f this bk is n techniques fr estimating f with the aim f minimizing the reducible errr. It is imprtant t keep in mind that the irreducible errr will always prvide an upper bund n the accuracy f ur predictin fr Y. This bund is almst always unknwn in practice. Inference We are ften interested in understanding the way that Y is affected as X 1,...,X p change. In this situatin we wish t estimate f, but ur gal is nt necessarily t make predictins fr Y. We instead want t understand the relatinship between X and Y, r mre specifically, t understand hw Y changes as a functin f X 1,...,X p.nw ˆf cannt be treated as a black bx, because we need t knw its exact frm. In this setting, ne may be interested in answering the fllwing questins: Which predictrs are assciated with the respnse? It is ften the case that nly a small fractin f the available predictrs are substantially assciated with Y. Identifying the few imprtant predictrs amng a large set f pssible variables can be extremely useful, depending n the applicatin. What is the relatinship between the respnse and each predictr? Sme predictrs may have a psitive relatinship with Y, in the sense that increasing the predictr is assciated with increasing values f Y. Other predictrs may have the ppsite relatinship. Depending n the cmplexity f f, the relatinship between the respnse and a given predictr may als depend n the values f the ther predictrs. Can the relatinship between Y and each predictr be adequately summarized using a linear equatin, r is the relatinship mre cmplicated? Histrically, mst methds fr estimating f have taken a linear frm. In sme situatins, such an assumptin is reasnable r even desirable. But ften the true relatinship is mre cmplicated, in which case a linear mdel may nt prvide an accurate representatin f the relatinship between the input and utput variables.

35 20 2. Statistical Learning In this bk, we will see a number f examples that fall int the predictin setting, the inference setting, r a cmbinatin f the tw. Fr instance, cnsider a cmpany that is interested in cnducting a direct-marketing campaign. The gal is t identify individuals wh will respnd psitively t a mailing, based n bservatins f demgraphic variables measured n each individual. In this case, the demgraphic variables serve as predictrs, and respnse t the marketing campaign (either psitive r negative) serves as the utcme. The cmpany is nt interested in btaining a deep understanding f the relatinships between each individual predictr and the respnse; instead, the cmpany simply wants an accurate mdel t predict the respnse using the predictrs. This is an example f mdeling fr predictin. In cntrast, cnsider the Advertising data illustrated in Figure 2.1. One may be interested in answering questins such as: Which media cntribute t sales? Which media generate the biggest bst in sales? r Hw much increase in sales is assciated with a given increase in TV advertising? This situatin falls int the inference paradigm. Anther example invlves mdeling the brand f a prduct that a custmer might purchase based n variables such as price, stre lcatin, discunt levels, cmpetitin price, and s frth. In this situatin ne might really be mst interested in hw each f the individual variables affects the prbability f purchase. Fr instance, what effect will changing the price f a prduct have n sales? This is an example f mdeling fr inference. Finally, sme mdeling culd be cnducted bth fr predictin and inference. Fr example, in a real estate setting, ne may seek t relate values f hmes t inputs such as crime rate, zning, distance frm a river, air quality, schls, incme level f cmmunity, size f huses, and s frth. In this case ne might be interested in hw the individual input variables affect the prices that is, hw much extra will a huse be wrth if it has a view f the river? This is an inference prblem. Alternatively, ne may simply be interested in predicting the value f a hme given its characteristics: is this huse under- r ver-valued? This is a predictin prblem. Depending n whether ur ultimate gal is predictin, inference, r a cmbinatin f the tw, different methds fr estimating f may be apprpriate. Fr example, linear mdels allw fr relatively simple and inter- linear mdel pretable inference, but may nt yield as accurate predictins as sme ther appraches. In cntrast, sme f the highly nn-linear appraches that we discuss in the later chapters f this bk can ptentially prvide quite accurate predictins fr Y, but this cmes at the expense f a less interpretable mdel fr which inference is mre challenging.

36 2.1.2 Hw D We Estimate f? 2.1 What Is Statistical Learning? 21 Thrughut this bk, we explre many linear and nn-linear appraches fr estimating f. Hwever, these methds generally share certain characteristics. We prvide an verview f these shared characteristics in this sectin. We will always assume that we have bserved a set f n different data pints. Fr example in Figure 2.2 we bserved n =30datapints. These bservatins are called the training data because we will use these training data bservatins t train, r teach, ur methd hw t estimate f. Letx ij represent the value f the jth predictr, r input, fr bservatin i, where i =1, 2,...,n and j =1, 2,...,p. Crrespndingly, let y i represent the respnse variable fr the ith bservatin. Then ur training data cnsist f {(x 1,y 1 ), (x 2,y 2 ),...,(x n,y n )} where x i =(x i1,x i2,...,x ip ) T. Our gal is t apply a statistical learning methd t the training data in rder t estimate the unknwn functin f. In ther wrds, we want t find a functin ˆf such that Y ˆf(X) fr any bservatin (X, Y ). Bradly speaking, mst statistical learning methds fr this task can be characterized as either parametric r nn-parametric. We nw briefly discuss these parametric tw types f appraches. Parametric Methds Parametric methds invlve a tw-step mdel-based apprach. 1. First, we make an assumptin abut the functinal frm, r shape, f f. Fr example, ne very simple assumptin is that f is linear in X: f(x) =β 0 + β 1 X 1 + β 2 X β p X p. (2.4) This is a linear mdel, which will be discussed extensively in Chapter 3. Once we have assumed that f is linear, the prblem f estimating f is greatly simplified. Instead f having t estimate an entirely arbitrary p-dimensinal functin f(x), ne nly needs t estimate the p + 1 cefficients β 0,β 1,...,β p. 2. After a mdel has been selected, we need a prcedure that uses the training data t fit r train the mdel. In the case f the linear mdel (2.4), we need t estimate the parameters β 0,β 1,...,β p.thatis,we want t find values f these parameters such that nnparametric fit train Y β 0 + β 1 X 1 + β 2 X β p X p. The mst cmmn apprach t fitting the mdel (2.4) is referred t as (rdinary) least squares, which we discuss in Chapter 3. Hw- least squares ever, least squares is ne f many pssible ways way t fit the linear mdel. In Chapter 6, we discuss ther appraches fr estimating the parameters in (2.4). The mdel-based apprach just described is referred t as parametric; it reduces the prblem f estimating f dwn t ne f estimating a set f

37 22 2. Statistical Learning Incme Years f Educatin Senirity FIGURE 2.4. A linear mdel fit by least squares t the Incme data frm Figure 2.3. The bservatins are shwn in red, and the yellw plane indicates the least squares fit t the data. parameters. Assuming a parametric frm fr f simplifies the prblem f estimating f because it is generally much easier t estimate a set f parameters, such as β 0,β 1,...,β p in the linear mdel (2.4), than it is t fit an entirely arbitrary functin f. The ptential disadvantage f a parametric apprach is that the mdel we chse will usually nt match the true unknwn frm f f. If the chsen mdel is t far frm the true f, then ur estimate will be pr. We can try t address this prblem by chsing flexible mdels that can fit many different pssible functinal frms flexible fr f. But in general, fitting a mre flexible mdel requires estimating a greater number f parameters. These mre cmplex mdels can lead t a phenmenn knwn as verfitting the data, which essentially means they verfitting fllw the errrs, r nise, t clsely. These issues are discussed thrugh- nise ut this bk. Figure 2.4 shws an example f the parametric apprach applied t the Incme data frm Figure 2.3. We have fit a linear mdel f the frm incme β 0 + β 1 educatin + β 2 senirity. Since we have assumed a linear relatinship between the respnse and the tw predictrs, the entire fitting prblem reduces t estimating β 0, β 1,and β 2, which we d using least squares linear regressin. Cmparing Figure 2.3 t Figure 2.4, we can see that the linear fit given in Figure 2.4 is nt quite right: the true f has sme curvature that is nt captured in the linear fit. Hwever, the linear fit still appears t d a reasnable jb f capturing the psitive relatinship between years f educatin and incme, aswellasthe

38 2.1 What Is Statistical Learning? 23 Incme Years f Educatin Senirity FIGURE 2.5. A smth thin-plate spline fit t the Incme data frm Figure 2.3 is shwn in yellw; the bservatins are displayed in red. Splines are discussed in Chapter 7. slightly less psitive relatinship between senirity and incme. Itmaybe that with such a small number f bservatins, this is the best we can d. Nn-parametric Methds Nn-parametric methds d nt make explicit assumptins abut the functinal frm f f. Instead they seek an estimate f f that gets as clse t the data pints as pssible withut being t rugh r wiggly. Such appraches can have a majr advantage ver parametric appraches: by aviding the assumptin f a particular functinal frm fr f, they have the ptential t accurately fit a wider range f pssible shapes fr f. Anyparametric apprach brings with it the pssibility that the functinal frm used t estimate f is very different frm the true f, in which case the resulting mdel will nt fit the data well. In cntrast, nn-parametric appraches cmpletely avid this danger, since essentially n assumptin abut the frm f f is made. But nn-parametric appraches d suffer frm a majr disadvantage: since they d nt reduce the prblem f estimating f t a small number f parameters, a very large number f bservatins (far mre than is typically needed fr a parametric apprach) is required in rder t btain an accurate estimate fr f. An example f a nn-parametric apprach t fitting the Incme data is shwninfigure2.5.athin-plate spline is used t estimate f. This ap- thin-plate prach des nt impse any pre-specified mdel n f. It instead attempts spline t prduce an estimate fr f that is as clse as pssible t the bserved data, subject t the fit that is, the yellw surface in Figure 2.5 being

39 24 2. Statistical Learning Incme Years f Educatin Senirity FIGURE 2.6. A rugh thin-plate spline fit t the Incme data frm Figure 2.3. This fit makes zer errrs n the training data. smth. In this case, the nn-parametric fit has prduced a remarkably accurate estimate f the true f shwn in Figure 2.3. In rder t fit a thin-plate spline, the data analyst must select a level f smthness. Figure 2.6 shws the same thin-plate spline fit using a lwer level f smthness, allwing fr a rugher fit. The resulting estimate fits the bserved data perfectly! Hwever, the spline fit shwn in Figure 2.6 is far mre variable than the true functin f, frm Figure 2.3. This is an example f verfitting the data, which we discussed previusly. It is an undesirable situatin because the fit btained will nt yield accurate estimates f the respnse n new bservatins that were nt part f the riginal training data set. We discuss methds fr chsing the crrect amunt f smthness in Chapter 5. Splines are discussed in Chapter 7. As we have seen, there are advantages and disadvantages t parametric and nn-parametric methds fr statistical learning. We explre bth types f methds thrughut this bk The Trade-Off Between Predictin Accuracy and Mdel Interpretability Of the many methds that we examine in this bk, sme are less flexible, r mre restrictive, in the sense that they can prduce just a relatively small range f shapes t estimate f. Fr example, linear regressin is a relatively inflexible apprach, because it can nly generate linear functins such as the lines shwn in Figure 2.1 r the plane shwn in Figure 2.3.

40 2.1 What Is Statistical Learning? 25 High Subset Selectin Lass Interpretability Least Squares Generalized Additive Mdels Trees Bagging, Bsting Lw Supprt Vectr Machines Lw High Flexibility FIGURE 2.7. A representatin f the tradeff between flexibility and interpretability, using different statistical learning methds. In general, as the flexibility f a methd increases, its interpretability decreases. Other methds, such as the thin plate splines shwn in Figures 2.5 and 2.6, are cnsiderably mre flexible because they can generate a much wider range f pssible shapes t estimate f. One might reasnably ask the fllwing questin: why wuld we ever chse t use a mre restrictive methd instead f a very flexible apprach? There are several reasns that we might prefer a mre restrictive mdel. If we are mainly interested in inference, then restrictive mdels are much mre interpretable. Fr instance, when inference is the gal, the linear mdel may be a gd chice since it will be quite easy t understand the relatinship between Y and X 1,X 2,...,X p. In cntrast, very flexible appraches, such as the splines discussed in Chapter 7 and displayed in Figures 2.5 and 2.6, and the bsting methds discussed in Chapter 8, can lead t such cmplicated estimates f f that it is difficult t understand hw any individual predictr is assciated with the respnse. Figure 2.7 prvides an illustratin f the trade-ff between flexibility and interpretability fr sme f the methds that we cver in this bk. Least squares linear regressin, discussed in Chapter 3, is relatively inflexible but is quite interpretable. The lass, discussed in Chapter 6, relies upn the lass linear mdel (2.4) but uses an alternative fitting prcedure fr estimating the cefficients β 0,β 1,...,β p. The new prcedure is mre restrictive in estimating the cefficients, and sets a number f them t exactly zer. Hence in this sense the lass is a less flexible apprach than linear regressin. It is als mre interpretable than linear regressin, because in the final mdel the respnse variable will nly be related t a small subset f the predictrs namely, thse with nnzer cefficient estimates. Generalized

41 26 2. Statistical Learning additive mdels (GAMs), discussed in Chapter 7, instead extend the lin- generalized ear mdel (2.4) t allw fr certain nn-linear relatinships. Cnsequently, additive GAMs are mre flexible than linear regressin. They are als smewhat mdel less interpretable than linear regressin, because the relatinship between each predictr and the respnse is nw mdeled using a curve. Finally, fully nn-linear methds such as bagging, bsting, andsupprt vectr machines bagging with nn-linear kernels, discussed in Chapters 8 and 9, are highly flexible appraches that are harder t interpret. We have established that when inference is the gal, there are clear advantages t using simple and relatively inflexible statistical learning methds. In sme settings, hwever, we are nly interested in predictin, and the interpretability f the predictive mdel is simply nt f interest. Fr instance, if we seek t develp an algrithm t predict the price f a stck, ur sle requirement fr the algrithm is that it predict accurately interpretability is nt a cncern. In this setting, we might expect that it will be best t use the mst flexible mdel available. Surprisingly, this is nt always the case! We will ften btain mre accurate predictins using a less flexible methd. This phenmenn, which may seem cunterintuitive at first glance, has t d with the ptential fr verfitting in highly flexible methds. We saw an example f verfitting in Figure 2.6. We will discuss this very imprtant cncept further in Sectin 2.2 and thrughut this bk. bsting supprt vectr machine Supervised Versus Unsupervised Learning Mst statistical learning prblems fall int ne f tw categries: supervised supervised r unsupervised. The examples that we have discussed s far in this chap- unsupervised ter all fall int the supervised learning dmain. Fr each bservatin f the predictr measurement(s) x i, i =1,...,n there is an assciated respnse measurement y i. We wish t fit a mdel that relates the respnse t the predictrs, with the aim f accurately predicting the respnse fr future bservatins (predictin) r better understanding the relatinship between the respnse and the predictrs (inference). Many classical statistical learning methds such as linear regressin and lgistic regressin (Chapter 4), as lgistic well as mre mdern appraches such as GAM, bsting, and supprt vectr machines, perate in the supervised learning dmain. The vast majrity regressin f this bk is devted t this setting. In cntrast, unsupervised learning describes the smewhat mre challenging situatin in which fr every bservatin i =1,...,n,webserve a vectr f measurements x i but n assciated respnse y i.itisntpssible t fit a linear regressin mdel, since there is n respnse variable t predict. In this setting, we are in sme sense wrking blind; the situatin is referred t as unsupervised because we lack a respnse variable that can supervise ur analysis. What srt f statistical analysis is

42 2.1 What Is Statistical Learning? FIGURE 2.8. A clustering data set invlving three grups. Each grup is shwn using a different clred symbl. Left: The three grups are well-separated. In this setting, a clustering apprach shuld successfully identify the three grups. Right: There is sme verlap amng the grups. Nw the clustering task is mre challenging. pssible? We can seek t understand the relatinships between the variables r between the bservatins. One statistical learning tl that we may use in this setting is cluster analysis, r clustering. The gal f cluster analysis cluster is t ascertain, n the basis f x 1,...,x n, whether the bservatins fall int analysis relatively distinct grups. Fr example, in a market segmentatin study we might bserve multiple characteristics (variables) fr ptential custmers, such as zip cde, family incme, and shpping habits. We might believe that the custmers fall int different grups, such as big spenders versus lw spenders. If the infrmatin abut each custmer s spending patterns were available, then a supervised analysis wuld be pssible. Hwever, this infrmatin is nt available that is, we d nt knw whether each ptential custmer is a big spender r nt. In this setting, we can try t cluster the custmers n the basis f the variables measured, in rder t identify distinct grups f ptential custmers. Identifying such grups can be f interest because it might be that the grups differ with respect t sme prperty f interest, such as spending habits. Figure 2.8 prvides a simple illustratin f the clustering prblem. We have pltted 150 bservatins with measurements n tw variables, X 1 and X 2. Each bservatin crrespnds t ne f three distinct grups. Fr illustrative purpses, we have pltted the members f each grup using different clrs and symbls. Hwever, in practice the grup memberships are unknwn, and the gal is t determine the grup t which each bservatin belngs. In the left-hand panel f Figure 2.8, this is a relatively easy task because the grups are well-separated. In cntrast, the right-hand panel illustrates a mre challenging prblem in which there is sme verlap

43 28 2. Statistical Learning tistical learning methd that can incrprate the m bservatins fr which respnse measurements are available as well as the n m bservatins fr which they are nt. Althugh this is an interesting tpic, it is beynd the scpe f this bk. between the grups. A clustering methd culd nt be expected t assign all f the verlapping pints t their crrect grup (blue, green, r range). In the examples shwn in Figure 2.8, there are nly tw variables, and s ne can simply visually inspect the scatterplts f the bservatins in rder t identify clusters. Hwever, in practice, we ften encunter data sets that cntain many mre than tw variables. In this case, we cannt easily plt the bservatins. Fr instance, if there are p variables in ur data set, then p(p 1)/2 distinct scatterplts can be made, and visual inspectin is simply nt a viable way t identify clusters. Fr this reasn, autmated clustering methds are imprtant. We discuss clustering and ther unsupervised learning appraches in Chapter 10. Many prblems fall naturally int the supervised r unsupervised learning paradigms. Hwever, smetimes the questin f whether an analysis shuld be cnsidered supervised r unsupervised is less clear-cut. Fr instance, suppse that we have a set f n bservatins. Fr m f the bservatins, where m<n, we have bth predictr measurements and a respnse measurement. Fr the remaining n m bservatins, we have predictr measurements but n respnse measurement. Such a scenari can arise if the predictrs can be measured relatively cheaply but the crrespnding respnses are much mre expensive t cllect. We refer t this setting as a semi-supervised learning prblem. In this setting, we wish t use a sta- semisupervised learning Regressin Versus Classificatin Prblems Variables can be characterized as either quantitative r qualitative (als quantitative knwn as categrical). Quantitative variables take n numerical values. qualitative Examples include a persn s age, height, r incme, the value f a huse, categrical and the price f a stck. In cntrast, qualitative variables take n values in ne f K different classes, r categries. Examples f qualitative class variables include a persn s gender (male r female), the brand f prduct purchased (brand A, B, r C), whether a persn defaults n a debt (yes r n), r a cancer diagnsis (Acute Myelgenus Leukemia, Acute Lymphblastic Leukemia, r N Leukemia). We tend t refer t prblems with a quantitative respnse as regressin prblems, while thse invlv- regressin ing a qualitative respnse are ften referred t as classificatin prblems. Hwever, the distinctin is nt always that crisp. Least squares linear regressin (Chapter 3) is used with a quantitative respnse, whereas lgistic regressin (Chapter 4) is typically used with a qualitative (tw-class, r binary) respnse. As such it is ften used as a classificatin methd. But binary since it estimates class prbabilities, it can be thught f as a regressin classificatin

44 2.2 Assessing Mdel Accuracy 29 methd as well. Sme statistical methds, such as K-nearest neighbrs (Chapters 2 and 4) and bsting (Chapter 8), can be used in the case f either quantitative r qualitative respnses. We tend t select statistical learning methds n the basis f whether the respnse is quantitative r qualitative; i.e. we might use linear regressin when quantitative and lgistic regressin when qualitative. Hwever, whether the predictrs are qualitative r quantitative is generally cnsidered less imprtant. Mst f the statistical learning methds discussed in this bk can be applied regardless f the predictr variable type, prvided that any qualitative predictrs are prperly cded befre the analysis is perfrmed. This is discussed in Chapter Assessing Mdel Accuracy One f the key aims f this bk is t intrduce the reader t a wide range f statistical learning methds that extend far beynd the standard linear regressin apprach. Why is it necessary t intrduce s many different statistical learning appraches, rather than just a single best methd? There is n free lunch in statistics: n ne methd dminates all thers ver all pssible data sets. On a particular data set, ne specific methd may wrk best, but sme ther methd may wrk better n a similar but different data set. Hence it is an imprtant task t decide fr any given set f data which methd prduces the best results. Selecting the best apprach can be ne f the mst challenging parts f perfrming statistical learning in practice. In this sectin, we discuss sme f the mst imprtant cncepts that arise in selecting a statistical learning prcedure fr a specific data set. As the bk prgresses, we will explain hw the cncepts presented here can be applied in practice Measuring the Quality f Fit In rder t evaluate the perfrmance f a statistical learning methd n a given data set, we need sme way t measure hw well its predictins actually match the bserved data. That is, we need t quantify the extent t which the predicted respnse value fr a given bservatin is clse t the true respnse value fr that bservatin. In the regressin setting, the mst cmmnly-used measure is the mean squared errr (MSE), given by MSE = 1 n n (y i ˆf(x i )) 2, (2.5) i=1 mean squared errr

45 30 2. Statistical Learning where ˆf(x i ) is the predictin that ˆf gives fr the ith bservatin. The MSE will be small if the predicted respnses are very clse t the true respnses, and will be large if fr sme f the bservatins, the predicted and true respnses differ substantially. The MSE in (2.5) is cmputed using the training data that was used t fit the mdel, and s shuld mre accurately be referred t as the training MSE. But in general, we d nt really care hw well the methd wrks training n the training data. Rather, we are interested in the accuracy f the predictins that we btain when we apply ur methd t previusly unseen MSE test data. Why is this what we care abut? Suppse that we are interested test data in develping an algrithm t predict a stck s price based n previus stck returns. We can train the methd using stck returns frm the past 6 mnths. But we dn t really care hw well ur methd predicts last week s stck price. We instead care abut hw well it will predict tmrrw s price r next mnth s price. On a similar nte, suppse that we have clinical measurements (e.g. weight, bld pressure, height, age, family histry f disease) fr a number f patients, as well as infrmatin abut whether each patient has diabetes. We can use these patients t train a statistical learning methd t predict risk f diabetes based n clinical measurements. In practice, we want this methd t accurately predict diabetes risk fr future patients based n their clinical measurements. We are nt very interested in whether r nt the methd accurately predicts diabetes risk fr patients used t train the mdel, since we already knw which f thse patients have diabetes. T state it mre mathematically, suppse that we fit ur statistical learning methd n ur training bservatins {(x 1,y 1 ), (x 2,y 2 ),...,(x n,y n )}, and we btain the estimate ˆf. We can then cmpute ˆf(x 1 ), ˆf(x 2 ),..., ˆf(x n ). If these are apprximately equal t y 1,y 2,...,y n, then the training MSE given by (2.5) is small. Hwever, we are really nt interested in whether ˆf(x i ) y i ; instead, we want t knw whether ˆf(x 0 ) is apprximately equal t y 0,where(x 0,y 0 )isapreviusly unseen test bservatin nt used t train the statistical learning methd. We want t chse the methd that gives the lwest test MSE, as ppsed t the lwest training MSE. In ther wrds, test MSE if we had a large number f test bservatins, we culd cmpute Ave(y 0 ˆf(x 0 )) 2, (2.6) the average squared predictin errr fr these test bservatins (x 0,y 0 ). We d like t select the mdel fr which the average f this quantity the test MSE is as small as pssible. Hw can we g abut trying t select a methd that minimizes the test MSE? In sme settings, we may have a test data set available that is, we may have access t a set f bservatins that were nt used t train the statistical learning methd. We can then simply evaluate (2.6) n the test bservatins, and select the learning methd fr which the test MSE is

46 2.2 Assessing Mdel Accuracy 31 Y Mean Squared Errr X Flexibility FIGURE 2.9. Left: Data simulated frm f, shwn in black. Three estimates f f are shwn: the linear regressin line (range curve), and tw smthing spline fits (blue and green curves). Right: Training MSE (grey curve), test MSE (red curve), and minimum pssible test MSE ver all methds (dashed line). Squares represent the training and test MSEs fr the three fits shwn in the left-hand panel. smallest. But what if n test bservatins are available? In that case, ne might imagine simply selecting a statistical learning methd that minimizes the training MSE (2.5). This seems like it might be a sensible apprach, since the training MSE and the test MSE appear t be clsely related. Unfrtunately, there is a fundamental prblem with this strategy: there is n guarantee that the methd with the lwest training MSE will als have the lwest test MSE. Rughly speaking, the prblem is that many statistical methds specifically estimate cefficients s as t minimize the training set MSE. Fr these methds, the training set MSE can be quite small, but the test MSE is ften much larger. Figure 2.9 illustrates this phenmenn n a simple example. In the lefthand panel f Figure 2.9, we have generated bservatins frm (2.1) with the true f given by the black curve. The range, blue and green curves illustrate three pssible estimates fr f btained using methds with increasing levels f flexibility. The range line is the linear regressin fit, which is relatively inflexible. The blue and green curves were prduced using smthing splines, discussed in Chapter 7, with different levels f smthness. It is smthing clear that as the level f flexibility increases, the curves fit the bserved spline data mre clsely. The green curve is the mst flexible and matches the data very well; hwever, we bserve that it fits the true f (shwn in black) prly because it is t wiggly. By adjusting the level f flexibility f the smthing spline fit, we can prduce many different fits t this data.

47 32 2. Statistical Learning We nw mve n t the right-hand panel f Figure 2.9. The grey curve displays the average training MSE as a functin f flexibility, r mre frmally the degrees f freedm, fr a number f smthing splines. The de- degrees f grees f freedm is a quantity that summarizes the flexibility f a curve; it freedm is discussed mre fully in Chapter 7. The range, blue and green squares indicate the MSEs assciated with the crrespnding curves in the lefthand panel. A mre restricted and hence smther curve has fewer degrees f freedm than a wiggly curve nte that in Figure 2.9, linear regressin is at the mst restrictive end, with tw degrees f freedm. The training MSE declines mntnically as flexibility increases. In this example the true f is nn-linear, and s the range linear fit is nt flexible enugh t estimate f well. The green curve has the lwest training MSE f all three methds, since it crrespnds t the mst flexible f the three curves fit in the left-hand panel. In this example, we knw the true functin f, and s we can als cmpute the test MSE ver a very large test set, as a functin f flexibility. (Of curse, in general f is unknwn, s this will nt be pssible.) The test MSE is displayed using the red curve in the right-hand panel f Figure 2.9. As with the training MSE, the test MSE initially declines as the level f flexibility increases. Hwever, at sme pint the test MSE levels ff and then starts t increase again. Cnsequently, the range and green curves bth have high test MSE. The blue curve minimizes the test MSE, which shuld nt be surprising given that visually it appears t estimate f the best in the left-hand panel f Figure 2.9. The hrizntal dashed line indicates Var(ɛ), the irreducible errr in (2.3), which crrespnds t the lwest achievable test MSE amng all pssible methds. Hence, the smthing spline represented by the blue curve is clse t ptimal. In the right-hand panel f Figure 2.9, as the flexibility f the statistical learning methd increases, we bserve a mntne decrease in the training MSE and a U-shape in the test MSE. This is a fundamental prperty f statistical learning that hlds regardless f the particular data set at hand and regardless f the statistical methd being used. As mdel flexibility increases, training MSE will decrease, but the test MSE may nt. When a given methd yields a small training MSE but a large test MSE, we are saidt be verfitting the data. This happens becauseur statisticallearning prcedure is wrking t hard t find patterns in the training data, and may be picking up sme patterns that are just caused by randm chance rather than by true prperties f the unknwn functin f. Whenweverfit the training data, the test MSE will be very large because the suppsed patterns that the methd fund in the training data simply dn t exist in the test data. Nte that regardless f whether r nt verfitting has ccurred, we almst always expect the training MSE t be smaller than the test MSE because mst statistical learning methds either directly r indirectly seek t minimize the training MSE. Overfitting refers specifically t the case in which a less flexible mdel wuld have yielded a smaller test MSE.

48 2.2 Assessing Mdel Accuracy 33 Y Mean Squared Errr X Flexibility FIGURE Details are as in Figure 2.9, using a different true f that is much clser t linear. In this setting, linear regressin prvides a very gd fit t the data. Figure 2.10 prvides anther example in which the true f is apprximately linear. Again we bserve that the training MSE decreases mntnically as the mdel flexibility increases, and that there is a U-shape in the test MSE. Hwever, because the truth is clse t linear, the test MSE nly decreases slightly befre increasing again, s that the range least squares fit is substantially better than the highly flexible green curve. Finally, Figure 2.11 displays an example in which f is highly nn-linear. The training and test MSE curves still exhibit the same general patterns, but nw there is a rapid decrease in bth curves befre the test MSE starts t increase slwly. In practice, ne can usually cmpute the training MSE with relative ease, but estimating test MSE is cnsiderably mre difficult because usually n test data are available. As the previus three examples illustrate, the flexibility level crrespnding t the mdel with the minimal test MSE can vary cnsiderably amng data sets. Thrughut this bk, we discuss a variety f appraches that can be used in practice t estimate this minimum pint. One imprtant methd is crss-validatin (Chapter 5), which is a crssvalidatin methd fr estimating test MSE using the training data The Bias-Variance Trade-Off The U-shape bserved in the test MSE curves (Figures ) turns ut t be the result f tw cmpeting prperties f statistical learning methds. Thugh the mathematical prf is beynd the scpe f this bk, it is pssible t shw that the expected test MSE, fr a given value x 0,can

49 34 2. Statistical Learning Y Mean Squared Errr X Flexibility FIGURE Details are as in Figure 2.9, using a different f that is far frm linear. In this setting, linear regressin prvides a very pr fit t the data. always be decmpsed int the sum f three fundamental quantities: the variance f ˆf(x 0 ), the squared bias f ˆf(x 0 ) and the variance f the errr variance terms ɛ. Thatis, bias ( E y 0 ˆf(x 2 0 )) =Var(ˆf(x0 )) + [Bias( ˆf(x 0 ))] 2 +Var(ɛ). (2.7) ( Here the ntatin E y 0 ˆf(x ) 2 0 ) defines the expected test MSE, and refers expected t the average test MSE that we wuld btain if we repeatedly estimated test MSE f using a large number f training sets, and tested each at x 0.Theverall ( expected test MSE can be cmputed by averaging E y 0 ˆf(x 2 0 )) ver all pssible values f x 0 in the test set. Equatin 2.7 tells us that in rder t minimize the expected test errr, we need t select a statistical learning methd that simultaneusly achieves lw variance and lw bias. Nte that variance is inherently a nnnegative quantity, and squared bias is als nnnegative. Hence, we see that the expected test MSE can never lie belw Var(ɛ), the irreducible errr frm (2.3). What d we mean by the variance and bias f a statistical learning methd? Variance refers t the amunt by which ˆf wuld change if we estimated it using a different training data set. Since the training data are used t fit the statistical learning methd, different training data sets will result in a different ˆf. But ideally the estimate fr f shuld nt vary t much between training sets. Hwever, if a methd has high variance then small changes in the training data can result in large changes in ˆf. In general, mre flexible statistical methds have higher variance. Cnsider the

50 2.2 Assessing Mdel Accuracy 35 green and range curves in Figure 2.9. The flexible green curve is fllwing the bservatins very clsely. It has high variance because changing any ne f these data pints may cause the estimate ˆf t change cnsiderably. In cntrast, the range least squares line is relatively inflexible and has lw variance, because mving any single bservatin will likely cause nly a small shift in the psitin f the line. On the ther hand, bias refers t the errr that is intrduced by apprximating a real-life prblem, which may be extremely cmplicated, by a much simpler mdel. Fr example, linear regressin assumes that there is a linear relatinship between Y and X 1,X 2,...,X p. It is unlikely that any real-life prblem truly has such a simple linear relatinship, and s perfrming linear regressin will undubtedly result in sme bias in the estimate f f. In Figure 2.11, the true f is substantially nn-linear, s n matter hw many training bservatins we are given, it will nt be pssible t prduce an accurate estimate using linear regressin. In ther wrds, linear regressin results in high bias in this example. Hwever, in Figure 2.10 the true f is very clse t linear, and s given enugh data, it shuld be pssible fr linear regressin t prduce an accurate estimate. Generally, mre flexible methds result in less bias. As a general rule, as we use mre flexible methds, the variance will increase and the bias will decrease. The relative rate f change f these tw quantities determines whether the test MSE increases r decreases. As we increase the flexibility f a class f methds, the bias tends t initially decrease faster than the variance increases. Cnsequently, the expected test MSE declines. Hwever, at sme pint increasing flexibility has little impact n the bias but starts t significantly increase the variance. When this happens the test MSE increases. Nte that we bserved this pattern f decreasing test MSE fllwed by increasing test MSE in the right-hand panels f Figures The three plts in Figure 2.12 illustrate Equatin 2.7 fr the examples in Figures In each case the blue slid curve represents the squared bias, fr different levels f flexibility, while the range curve crrespnds t the variance. The hrizntal dashed line represents Var(ɛ), the irreducible errr. Finally, the red curve, crrespnding t the test set MSE, is the sum f these three quantities. In all three cases, the variance increases and the bias decreases as the methd s flexibility increases. Hwever, the flexibility level crrespnding t the ptimal test MSE differs cnsiderably amng the three data sets, because the squared bias and variance change at different rates in each f the data sets. In the left-hand panel f Figure 2.12, the bias initially decreases rapidly, resulting in an initial sharp decrease in the expected test MSE. On the ther hand, in the center panel f Figure 2.12 the true f is clse t linear, s there is nly a small decrease in bias as flexibility increases, and the test MSE nly declines slightly befre increasing rapidly as the variance increases. Finally, in the right-hand panel f Figure 2.12, as flexibility increases, there is a dramatic decline in bias because

51 36 2. Statistical Learning MSE Bias Var Flexibility Flexibility Flexibility FIGURE Squared bias (blue curve), variance (range curve), Var(ɛ) (dashed line), and test MSE (red curve) fr the three data sets in Figures The vertical dtted line indicates the flexibility level crrespnding t the smallest test MSE. the true f is very nn-linear. There is als very little increase in variance as flexibility increases. Cnsequently, the test MSE declines substantially befre experiencing a small increase as mdel flexibility increases. The relatinship between bias, variance, and test set MSE given in Equatin 2.7 and displayed in Figure 2.12 is referred t as the bias-variance trade-ff. Gd test set perfrmance f a statistical learning methd re- bias-variance quires lw variance as well as lw squared bias. This is referred t as a trade-ff trade-ff because it is easy t btain a methd with extremely lw bias but high variance (fr instance, by drawing a curve that passes thrugh every single training bservatin) r a methd with very lw variance but high bias (by fitting a hrizntal line t the data). The challenge lies in finding a methd fr which bth the variance and the squared bias are lw. This trade-ff is ne f the mst imprtant recurring themes in this bk. In a real-life situatin in which f is unbserved, it is generally nt pssible t explicitly cmpute the test MSE, bias, r variance fr a statistical learning methd. Nevertheless, ne shuld always keep the bias-variance trade-ff in mind. In this bk we explre methds that are extremely flexible and hence can essentially eliminate bias. Hwever, this des nt guarantee that they will utperfrm a much simpler methd such as linear regressin. T take an extreme example, suppse that the true f is linear. In this situatin linear regressin will have n bias, making it very hard fr a mre flexible methd t cmpete. In cntrast, if the true f is highly nn-linear and we have an ample number f training bservatins, then we may d better using a highly flexible apprach, as in Figure In Chapter 5 we discuss crss-validatin, which is a way t estimate the test MSE using the training data.

52 2.2.3 The Classificatin Setting 2.2 Assessing Mdel Accuracy 37 Thus far, ur discussin f mdel accuracy has been fcused n the regressin setting. But many f the cncepts that we have encuntered, such as the bias-variance trade-ff, transfer ver t the classificatin setting with nly sme mdificatins due t the fact that y i is n lnger numerical. Suppse that we seek t estimate f n the basis f training bservatins {(x 1,y 1 ),...,(x n,y n )}, where nw y 1,...,y n are qualitative. The mst cmmn apprach fr quantifying the accuracy f ur estimate ˆf is the training errr rate, the prprtin f mistakes that are made if we apply errr rate ur estimate ˆf t the training bservatins: 1 n n I(y i ŷ i ). (2.8) i=1 Here ŷ i is the predicted class label fr the ith bservatin using ˆf. And I(y i ŷ i )isanindicatr variable that equals 1 if y i ŷ i and zer if y i =ŷ i. If I(y i ŷ i ) = 0 then the ith bservatin was classified crrectly by ur classificatin methd; therwise it was misclassified. Hence Equatin 2.8 cmputes the fractin f incrrect classificatins. indicatr variable Equatin 2.8 is referred t as the training errr rate because it is cm- training puted based n the data that was used t train ur classifier. As in the errr regressin setting, we are mst interested in the errr rates that result frm applying ur classifier t test bservatins that were nt used in training. The test errr rate assciated with a set f test bservatins f the frm test errr (x 0,y 0 )isgivenby Ave (I(y 0 ŷ 0 )), (2.9) where ŷ 0 is the predicted class label that results frm applying the classifier t the test bservatin with predictr x 0.Agd classifier is ne fr which the test errr (2.9) is smallest. The Bayes Classifier It is pssible t shw (thugh the prf is utside f the scpe f this bk) that the test errr rate given in (2.9) is minimized, n average, by a very simple classifier that assigns each bservatin t the mst likely class, given its predictr values. In ther wrds, we shuld simply assign a test bservatin with predictr vectr x 0 t the class j fr which Pr(Y = jx = x 0 ) (2.10) is largest. Nte that (2.10) is a cnditinal prbability: it is the prbability cnditinal that Y = j, given the bserved predictr vectr x 0. This very simple classifier is called the Bayes classifier. In a tw-class prblem where there are Bayes prbability nly tw pssible respnse values, say class 1 r class 2, the Bayes classifier classifier

53 38 2. Statistical Learning FIGURE A simulated data set cnsisting f 100 bservatins in each f tw grups, indicated in blue and in range. The purple dashed line represents the Bayes decisin bundary. The range backgrund grid indicates the regin in which a test bservatin will be assigned t the range class, and the blue backgrund grid indicates the regin in which a test bservatin will be assigned t the blue class. crrespnds t predicting class ne if Pr(Y =1X = x 0 ) > 0.5, and class tw therwise. Figure 2.13 prvides an example using a simulated data set in a twdimensinal space cnsisting f predictrs X 1 and X 2. The range and blue circles crrespnd t training bservatins that belng t tw different classes. Fr each value f X 1 and X 2, there is a different prbability f the respnse being range r blue. Since this is simulated data, we knw hw the data were generated and we can calculate the cnditinal prbabilities fr each value f X 1 and X 2. The range shaded regin reflects the set f pints fr which Pr(Y = rangex) is greater than 50 %, while the blue shaded regin indicates the set f pints fr which the prbability is belw 50 %. The purple dashed line represents the pints where the prbability is exactly 50 %. This is called the Bayes decisin bundary. TheBayes Bayes decisin bundary classifier s predictin is determined by the Bayes decisin bundary; an bservatin that falls n the range side f the bundary will be assigned t the range class, and similarly an bservatin n the blue side f the bundary will be assigned t the blue class. The Bayes classifier prduces the lwest pssible test errr rate, called the Bayes errr rate. Since the Bayes classifier will always chse the class Bayes errr rate fr which (2.10) is largest, the errr rate at X = x 0 will be 1 max j Pr(Y = jx = x 0 ). In general, the verall Bayes errr rate is given by 1 E ( max j Pr(Y = jx) ), (2.11)

54 2.2 Assessing Mdel Accuracy 39 where the expectatin averages the prbability ver all pssible values f X. Fr ur simulated data, the Bayes errr rate is It is greater than zer, because the classes verlap in the true ppulatin s max j Pr(Y = jx = x 0 ) < 1 fr sme values f x 0. The Bayes errr rate is analgus t the irreducible errr, discussed earlier. K-Nearest Neighbrs In thery we wuld always like t predict qualitative respnses using the Bayes classifier. But fr real data, we d nt knw the cnditinal distributin f Y given X, and s cmputing the Bayes classifier is impssible. Therefre, the Bayes classifier serves as an unattainable gld standard against which t cmpare ther methds. Many appraches attempt t estimate the cnditinal distributin f Y given X, and then classify a given bservatin t the class with highest estimated prbability. One such methd is the K-nearest neighbrs (KNN) classifier. Given a psitive in- K-nearest teger K and a test bservatin x 0, the KNN classifier first identifies the neighbrs K pints in the training data that are clsest t x 0, represented by N 0. It then estimates the cnditinal prbability fr class j as the fractin f pints in N 0 whse respnse values equal j: Pr(Y = jx = x 0 )= 1 I(y i = j). (2.12) K i N 0 Finally, KNN applies Bayes rule and classifies the test bservatin x 0 t the class with the largest prbability. Figure 2.14 prvides an illustrative example f the KNN apprach. In the left-hand panel, we have pltted a small training data set cnsisting f six blue and six range bservatins. Our gal is t make a predictin fr the pint labeled by the black crss. Suppse that we chse K =3.Then KNN will first identify the three bservatins that are clsest t the crss. This neighbrhd is shwn as a circle. It cnsists f tw blue pints and ne range pint, resulting in estimated prbabilities f 2/3 fr the blue class and 1/3 fr the range class. Hence KNN will predict that the black crss belngs t the blue class. In the right-hand panel f Figure 2.14 we have applied the KNN apprach with K = 3 at all f the pssible values fr X 1 and X 2, and have drawn in the crrespnding KNN decisin bundary. Despite the fact that it is a very simple apprach, KNN can ften prduce classifiers that are surprisingly clse t the ptimal Bayes classifier. Figure 2.15 displays the KNN decisin bundary, using K = 10, when applied t the larger simulated data set frm Figure Ntice that even thugh the true distributin is nt knwn by the KNN classifier, the KNN decisin bundary is very clse t that f the Bayes classifier. The test errr rate using KNN is , which is clse t the Bayes errr rate f

55 40 2. Statistical Learning FIGURE The KNN apprach, using K =3, is illustrated in a simple situatin with six blue bservatins and six range bservatins. Left: a test bservatin at which a predicted class label is desired is shwn as a black crss. The three clsest pints t the test bservatin are identified, and it is predicted that the test bservatin belngs t the mst cmmnly-ccurring class, in this case blue. Right: The KNN decisin bundary fr this example is shwn in black. The blue grid indicates the regin in which a test bservatin will be assigned t the blue class, and the range grid indicates the regin in which it will be assigned t therangeclass. The chice f K has a drastic effect n the KNN classifier btained. Figure 2.16 displays tw KNN fits t the simulated data frm Figure 2.13, using K =1andK = 100. When K = 1, the decisin bundary is verly flexible and finds patterns in the data that dn t crrespnd t the Bayes decisin bundary. This crrespnds t a classifier that has lw bias but very high variance. As K grws, the methd becmes less flexible and prduces a decisin bundary that is clse t linear. This crrespnds t a lw-variance but high-bias classifier. On this simulated data set, neither K =1nrK = 100 give gd predictins: they have test errr rates f and , respectively. Just as in the regressin setting, there is nt a strng relatinship between the training errr rate and the test errr rate. With K =1,the KNN training errr rate is 0, but the test errr rate may be quite high. In general, as we use mre flexible classificatin methds, the training errr rate will decline but the test errr rate may nt. In Figure 2.17, we have pltted the KNN test and training errrs as a functin f 1/K. As1/K increases, the methd becmes mre flexible. As in the regressin setting, the training errr rate cnsistently declines as the flexibility increases. Hwever, the test errr exhibits a characteristic U-shape, declining at first (with a minimum at apprximately K = 10) befre increasing again when the methd becmes excessively flexible and verfits.

56 2.2 Assessing Mdel Accuracy 41 KNN: K=10 FIGURE The black curve indicates the KNN decisin bundary n the data frm Figure 2.13, using K =10. The Bayes decisin bundary is shwn as a purple dashed line. The KNN and Bayes decisin bundaries are very similar. KNN: K=1 KNN: K=100 FIGURE A cmparisn f the KNN decisin bundaries (slid black curves) btained using K =1and K = 100 n the data frm Figure With K =1, the decisin bundary is verly flexible, while with K = 100 it is nt sufficiently flexible. The Bayes decisin bundary is shwn as a purple dashed line.

57 42 2. Statistical Learning Errr Rate Training Errrs Test Errrs /K FIGURE The KNN training errr rate (blue, 200 bservatins) and test errr rate (range, 5,000 bservatins) n the data frm Figure 2.13, as the level f flexibility (assessed using 1/K) increases, r equivalently as the number f neighbrs K decreases. The black dashed line indicates the Bayes errr rate. The jumpiness f the curves is due t the small size f the training data set. In bth the regressin and classificatin settings, chsing the crrect level f flexibility is critical t the success f any statistical learning methd. The bias-variance tradeff, and the resulting U-shape in the test errr, can make this a difficult task. In Chapter 5, we return t this tpic and discuss varius methds fr estimating test errr rates and thereby chsing the ptimal level f flexibility fr a given statistical learning methd. 2.3 Lab: Intrductin t R In this lab, we will intrduce sme simple R cmmands. The best way t learn a new language is t try ut the cmmands. R can be dwnladed frm Basic Cmmands R uses functins t perfrm peratins. T run a functin called funcname, functin we type funcname(input1, input2), where the inputs (r arguments) input1 argument

58 2.3 Lab: Intrductin t R 43 and input2 tell R hw t run the functin. A functin can have any number f inputs. Fr example, t create a vectr f numbers, we use the functin c() (fr cncatenate). Any numbers inside the parentheses are jined t- c() gether. The fllwing cmmand instructs R t jin tgether the numbers 1, 3, 2, and 5, and t save them as a vectr named x. Whenwetypex, it vectr gives us back the vectr. > x <- c(1,3,2,5) > x [1] Nte that the > is nt part f the cmmand; rather, it is printed by R t indicate that it is ready fr anther cmmand t be entered. We can als save things using = rather than <-: > x = c(1,6,2) > x [1] > y = c(1,4,3) Hitting the up arrw multiple times will display the previus cmmands, which can then be edited. This is useful since ne ften wishes t repeat a similar cmmand. In additin, typing?funcname will always cause R t pen a new help file windw with additinal infrmatin abut the functin funcname. We can tell R t add tw sets f numbers tgether. It will then add the first number frm x t the first number frm y, and s n. Hwever, x and y shuld be the same length. We can check their length using the length() length() functin. > length(x) [1] 3 > length(y) [1] 3 > x+y [1] The ls() functin allws us t lk at a list f all f the bjects, such ls() as data and functins, that we have saved s far. The rm() functin can be rm() used t delete any that we dn t want. > ls() [1] "x" "y" > rm(x,y) > ls() character (0) It s als pssible t remve all bjects at nce: > rm(list=ls())

59 44 2. Statistical Learning The matrix() functin can be used t create a matrix f numbers. Befre matrix() we use the matrix() functin, we can learn mre abut it: >?matrix The help file reveals that the matrix() functin takes a number f inputs, but fr nw we fcus n the first three: the data (the entries in the matrix), the number f rws, and the number f clumns. First, we create a simple matrix. > x=matrix(data=c(1,2,3,4), nrw=2, ncl=2) > x [,1] [,2] [1,] 1 3 [2,] 2 4 Nte that we culd just as well mit typing data=, nrw=, andncl= in the matrix() cmmand abve: that is, we culd just type > x=matrix(c(1,2,3,4),2,2) and this wuld have the same effect. Hwever, it can smetimes be useful t specify the names f the arguments passed in, since therwise R will assume that the functin arguments are passed int the functin in the same rder that is given in the functin s help file. As this example illustrates, by default R creates matrices by successively filling in clumns. Alternatively, the byrw=true ptin can be used t ppulate the matrix in rder f the rws. > matrix(c(1,2,3,4),2,2,byrw=true) [,1] [,2] [1,] 1 2 [2,] 3 4 Ntice that in the abve cmmand we did nt assign the matrix t a value such as x. In this case the matrix is printed t the screen but is nt saved fr future calculatins. The sqrt() functin returns the square rt f each sqrt() element f a vectr r matrix. The cmmand x^2 raises each element f x t the pwer 2; any pwers are pssible, including fractinal r negative pwers. > sqrt(x) [,1] [,2] [1,] [2,] > x^2 [,1] [,2] [1,] 1 9 [2,] 4 16 The rnrm() functin generates a vectr f randm nrmal variables, rnrm() with first argument n the sample size. Each time we call this functin, we will get a different answer. Here we create tw crrelated sets f numbers, x and y, and use the cr() functin t cmpute the crrelatin between cr() them.

60 2.3 Lab: Intrductin t R 45 > x=rnrm(50) > y=x+rnrm(50,mean=50,sd=.1) > cr(x,y) [1] By default, rnrm() creates standard nrmal randm variables with a mean f 0 and a standard deviatin f 1. Hwever, the mean and standard deviatin can be altered using the mean and sd arguments, as illustrated abve. Smetimes we want ur cde t reprduce the exact same set f randm numbers; we can use the set.seed() functin t d this. The set.seed() set.seed() functin takes an (arbitrary) integer argument. > set.seed(1303) > rnrm(50) [1] We use set.seed() thrughut the labs whenever we perfrm calculatins invlving randm quantities. In general this shuld allw the user t reprduce ur results. Hwever, it shuld be nted that as new versins f R becme available it is pssible that sme small discrepancies may frm between the bk and the utput frm R. The mean() and var() functins can be used t cmpute the mean and mean() variance f a vectr f numbers. Applying sqrt() t the utput f var() var() will give the standard deviatin. Or we can simply use the sd() functin. > set.seed(3) > y=rnrm(100) > mean(y) [1] > var(y) [1] > sqrt(var(y)) [1] > sd(y) [1] sd() Graphics The plt() functin is the primary way t plt data in R. Fr instance, plt() plt(x,y) prduces a scatterplt f the numbers in x versus the numbers in y. There are many additinal ptins that can be passed in t the plt() functin. Fr example, passing in the argument xlab will result in a label n the x-axis. T find ut mre infrmatin abut the plt() functin, type?plt. > x=rnrm(100) > y=rnrm(100) > plt(x,y) > plt(x,y,xlab="this is the x-axis",ylab="this is the y-axis", main="plt f X vs Y")

61 46 2. Statistical Learning We will ften want t save the utput f an R plt. The cmmand that we use t d this will depend n the file type that we wuld like t create. Fr instance, t create a pdf, we use the pdf() functin, and t create a jpeg, we use the jpeg() functin. > pdf("figure.pdf") > plt(x,y,cl="green") > dev.ff() null device 1 pdf() jpeg() The functin dev.ff() indicates t R that we are dne creating the plt. dev.ff() Alternatively, we can simply cpy the plt windw and paste it int an apprpriate file type, such as a Wrd dcument. The functin seq() can be used t create a sequence f numbers. Fr seq() instance, seq(a,b) makes a vectr f integers between a and b. Thereare many ther ptins: fr instance, seq(0,1,length=10) makes a sequence f 10 numbers that are equally spaced between 0 and 1. Typing 3:11 is a shrthand fr seq(3,11) fr integer arguments. > x=seq(1,10) > x [1] > x=1:10 > x [1] > x=seq(-pi,pi,length=50) We will nw create sme mre sphisticated plts. The cntur() func- cntur() tin prduces a cntur plt in rder t represent three-dimensinal data; cntur plt it is like a tpgraphical map. It takes three arguments: 1. A vectr f the x values (the first dimensin), 2. A vectr f the y values (the secnd dimensin), and 3. A matrix whse elements crrespnd t the z value (the third dimensin) fr each pair f (x,y) crdinates. As with the plt() functin, there are many ther inputs that can be used t fine-tune the utput f the cntur() functin. T learn mre abut these, take a lk at the help file by typing?cntur. > y=x > f=uter(x,y,functin(x,y)cs(y)/(1+x^2)) > cntur(x,y,f) > cntur(x,y,f,nlevels=45,add=t) > fa=(f-t(f))/2 > cntur(x,y,fa,nlevels =15) The image() functin wrks the same way as cntur(), except that it image() prduces a clr-cded plt whse clrs depend n the z value. This is

62 2.3 Lab: Intrductin t R 47 knwn as a heatmap, and is smetimes used t plt temperature in weather heatmap frecasts. Alternatively, persp() can be used t prduce a three-dimensinal persp() plt. The arguments theta and phi cntrl the angles at which the plt is viewed. > image(x,y,fa) > persp(x,y,fa) > persp(x,y,fa,theta=30) > persp(x,y,fa,theta=30,phi=20) > persp(x,y,fa,theta=30,phi=70) > persp(x,y,fa,theta=30,phi=40) Indexing Data We ften wish t examine part f a set f data. Suppse that ur data is stred in the matrix A. > A=matrix(1:16,4,4) > A [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] Then, typing > A[2,3] [1] 10 will select the element crrespnding t the secnd rw and the third clumn. The first number after the pen-bracket symbl [ always refers t the rw, and the secnd number always refers t the clumn. We can als select multiple rws and clumns at a time, by prviding vectrs as the indices. > A[c(1,3),c(2,4)] [,1] [,2] [1,] 5 13 [2,] 7 15 > A[1:3,2:4] [,1] [,2] [,3] [1,] [2,] [3,] > A[1:2,] [,1] [,2] [,3] [,4] [1,] [2,] > A[,1:2] [,1] [,2] [1,] 1 5 [2,] 2 6

63 48 2. Statistical Learning [3,] 3 7 [4,] 4 8 The last tw examples include either n index fr the clumns r n index fr the rws. These indicate that R shuld include all clumns r all rws, respectively. R treats a single rw r clumn f a matrix as a vectr. > A[1,] [1] The use f a negative sign - in the index tells R t keep all rws r clumns except thse indicated in the index. > A[-c(1,3),] [,1] [,2] [,3] [,4] [1,] [2,] > A[-c(1,3),-c(1,3,4)] [1] 6 8 The dim() functin utputs the number f rws fllwed by the number f dim() clumns f a given matrix. > dim(a) [1] Lading Data Fr mst analyses, the first step invlves imprting a data set int R. The read.table() functin is ne f the primary ways t d this. The help file read.table() cntains details abut hw t use this functin. We can use the functin write.table() t exprt data. Befre attempting t lad a data set, we must make sure that R knws t search fr the data in the prper directry. Fr example n a Windws system ne culd select the directry using the Change dir... ptin under the File menu. Hwever, the details f hw t d this depend n the perating system (e.g. Windws, Mac, Unix) that is being used, and s we d nt give further details here. We begin by lading in the Aut data set. This data is part f the ISLR library (we discuss libraries in Chapter 3) but t illustrate the read.table() functin we lad it nw frm a text file. The fllwing cmmand will lad the Aut.data file int R and stre it as an write. table() bject called Aut, in a frmat referred t as a data frame. (Thetextfile data frame can be btained frm this bk s website.) Once the data has been laded, the fix() functin can be used t view it in a spreadsheet like windw. Hwever, the windw must be clsed befre further R cmmands can be entered. > Aut=read.table("Aut.data") > fix(aut)

64 2.3 Lab: Intrductin t R 49 Nte that Aut.data is simply a text file, which yu culd alternatively pen n yur cmputer using a standard text editr. It is ften a gd idea t view a data set using a text editr r ther sftware such as Excel befre lading it int R. This particular data set has nt been laded crrectly, because R has assumed that the variable names are part f the data and s has included them in the first rw. The data set als includes a number f missing bservatins, indicated by a questin mark?. Missing values are a cmmn ccurrence in real data sets. Using the ptin header=t (r header=true) in the read.table() functin tells R that the first line f the file cntains the variable names, and using the ptin na.strings tells R that any time it sees a particular character r set f characters (such as a questin mark), it shuld be treated as a missing element f the data matrix. > Aut=read.table("Aut.data",header=T,na.strings ="?") > fix(aut) Excel is a cmmn-frmat data strage prgram. An easy way t lad such data int R is t save it as a csv (cmma separated value) file and then use the read.csv() functin t lad it in. > Aut=read.csv("Aut.csv",header=T,na.strings ="?") > fix(aut) > dim(aut) [1] > Aut[1:4,] The dim() functin tells us that the data has 397 bservatins, r rws, and dim() nine variables, r clumns. There are varius ways t deal with the missing data. In this case, nly five f the rws cntain missing bservatins, and s we chse t use the na.mit() functin t simply remve these rws. na.mit() > Aut=na.mit(Aut) > dim(aut) [1] Once the data are laded crrectly, we can use names() t check the names() variable names. > names(aut) [1] "mpg" "cylinders " "displacement" "hrsepwer " [5] "weight" "acceleratin" "year" "rigin" [9] "name" Additinal Graphical and Numerical Summaries We can use the plt() functin t prduce scatterplts f the quantitative scatterplt variables. Hwever, simply typing the variable names will prduce an errr message, because R des nt knw t lk in the Aut data set fr thse variables.

65 50 2. Statistical Learning > plt(cylinders, mpg) Errr in plt(cylinders, mpg) : bject cylinders nt fund T refer t a variable, we must type the data set and the variable name jined with a $ symbl. Alternatively, we can use the attach() functin in attach() rder t tell R t make the variables in this data frame available by name. > plt(aut$cylinders, Aut$mpg) > attach(aut) > plt(cylinders, mpg) The cylinders variable is stred as a numeric vectr, s R has treated it as quantitative. Hwever, since there are nly a small number f pssible values fr cylinders, ne may prefer t treat it as a qualitative variable. The as.factr() functin cnverts quantitative variables int qualitative as.factr() variables. > cylinders =as.factr(cylinders ) If the variable pltted n the x-axis is categrial, then bxplts will autmatically be prduced by the plt() functin. As usual, a number f ptins can be specified in rder t custmize the plts. > plt(cylinders, mpg) > plt(cylinders, mpg, cl="red") > plt(cylinders, mpg, cl="red", varwidth=t) > plt(cylinders, mpg, cl="red", varwidth=t,hrizntal =T) > plt(cylinders, mpg, cl="red", varwidth=t, xlab="cylinders ", ylab="mpg") bxplt The hist() functin can be used t plt a histgram. Ntethatcl=2 hist() has the same effect as cl="red". histgram > hist(mpg) > hist(mpg,cl=2) > hist(mpg,cl=2,breaks=15) The pairs() functin creates a scatterplt matrix i.e. a scatterplt fr every scatterplt pair f variables fr any given data set. We can als prduce scatterplts matrix fr just a subset f the variables. > pairs(aut) > pairs( mpg + displacement + hrsepwer + weight + acceleratin, Aut) In cnjunctin with the plt() functin, identify() prvides a useful identify() interactive methd fr identifying the value fr a particular variable fr pints n a plt. We pass in three arguments t identify(): thex-axis variable, the y-axis variable, and the variable whse values we wuld like t see printed fr each pint. Then clicking n a given pint in the plt will cause R t print the value f the variable f interest. Right-clicking n the plt will exit the identify() functin (cntrl-click n a Mac). The numbers printed under the identify() functin crrespnd t the rws fr the selected pints.

66 2.3 Lab: Intrductin t R 51 > plt(hrsepwer,mpg) > identify(hrsepwer,mpg,name) The summary() functin prduces a numerical summary f each variable in summary() a particular data set. > summary(aut) mpg cylinders displacement Min. : 9.00 Min. :3.000 Min. : st Qu.: st Qu.: st Qu.:105.0 Median :22.75 Median :4.000 Median :151.0 Mean :23.45 Mean :5.472 Mean : rd Qu.: rd Qu.: rd Qu.:275.8 Max. :46.60 Max. :8.000 Max. :455.0 hrsepwer weight acceleratin Min. : 46.0 Min. :1613 Min. : st Qu.: st Qu.:2225 1st Qu.:13.78 Median : 93.5 Median :2804 Median :15.50 Mean :104.5 Mean :2978 Mean : rd Qu.: rd Qu.:3615 3rd Qu.:17.02 Max. :230.0 Max. :5140 Max. :24.80 year rigin name Min. :70.00 Min. :1.000 amc matadr : 5 1st Qu.: st Qu.:1.000 frd pint : 5 Median :76.00 Median :1.000 tyta crlla : 5 Mean :75.98 Mean :1.577 amc gremlin : 4 3rd Qu.: rd Qu.:2.000 amc hrnet : 4 Max. :82.00 Max. :3.000 chevrlet chevette: 4 (Other) :365 Fr qualitative variables such as name, R will list the number f bservatins that fall in each categry. We can als prduce a summary f just a single variable. > summary(mpg) Min. 1st Qu. Median Mean 3rd Qu. Max Once we have finished using R, wetypeq() inrdertshutitdwn,r q() quit. When exiting R, we have the ptin t save the current wrkspace s wrkspace that all bjects (such as data sets) that we have created in this R sessin will be available next time. Befre exiting R, we may want t save a recrd f all f the cmmands that we typed in the mst recent sessin; this can be accmplished using the savehistry() functin.nexttimeweenterr, savehistry() we can lad that histry using the ladhistry() functin. ladhistry()

67 52 2. Statistical Learning 2.4 Exercises Cnceptual 1. Fr each f parts (a) thrugh (d), indicate whether we wuld generally expect the perfrmance f a flexible statistical learning methd t be better r wrse than an inflexible methd. Justify yur answer. (a) The sample size n is extremely large, and the number f predictrs p is small. (b) The number f predictrs p is extremely large, and the number f bservatins n is small. (c) The relatinship between the predictrs and respnse is highly nn-linear. (d) The variance f the errr terms, i.e. σ 2 =Var(ɛ), is extremely high. 2. Explain whether each scenari is a classificatin r regressin prblem, and indicate whether we are mst interested in inference r predictin. Finally, prvide n and p. (a) We cllect a set f data n the tp 500 firms in the US. Fr each firm we recrd prfit, number f emplyees, industry and the CEO salary. We are interested in understanding which factrs affect CEO salary. (b) We are cnsidering launching a new prduct and wish t knw whether it will be a success r a failure. We cllect data n 20 similar prducts that were previusly launched. Fr each prduct we have recrded whether it was a success r failure, price charged fr the prduct, marketing budget, cmpetitin price, and ten ther variables. (c) We are interesting in predicting the % change in the US dllar in relatin t the weekly changes in the wrld stck markets. Hence we cllect weekly data fr all f Fr each week we recrd the % change in the dllar, the % change in the US market, the % change in the British market, and the % change in the German market. 3. We nw revisit the bias-variance decmpsitin. (a) Prvide a sketch f typical (squared) bias, variance, training errr, test errr, and Bayes (r irreducible) errr curves, n a single plt, as we g frm less flexible statistical learning methds twards mre flexible appraches. The x-axis shuld represent

68 2.4 Exercises 53 the amunt f flexibility in the methd, and the y-axis shuld represent the values fr each curve. There shuld be five curves. Make sure t label each ne. (b) Explain why each f the five curves has the shape displayed in part (a). 4. Yu will nw think f sme real-life applicatins fr statistical learning. (a) Describe three real-life applicatins in which classificatin might be useful. Describe the respnse, as well as the predictrs. Is the gal f each applicatin inference r predictin? Explain yur answer. (b) Describe three real-life applicatins in which regressin might be useful. Describe the respnse, as well as the predictrs. Is the gal f each applicatin inference r predictin? Explain yur answer. (c) Describe three real-life applicatins in which cluster analysis might be useful. 5. What are the advantages and disadvantages f a very flexible (versus a less flexible) apprach fr regressin r classificatin? Under what circumstances might a mre flexible apprach be preferred t a less flexible apprach? When might a less flexible apprach be preferred? 6. Describe the differences between a parametric and a nn-parametric statistical learning apprach. What are the advantages f a parametric apprach t regressin r classificatin (as ppsed t a nnparametric apprach)? What are its disadvantages? 7. The table belw prvides a training data set cntaining six bservatins, three predictrs, and ne qualitative respnse variable. Obs. X 1 X 2 X 3 Y Red Red Red Green Green Red Suppse we wish t use this data set t make a predictin fr Y when X 1 = X 2 = X 3 = 0 using K-nearest neighbrs. (a) Cmpute the Euclidean distance between each bservatin and the test pint, X 1 = X 2 = X 3 =0.

69 54 2. Statistical Learning (b) What is ur predictin with K =1?Why? (c) What is ur predictin with K =3?Why? (d) If the Bayes decisin bundary in this prblem is highly nnlinear, then wuld we expect the best value fr K t be large r small? Why? Applied 8. This exercise relates t the Cllege data set, which can be fund in the file Cllege.csv. It cntains a number f variables fr 777 different universities and clleges in the US. The variables are Private : Public/private indicatr Apps : Number f applicatins received Accept : Number f applicants accepted Enrll : Number f new students enrlled Tp10perc : New students frm tp 10 % f high schl class Tp25perc : New students frm tp 25 % f high schl class F.Undergrad : Number f full-time undergraduates P.Undergrad : Number f part-time undergraduates Outstate : Out-f-state tuitin Rm.Bard : Rm and bard csts Bks : Estimated bk csts Persnal : Estimated persnal spending PhD : Percent f faculty with Ph.D. s Terminal : Percent f faculty with terminal degree S.F.Rati : Student/faculty rati perc.alumni : Percent f alumni wh dnate Expend : Instructinal expenditure per student Grad.Rate : Graduatin rate Befre reading the data int R, it can be viewed in Excel r a text editr. (a) Use the read.csv() functin t read the data int R. Callthe laded data cllege. Make sure that yu have the directry set t the crrect lcatin fr the data. (b) Lk at the data using the fix() functin. Yu shuld ntice that the first clumn is just the name f each university. We dn t really want R t treat this as data. Hwever, it may be handy t have these names fr later. Try the fllwing cmmands:

70 2.4 Exercises 55 (c) > rwnames(cllege)=cllege[,1] > fix(cllege) Yu shuld see that there is nw a rw.names clumn with the name f each university recrded. This means that R has given each rw a name crrespnding t the apprpriate university. R will nt try t perfrm calculatins n the rw names. Hwever, we still need t eliminate the first clumn in the data where the names are stred. Try > cllege=cllege[,-1] > fix(cllege) Nw yu shuld see that the first data clumn is Private. Nte that anther clumn labeled rw.names nw appears befre the Private clumn. Hwever, this is nt a data clumn but rather thenamethatr is giving t each rw. i. Use the summary() functin t prduce a numerical summary f the variables in the data set. ii. Use the pairs() functin t prduce a scatterplt matrix f the first ten clumns r variables f the data. Recall that yu can reference the first ten clumns f a matrix A using A[,1:10]. iii. Use the plt() functin t prduce side-by-side bxplts f Outstate versus Private. iv. Create a new qualitative variable, called Elite, bybinning the Tp10perc variable. We are ging t divide universities int tw grups based n whether r nt the prprtin f students cming frm the tp 10 % f their high schl classes exceeds 50 %. > Elite=rep("N",nrw(cllege)) > Elite[cllege$Tp10perc >50]="Yes" > Elite=as.factr(Elite) > cllege=data.frame(cllege,elite) Use the summary() functin t see hw many elite universities there are. Nw use the plt() functin t prduce side-by-side bxplts f Outstate versus Elite. v. Use the hist() functin t prduce sme histgrams with differing numbers f bins fr a few f the quantitative variables. Yu may find the cmmand par(mfrw=c(2,2)) useful: it will divide the print windw int fur regins s that fur plts can be made simultaneusly. Mdifying the arguments t this functin will divide the screen in ther ways. vi. Cntinue explring the data, and prvide a brief summary f what yu discver.

71 56 2. Statistical Learning 9. This exercise invlves the Aut data set studied in the lab. Make sure that the missing values have been remved frm the data. (a) Which f the predictrs are quantitative, and which are qualitative? (b) What is the range f each quantitative predictr? Yu can answer this using the range() functin. (c) What is the mean and standard deviatin f each quantitative predictr? (d) Nw remve the 10th thrugh 85th bservatins. What is the range, mean, and standard deviatin f each predictr in the subset f the data that remains? (e) Using the full data set, investigate the predictrs graphically, using scatterplts r ther tls f yur chice. Create sme plts highlighting the relatinships amng the predictrs. Cmment n yur findings. (f) Suppse that we wish t predict gas mileage (mpg) n the basis f the ther variables. D yur plts suggest that any f the ther variables might be useful in predicting mpg? Justify yur answer. 10. This exercise invlves the Bstn husing data set. (a) T begin, lad in the Bstn data set. The Bstn data set is part f the MASS library in R. > library(mass) Nw the data set is cntained in the bject Bstn. > Bstn Read abut the data set: >?Bstn Hw many rws are in this data set? Hw many clumns? What d the rws and clumns represent? (b) Make sme pairwise scatterplts f the predictrs (clumns) in this data set. Describe yur findings. (c) Are any f the predictrs assciated with per capita crime rate? If s, explain the relatinship. (d) D any f the suburbs f Bstn appear t have particularly high crime rates? Tax rates? Pupil-teacher ratis? Cmment n the range f each predictr. (e) Hw many f the suburbs in this data set bund the Charles river? range()

72 2.4 Exercises 57 (f) What is the median pupil-teacher rati amng the twns in this data set? (g) Which suburb f Bstn has lwest median value f wnerccupied hmes? What are the values f the ther predictrs fr that suburb, and hw d thse values cmpare t the verall ranges fr thse predictrs? Cmment n yur findings. (h) In this data set, hw many f the suburbs average mre than seven rms per dwelling? Mre than eight rms per dwelling? Cmment n the suburbs that average mre than eight rms per dwelling.

73

74 3 Linear Regressin This chapter is abut linear regressin, a very simple apprach fr supervised learning. In particular, linear regressin is a useful tl fr predicting a quantitative respnse. Linear regressin has been arund fr a lng time and is the tpic f innumerable textbks. Thugh it may seem smewhat dull cmpared t sme f the mre mdern statistical learning appraches described in later chapters f this bk, linear regressin is still a useful and widely used statistical learning methd. Mrever, it serves as a gd jumping-ff pint fr newer appraches: as we will see in later chapters, many fancy statistical learning appraches can be seen as generalizatins r extensins f linear regressin. Cnsequently, the imprtance f having a gd understanding f linear regressin befre studying mre cmplex learning methds cannt be verstated. In this chapter, we review sme f the key ideas underlying the linear regressin mdel, as well as the least squares apprach that is mst cmmnly used t fit this mdel. Recall the Advertising data frm Chapter 2. Figure 2.1 displays sales (in thusands f units) fr a particular prduct as a functin f advertising budgets (in thusands f dllars) fr TV, radi, and newspaper media. Suppse that in ur rle as statistical cnsultants we are asked t suggest, n the basis f this data, a marketing plan fr next year that will result in high prduct sales. What infrmatin wuld be useful in rder t prvide such a recmmendatin? Here are a few imprtant questins that we might seek t address: 1. Is there a relatinship between advertising budget and sales? Our first gal shuld be t determine whether the data prvide G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics, DOI / , Springer Science+Business Media New Yrk

75 60 3. Linear Regressin evidence f an assciatin between advertising expenditure and sales. If the evidence is weak, then ne might argue that n mney shuld be spent n advertising! 2. Hw strng is the relatinship between advertising budget and sales? Assuming that there is a relatinship between advertising and sales, we wuld like t knw the strength f this relatinship. In ther wrds, given a certain advertising budget, can we predict sales with a high level f accuracy? This wuld be a strng relatinship. Or is a predictin f sales based n advertising expenditure nly slightly better than a randm guess? This wuld be a weak relatinship. 3. Which media cntribute t sales? D all three media TV, radi, and newspaper cntribute t sales, r d just ne r tw f the media cntribute? T answer this questin, we must find a way t separate ut the individual effects f each medium when we have spent mney n all three media. 4. Hw accurately can we estimate the effect f each medium n sales? Fr every dllar spent n advertising in a particular medium, by what amunt will sales increase? Hw accurately can we predict this amunt f increase? 5. Hw accurately can we predict future sales? Fr any given level f televisin, radi, r newspaper advertising, what is ur predictin fr sales, and what is the accuracy f this predictin? 6. Is the relatinship linear? If there is apprximately a straight-line relatinship between advertising expenditure in the varius media and sales, then linear regressin is an apprpriate tl. If nt, then it may still be pssible t transfrm the predictr r the respnse s that linear regressin can be used. 7. Is there synergy amng the advertising media? Perhaps spending $50,000 n televisin advertising and $50,000 n radi advertising results in mre sales than allcating $100,000 t either televisin r radi individually. In marketing, this is knwn as a synergy effect, while in statistics it is called an interactin effect. It turns ut that linear regressin can be used t answer each f these questins. We will first discuss all f these questins in a general cntext, and then return t them in this specific cntext in Sectin 3.4. synergy interactin

76 3.1 Simple Linear Regressin 3.1 Simple Linear Regressin 61 Simple linear regressin lives up t its name: it is a very straightfrward simple linear apprach fr predicting a quantitative respnse Y n the basis f a single predictr variable X. It assumes that there is apprximately a linear regressin relatinship between X and Y. Mathematically, we can write this linear relatinship as Y β 0 + β 1 X. (3.1) Yu might read as is apprximately mdeled as. We will smetimes describe (3.1) by saying that we are regressing Y n X (r Y nt X). Fr example, X may represent TV advertising and Y may represent sales. Then we can regress sales nt TV by fitting the mdel sales β 0 + β 1 TV. In Equatin 3.1, β 0 and β 1 are tw unknwn cnstants that represent the intercept and slpe terms in the linear mdel. Tgether, β 0 and β 1 are intercept knwn as the mdel cefficients r parameters. Oncewehaveusedur slpe training data t prduce estimates ˆβ 0 and ˆβ 1 fr the mdel cefficients, we can predict future sales n the basis f a particular value f TV advertising by cmputing ŷ = ˆβ 0 + ˆβ 1 x, (3.2) where ŷ indicates a predictin f Y n the basis f X = x. Hereweusea hat symbl, ˆ, t dente the estimated value fr an unknwn parameter r cefficient, r t dente the predicted value f the respnse. cefficient parameter Estimating the Cefficients In practice, β 0 and β 1 are unknwn. S befre we can use (3.1) t make predictins, we must use data t estimate the cefficients. Let (x 1,y 1 ), (x 2,y 2 ),..., (x n,y n ) represent n bservatin pairs, each f which cnsists f a measurement f X and a measurement f Y.Inthe Advertising example, this data set cnsists f the TV advertising budget and prduct sales in n = 200 different markets. (Recall that the data are displayed in Figure 2.1.) Our gal is t btain cefficient estimates ˆβ 0 and ˆβ 1 such that the linear mdel (3.1) fits the available data well that is, s that y i ˆβ 0 + ˆβ 1 x i fr i = 1,...,n. In ther wrds, we want t find an intercept ˆβ 0 and a slpe ˆβ 1 such that the resulting line is as clse as pssible t the n = 200 data pints. There are a number f ways f measuring clseness. Hwever, by far the mst cmmn apprach invlves minimizing the least squares criterin, and we take that apprach in this chapter. Alternative appraches will be cnsidered in Chapter 6. least squares

77 62 3. Linear Regressin Sales FIGURE 3.1. Fr the Advertising data, the least squares fit fr the regressin f sales nt TV is shwn. The fit is fund by minimizing the sum f squared errrs. Each grey line segment represents an errr, and the fit makes a cmprmise by averaging their squares. In this case a linear fit captures the essence f the relatinship, althugh it is smewhat deficient in the left f the plt. TV Let ŷ i = ˆβ 0 + ˆβ 1 x i be the predictin fr Y based n the ith value f X. Then e i = y i ŷ i represents the ith residual this is the difference between residual the ith bserved respnse value and the ith respnse value that is predicted by ur linear mdel. We define the residual sum f squares (RSS) as residual sum f squares RSS = e e e2 n, r equivalently as RSS = (y 1 ˆβ 0 ˆβ 1 x 1 ) 2 +(y 2 ˆβ 0 ˆβ 1 x 2 ) (y n ˆβ 0 ˆβ 1 x n ) 2. (3.3) The least squares apprach chses ˆβ 0 and ˆβ 1 t minimize the RSS. Using sme calculus, ne can shw that the minimizers are ˆβ 1 = n i=1 (x i x)(y i ȳ) n i=1 (x i x) 2, ˆβ 0 =ȳ ˆβ 1 x, (3.4) where ȳ 1 n n i=1 y i and x 1 n n i=1 x i are the sample means. In ther wrds, (3.4) defines the least squares cefficient estimates fr simple linear regressin. Figure 3.1 displays the simple linear regressin fit t the Advertising data, where ˆβ 0 = 7.03 and ˆβ 1 = In ther wrds, accrding t

78 3.1 Simple Linear Regressin β RSS β β 0 β 0 FIGURE 3.2. Cntur and three-dimensinal plts f the RSS n the Advertising data, using sales as the respnse and TV as the predictr. The red dts crrespnd t the least squares estimates ˆβ 0 and ˆβ 1, given by (3.4). this apprximatin, an additinal $1,000 spent n TV advertising is assciated with selling apprximately 47.5 additinal units f the prduct. In Figure 3.2, we have cmputed RSS fr a number f values f β 0 and β 1, using the advertising data with sales as the respnse and TV as the predictr. In each plt, the red dt represents the pair f least squares estimates ( ˆβ 0, ˆβ 1 ) given by (3.4). These values clearly minimize the RSS Assessing the Accuracy f the Cefficient Estimates Recall frm (2.1) that we assume that the true relatinship between X and Y takes the frm Y = f(x) +ɛ fr sme unknwn functin f, whereɛ is a mean-zer randm errr term. If f is t be apprximated by a linear functin, then we can write this relatinship as Y = β 0 + β 1 X + ɛ. (3.5) Here β 0 is the intercept term that is, the expected value f Y when X =0, and β 1 is the slpe the average increase in Y assciated with a ne-unit increase in X. The errr term is a catch-all fr what we miss with this simple mdel: the true relatinship is prbably nt linear, there may be ther variables that cause variatin in Y, and there may be measurement errr. We typically assume that the errr term is independent f X. The mdel given by (3.5) defines the ppulatin regressin line, which ppulatin is the best linear apprximatin t the true relatinship between X and Y. 1 The least squares regressin cefficient estimates (3.4) characterize the regressin line least squares line (3.2). The left-hand panel f Figure 3.3 displays these least squares line 1 The assumptin f linearity is ften a useful wrking mdel. Hwever, despite what many textbks might tell us, we seldm believe that the true relatinship is linear.

79 64 3. Linear Regressin Y Y X X FIGURE 3.3. A simulated data set. Left: The red line represents the true relatinship, f(x) =2+3X, which is knwn as the ppulatin regressin line. The blue line is the least squares line; it is the least squares estimate fr f(x) based n the bserved data, shwn in black. Right: The ppulatin regressin line is again shwn in red, and the least squares line in dark blue. In light blue, ten least squares lines are shwn, each cmputed n the basis f a separate randm set f bservatins. Each least squares line is different, but n average, the least squares lines are quite clse t the ppulatin regressin line. tw lines in a simple simulated example. We created 100 randm Xs, and generated 100 crrespnding Y s frm the mdel Y =2+3X + ɛ, (3.6) where ɛ was generated frm a nrmal distributin with mean zer. The red line in the left-hand panel f Figure 3.3 displays the true relatinship, f(x) = 2+3X, while the blue line is the least squares estimate based n the bserved data. The true relatinship is generally nt knwn fr real data, but the least squares line can always be cmputed using the cefficient estimates given in (3.4). In ther wrds, in real applicatins, we have access t a set f bservatins frm which we can cmpute the least squares line; hwever, the ppulatin regressin line is unbserved. In the right-hand panel f Figure 3.3 we have generated ten different data sets frm the mdel given by (3.6) and pltted the crrespnding ten least squares lines. Ntice that different data sets generated frm the same true mdel result in slightly different least squares lines, but the unbserved ppulatin regressin line des nt change. At first glance, the difference between the ppulatin regressin line and the least squares line may seem subtle and cnfusing. We nly have ne data set, and s what des it mean that tw different lines describe the relatinship between the predictr and the respnse? Fundamentally, the

80 3.1 Simple Linear Regressin 65 cncept f these tw lines is a natural extensin f the standard statistical apprach f using infrmatin frm a sample t estimate characteristics f a large ppulatin. Fr example, suppse that we are interested in knwing the ppulatin mean μ f sme randm variable Y. Unfrtunately, μ is unknwn, but we d have access t n bservatins frm Y,whichwecan write as y 1,...,y n, and which we can use t estimate μ. A reasnable estimate is ˆμ =ȳ, whereȳ = 1 n n i=1 y i isthesamplemean.thesample mean and the ppulatin mean are different, but in general the sample mean will prvide a gd estimate f the ppulatin mean. In the same way, the unknwn cefficients β 0 and β 1 in linear regressin define the ppulatin regressin line. We seek t estimate these unknwn cefficients using ˆβ 0 and ˆβ 1 given in (3.4). These cefficient estimates define the least squares line. The analgy between linear regressin and estimatin f the mean f a randm variable is an apt ne based n the cncept f bias. Ifweusethe bias sample mean ˆμ t estimate μ, this estimate is unbiased, in the sense that unbiased n average, we expect ˆμ t equal μ. What exactly des this mean? It means that n the basis f ne particular set f bservatins y 1,...,y n,ˆμ might verestimate μ, and n the basis f anther set f bservatins, ˆμ might underestimate μ. But if we culd average a huge number f estimates f μ btained frm a huge number f sets f bservatins, then this average wuld exactly equal μ. Hence, anunbiasedestimatrdes ntsystematically ver- r under-estimate the true parameter. The prperty f unbiasedness hlds fr the least squares cefficient estimates given by (3.4) as well: if we estimate β 0 and β 1 n the basis f a particular data set, then ur estimates wn t be exactly equal t β 0 and β 1. But if we culd average the estimates btained ver a huge number f data sets, then the average f these estimates wuld be spt n! In fact, we can see frm the righthand panel f Figure 3.3 that the average f many least squares lines, each estimated frm a separate data set, is pretty clse t the true ppulatin regressin line. We cntinue the analgy with the estimatin f the ppulatin mean μ f a randm variable Y. A natural questin is as fllws: hw accurate is the sample mean ˆμ as an estimate f μ? We have established that the average f ˆμ s ver many data sets will be very clse t μ, but that a single estimate ˆμ may be a substantial underestimate r verestimate f μ. Hw far ff will that single estimate f ˆμ be? In general, we answer this questin by cmputing the standard errr f ˆμ, written as SE(ˆμ). We have standard the well-knwn frmula errr Var(ˆμ) =SE(ˆμ) 2 = σ2 n, (3.7)

81 66 3. Linear Regressin where σ is the standard deviatin f each f the realizatins y i f Y. 2 Rughly speaking, the standard errr tells us the average amunt that this estimate ˆμ differs frm the actual value f μ. Equatin 3.7 als tells us hw this deviatin shrinks with n the mre bservatins we have, the smaller the standard errr f ˆμ. In a similar vein, we can wnder hw clse ˆβ 0 and ˆβ 1 are t the true values β 0 and β 1. T cmpute the standard errrs assciated with ˆβ 0 and ˆβ 1, we use the fllwing frmulas: SE( ˆβ 0 ) 2 = σ 2 [ 1 n + x 2 n i=1 (x i x) 2 ], SE( ˆβ 1 ) 2 = σ 2 n i=1 (x i x) 2, (3.8) where σ 2 =Var(ɛ). Fr these frmulas t be strictly valid, we need t assume that the errrs ɛ i fr each bservatin are uncrrelated with cmmn variance σ 2. This is clearly nt true in Figure 3.1, but the frmula still turns ut t be a gd apprximatin. Ntice in the frmula that SE( ˆβ 1 )is smaller when the x i are mre spread ut; intuitively we have mre leverage t estimate a slpe when this is the case. We als see that SE( ˆβ 0 )wuldbe the same as SE(ˆμ) if x were zer (in which case ˆβ 0 wuld be equal t ȳ). In general, σ 2 is nt knwn, but can be estimated frm the data. This estimate is knwn as the residual standard errr, and is given by the frmula residual RSE = RSS/(n 2). Strictly speaking, when σ 2 is estimated frm the data we shuld write ŜE( ˆβ 1 ) t indicate that an estimate has been made, but fr simplicity f ntatin we will drp this extra hat. standard errr Standard errrs can be used t cmpute cnfidence intervals. A95% cnfidence cnfidence interval is defined as a range f values such that with 95 % interval prbability, the range will cntain the true unknwn value f the parameter. The range is defined in terms f lwer and upper limits cmputed frm the sample f data. Fr linear regressin, the 95 % cnfidence interval fr β 1 apprximately takes the frm ˆβ 1 ± 2 SE( ˆβ 1 ). (3.9) That is, there is apprximately a 95 % chance that the interval [ ˆβ1 2 SE( ˆβ 1 ), ˆβ1 +2 SE( ˆβ 1 )] (3.10) will cntain the true value f β 1. 3 Similarly, a cnfidence interval fr β 0 apprximately takes the frm ˆβ 0 ± 2 SE( ˆβ 0 ). (3.11) 2 This frmula hlds prvided that the n bservatins are uncrrelated. 3 Apprximately fr several reasns. Equatin 3.10 relies n the assumptin that the errrs are Gaussian. Als, the factr f 2 in frnt f the SE( ˆβ 1 ) term will vary slightly depending n the number f bservatins n in the linear regressin. T be precise, rather than the number 2, (3.10) shuld cntain the 97.5 % quantile f a t-distributin with n 2 degrees f freedm. Details f hw t cmpute the 95 % cnfidence interval precisely in R will be prvided later in this chapter.

82 3.1 Simple Linear Regressin 67 In the case f the advertising data, the 95 % cnfidence interval fr β 0 is [6.130, 7.935] and the 95 % cnfidence interval fr β 1 is [0.042, 0.053]. Therefre, we can cnclude that in the absence f any advertising, sales will, n average, fall smewhere between 6,130 and 7,940 units. Furthermre, fr each $1,000 increase in televisin advertising, there will be an average increase in sales f between 42 and 53 units. Standard errrs can als be used t perfrm hypthesis tests n the hypthesis cefficients. The mst cmmn hypthesis test invlves testing the null test hypthesis f null hypthesis H 0 : There is n relatinship between X and Y (3.12) versus the alternative hypthesis H a : There is sme relatinship between X and Y. (3.13) Mathematically, this crrespnds t testing H 0 : β 1 =0 alternative hypthesis versus H a : β 1 0, since if β 1 = 0 then the mdel (3.5) reduces t Y = β 0 + ɛ, andx is nt assciated with Y. T test the null hypthesis, we need t determine whether ˆβ 1, ur estimate fr β 1, is sufficiently far frm zer that we can be cnfident that β 1 is nn-zer. Hw far is far enugh? This f curse depends n the accuracy f ˆβ 1 that is, it depends n SE( ˆβ 1 ). If SE( ˆβ 1 )is small, then even relatively small values f ˆβ 1 may prvide strng evidence that β 1 0, and hence that there is a relatinship between X and Y.In cntrast, if SE( ˆβ 1 ) is large, then ˆβ 1 must be large in abslute value in rder fr us t reject the null hypthesis. In practice, we cmpute a t-statistic, t-statistic given by t = ˆβ 1 0 SE( ˆβ 1 ), (3.14) which measures the number f standard deviatins that ˆβ 1 is away frm 0. If there really is n relatinship between X and Y, then we expect that (3.14) will have a t-distributin with n 2 degrees f freedm. The t- distributin has a bell shape and fr values f n greater than apprximately 30 it is quite similar t the nrmal distributin. Cnsequently, it is a simple matter t cmpute the prbability f bserving any value equal t t r larger, assuming β 1 = 0. We call this prbability the p-value. Rughly p-value speaking, we interpret the p-value as fllws: a small p-value indicates that it is unlikely t bserve such a substantial assciatin between the predictr and the respnse due t chance, in the absence f any real assciatin between the predictr and the respnse. Hence, if we see a small p-value,

83 68 3. Linear Regressin then we can infer that there is an assciatin between the predictr and the respnse. We reject the null hypthesis that is, we declare a relatinship t exist between X and Y if the p-value is small enugh. Typical p-value cutffs fr rejecting the null hypthesis are 5 r 1 %. When n = 30, these crrespnd t t-statistics (3.14) f arund 2 and 2.75, respectively. Cefficient Std. errr t-statistic p-value Intercept < TV < TABLE 3.1. Fr the Advertising data, cefficients f the least squares mdel fr the regressin f number f units sld n TV advertising budget. An increase f $1,000 in the TV advertising budget is assciated with an increase in sales by arund 50 units (Recall that the sales variable is in thusands f units, and the TV variable is in thusands f dllars). Table 3.1 prvides details f the least squares mdel fr the regressin f number f units sld n TV advertising budget fr the Advertising data. Ntice that the cefficients fr ˆβ 0 and ˆβ 1 are very large relative t their standard errrs, s the t-statistics are als large; the prbabilities f seeing such values if H 0 is true are virtually zer. Hence we can cnclude that β 0 0andβ Assessing the Accuracy f the Mdel Once we have rejected the null hypthesis (3.12) in favr f the alternative hypthesis (3.13), it is natural t want t quantify the extent t which the mdel fits the data. The quality f a linear regressin fit is typically assessed using tw related quantities: the residual standard errr (RSE) and the R 2 R 2 statistic. Table 3.2 displays the RSE, the R 2 statistic, and the F-statistic (t be described in Sectin 3.2.2) fr the linear regressin f number f units sld n TV advertising budget. Residual Standard Errr Recall frm the mdel (3.5) that assciated with each bservatin is an errr term ɛ. Due t the presence f these errr terms, even if we knew the true regressin line (i.e. even if β 0 and β 1 were knwn), we wuld nt be able t perfectly predict Y frm X. The RSE is an estimate f the standard 4 In Table 3.1, a small p-value fr the intercept indicates that we can reject the null hypthesis that β 0 = 0, and a small p-value fr TV indicates that we can reject the null hypthesis that β 1 = 0. Rejecting the latter null hypthesis allws us t cnclude that there is a relatinship between TV and sales. Rejecting the frmer allws us t cnclude that in the absence f TV expenditure, sales are nn-zer.

84 Quantity Value Residual standard errr 3.26 R F-statistic Simple Linear Regressin 69 TABLE 3.2. Fr the Advertising data, mre infrmatin abut the least squares mdel fr the regressin f number f units sld n TV advertising budget. deviatin f ɛ. Rughly speaking, it is the average amunt that the respnse will deviate frm the true regressin line. It is cmputed using the frmula 1 RSE = n 2 RSS = 1 n (y i ŷ i ) n 2 2. (3.15) Nte that RSS was defined in Sectin 3.1.1, and is given by the frmula n RSS = (y i ŷ i ) 2. (3.16) i=1 In the case f the advertising data, we see frm the linear regressin utput in Table 3.2 that the RSE is In ther wrds, actual sales in each market deviate frm the true regressin line by apprximately 3,260 units, n average. Anther way t think abut this is that even if the mdel were crrect and the true values f the unknwn cefficients β 0 and β 1 were knwn exactly, any predictin f sales n the basis f TV advertising wuld still be ff by abut 3,260 units n average. Of curse, whether r nt 3,260 units is an acceptable predictin errr depends n the prblem cntext. In the advertising data set, the mean value f sales ver all markets is apprximately 14,000 units, and s the percentage errr is 3,260/14,000=23%. The RSE is cnsidered a measure f the lack f fit f the mdel (3.5) t the data. If the predictins btained using the mdel are very clse t the true utcme values that is, if ŷ i y i fr i =1,...,n then (3.15) will be small, and we can cnclude that the mdel fits the data very well. On the ther hand, if ŷ i is very far frm y i fr ne r mre bservatins, then the RSE may be quite large, indicating that the mdel desn t fit the data well. R 2 Statistic The RSE prvides an abslute measure f lack f fit f the mdel (3.5) t the data. But since it is measured in the units f Y, it is nt always clear what cnstitutes a gd RSE. The R 2 statistic prvides an alternative measure f fit. It takes the frm f a prprtin the prprtin f variance explained and s it always takes n a value between 0 and 1, and is independent f the scale f Y. i=1

85 70 3. Linear Regressin T calculate R 2,weusethefrmula R 2 = TSS RSS TSS =1 RSS TSS (3.17) where TSS = (y i ȳ) 2 is the ttal sum f squares, and RSS is defined ttal sum f in (3.16). TSS measures the ttal variance in the respnse Y,andcanbe squares thught f as the amunt f variability inherent in the respnse befre the regressin is perfrmed. In cntrast, RSS measures the amunt f variability that is left unexplained after perfrming the regressin. Hence, TSS RSS measures the amunt f variability in the respnse that is explained (r remved) by perfrming the regressin, and R 2 measures the prprtin f variability in Y that can be explained using X. AnR 2 statistic that is clse t 1 indicates that a large prprtin f the variability in the respnse has been explained by the regressin. A number near 0 indicates that the regressin did nt explain much f the variability in the respnse; this might ccur because the linear mdel is wrng, r the inherent errr σ 2 is high, r bth. In Table 3.2, the R 2 was 0.61, and s just under tw-thirds f the variability in sales is explained by a linear regressin n TV. The R 2 statistic (3.17) has an interpretatinal advantage ver the RSE (3.15), since unlike the RSE, it always lies between 0 and 1. Hwever, it can still be challenging t determine what is a gd R 2 value, and in general, this will depend n the applicatin. Fr instance, in certain prblems in physics, we may knw that the data truly cmes frm a linear mdel with a small residual errr. In this case, we wuld expect t see an R 2 value that is extremely clse t 1, and a substantially smaller R 2 value might indicate a serius prblem with the experiment in which the data were generated. On the ther hand, in typical applicatins in bilgy, psychlgy, marketing, and ther dmains, the linear mdel (3.5) is at best an extremely rugh apprximatin t the data, and residual errrs due t ther unmeasured factrs are ften very large. In this setting, we wuld expect nly a very small prprtin f the variance in the respnse t be explained by the predictr, and an R 2 value well belw 0.1 might be mre realistic! The R 2 statistic is a measure f the linear relatinship between X and Y. Recall that crrelatin, defined as crrelatin n i=1 Cr(X, Y )= (x i x)(y i y) n i=1 (x i x) 2 n i=1 (y i y), (3.18) 2 is als a measure f the linear relatinship between X and Y. 5 This suggests that we might be able t use r =Cr(X, Y ) instead f R 2 in rder t assess the fit f the linear mdel. In fact, it can be shwn that in the simple linear regressin setting, R 2 = r 2. In ther wrds, the squared crrelatin 5 We nte that in fact, the right-hand side f (3.18) is the sample crrelatin; thus, it wuld be mre crrect t write Cr(X, Y ); hwever, we mit the hat fr ease f ntatin.

86 3.2 Multiple Linear Regressin 71 and the R 2 statistic are identical. Hwever, in the next sectin we will discuss the multiple linear regressin prblem, in which we use several predictrs simultaneusly t predict the respnse. The cncept f crrelatin between the predictrs and the respnse des nt extend autmatically t this setting, since crrelatin quantifies the assciatin between a single pair f variables rather than between a larger number f variables. We will see that R 2 fills this rle. 3.2 Multiple Linear Regressin Simple linear regressin is a useful apprach fr predicting a respnse n the basis f a single predictr variable. Hwever, in practice we ften have mre than ne predictr. Fr example, in the Advertising data,wehaveexamined the relatinship between sales and TV advertising. We als have data fr the amunt f mney spent advertising n the radi and in newspapers, and we may want t knw whether either f these tw media is assciated with sales. Hw can we extend ur analysis f the advertising data in rder t accmmdate these tw additinal predictrs? One ptin is t run three separate simple linear regressins, each f which uses a different advertising medium as a predictr. Fr instance, we can fit a simple linear regressin t predict sales n the basis f the amunt spent n radi advertisements. Results are shwn in Table 3.3 (tp table). We find that a $1,000 increase in spending n radi advertising is assciated with an increase in sales by arund 203 units. Table 3.3 (bttm table) cntains the least squares cefficients fr a simple linear regressin f sales nt newspaper advertising budget. A $1,000 increase in newspaper advertising budget is assciated with an increase in sales by apprximately 55 units. Hwever, the apprach f fitting a separate simple linear regressin mdel fr each predictr is nt entirely satisfactry. First f all, it is unclear hw t make a single predictin f sales given levels f the three advertising media budgets, since each f the budgets is assciated with a separate regressin equatin. Secnd, each f the three regressin equatins ignres the ther tw media in frming estimates fr the regressin cefficients. We will see shrtly that if the media budgets are crrelated with each ther in the 200 markets that cnstitute ur data set, then this can lead t very misleading estimates f the individual media effects n sales. Instead f fitting a separate simple linear regressin mdel fr each predictr, a better apprach is t extend the simple linear regressin mdel (3.5) s that it can directly accmmdate multiple predictrs. We can d this by giving each predictr a separate slpe cefficient in a single mdel. In general, suppse that we have p distinct predictrs. Then the multiple linear regressin mdel takes the frm Y = β 0 + β 1 X 1 + β 2 X β p X p + ɛ, (3.19)

87 72 3. Linear Regressin Simple regressin f sales n radi Cefficient Std. errr t-statistic p-value Intercept < radi < Simple regressin f sales n newspaper Cefficient Std. errr t-statistic p-value Intercept < newspaper < TABLE 3.3. Mre simple linear regressin mdels fr the Advertising data. Cefficients f the simple linear regressin mdel fr number f units sld n Tp: radi advertising budget and Bttm: newspaper advertising budget. A $1,000 increase in spending n radi advertising is assciated with an average increase in sales by arund 203 units, while the same increase in spending n newspaper advertising is assciated with an average increase in sales by arund 55 units (Nte that the sales variable is in thusands f units, and the radi and newspaper variables are in thusands f dllars). where X j represents the jth predictr and β j quantifies the assciatin between that variable and the respnse. We interpret β j as the average effect n Y f a ne unit increase in X j, hlding all ther predictrs fixed. In the advertising example, (3.19) becmes sales = β 0 + β 1 TV + β 2 radi + β 3 newspaper + ɛ. (3.20) Estimating the Regressin Cefficients As was the case in the simple linear regressin setting, the regressin cefficients β 0,β 1,...,β p in (3.19) are unknwn, and must be estimated. Given estimates ˆβ 0, ˆβ 1,..., ˆβ p, we can make predictins using the frmula ŷ = ˆβ 0 + ˆβ 1 x 1 + ˆβ 2 x ˆβ p x p. (3.21) The parameters are estimated using the same least squares apprach that we saw in the cntext f simple linear regressin. We chse β 0,β 1,...,β p t minimize the sum f squared residuals RSS = = n (y i ŷ i ) 2 i=1 n (y i ˆβ 0 ˆβ 1 x i1 ˆβ 2 x i2 ˆβ p x ip ) 2. (3.22) i=1

88 3.2 Multiple Linear Regressin 73 Y X 2 X 1 FIGURE 3.4. In a three-dimensinal setting, with tw predictrs and ne respnse, the least squares regressin line becmes a plane. The plane is chsen t minimize the sum f the squared vertical distances between each bservatin (shwn in red) and the plane. The values ˆβ 0, ˆβ 1,..., ˆβ p that minimize (3.22) are the multiple least squares regressin cefficient estimates. Unlike the simple linear regressin estimates given in (3.4), the multiple regressin cefficient estimates have smewhat cmplicated frms that are mst easily represented using matrix algebra. Fr this reasn, we d nt prvide them here. Any statistical sftware package can be used t cmpute these cefficient estimates, and later in this chapter we will shw hw this can be dne in R. Figure 3.4 illustrates an example f the least squares fit t a ty data set with p =2 predictrs. Table 3.4 displays the multiple regressin cefficient estimates when TV, radi, and newspaper advertising budgets are used t predict prduct sales using the Advertising data. We interpret these results as fllws: fr a given amunt f TV and newspaper advertising, spending an additinal $1,000 n radi advertising leads t an increase in sales by apprximately 189 units. Cmparing these cefficient estimates t thse displayed in Tables 3.1 and 3.3, we ntice that the multiple regressin cefficient estimates fr TV and radi are pretty similar t the simple linear regressin cefficient estimates. Hwever, while the newspaper regressin cefficient estimate in Table 3.3 was significantly nn-zer, the cefficient estimate fr newspaper in the multiple regressin mdel is clse t zer, and the crrespnding p-value is n lnger significant, with a value arund This illustrates

89 74 3. Linear Regressin Cefficient Std. errr t-statistic p-value Intercept < TV < radi < newspaper TABLE 3.4. Fr the Advertising data, least squares cefficient estimates f the multiple linear regressin f number f units sld n radi, TV, and newspaper advertising budgets. that the simple and multiple regressin cefficients can be quite different. This difference stems frm the fact that in the simple regressin case, the slpe term represents the average effect f a $1,000 increase in newspaper advertising, ignring ther predictrs such as TV and radi. In cntrast, in the multiple regressin setting, the cefficient fr newspaper represents the average effect f increasing newspaper spending by $1,000 while hlding TV and radi fixed. Des it make sense fr the multiple regressin t suggest n relatinship between sales and newspaper while the simple linear regressin implies the ppsite? In fact it des. Cnsider the crrelatin matrix fr the three predictr variables and respnse variable, displayed in Table 3.5. Ntice that the crrelatin between radi and newspaper is This reveals a tendency t spend mre n newspaper advertising in markets where mre is spent n radi advertising. Nw suppse that the multiple regressin is crrect and newspaper advertising has n direct impact n sales, but radi advertising des increase sales. Then in markets where we spend mre n radi ur sales will tend t be higher, and as ur crrelatin matrix shws, we als tend t spend mre n newspaper advertising in thse same markets. Hence, in a simple linear regressin which nly examines sales versus newspaper, we will bserve that higher values f newspaper tend t be assciated with higher values f sales, even thugh newspaper advertising des nt actually affect sales. S newspaper sales are a surrgate fr radi advertising; newspaper gets credit fr the effect f radi n sales. This slightly cunterintuitive result is very cmmn in many real life situatins. Cnsider an absurd example t illustrate the pint. Running a regressin f shark attacks versus ice cream sales fr data cllected at a given beach cmmunity ver a perid f time wuld shw a psitive relatinship, similar t that seen between sales and newspaper. Ofcurse n ne (yet) has suggested that ice creams shuld be banned at beaches t reduce shark attacks. In reality, higher temperatures cause mre peple t visit the beach, which in turn results in mre ice cream sales and mre shark attacks. A multiple regressin f attacks versus ice cream sales and temperature reveals that, as intuitin implies, the frmer predictr is n lnger significant after adjusting fr temperature.

90 3.2 Multiple Linear Regressin 75 TV radi newspaper sales TV radi newspaper sales TABLE 3.5. Crrelatin matrix fr TV, radi, newspaper, and sales fr the Advertising data Sme Imprtant Questins When we perfrm multiple linear regressin, we usually are interested in answering a few imprtant questins. 1. Is at least ne f the predictrs X 1,X 2,...,X p useful in predicting the respnse? 2. D all the predictrs help t explain Y, r is nly a subset f the predictrs useful? 3. Hw well des the mdel fit the data? 4. Given a set f predictr values, what respnse value shuld we predict, and hw accurate is ur predictin? We nw address each f these questins in turn. One: Is There a Relatinship Between the Respnse and Predictrs? Recall that in the simple linear regressin setting, in rder t determine whether there is a relatinship between the respnse and the predictr we can simply check whether β 1 = 0. In the multiple regressin setting with p predictrs, we need t ask whether all f the regressin cefficients are zer, i.e. whether β 1 = β 2 = = β p = 0. As in the simple linear regressin setting, we use a hypthesis test t answer this questin. We test the null hypthesis, H 0 : β 1 = β 2 = = β p =0 versus the alternative H a : at least ne β j is nn-zer. This hypthesis test is perfrmed by cmputing the F-statistic, F = (TSS RSS)/p RSS/(n p 1), (3.23) F-statistic

91 76 3. Linear Regressin Quantity Value Residual standard errr 1.69 R F-statistic 570 TABLE 3.6. Mre infrmatin abut the least squares mdel fr the regressin f number f units sld n TV, newspaper, and radi advertising budgets in the Advertising data. Other infrmatin abut this mdel was displayed in Table 3.4. where, as with simple linear regressin, TSS = (y i ȳ) 2 and RSS = (yi ŷ i ) 2. If the linear mdel assumptins are crrect, ne can shw that and that, prvided H 0 is true, E{RSS/(n p 1)} = σ 2 E{(TSS RSS)/p} = σ 2. Hence, when there is n relatinship between the respnse and predictrs, ne wuld expect the F-statistic t take n a value clse t 1. On the ther hand, if H a is true, then E{(TSS RSS)/p} >σ 2, s we expect F t be greater than 1. The F-statistic fr the multiple linear regressin mdel btained by regressing sales nt radi, TV, andnewspaper is shwn in Table 3.6. In this example the F-statistic is 570. Since this is far larger than 1, it prvides cmpelling evidence against the null hypthesis H 0. In ther wrds, the large F-statistic suggests that at least ne f the advertising media must be related t sales. Hwever, what if the F-statistic had been clser t 1? Hw large des the F-statistic need t be befre we can reject H 0 and cnclude that there is a relatinship? It turns ut that the answer depends n the values f n and p. Whenn is large, an F-statistic that is just a little larger than 1 might still prvide evidence against H 0. In cntrast, a larger F-statistic is needed t reject H 0 if n is small. When H 0 is true and the errrs ɛ i have a nrmal distributin, the F-statistic fllws an F-distributin. 6 Fr any given value f n and p, any statistical sftware package can be used t cmpute the p-value assciated with the F-statistic using this distributin. Based n this p-value, we can determine whether r nt t reject H 0. Fr the advertising data, the p-value assciated with the F-statistic in Table 3.6 is essentially zer, s we have extremely strng evidence that at least ne f the media is assciated with increased sales. In (3.23) we are testing H 0 that all the cefficients are zer. Smetimes we want t test that a particular subset f q f the cefficients are zer. This crrespnds t a null hypthesis H 0 : β p q+1 = β p q+2 =...= β p =0, 6 Even if the errrs are nt nrmally-distributed, the F-statistic apprximately fllws an F-distributin prvided that the sample size n is large.

92 3.2 Multiple Linear Regressin 77 where fr cnvenience we have put the variables chsen fr missin at the end f the list. In this case we fit a secnd mdel that uses all the variables except thse last q. Suppse that the residual sum f squares fr that mdel is RSS 0. Then the apprpriate F-statistic is F = (RSS 0 RSS)/q RSS/(n p 1). (3.24) Ntice that in Table 3.4, fr each individual predictr a t-statistic and a p-value were reprted. These prvide infrmatin abut whether each individual predictr is related t the respnse, after adjusting fr the ther predictrs. It turns ut that each f these are exactly equivalent 7 t the F-test that mits that single variable frm the mdel, leaving all the thers in i.e. q=1 in (3.24). S it reprts the partial effect f adding that variable t the mdel. Fr instance, as we discussed earlier, these p-values indicate that TV and radi are related t sales, but that there is n evidence that newspaper is assciated with sales, in the presence f these tw. Given these individual p-values fr each variable, why d we need t lk at the verall F-statistic? After all, it seems likely that if any ne f the p-values fr the individual variables is very small, then at least ne f the predictrs is related t the respnse. Hwever, this lgic is flawed, especially when the number f predictrs p is large. Fr instance, cnsider an example in which p = 100 and H 0 : β 1 = β 2 =...= β p = 0 is true, s n variable is truly assciated with the respnse. In this situatin, abut 5 % f the p-values assciated with each variable (f the type shwn in Table 3.4) will be belw 0.05 by chance. In ther wrds, we expect t see apprximately five small p-values even in the absence f any true assciatin between the predictrs and the respnse. In fact, we are almst guaranteed that we will bserve at least ne p-value belw 0.05 by chance! Hence, if we use the individual t-statistics and assciated p- values in rder t decide whether r nt there is any assciatin between the variables and the respnse, there is a very high chance that we will incrrectly cnclude that there is a relatinship. Hwever, the F-statistic des nt suffer frm this prblem because it adjusts fr the number f predictrs. Hence, if H 0 is true, there is nly a 5 % chance that the F- statistic will result in a p-value belw 0.05, regardless f the number f predictrs r the number f bservatins. The apprach f using an F-statistic t test fr any assciatin between the predictrs and the respnse wrks when p is relatively small, and certainly small cmpared t n. Hwever, smetimes we have a very large number f variables. If p>nthen there are mre cefficients β j t estimate than bservatins frm which t estimate them. In this case we cannt even fit the multiple linear regressin mdel using least squares, s the 7 The square f each t-statistic is the crrespnding F-statistic.

93 78 3. Linear Regressin F-statistic cannt be used, and neither can mst f the ther cncepts that we have seen s far in this chapter. When p is large, sme f the appraches discussed in the next sectin, such as frward selectin, can be used. This high-dimensinal setting is discussed in greater detail in Chapter 6. highdimensinal Tw: Deciding n Imprtant Variables As discussed in the previus sectin, the first step in a multiple regressin analysis is t cmpute the F-statistic and t examine the assciated p- value. If we cnclude n the basis f that p-value that at least ne f the predictrs is related t the respnse, then it is natural t wnder which are the guilty nes! We culd lk at the individual p-values as in Table 3.4, but as discussed, if p is large we are likely t make sme false discveries. It is pssible that all f the predictrs are assciated with the respnse, but it is mre ften the case that the respnse is nly related t a subset f the predictrs. The task f determining which predictrs are assciated with the respnse, in rder t fit a single mdel invlving nly thse predictrs, is referred t as variable selectin. The variable selectin prblem is studied variable extensively in Chapter 6, and s here we will prvide nly a brief utline selectin f sme classical appraches. Ideally, we wuld like t perfrm variable selectin by trying ut a lt f different mdels, each cntaining a different subset f the predictrs. Fr instance, if p = 2, then we can cnsider fur mdels: (1) a mdel cntaining n variables, (2) a mdel cntaining X 1 nly, (3) a mdel cntaining X 2 nly, and (4) a mdel cntaining bth X 1 and X 2.Wecanthenselect the best mdel ut f all f the mdels that we have cnsidered. Hw d we determine which mdel is best? Varius statistics can be used t judge the quality f a mdel. These include Mallw s C p, Akaike infrma- Mallw s Cp infrmatin criterin Bayesian infrmatin criterin tin criterin (AIC), Bayesian infrmatin criterin (BIC), and adjusted Akaike R 2. These are discussed in mre detail in Chapter 6. We can als determine which mdel is best by pltting varius mdel utputs, such as the residuals, in rder t search fr patterns. Unfrtunately, there are a ttal f 2 p mdels that cntain subsets f p variables. This means that even fr mderate p, trying ut every pssible subset f the predictrs is infeasible. Fr instance, we saw that if p =2,then there are 2 2 = 4 mdels t cnsider. But if p = 30, then we must cnsider 2 30 =1,073,741,824 mdels! This is nt practical. Therefre, unless p is very small, we cannt cnsider all 2 p mdels, and instead we need an autmated and efficient apprach t chse a smaller set f mdels t cnsider. There are three classical appraches fr this task: adjusted R 2 Frward selectin. We begin with the null mdel a mdel that cn- frward tains an intercept but n predictrs. We then fit p simple linear regressins and add t the null mdel the variable that results in the lwest RSS. We then add t that mdel the variable that results selectin null mdel

94 3.2 Multiple Linear Regressin 79 in the lwest RSS fr the new tw-variable mdel. This apprach is cntinued until sme stpping rule is satisfied. Backward selectin. We start with all variables in the mdel, and backward remve the variable with the largest p-value that is, the variable selectin that is the least statistically significant. The new (p 1)-variable mdel is fit, and the variable with the largest p-value is remved. This prcedure cntinues until a stpping rule is reached. Fr instance, we may stp when all remaining variables have a p-value belw sme threshld. Mixed selectin. This is a cmbinatin f frward and backward se- mixed lectin. We start with n variables in the mdel, and as with frward selectin selectin, we add the variable that prvides the best fit. We cntinue t add variables ne-by-ne. Of curse, as we nted with the Advertising example, the p-values fr variables can becme larger as new predictrs are added t the mdel. Hence, if at any pint the p-value fr ne f the variables in the mdel rises abve a certain threshld, then we remve that variable frm the mdel. We cntinue t perfrm these frward and backward steps until all variables in the mdel have a sufficiently lw p-value, and all variables utside the mdel wuld have a large p-value if added t the mdel. Backward selectin cannt be used if p>n, while frward selectin can always be used. Frward selectin is a greedy apprach, and might include variables early that later becme redundant. Mixed selectin can remedy this. Three: Mdel Fit Tw f the mst cmmn numerical measures f mdel fit are the RSE and R 2, the fractin f variance explained. These quantities are cmputed and interpreted in the same fashin as fr simple linear regressin. Recall that in simple regressin, R 2 is the square f the crrelatin f the respnse and the variable. In multiple linear regressin, it turns ut that it equals Cr(Y,Ŷ )2, the square f the crrelatin between the respnse and the fitted linear mdel; in fact ne prperty f the fitted linear mdel is that it maximizes this crrelatin amng all pssible linear mdels. An R 2 value clse t 1 indicates that the mdel explains a large prtin f the variance in the respnse variable. As an example, we saw in Table 3.6 that fr the Advertising data, the mdel that uses all three advertising media t predict sales has an R 2 f On the ther hand, the mdel that uses nly TV and radi t predict sales has an R 2 value f In ther wrds, there is a small increase in R 2 if we include newspaper advertising in the mdel that already cntains TV and radi advertising, even thugh we saw earlier that the p-value fr newspaper advertising in Table 3.4 is nt significant. It turns ut that R 2 will always increase when mre variables

95 80 3. Linear Regressin are added t the mdel, even if thse variables are nly weakly assciated with the respnse. This is due t the fact that adding anther variable t the least squares equatins must allw us t fit the training data (thugh nt necessarily the testing data) mre accurately. Thus, the R 2 statistic, which is als cmputed n the training data, must increase. The fact that adding newspaper advertising t the mdel cntaining nly TV and radi advertising leads t just a tiny increase in R 2 prvides additinal evidence that newspaper can be drpped frm the mdel. Essentially, newspaper prvides n real imprvement in the mdel fit t the training samples, and its inclusin will likely lead t pr results n independent test samples due t verfitting. In cntrast, the mdel cntaining nly TV as a predictr had an R 2 f 0.61 (Table 3.2). Adding radi t the mdel leads t a substantial imprvement in R 2. This implies that a mdel that uses TV and radi expenditures t predict sales is substantially better than ne that uses nly TV advertising. We culd further quantify this imprvement by lking at the p-value fr the radi cefficient in a mdel that cntains nly TV and radi as predictrs. The mdel that cntains nly TV and radi as predictrs has an RSE f 1.681, and the mdel that als cntains newspaper as a predictr has an RSE f (Table 3.6). In cntrast, the mdel that cntains nly TV has an RSE f 3.26 (Table 3.2). This crrbrates ur previus cnclusin that a mdel that uses TV and radi expenditures t predict sales is much mre accurate (n the training data) than ne that nly uses TV spending. Furthermre, given that TV and radi expenditures are used as predictrs, there is n pint in als using newspaper spending as a predictr in the mdel. The bservant reader may wnder hw RSE can increase when newspaper is added t the mdel given that RSS must decrease. In general RSE is defined as 1 RSE = RSS, (3.25) n p 1 which simplifies t (3.15) fr a simple linear regressin. Thus, mdels with mre variables can have higher RSE if the decrease in RSS is small relative t the increase in p. In additin t lking at the RSE and R 2 statistics just discussed, it can be useful t plt the data. Graphical summaries can reveal prblems with a mdel that are nt visible frm numerical statistics. Fr example, Figure 3.5 displays a three-dimensinal plt f TV and radi versus sales. We see that sme bservatins lie abve and sme bservatins lie belw the least squares regressin plane. In particular, the linear mdel seems t verestimate sales fr instances in which mst f the advertising mney was spent exclusively n either TV r radi. It underestimates sales fr instances where the budget was split between the tw media. This prnunced nn-linear pattern cannt be mdeled accurately using linear re-

96 3.2 Multiple Linear Regressin 81 Sales TV Radi FIGURE 3.5. Fr the Advertising data, a linear regressin fit t sales using TV and radi as predictrs. Frm the pattern f the residuals, we can see that there is a prnunced nn-linear relatinship in the data. The psitive residuals (thse visible abve the surface), tend t lie alng the 45-degree line, where TV and Radi budgets are split evenly. The negative residuals (mst nt visible), tend t lie away frm this line, where budgets are mre lpsided. gressin. It suggests a synergy r interactin effect between the advertising media, whereby cmbining the media tgether results in a bigger bst t sales than using any single medium. In Sectin 3.3.2, we will discuss extending the linear mdel t accmmdate such synergistic effects thrugh the use f interactin terms. Fur: Predictins Once we have fit the multiple regressin mdel, it is straightfrward t apply (3.21) in rder t predict the respnse Y n the basis f a set f values fr the predictrs X 1,X 2,...,X p. Hwever, there are three srts f uncertainty assciated with this predictin. 1. The cefficient estimates ˆβ 0, ˆβ 1,..., ˆβ p are estimates fr β 0,β 1,...,β p. That is, the least squares plane Ŷ = ˆβ 0 + ˆβ 1 X ˆβ p X p is nly an estimate fr the true ppulatin regressin plane f(x) =β 0 + β 1 X β p X p. The inaccuracy in the cefficient estimates is related t the reducible errr frm Chapter 2. We can cmpute a cnfidence interval in rder t determine hw clse Ŷ will be t f(x).

97 82 3. Linear Regressin 2. Of curse, in practice assuming a linear mdel fr f(x) is almst always an apprximatin f reality, s there is an additinal surce f ptentially reducible errr which we call mdel bias. Swhenweusea linear mdel, we are in fact estimating the best linear apprximatin t the true surface. Hwever, here we will ignre this discrepancy, and perate as if the linear mdel were crrect. 3. Even if we knew f(x) that is, even if we knew the true values fr β 0,β 1,...,β p the respnse value cannt be predicted perfectly because f the randm errr ɛ in the mdel (3.21). In Chapter 2, we referred t this as the irreducible errr. Hw much will Y vary frm Ŷ? We use predictin intervals t answer this questin. Predictin intervals are always wider than cnfidence intervals, because they incrprate bth the errr in the estimate fr f(x) (the reducible errr) and the uncertainty as t hw much an individual pint will differ frm the ppulatin regressin plane (the irreducible errr). We use a cnfidence interval t quantify the uncertainty surrunding cnfidence the average sales ver a large number f cities. Fr example, given that interval $100,000 is spent n TV advertising and $20,000 is spent n radi advertising in each city, the 95 % cnfidence interval is [10,985, 11,528]. We interpret this t mean that 95 % f intervals f this frm will cntain the true value f f(x). 8 On the ther hand, a predictin interval canbeusedtquantifythe predictin uncertainty surrunding sales fr a particular city. Given that $100,000 is interval spent n TV advertising and $20,000 is spent n radi advertising in that city the 95 % predictin interval is [7,930, 14,580]. We interpret this t mean that 95 % f intervals f this frm will cntain the true value f Y fr this city. Nte that bth intervals are centered at 11,256, but that the predictin interval is substantially wider than the cnfidence interval, reflecting the increased uncertainty abut sales fr a given city in cmparisn t the average sales ver many lcatins. 3.3 Other Cnsideratins in the Regressin Mdel Qualitative Predictrs In ur discussin s far, we have assumed that all variables in ur linear regressin mdel are quantitative. But in practice, this is nt necessarily the case; ften sme predictrs are qualitative. 8 In ther wrds, if we cllect a large number f data sets like the Advertising data set, and we cnstruct a cnfidence interval fr the average sales n the basis f each data set (given $100,000 in TV and $20,000 in radi advertising), then 95 % f these cnfidence intervals will cntain the true value f average sales.

98 3.3 Other Cnsideratins in the Regressin Mdel 83 Fr example, the Credit data set displayed in Figure 3.6 recrds balance (average credit card debt fr a number f individuals) as well as several quantitative predictrs: age, cards (number f credit cards), educatin (years f educatin), incme (in thusands f dllars), limit (credit limit), and rating (credit rating). Each panel f Figure 3.6 is a scatterplt fr a pair f variables whse identities are given by the crrespnding rw and clumn labels. Fr example, the scatterplt directly t the right f the wrd Balance depicts balance versus age, while the plt directly t the right f Age crrespnds t age versus cards. In additin t these quantitative variables, we als have fur qualitative variables: gender, student (student status), status (marital status), and ethnicity (Caucasian, African American r Asian) Balance Age Cards Educatin Incme Limit Rating FIGURE 3.6. The Credit data set cntains infrmatin abut balance, age, cards, educatin, incme, limit, and rating fr a number f ptential custmers.

99 84 3. Linear Regressin Cefficient Std. errr t-statistic p-value Intercept < gender[female] TABLE 3.7. Least squares cefficient estimates assciated with the regressin f balance nt gender in the Credit data set. The linear mdel is given in (3.27). That is, gender is encded as a dummy variable, as in (3.26). Predictrs with Only Tw Levels Suppse that we wish t investigate differences in credit card balance between males and females, ignring the ther variables fr the mment. If a qualitative predictr (als knwn as a factr) nlyhastwlevels, r pssi- factr ble values, then incrprating it int a regressin mdel is very simple. We level simply create an indicatr r dummy variable that takes n tw pssible dummy numerical values. Fr example, based n the gender variable, we can create variable a new variable that takes the frm x i = { 1 if ith persn is female 0 if ith persn is male, (3.26) and use this variable as a predictr in the regressin equatin. This results in the mdel y i = β 0 + β 1 x i + ɛ i = { β 0 + β 1 + ɛ i β 0 + ɛ i if ith persn is female if ith persn is male. (3.27) Nw β 0 can be interpreted as the average credit card balance amng males, β 0 + β 1 as the average credit card balance amng females, and β 1 as the average difference in credit card balance between females and males. Table 3.7 displays the cefficient estimates and ther infrmatin assciated with the mdel (3.27). The average credit card debt fr males is estimated t be $509.80, whereas females are estimated t carry $19.73 in additinal debt fr a ttal f $ $19.73 = $ Hwever, we ntice that the p-value fr the dummy variable is very high. This indicates that there is n statistical evidence f a difference in average credit card balance between the genders. The decisin t cde females as 1 and males as 0 in (3.27) is arbitrary, and has n effect n the regressin fit, but des alter the interpretatin f the cefficients. If we had cded males as 1 and females as 0, then the estimates fr β 0 and β 1 wuld have been and 19.73, respectively, leading nce again t a predictin f credit card debt f $ $19.73 = $ fr males and a predictin f $ fr females. Alternatively, instead f a 0/1 cding scheme, we culd create a dummy variable

100 x i = 3.3 Other Cnsideratins in the Regressin Mdel 85 { 1 if ith persn is female 1 if ith persn is male and use this variable in the regressin equatin. This results in the mdel y i = β 0 + β 1 x i + ɛ i = { β 0 + β 1 + ɛ i β 0 β 1 + ɛ i if ith persn is female if ith persn is male. Nw β 0 can be interpreted as the verall average credit card balance (ignring the gender effect), and β 1 is the amunt that females are abve the average and males are belw the average. In this example, the estimate fr β 0 wuld be $ , halfway between the male and female averages f $ and $ The estimate fr β 1 wuld be $9.865, which is half f $19.73, the average difference between females and males. It is imprtant t nte that the final predictins fr the credit balances f males and females will be identical regardless f the cding scheme used. The nly difference is in the way that the cefficients are interpreted. Qualitative Predictrs with Mre than Tw Levels When a qualitative predictr has mre than tw levels, a single dummy variable cannt represent all pssible values. In this situatin, we can create additinal dummy variables. Fr example, fr the ethnicity variable we create tw dummy variables. The first culd be { 1 if ith persn is Asian x i1 = (3.28) 0 if ith persn is nt Asian, and the secnd culd be { 1 if ith persn is Caucasian x i2 = 0 if ith persn is nt Caucasian. (3.29) Then bth f these variables can be used in the regressin equatin, in rder t btain the mdel β 0+β 1+ɛ i if ith persn is Asian y i = β 0+β 1x i1+β 2x i2+ɛ i = β 0+β 2+ɛ i if ith persn is Caucasian β 0+ɛ i if ith persn is African American. (3.30) Nw β 0 can be interpreted as the average credit card balance fr African Americans, β 1 can be interpreted as the difference in the average balance between the Asian and African American categries, and β 2 can be interpreted as the difference in the average balance between the Caucasian and

101 86 3. Linear Regressin Cefficient Std. errr t-statistic p-value Intercept < ethnicity[asian] ethnicity[caucasian] TABLE 3.8. Least squares cefficient estimates assciated with the regressin f balance nt ethnicity in the Credit data set. The linear mdel is given in (3.30). That is, ethnicity is encded via tw dummy variables (3.28) and (3.29). African American categries. There will always be ne fewer dummy variable than the number f levels. The level with n dummy variable African American in this example is knwn as the baseline. baseline Frm Table 3.8, we see that the estimated balance fr the baseline, African American, is $ It is estimated that the Asian categry will have $18.69 less debt than the African American categry, and that the Caucasian categry will have $12.50 less debt than the African American categry. Hwever, the p-values assciated with the cefficient estimates fr the tw dummy variables are very large, suggesting n statistical evidence f a real difference in credit card balance between the ethnicities. Once again, the level selected as the baseline categry is arbitrary, and the final predictins fr each grup will be the same regardless f this chice. Hwever, the cefficients and their p-values d depend n the chice f dummy variable cding. Rather than rely n the individual cefficients, we can use an F-test t test H 0 : β 1 = β 2 = 0; this des nt depend n the cding. This F-test has a p-value f 0.96, indicating that we cannt reject the null hypthesis that there is n relatinship between balance and ethnicity. Using this dummy variable apprach presents n difficulties when incrprating bth quantitative and qualitative predictrs. Fr example, t regress balance n bth a quantitative variable such as incme and a qualitative variable such as student, we must simply create a dummy variable fr student and then fit a multiple regressin mdel using incme and the dummy variable as predictrs fr credit card balance. There are many different ways f cding qualitative variables besides the dummy variable apprach taken here. All f these appraches lead t equivalent mdel fits, but the cefficients are different and have different interpretatins, and are designed t measure particular cntrasts.thistpic cntrast is beynd the scpe f the bk, and s we will nt pursue it further Extensins f the Linear Mdel The standard linear regressin mdel (3.19) prvides interpretable results and wrks quite well n many real-wrld prblems. Hwever, it makes several highly restrictive assumptins that are ften vilated in practice. Tw f the mst imprtant assumptins state that the relatinship between the predictrs and respnse are additive and linear. The additive assumptin additive linear

102 3.3 Other Cnsideratins in the Regressin Mdel 87 means that the effect f changes in a predictr X j n the respnse Y is independent f the values f the ther predictrs. The linear assumptin states that the change in the respnse Y due t a ne-unit change in X j is cnstant, regardless f the value f X j. In this bk, we examine a number f sphisticated methds that relax these tw assumptins. Here, we briefly examine sme cmmn classical appraches fr extending the linear mdel. Remving the Additive Assumptin In ur previus analysis f the Advertising data, we cncluded that bth TV and radi seem t be assciated with sales. The linear mdels that frmed the basis fr this cnclusin assumed that the effect n sales f increasing ne advertising medium is independent f the amunt spent n the ther media. Fr example, the linear mdel (3.20) states that the average effect n sales f a ne-unit increase in TV is always β 1, regardless f the amunt spent n radi. Hwever, this simple mdel may be incrrect. Suppse that spending mney n radi advertising actually increases the effectiveness f TV advertising, s that the slpe term fr TV shuld increase as radi increases. In this situatin, given a fixed budget f $100,000, spending half n radi and half n TV may increase sales mre than allcating the entire amunt t either TV r t radi. In marketing, this is knwn as a synergy effect, andinstatisticsitisreferredtasaninteractin effect. Figure 3.5 suggests that such an effect may be present in the advertising data. Ntice that when levels f either TV r radi are lw, then the true sales are lwer than predicted by the linear mdel. But when advertising is split between the tw media, then the mdel tends t underestimate sales. Cnsider the standard linear regressin mdel with tw variables, Y = β 0 + β 1 X 1 + β 2 X 2 + ɛ. Accrding t this mdel, if we increase X 1 by ne unit, then Y will increase by an average f β 1 units. Ntice that the presence f X 2 des nt alter this statement that is, regardless f the value f X 2, a ne-unit increase in X 1 will lead t a β 1 -unit increase in Y. One way f extending this mdel t allw fr interactin effects is t include a third predictr, called an interactin term, which is cnstructed by cmputing the prduct f X 1 and X 2. This results in the mdel Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 + ɛ. (3.31) Hw des inclusin f this interactin term relax the additive assumptin? Ntice that (3.31) can be rewritten as Y = β 0 +(β 1 + β 3 X 2 )X 1 + β 2 X 2 + ɛ (3.32) = β 0 + β 1 X 1 + β 2 X 2 + ɛ

103 88 3. Linear Regressin Cefficient Std. errr t-statistic p-value Intercept < TV < radi TV radi < TABLE 3.9. Fr the Advertising data, least squares cefficient estimates assciated with the regressin f sales nt TV and radi, with an interactin term, as in (3.33). where β 1 = β 1 + β 3 X 2.Since β 1 changes with X 2, the effect f X 1 n Y is n lnger cnstant: adjusting X 2 will change the impact f X 1 n Y. Fr example, suppse that we are interested in studying the prductivity f a factry. We wish t predict the number f units prduced n the basis f the number f prductin lines and the ttal number f wrkers. It seems likely that the effect f increasing the number f prductin lines will depend n the number f wrkers, since if n wrkers are available t perate the lines, then increasing the number f lines will nt increase prductin. This suggests that it wuld be apprpriate t include an interactin term between lines and wrkers in a linear mdel t predict units. Suppse that when we fit the mdel, we btain units lines wrkers (lines wrkers) = 1.2 +( wrkers) lines wrkers. In ther wrds, adding an additinal line will increase the number f units prduced by wrkers. Hence the mre wrkers we have, the strnger will be the effect f lines. We nw return t the Advertising example. A linear mdel that uses radi, TV, and an interactin between the tw t predict sales takes the frm sales = β 0 + β 1 TV + β 2 radi + β 3 (radi TV)+ɛ = β 0 +(β 1 + β 3 radi) TV + β 2 radi + ɛ. (3.33) We can interpret β 3 as the increase in the effectiveness f TV advertising fr a ne unit increase in radi advertising (r vice-versa). The cefficients that result frm fitting the mdel (3.33) are given in Table 3.9. The results in Table 3.9 strngly suggest that the mdel that includes the interactin term is superir t the mdel that cntains nly main effects. The p-value fr the interactin term, TV radi, is extremely lw, indicating that there is strng evidence fr H a : β 3 0. In ther wrds, it is clear that the true relatinship is nt additive. The R 2 fr the mdel (3.33) is 96.8 %, cmpared t nly 89.7 % fr the mdel that predicts sales using TV and radi withut an interactin term. This means that ( )/( ) = 69 % f the variability in sales that remains after fitting the additive mdel has been explained by the interactin term. The cefficient main effect

104 3.3 Other Cnsideratins in the Regressin Mdel 89 estimates in Table 3.9 suggest that an increase in TV advertising f $1,000 is assciated with increased sales f ( ˆβ 1 + ˆβ 3 radi) 1,000 = radi units. And an increase in radi advertising f $1,000 will be assciated with an increase in sales f ( ˆβ 2 + ˆβ 3 TV) 1,000 = TV units. In this example, the p-values assciated with TV, radi, andtheinteractin term all are statistically significant (Table 3.9), and s it is bvius that all three variables shuld be included in the mdel. Hwever, it is smetimes the case that an interactin term has a very small p-value, but the assciated main effects (in this case, TV and radi) dnt.thehierarchical principle states that if we include an interactin in a mdel, we hierarchical shuld als include the main effects, even if the p-values assciated with principle their cefficients are nt significant. In ther wrds, if the interactin between X 1 and X 2 seems imprtant, then we shuld include bth X 1 and X 2 in the mdel even if their cefficient estimates have large p-values. The ratinale fr this principle is that if X 1 X 2 is related t the respnse, then whether r nt the cefficients f X 1 r X 2 are exactly zer is f little interest. Als X 1 X 2 is typically crrelated with X 1 and X 2,ands leaving them ut tends t alter the meaning f the interactin. In the previus example, we cnsidered an interactin between TV and radi, bth f which are quantitative variables. Hwever, the cncept f interactins applies just as well t qualitative variables, r t a cmbinatin f quantitative and qualitative variables. In fact, an interactin between a qualitative variable and a quantitative variable has a particularly nice interpretatin. Cnsider the Credit data set frm Sectin 3.3.1, and suppse that we wish t predict balance using the incme (quantitative) and student (qualitative) variables. In the absence f an interactin term, the mdel takes the frm balance i β 0 + β 1 incme i + = β 1 incme i + { β 2 { β 0 + β 2 β 0 if ith persn is a student 0 if ith persn is nt a student if ith persn is a student if ith persn is nt a student. (3.34) Ntice that this amunts t fitting tw parallel lines t the data, ne fr students and ne fr nn-students. The lines fr students and nn-students have different intercepts, β 0 + β 2 versus β 0, but the same slpe, β 1.This is illustrated in the left-hand panel f Figure 3.7. The fact that the lines are parallel means that the average effect n balance f a ne-unit increase in incme des nt depend n whether r nt the individual is a student. This represents a ptentially serius limitatin f the mdel, since in fact a change in incme may have a very different effect n the credit card balance f a student versus a nn-student. This limitatin can be addressed by adding an interactin variable, created by multiplying incme with the dummy variable fr student. Our

105 90 3. Linear Regressin Balance Balance student nn student Incme Incme FIGURE 3.7. Fr the Credit data, the least squares lines are shwn fr predictin f balance frm incme fr students and nn-students. Left: The mdel (3.34) was fit. There is n interactin between incme and student. Right: The mdel (3.35) was fit. There is an interactin term between incme and student. mdel nw becmes balance i β 0 + β 1 incme i + = { β 2 + β 3 incme i if student 0 if nt student { (β 0 + β 2 )+(β 1 + β 3 ) incme i if student β 0 + β 1 incme i if nt student (3.35) Once again, we have tw different regressin lines fr the students and the nn-students. But nw thse regressin lines have different intercepts, β 0 +β 2 versus β 0, as well as different slpes, β 1 +β 3 versus β 1. This allws fr the pssibility that changes in incme may affect the credit card balances f students and nn-students differently. The right-hand panel f Figure 3.7 shws the estimated relatinships between incme and balance fr students and nn-students in the mdel (3.35). We nte that the slpe fr students is lwer than the slpe fr nn-students. This suggests that increases in incme are assciated with smaller increases in credit card balance amng students as cmpared t nn-students. Nn-linear Relatinships As discussed previusly, the linear regressin mdel (3.19) assumes a linear relatinship between the respnse and predictrs. But in sme cases, the true relatinship between the respnse and the predictrs may be nnlinear. Here we present a very simple way t directly extend the linear mdel t accmmdate nn-linear relatinships, using plynmial regressin. In plynmial later chapters, we will present mre cmplex appraches fr perfrming regressin nn-linear fits in mre general settings. Cnsider Figure 3.8, in which the mpg (gas mileage in miles per galln) versus hrsepwer is shwn fr a number f cars in the Aut data set. The

106 3.3 Other Cnsideratins in the Regressin Mdel 91 Miles per galln Linear Degree 2 Degree Hrsepwer FIGURE 3.8. The Aut data set. Fr a number f cars, mpg and hrsepwer are shwn. The linear regressin fit is shwn in range. The linear regressin fit fr a mdel that includes hrsepwer 2 is shwn as a blue curve. The linear regressin fit fr a mdel that includes all plynmials f hrsepwer up t fifth-degree is shwn in green. range line represents the linear regressin fit. There is a prnunced relatinship between mpg and hrsepwer, but it seems clear that this relatinship is in fact nn-linear: the data suggest a curved relatinship. A simple apprach fr incrprating nn-linear assciatins in a linear mdel is t include transfrmed versins f the predictrs in the mdel. Fr example, the pints in Figure 3.8 seem t have a quadratic shape, suggesting that a quadratic mdel f the frm mpg = β 0 + β 1 hrsepwer + β 2 hrsepwer 2 + ɛ (3.36) may prvide a better fit. Equatin 3.36 invlves predicting mpg using a nn-linear functin f hrsepwer. But it is still a linear mdel! That is, (3.36) is simply a multiple linear regressin mdel with X 1 = hrsepwer and X 2 = hrsepwer 2. S we can use standard linear regressin sftware t estimate β 0,β 1,andβ 2 in rder t prduce a nn-linear fit. The blue curve in Figure 3.8 shws the resulting quadratic fit t the data. The quadratic fit appears t be substantially better than the fit btained when just the linear term is included. The R 2 f the quadratic fit is 0.688, cmpared t fr the linear fit, and the p-value in Table 3.10 fr the quadratic term is highly significant. If including hrsepwer 2 ledtsuchabigimprvementinthemdel,why nt include hrsepwer 3, hrsepwer 4,revenhrsepwer 5? The green curve

107 92 3. Linear Regressin Cefficient Std. errr t-statistic p-value Intercept < hrsepwer < hrsepwer < TABLE Fr the Aut data set, least squares cefficient estimates assciated with the regressin f mpg nt hrsepwer and hrsepwer 2. in Figure 3.8 displays the fit that results frm including all plynmials up t fifth degree in the mdel (3.36). The resulting fit seems unnecessarily wiggly that is, it is unclear that including the additinal terms really has led t a better fit t the data. The apprach that we have just described fr extending the linear mdel t accmmdate nn-linear relatinships is knwn as plynmial regressin, since we have included plynmial functins f the predictrs in the regressin mdel. We further explre this apprach and ther nn-linear extensins f the linear mdel in Chapter Ptential Prblems When we fit a linear regressin mdel t a particular data set, many prblems may ccur. Mst cmmn amng these are the fllwing: 1. Nn-linearity f the respnse-predictr relatinships. 2. Crrelatin f errr terms. 3. Nn-cnstant variance f errr terms. 4. Outliers. 5. High-leverage pints. 6. Cllinearity. In practice, identifying and vercming these prblems is as much an art as a science. Many pages in cuntless bks have been written n this tpic. Since the linear regressin mdel is nt ur primary fcus here, we will prvide nly a brief summary f sme key pints. 1. Nn-linearity f the Data The linear regressin mdel assumes that there is a straight-line relatinship between the predictrs and the respnse. If the true relatinship is far frm linear, then virtually all f the cnclusins that we draw frm the fit are suspect. In additin, the predictin accuracy f the mdel can be significantly reduced. Residual plts are a useful graphical tl fr identifying nn-linearity. residual plt Given a simple linear regressin mdel, we can plt the residuals, e i = y i ŷ i, versus the predictr x i. In the case f a multiple regressin mdel,

108 3.3 Other Cnsideratins in the Regressin Mdel 93 Residual Plt fr Linear Fit Residual Plt fr Quadratic Fit Residuals Residuals Fitted values Fitted values FIGURE 3.9. Plts f residuals versus predicted (r fitted) values fr the Aut data set. In each plt, the red line is a smth fit t the residuals, intended t make it easier t identify a trend. Left: A linear regressin f mpg n hrsepwer. A strng pattern in the residuals indicates nn-linearity in the data. Right: A linear regressin f mpg n hrsepwer and hrsepwer 2. There is little pattern in the residuals. since there are multiple predictrs, we instead plt the residuals versus the predicted (r fitted) valuesŷ i. Ideally, the residual plt will shw n fitted discernible pattern. The presence f a pattern may indicate a prblem with sme aspect f the linear mdel. The left panel f Figure 3.9 displays a residual plt frm the linear regressin f mpg nt hrsepwer n the Aut data set that was illustrated in Figure 3.8. The red line is a smth fit t the residuals, which is displayed in rder t make it easier t identify any trends. The residuals exhibit a clear U-shape, which prvides a strng indicatin f nn-linearity in the data. In cntrast, the right-hand panel f Figure 3.9 displays the residual plt that results frm the mdel (3.36), which cntains a quadratic term. There appears t be little pattern in the residuals, suggesting that the quadratic term imprves the fit t the data. If the residual plt indicates that there are nn-linear assciatins in the data, then a simple apprach is t use nn-linear transfrmatins f the predictrs, such as lg X, X,andX 2, in the regressin mdel. In the later chapters f this bk, we will discuss ther mre advanced nn-linear appraches fr addressing this issue. 2. Crrelatin f Errr Terms An imprtant assumptin f the linear regressin mdel is that the errr terms, ɛ 1,ɛ 2,...,ɛ n, are uncrrelated. What des this mean? Fr instance, if the errrs are uncrrelated, then the fact that ɛ i is psitive prvides little r n infrmatin abut the sign f ɛ i+1. The standard errrs that are cmputed fr the estimated regressin cefficients r the fitted values

109 94 3. Linear Regressin are based n the assumptin f uncrrelated errr terms. If in fact there is crrelatin amng the errr terms, then the estimated standard errrs will tend t underestimate the true standard errrs. As a result, cnfidence and predictin intervals will be narrwer than they shuld be. Fr example, a 95 % cnfidence interval may in reality have a much lwer prbability than 0.95 f cntaining the true value f the parameter. In additin, p-values assciated with the mdel will be lwer than they shuld be; this culd cause us t errneusly cnclude that a parameter is statistically significant. In shrt, if the errr terms are crrelated, we may have an unwarranted sense f cnfidence in ur mdel. As an extreme example, suppse we accidentally dubled ur data, leading t bservatins and errr terms identical in pairs. If we ignred this, ur standard errr calculatins wuld be as if we had a sample f size 2n, when in fact we have nly n samples. Our estimated parameters wuld be the same fr the 2n samplesasfrthen samples, but the cnfidence intervals wuld be narrwer by a factr f 2! Why might crrelatins amng the errr terms ccur? Such crrelatins frequently ccur in the cntext f time series data, which cnsists f b- time series servatins fr which measurements are btained at discrete pints in time. In many cases, bservatins that are btained at adjacent time pints will have psitively crrelated errrs. In rder t determine if this is the case fr a given data set, we can plt the residuals frm ur mdel as a functin f time. If the errrs are uncrrelated, then there shuld be n discernible pattern. On the ther hand, if the errr terms are psitively crrelated, then we may see tracking in the residuals that is, adjacent residuals may have tracking similar values. Figure 3.10 prvides an illustratin. In the tp panel, we see the residuals frm a linear regressin fit t data generated with uncrrelated errrs. There is n evidence f a time-related trend in the residuals. In cntrast, the residuals in the bttm panel are frm a data set in which adjacent errrs had a crrelatin f 0.9. Nw there is a clear pattern in the residuals adjacent residuals tend t take n similar values. Finally, the center panel illustrates a mre mderate case in which the residuals had a crrelatin f 0.5. There is still evidence f tracking, but the pattern is less clear. Many methds have been develped t prperly take accunt f crrelatins in the errr terms in time series data. Crrelatin amng the errr terms can als ccur utside f time series data. Fr instance, cnsider a study in which individuals heights are predicted frm their weights. The assumptin f uncrrelated errrs culd be vilated if sme f the individuals in the study are members f the same family, r eat the same diet, r have been expsed t the same envirnmental factrs. In general, the assumptin f uncrrelated errrs is extremely imprtant fr linear regressin as well as fr ther statistical methds, and gd experimental design is crucial in rder t mitigate the risk f such crrelatins.

110 3.3 Other Cnsideratins in the Regressin Mdel 95 ρ=0.0 Residual ρ=0.5 Residual ρ=0.9 Residual Observatin FIGURE Plts f residuals frm simulated time series data sets generated with differing levels f crrelatin ρ between errr terms fr adjacent time pints. 3. Nn-cnstant Variance f Errr Terms Anther imprtant assumptin f the linear regressin mdel is that the errr terms have a cnstant variance, Var(ɛ i )=σ 2. The standard errrs, cnfidence intervals, and hypthesis tests assciated with the linear mdel rely upn this assumptin. Unfrtunately, it is ften the case that the variances f the errr terms are nn-cnstant. Fr instance, the variances f the errr terms may increase with the value f the respnse. One can identify nn-cnstant variances in the errrs, r heterscedasticity, frm the presence f a funnel shape in heterscedasticity the residual plt. An example is shwn in the left-hand panel f Figure 3.11, in which the magnitude f the residuals tends t increase with the fitted values. When faced with this prblem, ne pssible slutin is t transfrm the respnse Y using a cncave functin such as lg Y r Y. Such a transfrmatin results in a greater amunt f shrinkage f the larger respnses, leading t a reductin in heterscedasticity. The right-hand panel f Figure 3.11 displays the residual plt after transfrming the respnse

111 96 3. Linear Regressin Respnse Y Respnse lg(y) Residuals Residuals Fitted values Fitted values FIGURE Residual plts. In each plt, the red line is a smth fit t the residuals, intended t make it easier t identify a trend. The blue lines track the uter quantiles f the residuals, and emphasize patterns. Left: The funnel shape indicates heterscedasticity. Right: The predictr has been lg-transfrmed, and there is nw n evidence f heterscedasticity. using lgy. The residuals nw appear t have cnstant variance, thugh there is sme evidence f a slight nn-linear relatinship in the data. Smetimes we have a gd idea f the variance f each respnse. Fr example, the ith respnse culd be an average f n i raw bservatins. If each f these raw bservatins is uncrrelated with variance σ 2, then their average has variance σi 2 = σ2 /n i. In this case a simple remedy is t fit ur mdel by weighted least squares, with weights prprtinal t the inverse weighted variances i.e. w i = n i in this case. Mst linear regressin sftware allws least squares fr bservatin weights. 4. Outliers An utlier is a pint fr which y i is far frm the value predicted by the utlier mdel. Outliers can arise fr a variety f reasns, such as incrrect recrding f an bservatin during data cllectin. The red pint (bservatin 20) in the left-hand panel f Figure 3.12 illustrates a typical utlier. The red slid line is the least squares regressin fit, while the blue dashed line is the least squares fit after remval f the utlier. In this case, remving the utlier has little effect n the least squares line: it leads t almst n change in the slpe, and a miniscule reductin in the intercept. It is typical fr an utlier that des nt have an unusual predictr value t have little effect n the least squares fit. Hwever, even if an utlier des nt have much effect n the least squares fit, it can cause ther prblems. Fr instance, in this example, the RSE is 1.09 when the utlier is included in the regressin, but it is nly 0.77 when the utlier is remved. Since the RSE is used t cmpute all cnfidence intervals and

112 3.3 Other Cnsideratins in the Regressin Mdel 97 Y Residuals Studentized Residuals X Fitted Values Fitted Values FIGURE Left: The least squares regressin line is shwn in red, and the regressin line after remving the utlier is shwn in blue. Center: The residual plt clearly identifies the utlier. Right: The utlier has a studentized residual f 6; typically we expect values between 3 and 3. p-values, such a dramatic increase caused by a single data pint can have implicatins fr the interpretatin f the fit. Similarly, inclusin f the utlier causes the R 2 t decline frm t Residual plts can be used t identify utliers. In this example, the utlier is clearly visible in the residual plt illustrated in the center panel f Figure But in practice, it can be difficult t decide hw large a residual needs t be befre we cnsider the pint t be an utlier. T address this prblem, instead f pltting the residuals, we can plt the studentized residuals, cmputed by dividing each residual e i by its estimated standard studentized errr. Observatins whse studentized residuals are greater than 3 in abslute value are pssible utliers. In the right-hand panel f Figure 3.12, the residual utlier s studentized residual exceeds 6, while all ther bservatins have studentized residuals between 2 and2. If we believe that an utlier has ccurred due t an errr in data cllectin r recrding, then ne slutin is t simply remve the bservatin. Hwever, care shuld be taken, since an utlier may instead indicate a deficiency with the mdel, such as a missing predictr. 5. High Leverage Pints We just saw that utliers are bservatins fr which the respnse y i is unusual given the predictr x i. In cntrast, bservatins with high leverage high leverage have an unusual value fr x i. Fr example, bservatin 41 in the left-hand panel f Figure 3.13 has high leverage, in that the predictr value fr this bservatin is large relative t the ther bservatins. (Nte that the data displayed in Figure 3.13 are the same as the data displayed in Figure 3.12, but with the additin f a single high leverage bservatin.) The red slid line is the least squares fit t the data, while the blue dashed line is the fit prduced when bservatin 41 is remved. Cmparing the left-hand panels f Figures 3.12 and 3.13, we bserve that remving the high leverage bservatin has a much mre substantial impact n the least squares line

113 98 3. Linear Regressin Y X Studentized Residuals X X Leverage FIGURE Left: Observatin 41 is a high leverage pint, while 20 is nt. The red line is the fit t all the data, and the blue line is the fit with bservatin 41 remved. Center: The red bservatin is nt unusual in terms f its X 1 value r its X 2 value, but still falls utside the bulk f the data, and hence has high leverage. Right: Observatin 41 has a high leverage and a high residual. than remving the utlier. In fact, high leverage bservatins tend t have a sizable impact n the estimated regressin line. It is cause fr cncern if the least squares line is heavily affected by just a cuple f bservatins, because any prblems with these pints may invalidate the entire fit. Fr this reasn, it is imprtant t identify high leverage bservatins. In a simple linear regressin, high leverage bservatins are fairly easy t identify, since we can simply lk fr bservatins fr which the predictr value is utside f the nrmal range f the bservatins. But in a multiple linear regressin with many predictrs, it is pssible t have an bservatin that is well within the range f each individual predictr s values, but that is unusual in terms f the full set f predictrs. An example is shwn in the center panel f Figure 3.13, fr a data set with tw predictrs, X 1 and X 2. Mst f the bservatins predictr values fall within the blue dashed ellipse, but the red bservatin is well utside f this range. But neither its value fr X 1 nr its value fr X 2 is unusual. S if we examine just X 1 r just X 2, we will fail t ntice this high leverage pint. This prblem is mre prnunced in multiple regressin settings with mre than tw predictrs, because then there is n simple way t plt all dimensins f the data simultaneusly. In rder t quantify an bservatin s leverage, we cmpute the leverage statistic. A large value f this statistic indicates an bservatin with high leverage leverage. Fr a simple linear regressin, statistic h i = 1 n + (x i x) 2 n i =1 (x. (3.37) i x)2 It is clear frm this equatin that h i increases with the distance f x i frm x. There is a simple extensin f h i t the case f multiple predictrs, thugh we d nt prvide the frmula here. The leverage statistic h i is always between 1/n and 1, and the average leverage fr all the bservatins is always equal t (p +1)/n. S if a given bservatin has a leverage statistic

114 3.3 Other Cnsideratins in the Regressin Mdel 99 Age Rating Limit Limit FIGURE Scatterplts f the bservatins frm the Credit data set. Left: A plt f age versus limit. These tw variables are nt cllinear. Right: A plt f rating versus limit. There is high cllinearity. that greatly exceeds (p+1)/n, then we may suspect that the crrespnding pint has high leverage. The right-hand panel f Figure 3.13 prvides a plt f the studentized residuals versus h i fr the data in the left-hand panel f Figure Observatin 41 stands ut as having a very high leverage statistic as well as a high studentized residual. In ther wrds, it is an utlier as well as a high leverage bservatin. This is a particularly dangerus cmbinatin! This plt als reveals the reasn that bservatin 20 had relatively little effect n the least squares fit in Figure 3.12: it has lw leverage. 6. Cllinearity Cllinearity refers t the situatin in which tw r mre predictr variables cllinearity are clsely related t ne anther. The cncept f cllinearity is illustrated in Figure 3.14 using the Credit data set. In the left-hand panel f Figure 3.14, the tw predictrs limit and age appear t have n bvius relatinship. In cntrast, in the right-hand panel f Figure 3.14, the predictrs limit and rating are very highly crrelated with each ther, and we say that they are cllinear. The presence f cllinearity can pse prblems in the regressin cntext, since it can be difficult t separate ut the individual effects f cllinear variables n the respnse. In ther wrds, since limit and rating tend t increase r decrease tgether, it can be difficult t determine hw each ne separately is assciated with the respnse, balance. Figure 3.15 illustrates sme f the difficulties that can result frm cllinearity. The left-hand panel f Figure 3.15 is a cntur plt f the RSS (3.22) assciated with different pssible cefficient estimates fr the regressin f balance n limit and age. Each ellipse represents a set f cefficients that crrespnd t the same RSS, with ellipses nearest t the center taking n the lwest values f RSS. The black dts and assciated dashed

115 Linear Regressin βage βrating β Limit β Limit FIGURE Cntur plts fr the RSS values as a functin f the parameters β fr varius regressins invlving the Credit data set. In each plt, the black dts represent the cefficient values crrespnding t the minimum RSS. Left: A cntur plt f RSS fr the regressin f balance nt age and limit. The minimum value is well defined. Right: A cntur plt f RSS fr the regressin f balance nt rating and limit. Because f the cllinearity, there are many pairs (β Limit,β Rating ) with a similar value fr RSS. lines represent the cefficient estimates that result in the smallest pssible RSS in ther wrds, these are the least squares estimates. The axes fr limit and age have been scaled s that the plt includes pssible cefficient estimates that are up t fur standard errrs n either side f the least squares estimates. Thus the plt includes all plausible values fr the cefficients. Fr example, we see that the true limit cefficient is almst certainly smewhere between 0.15 and In cntrast, the right-hand panel f Figure 3.15 displays cntur plts f the RSS assciated with pssible cefficient estimates fr the regressin f balance nt limit and rating, which we knw t be highly cllinear. Nw the cnturs run alng a narrw valley; there is a brad range f values fr the cefficient estimates that result in equal values fr RSS. Hence a small change in the data culd cause the pair f cefficient values that yield the smallest RSS that is, the least squares estimates t mve anywhere alng this valley. This results in a great deal f uncertainty in the cefficient estimates. Ntice that the scale fr the limit cefficient nw runs frm rughly 0.2 t 0.2; this is an eight-fld increase ver the plausible range f the limit cefficient in the regressin with age. Interestingly, even thugh the limit and rating cefficients nw have much mre individual uncertainty, they will almst certainly lie smewhere in this cntur valley. Fr example, we wuld nt expect the true value f the limit and rating cefficients t be 0.1 and 1 respectively, even thugh such a value is plausible fr each cefficient individually.

116 3.3 Other Cnsideratins in the Regressin Mdel 101 Cefficient Std. errr t-statistic p-value Intercept < Mdel 1 age limit < Intercept < Mdel 2 rating limit TABLE The results fr tw multiple regressin mdels invlving the Credit data set are shwn. Mdel 1 is a regressin f balance n age and limit, and Mdel 2 a regressin f balance n rating and limit. The standard errr f ˆβ limit increases 12-fld in the secnd regressin, due t cllinearity. Since cllinearity reduces the accuracy f the estimates f the regressin cefficients, it causes the standard errr fr ˆβ j t grw. Recall that the t-statistic fr each predictr is calculated by dividing ˆβ j by its standard errr. Cnsequently, cllinearity results in a decline in the t-statistic. As a result, in the presence f cllinearity, we may fail t reject H 0 : β j =0.This means that the pwer f the hypthesis test the prbability f crrectly pwer detecting a nn-zer cefficient is reduced by cllinearity. Table 3.11 cmpares the cefficient estimates btained frm tw separate multiple regressin mdels. The first is a regressin f balance n age and limit, and the secnd is a regressin f balance n rating and limit. Inthe first regressin, bth age and limit are highly significant with very small p- values. In the secnd, the cllinearity between limit and rating has caused the standard errr fr the limit cefficient estimate t increase by a factr f 12 and the p-value t increase t In ther wrds, the imprtance f the limit variable has been masked due t the presence f cllinearity. T avid such a situatin, it is desirable t identify and address ptential cllinearity prblems while fitting the mdel. A simple way t detect cllinearity is t lk at the crrelatin matrix f the predictrs. An element f this matrix that is large in abslute value indicates a pair f highly crrelated variables, and therefre a cllinearity prblem in the data. Unfrtunately, nt all cllinearity prblems can be detected by inspectin f the crrelatin matrix: it is pssible fr cllinearity t exist between three r mre variables even if n pair f variables has a particularly high crrelatin. We call this situatin multicllinearity. multicllinearity Instead f inspecting the crrelatin matrix, a better way t assess multicllinearity is t cmpute the variance inflatin factr (VIF). The VIF is variance the rati f the variance f ˆβ j when fitting the full mdel divided by the inflatin factr variance f ˆβ j if fit n its wn. The smallest pssible value fr VIF is 1, which indicates the cmplete absence f cllinearity. Typically in practice there is a small amunt f cllinearity amng the predictrs. As a rule f thumb, a VIF value that exceeds 5 r 10 indicates a prblematic amunt f

117 Linear Regressin cllinearity. The VIF fr each variable can be cmputed using the frmula VIF( ˆβ j )= 1 1 R 2 X jx j, where RX 2 jx j is the R 2 frm a regressin f X j nt all f the ther predictrs. If RX 2 j X j is clse t ne, then cllinearity is present, and s the VIF will be large. In the Credit data, a regressin f balance n age, rating, andlimit indicates that the predictrs have VIF values f 1.01, , and As we suspected, there is cnsiderable cllinearity in the data! When faced with the prblem f cllinearity, there are tw simple slutins. The first is t drp ne f the prblematic variables frm the regressin. This can usually be dne withut much cmprmise t the regressin fit, since the presence f cllinearity implies that the infrmatin that this variable prvides abut the respnse is redundant in the presence f the ther variables. Fr instance, if we regress balance nt age and limit, withut the rating predictr, then the resulting VIF values are clse t the minimum pssible value f 1, and the R 2 drps frm t S drpping rating frm the set f predictrs has effectively slved the cllinearity prblem withut cmprmising the fit. The secnd slutin is t cmbine the cllinear variables tgether int a single predictr. Fr instance, we might take the average f standardized versins f limit and rating in rder t create a new variable that measures credit wrthiness. 3.4 The Marketing Plan We nw briefly return t the seven questins abut the Advertising data that we set ut t answer at the beginning f this chapter. 1. Is there a relatinship between advertising sales and budget? This questin can be answered by fitting a multiple regressin mdel f sales nt TV, radi, andnewspaper, as in (3.20), and testing the hypthesis H 0 : β TV = β radi = β newspaper = 0. In Sectin 3.2.2, we shwed that the F-statistic can be used t determine whether r nt we shuld reject this null hypthesis. In this case the p-value crrespnding t the F-statistic in Table 3.6 is very lw, indicating clear evidence f a relatinship between advertising and sales. 2. Hw strng is the relatinship? We discussed tw measures f mdel accuracy in Sectin First, the RSE estimates the standard deviatin f the respnse frm the ppulatin regressin line. Fr the Advertising data, the RSE is 1,681

118 3.4 The Marketing Plan 103 units while the mean value fr the respnse is 14,022, indicating a percentage errr f rughly 12 %. Secnd, the R 2 statistic recrds the percentage f variability in the respnse that is explained by the predictrs. The predictrs explain almst 90 % f the variance in sales. The RSE and R 2 statistics are displayed in Table Which media cntribute t sales? T answer this questin, we can examine the p-values assciated with each predictr s t-statistic (Sectin 3.1.2). In the multiple linear regressin displayed in Table 3.4, the p-values fr TV and radi are lw, but the p-value fr newspaper is nt. This suggests that nly TV and radi are related t sales. In Chapter 6 we explre this questin in greater detail. 4. Hw large is the effect f each medium n sales? We saw in Sectin that the standard errr f ˆβ j can be used t cnstruct cnfidence intervals fr β j.frthe Advertising data, the 95 % cnfidence intervals are as fllws: (0.043, 0.049) fr TV, (0.172, 0.206) fr radi, and( 0.013, 0.011) fr newspaper. The cnfidence intervals fr TV and radi are narrw and far frm zer, prviding evidence that these media are related t sales. Buttheinterval fr newspaper includes zer, indicating that the variable is nt statistically significant given the values f TV and radi. We saw in Sectin that cllinearity can result in very wide standard errrs. Culd cllinearity be the reasn that the cnfidence interval assciated with newspaper is s wide? The VIF scres are 1.005, 1.145, and fr TV, radi, andnewspaper, suggesting n evidence f cllinearity. In rder t assess the assciatin f each medium individually n sales, we can perfrm three separate simple linear regressins. Results are shwn in Tables 3.1 and 3.3. There is evidence f an extremely strng assciatin between TV and sales and between radi and sales. There is evidence f a mild assciatin between newspaper and sales, when the values f TV and radi are ignred. 5. Hw accurately can we predict future sales? The respnse can be predicted using (3.21). The accuracy assciated with this estimate depends n whether we wish t predict an individual respnse, Y = f(x) +ɛ, r the average respnse, f(x) (Sectin 3.2.2). If the frmer, we use a predictin interval, and if the latter, we use a cnfidence interval. Predictin intervals will always be wider than cnfidence intervals because they accunt fr the uncertainty assciated with ɛ, the irreducible errr.

119 Linear Regressin 6. Is the relatinship linear? In Sectin 3.3.3, we saw that residual plts can be used in rder t identify nn-linearity. If the relatinships are linear, then the residual plts shuld display n pattern. In the case f the Advertising data, we bserve a nn-linear effect in Figure 3.5, thugh this effect culd als be bserved in a residual plt. In Sectin 3.3.2, we discussed the inclusin f transfrmatins f the predictrs in the linear regressin mdel in rder t accmmdate nn-linear relatinships. 7. Is there synergy amng the advertising media? The standard linear regressin mdel assumes an additive relatinship between the predictrs and the respnse. An additive mdel is easy t interpret because the effect f each predictr n the respnse is unrelated t the values f the ther predictrs. Hwever, the additive assumptin may be unrealistic fr certain data sets. In Sectin 3.3.3, we shwed hw t include an interactin term in the regressin mdel in rder t accmmdate nn-additive relatinships. A small p-value assciated with the interactin term indicates the presence f such relatinships. Figure 3.5 suggested that the Advertising data may nt be additive. Including an interactin term in the mdel results in a substantial increase in R 2, frm arund 90 % t almst 97 %. 3.5 Cmparisn f Linear Regressin with K-Nearest Neighbrs As discussed in Chapter 2, linear regressin is an example f a parametric apprach because it assumes a linear functinal frm fr f(x). Parametric methds have several advantages. They are ften easy t fit, because ne need estimate nly a small number f cefficients. In the case f linear regressin, the cefficients have simple interpretatins, and tests f statistical significance can be easily perfrmed. But parametric methds d have a disadvantage: by cnstructin, they make strng assumptins abut the frm f f(x). If the specified functinal frm is far frm the truth, and predictin accuracy is ur gal, then the parametric methd will perfrm prly. Fr instance, if we assume a linear relatinship between X and Y but the true relatinship is far frm linear, then the resulting mdel will prvide a pr fit t the data, and any cnclusins drawn frm it will be suspect. In cntrast, nn-parametric methds d nt explicitly assume a parametric frm fr f(x), and thereby prvide an alternative and mre flexible apprach fr perfrming regressin. We discuss varius nn-parametric methds in this bk. Here we cnsider ne f the simplest and best-knwn nn-parametric methds, K-nearest neighbrs regressin (KNN regressin). K-nearest neighbrs regressin

120 y x 2 y x Cmparisn f Linear Regressin with K-Nearest Neighbrs 105 x 1 x 1 FIGURE Plts f ˆf(X) using KNN regressin n a tw-dimensinal data set with 64 bservatins (range dts). Left: K =1results in a rugh step functin fit. Right: K =9prduces a much smther fit. The KNN regressin methd is clsely related t the KNN classifier discussed in Chapter 2. Given a value fr K and a predictin pint x 0,KNN regressin first identifies the K training bservatins that are clsest t x 0, represented by N 0.Itthenestimatesf(x 0 ) using the average f all the training respnses in N 0.Intherwrds, ˆf(x 0 )= 1 y i. K x i N 0 Figure 3.16 illustrates tw KNN fits n a data set with p = 2 predictrs. The fit with K = 1 is shwn in the left-hand panel, while the right-hand panel crrespnds t K = 9. We see that when K = 1, the KNN fit perfectly interplates the training bservatins, and cnsequently takes the frm f a step functin. When K = 9, the KNN fit still is a step functin, but averaging ver nine bservatins results in much smaller regins f cnstant predictin, and cnsequently a smther fit. In general, the ptimal value fr K will depend n the bias-variance tradeff, which we intrduced in Chapter 2. A small value fr K prvides the mst flexible fit, which will have lw bias but high variance. This variance is due t the fact that the predictin in a given regin is entirely dependent n just ne bservatin. In cntrast, larger values f K prvide a smther and less variable fit; the predictin in a regin is an average f several pints, and s changing ne bservatin has a smaller effect. Hwever, the smthing may cause bias by masking sme f the structure in f(x). In Chapter 5, we intrduce several appraches fr estimating test errr rates. These methds can be used t identify the ptimal value f K in KNN regressin.

121 Linear Regressin In what setting will a parametric apprach such as least squares linear regressin utperfrm a nn-parametric apprach such as KNN regressin? The answer is simple: the parametric apprach will utperfrm the nnparametric apprach if the parametric frm that has been selected is clse t the true frm f f. Figure 3.17 prvides an example with data generated frm a ne-dimensinal linear regressin mdel. The black slid lines represent f(x), while the blue curves crrespnd t the KNN fits using K =1 and K =9.Inthiscase,theK = 1 predictins are far t variable, while the smther K = 9 fit is much clser t f(x). Hwever, since the true relatinship is linear, it is hard fr a nn-parametric apprach t cmpete with linear regressin: a nn-parametric apprach incurs a cst in variance that is nt ffset by a reductin in bias. The blue dashed line in the lefthand panel f Figure 3.18 represents the linear regressin fit t the same data. It is almst perfect. The right-hand panel f Figure 3.18 reveals that linear regressin utperfrms KNN fr this data. The green slid line, pltted as a functin f 1/K, represents the test set mean squared errr (MSE) fr KNN. The KNN errrs are well abve the black dashed line, which is the test MSE fr linear regressin. When the value f K is large, then KNN perfrms nly a little wrse than least squares regressin in terms f MSE. It perfrms far wrse when K is small. In practice, the true relatinship between X and Y is rarely exactly linear. Figure 3.19 examines the relative perfrmances f least squares regressin and KNN under increasing levels f nn-linearity in the relatinship between X and Y. In the tp rw, the true relatinship is nearly linear. In this case we see that the test MSE fr linear regressin is still superir t that f KNN fr lw values f K. Hwever, fr K 4, KNN utperfrms linear regressin. The secnd rw illustrates a mre substantial deviatin frm linearity. In this situatin, KNN substantially utperfrms linear regressin fr all values f K. Nte that as the extent f nn-linearity increases, there is little change in the test set MSE fr the nn-parametric KNN methd, but there is a large increase in the test set MSE f linear regressin. Figures 3.18 and 3.19 display situatins in which KNN perfrms slightly wrsethan linear regressinwhen the relatinship is linear, but much better than linear regressin fr nn-linear situatins. In a real life situatin in which the true relatinship is unknwn, ne might draw the cnclusin that KNN shuld be favred ver linear regressin because it will at wrst be slightly inferir than linear regressin if the true relatinship is linear, and may give substantially better results if the true relatinship is nn-linear. But in reality, even when the true relatinship is highly nn-linear, KNN may still prvide inferir results t linear regressin. In particular, bth Figures 3.18 and 3.19 illustrate settings with p = 1 predictr. But in higher dimensins, KNN ften perfrms wrse than linear regressin. Figure 3.20 cnsiders the same strngly nn-linear situatin as in the secnd rw f Figure 3.19, except that we have added additinal nise

122 3.5 Cmparisn f Linear Regressin with K-Nearest Neighbrs 107 y y x FIGURE Plts f ˆf(X) using KNN regressin n a ne-dimensinal data set with 100 bservatins. The true relatinship is given by the black slid line. Left: ThebluecurvecrrespndstK =1and interplates (i.e. passes directly thrugh) the training data. Right: The blue curve crrespnds t K =9, and represents a smther fit. x y Mean Squared Errr x 1/K FIGURE The same data set shwn in Figure 3.17 is investigated further. Left: The blue dashed line is the least squares fit t the data. Since f(x) is in fact linear (displayed as the black line), the least squares regressin line prvides a very gd estimate f f(x). Right: The dashed hrizntal line represents the least squares test set MSE, while the green slid line crrespnds t the MSE fr KNN as a functin f 1/K (n the lg scale). Linear regressin achieves a lwer test MSE than des KNN regressin, since f(x) is in fact linear. Fr KNN regressin, the best results ccur with a very large value f K, crrespnding t a small value f 1/K.

123 Linear Regressin y y Mean Squared Errr x Mean Squared Errr x 1/K 1/K FIGURE Tp Left: In a setting with a slightly nn-linear relatinship between X and Y (slid black line), the KNN fits with K =1(blue) and K =9 (red) are displayed. Tp Right: Fr the slightly nn-linear data, the test set MSE fr least squares regressin (hrizntal black) and KNN with varius values f 1/K (green) are displayed. Bttm Left and Bttm Right: As in the tp panel, but with a strngly nn-linear relatinship between X and Y. predictrs that are nt assciated with the respnse. When p =1rp =2, KNN utperfrms linear regressin. But fr p = 3 the results are mixed, and fr p 4 linear regressin is superir t KNN. In fact, the increase in dimensin has nly caused a small deteriratin in the linear regressin test set MSE, but it has caused mre than a ten-fld increase in the MSE fr KNN. This decrease in perfrmance as the dimensin increases is a cmmn prblem fr KNN, and results frm the fact that in higher dimensins there is effectively a reductin in sample size. In this data set there are 100 training bservatins; when p = 1, this prvides enugh infrmatin t accurately estimate f(x). Hwever, spreading 100 bservatins ver p =20 dimensins results in a phenmenn in which a given bservatin has n nearby neighbrs this is the s-called curse f dimensinality. Thatis, the K bservatins that are nearest t a given test bservatin x 0 may be very far away frm x 0 in p-dimensinal space when p is large, leading t a curse f dimensinality

124 3.6 Lab: Linear Regressin 109 p=1 p=2 p=3 p=4 p=10 p=20 Mean Squared Errr FIGURE Test MSE fr linear regressin (black dashed lines) and KNN (green curves) as the number f variables p increases. The true functin is nn linear in the first variable, as in the lwer panel in Figure 3.19, and des nt depend n the additinal variables. The perfrmance f linear regressin deterirates slwly in the presence f these additinal nise variables, whereas KNN s perfrmance degrades much mre quickly as p increases. very pr predictin f f(x 0 ) and hence a pr KNN fit. As a general rule, parametric methds will tend t utperfrm nn-parametric appraches when there is a small number f bservatins per predictr. Even in prblems in which the dimensin is small, we might prefer linear regressin t KNN frm an interpretability standpint. If the test MSE f KNN is nly slightly lwer than that f linear regressin, we might be willing t freg a little bit f predictin accuracy fr the sake f a simple mdel that can be described in terms f just a few cefficients, and fr which p-values are available. 1/K 3.6 Lab: Linear Regressin Libraries The library() functin is used t lad libraries, r grups f functins and library() data sets that are nt included in the base R distributin. Basic functins that perfrm least squares linear regressin and ther simple analyses cme standard with the base distributin, but mre extic functins require additinal libraries. Here we lad the MASS package, which is a very large cllectin f data sets and functins. We als lad the ISLR package, which includes the data sets assciated with this bk. > library(mass) > library(islr) If yu receive an errr message when lading any f these libraries, it likely indicates that the crrespnding library has nt yet been installed n yur system. Sme libraries, such as MASS, cme with R and d nt need t be separately installed n yur cmputer. Hwever, ther packages, such as

125 Linear Regressin ISLR, must be dwnladed the first time they are used. This can be dne directly frm within R. Fr example, n a Windws system, select the Install package ptin under the Packages tab. After yu select any mirrr site, a list f available packages will appear. Simply select the package yu wish t install and R will autmatically dwnlad the package. Alternatively, this can be dne at the R cmmand line via install.packages("islr"). This installatin nly needs t be dne the first time yu use a package. Hwever, the library() functin must be called each time yu wish t use a given package Simple Linear Regressin The MASS library cntains the Bstn data set, which recrds medv (median huse value) fr 506 neighbrhds arund Bstn. We will seek t predict medv using 13 predictrs such as rm (average number f rms per huse), age (average age f huses), and lstat (percent f husehlds with lw sciecnmic status). > fix(bstn) > names(bstn) [1] "crim" "zn" "indus" "chas" "nx" "rm" "age" [8] "dis" "rad" "tax" "ptrati" "black" "lstat" "medv" T find ut mre abut the data set, we can type?bstn. We will start by using the lm() functin t fit a simple linear regressin lm() mdel, with medv as the respnse and lstat as the predictr. The basic syntax is lm(y x,data), where y is the respnse, x is the predictr, and data is the data set in which these tw variables are kept. > lm.fit=lm(medv lstat) Errr in eval(expr, envir, encls) : Object "medv" nt fund The cmmand causes an errr because R des nt knw where t find the variables medv and lstat. The next line tells R that the variables are in Bstn. If we attach Bstn, the first line wrks fine because R nw recgnizes the variables. > lm.fit=lm(medv lstat,data=bstn) > attach(bstn) > lm.fit=lm(medv lstat) If we type lm.fit, sme basic infrmatin abut the mdel is utput. Fr mre detailed infrmatin, we use summary(lm.fit). This gives us p- values and standard errrs fr the cefficients, as well as the R 2 statistic and F-statistic fr the mdel. > lm.fit Call: lm(frmula = medv lstat)

126 3.6 Lab: Linear Regressin 111 Cefficients: (Intercept ) lstat > summary(lm.fit) Call: lm(frmula = medv lstat) Residuals : Min 1Q Median 3Q Max Cefficients: Estimate Std. Errr t value Pr(>t) (Intercept ) <2e-16 *** lstat <2e-16 *** --- Signif. cdes: 0 *** ** 0.01 * Residual standard errr: 6.22 n 504 degrees f freedm Multiple R-squared: 0.544, Adjusted R-squared: F-statistic : 602 n 1 and 504 DF, p-value: <2e-16 We can use the names() functin in rder t find ut what ther pieces names() f infrmatin are stred in lm.fit. Althugh we can extract these quantities by name e.g. lm.fit$cefficients it is safer t use the extractr functins like cef() t access them. cef() > names(lm.fit) [1] "cefficients" "residuals " "effects" [4] "rank" "fitted.values" "assign" [7] "qr" "df.residual" "xlevels" [10] "call" "terms" "mdel" > cef(lm.fit) (Intercept ) lstat In rder t btain a cnfidence interval fr the cefficient estimates, we can use the cnfint() cmmand. > cnfint(lm.fit) 2.5 % 97.5 % (Intercept ) lstat cnfint() The predict() functin can be used t prduce cnfidence intervals and predict() predictin intervals fr the predictin f medv fr a given value f lstat. > predict(lm.fit,data.frame(lstat=(c(5,10,15))), interval ="cnfidence ") fit lwr upr

127 Linear Regressin > predict(lm.fit,data.frame(lstat=(c(5,10,15))), interval ="predictin ") fit lwr upr Fr instance, the 95 % cnfidence interval assciated with a lstat value f 10 is (24.47, 25.63), and the 95 % predictin interval is (12.828, 37.28). As expected, the cnfidence and predictin intervals are centered arund the same pint (a predicted value f fr medv when lstat equals 10), but the latter are substantially wider. We will nw plt medv and lstat alng with the least squares regressin line using the plt() and abline() functins. > plt(lstat,medv) > abline(lm.fit) There is sme evidence fr nn-linearity in the relatinship between lstat and medv. We will explre this issue later in this lab. The abline() functin can be used t draw any line, nt just the least squares regressin line. T draw a line with intercept a and slpe b, we type abline(a,b). Belw we experiment with sme additinal settings fr pltting lines and pints. The lwd=3 cmmand causes the width f the regressin line t be increased by a factr f 3; this wrks fr the plt() and lines() functins als. We can als use the pch ptin t create different pltting symbls. > abline(lm.fit,lwd=3) > abline(lm.fit,lwd=3,cl="red") > plt(lstat,medv,cl="red") > plt(lstat,medv,pch=20) > plt(lstat,medv,pch="+") > plt(1:20,1:20,pch=1:20) abline() Next we examine sme diagnstic plts, several f which were discussed in Sectin Fur diagnstic plts are autmatically prduced by applying the plt() functin directly t the utput frm lm(). In general, this cmmand will prduce ne plt at a time, and hitting Enter will generate the next plt. Hwever, it is ften cnvenient t view all fur plts tgether. We can achieve this by using the par() functin, which tells R t split the par() display screen int separate panels s that multiple plts can be viewed simultaneusly. Fr example, par(mfrw=c(2,2)) divides the pltting regin int a 2 2gridfpanels. > par(mfrw=c(2,2)) > plt(lm.fit) Alternatively, we can cmpute the residuals frm a linear regressin fit using the residuals() functin. The functin rstudent() will return the residuals() studentized residuals, and we can use this functin t plt the residuals rstudent() against the fitted values.

128 3.6 Lab: Linear Regressin 113 > plt(predict(lm.fit), residuals (lm.fit)) > plt(predict(lm.fit), rstudent(lm.fit)) On the basis f the residual plts, there is sme evidence f nn-linearity. Leverage statistics can be cmputed fr any number f predictrs using the hatvalues() functin. > plt(hatvalues (lm.fit)) > which.max(hatvalues (lm.fit)) 375 hatvalues() The which.max() functin identifies the index f the largest element f a which.max() vectr. In this case, it tells us which bservatin has the largest leverage statistic Multiple Linear Regressin In rder t fit a multiple linear regressin mdel using least squares, we again use the lm() functin. The syntax lm(y x1+x2+x3) is used t fit a mdel with three predictrs, x1, x2, and x3. The summary() functin nw utputs the regressin cefficients fr all the predictrs. > lm.fit=lm(medv lstat+age,data=bstn) > summary(lm.fit) Call: lm(frmula = medv lstat + age, data = Bstn) Residuals : Min 1Q Median 3Q Max Cefficients: Estimate Std. Errr t value Pr(>t) (Intercept ) <2e-16 *** lstat <2e-16 *** age ** --- Signif. cdes: 0 *** ** 0.01 * Residual standard errr: 6.17 n 503 degrees f freedm Multiple R-squared: 0.551, Adjusted R-squared: F-statistic : 309 n 2 and 503 DF, p-value: <2e-16 The Bstn data set cntains 13 variables, and s it wuld be cumbersme t have t type all f these in rder t perfrm a regressin using all f the predictrs. Instead, we can use the fllwing shrt-hand: > lm.fit=lm(medv.,data=bstn) > summary(lm.fit) Call: lm(frmula = medv., data = Bstn)

129 Linear Regressin Residuals : Min 1Q Median 3Q Max Cefficients: Estimate Std. Errr t value Pr(>t) (Intercept ) 3.646e e e-12 *** crim e e ** zn 4.642e e *** indus 2.056e e chas 2.687e e ** nx e e e-06 *** rm 3.810e e < 2e-16 *** age 6.922e e dis e e e-13 *** rad 3.060e e e-06 *** tax e e ** ptrati e e e-12 *** black 9.312e e *** lstat e e < 2e-16 *** --- Signif. cdes: 0 *** ** 0.01 * Residual standard errr: n 492 degrees f freedm Multiple R-Squared: , Adjusted R-squared: F-statistic : n 13 and 492 DF, p-value: < 2.2e-16 We can access the individual cmpnents f a summary bject by name (type?summary.lm t see what is available). Hence summary(lm.fit)$r.sq gives us the R 2,andsummary(lm.fit)$sigma gives us the RSE. The vif() vif() functin, part f the car package, can be used t cmpute variance inflatin factrs. Mst VIF s are lw t mderate fr this data. The car package is nt part f the base R installatin s it must be dwnladed the first time yu use it via the install.packages ptin in R. > library(car) > vif(lm.fit) crim zn indus chas nx rm age dis rad tax ptrati black lstat What if we wuld like t perfrm a regressin using all f the variables but ne? Fr example, in the abve regressin utput, age has a high p-value. S we may wish t run a regressin excluding this predictr. The fllwing syntax results in a regressin using all predictrs except age. > lm.fit1=lm(medv.-age,data=bstn) > summary(lm.fit1)... Alternatively, the update() functin can be used. update()

130 3.6 Lab: Linear Regressin 115 > lm.fit1=update(lm.fit,.-age) Interactin Terms It is easy t include interactin terms in a linear mdel using the lm() functin. The syntax lstat:black tells R t include an interactin term between lstat and black. Thesyntaxlstat*age simultaneusly includes lstat, age, and the interactin term lstat age as predictrs; it is a shrthand fr lstat+age+lstat:age. > summary(lm(medv lstat*age,data=bstn)) Call: lm(frmula = medv lstat * age, data = Bstn) Residuals : Min 1Q Median 3Q Max Cefficients: Estimate Std. Errr t value Pr(>t) (Intercept ) < 2e-16 *** lstat e-16 *** age lstat:age * --- Signif. cdes: 0 *** ** 0.01 * Residual standard errr: 6.15 n 502 degrees f freedm Multiple R-squared: 0.556, Adjusted R-squared: F-statistic : 209 n 3 and 502 DF, p-value: <2e Nn-linear Transfrmatins f the Predictrs The lm() functin can als accmmdate nn-linear transfrmatins f the predictrs. Fr instance, given a predictr X, we can create a predictr X 2 using I(X^2). The functin I() is needed since the ^ has a special meaning I() in a frmula; wrapping as we d allws the standard usage in R, whichis t raise X t the pwer 2. We nw perfrm a regressin f medv nt lstat and lstat 2. > lm.fit2=lm(medv lstat+i(lstat^2)) > summary(lm.fit2) Call: lm(frmula = medv lstat + I(lstat^2)) Residuals : Min 1Q Median 3Q Max

131 Linear Regressin Cefficients: Estimate Std. Errr t value Pr(>t) (Intercept ) <2e-16 *** lstat <2e-16 *** I(lstat^2) <2e-16 *** --- Signif. cdes: 0 *** ** 0.01 * Residual standard errr: 5.52 n 503 degrees f freedm Multiple R-squared: 0.641, Adjusted R-squared: F-statistic : 449 n 2 and 503 DF, p-value: <2e-16 The near-zer p-value assciated with the quadratic term suggests that it leads t an imprved mdel. We use the anva() functin t further anva() quantify the extent t which the quadratic fit is superir t the linear fit. > lm.fit=lm(medv lstat) > anva(lm.fit,lm.fit2) Analysis f Variance Table Mdel 1: medv lstat Mdel 2: medv lstat + I(lstat^2) Res.Df RSS Df Sum f Sq F Pr(>F) <2e-16 *** --- Signif. cdes: 0 *** ** 0.01 * Here Mdel 1 represents the linear submdel cntaining nly ne predictr, lstat, while Mdel 2 crrespnds t the larger quadratic mdel that has tw predictrs, lstat and lstat 2.Theanva() functin perfrms a hypthesis test cmparing the tw mdels. The null hypthesis is that the tw mdels fit the data equally well, and the alternative hypthesis is that the full mdel is superir. Here the F-statistic is 135 and the assciated p-value is virtually zer. This prvides very clear evidence that the mdel cntaining the predictrs lstat and lstat 2 is far superir t the mdel that nly cntains the predictr lstat. This is nt surprising, since earlier we saw evidence fr nn-linearity in the relatinship between medv and lstat. Ifwe type > par(mfrw=c(2,2)) > plt(lm.fit2) then we see that when the lstat 2 term is included in the mdel, there is little discernible pattern in the residuals. In rder t create a cubic fit, we can include a predictr f the frm I(X^3). Hwever, this apprach can start t get cumbersme fr higherrder plynmials. A better apprach invlves using the ply() functin ply() t create the plynmial within lm(). Fr example, the fllwing cmmand prduces a fifth-rder plynmial fit:

132 3.6 Lab: Linear Regressin 117 > lm.fit5=lm(medv ply(lstat,5)) > summary(lm.fit5) Call: lm(frmula = medv ply(lstat, 5)) Residuals : Min 1Q Median 3Q Max Cefficients: Estimate Std. Errr t value Pr(>t) (Intercept ) < 2e-16 *** ply(lstat, 5) < 2e-16 *** ply(lstat, 5) < 2e-16 *** ply(lstat, 5) e-07 *** ply(lstat, 5) e-06 *** ply(lstat, 5) *** --- Signif. cdes: 0 *** ** 0.01 * Residual standard errr: 5.21 n 500 degrees f freedm Multiple R-squared: 0.682, Adjusted R-squared: F-statistic : 214 n 5 and 500 DF, p-value: <2e-16 This suggests that including additinal plynmial terms, up t fifth rder, leads t an imprvement in the mdel fit! Hwever, further investigatin f the data reveals that n plynmial terms beynd fifth rder have significant p-values in a regressin fit. Of curse, we are in n way restricted t using plynmial transfrmatins f the predictrs. Here we try a lg transfrmatin. > summary(lm(medv lg(rm),data=bstn)) Qualitative Predictrs We will nw examine the Carseats data, which is part f the ISLR library. We will attempt t predict Sales (child car seat sales) in 400 lcatins based n a number f predictrs. > fix(carseats) > names(carseats) [1] "Sales" "CmpPrice " "Incme" "Advertising " [5] "Ppulatin " "Price" "ShelveLc " "Age" [9] "Educatin " "Urban" "US" The Carseats data includes qualitative predictrs such as Shelvelc, an indicatr f the quality f the shelving lcatin that is, the space within a stre in which the car seat is displayed at each lcatin. The predictr Shelvelc takes n three pssible values, Bad, Medium, and Gd.

133 Linear Regressin Given a qualitative variable such as Shelvelc, R generates dummy variables autmatically. Belw we fit a multiple regressin mdel that includes sme interactin terms. > lm.fit=lm(sales.+incme:advertising +Price:Age,data=Carseats) > summary(lm.fit) Call: lm(frmula = Sales. + Incme:Advertising + Price:Age, data = Carseats) Residuals : Min 1Q Median 3Q Max Cefficients: Estimate Std. Errr t value Pr(>t) (Intercept ) e-10 *** CmpPrice < 2e-16 *** Incme e-05 *** Advertising ** Ppulatin Price < 2e-16 *** ShelveLcGd < 2e-16 *** ShelveLcMedium < 2e-16 *** Age *** Educatin UrbanYes USYes Incme:Advertising ** Price:Age Signif. cdes: 0 *** ** 0.01 * Residual standard errr: 1.01 n 386 degrees f freedm Multiple R-squared: 0.876, Adjusted R-squared: F-statistic : 210 n 13 and 386 DF, p-value: <2e-16 The cntrasts() functin returns the cding that R uses fr the dummy cntrasts() variables. > attach(carseats) > cntrasts (ShelveLc ) Gd Medium Bad 0 0 Gd 1 0 Medium 0 1 Use?cntrasts t learn abut ther cntrasts, and hw t set them. R has created a ShelveLcGd dummy variable that takes n a value f 1 if the shelving lcatin is gd, and 0 therwise. It has als created a ShelveLcMedium dummy variable that equals 1 if the shelving lcatin is medium, and 0 therwise. A bad shelving lcatin crrespnds t a zer fr each f the tw dummy variables. The fact that the cefficient fr

134 3.6 Lab: Linear Regressin 119 ShelveLcGd in the regressin utput is psitive indicates that a gd shelving lcatin is assciated with high sales (relative t a bad lcatin). And ShelveLcMedium has a smaller psitive cefficient, indicating that a medium shelving lcatin leads t higher sales than a bad shelving lcatin but lwer sales than a gd shelving lcatin Writing Functins As we have seen, R cmes with many useful functins, and still mre functins are available by way f R libraries. Hwever, we will ften be interested in perfrming an peratin fr which n functin is available. In this setting, we may want t write ur wn functin. Fr instance, belw we prvide a simple functin that reads in the ISLR and MASS libraries, called LadLibraries(). Befre we have created the functin, R returns an errr if we try t call it. > LadLibraries Errr: bject LadLibraries nt fund > LadLibraries() Errr: culd nt find functin "LadLibraries" We nw create the functin. Nte that the + symbls are printed by R and shuld nt be typed in. The { symbl infrms R that multiple cmmands are abut t be input. Hitting Enter after typing { will cause R t print the + symbl. We can then input as many cmmands as we wish, hitting Enter after each ne. Finally the } symbl infrms R that n further cmmands will be entered. > LadLibraries=functin (){ + library(islr) + library(mass) + print("the libraries have been laded.") + } Nw if we type in LadLibraries, R will tell us what is in the functin. > LadLibraries functin (){ library(islr) library(mass) print("the libraries have been laded.") } If we call the functin, the libraries are laded in and the print statement is utput. > LadLibraries() [1] "The libraries have been laded."

135 Linear Regressin 3.7 Exercises Cnceptual 1. Describe the null hyptheses t which the p-values given in Table 3.4 crrespnd. Explain what cnclusins yu can draw based n these p-values. Yur explanatin shuld be phrased in terms f sales, TV, radi, and newspaper, rather than in terms f the cefficients f the linear mdel. 2. Carefully explain the differences between the KNN classifier and KNN regressin methds. 3. Suppse we have a data set with five predictrs, X 1 =GPA,X 2 =IQ, X 3 = Gender (1 fr Female and 0 fr Male), X 4 = Interactin between GPA and IQ, and X 5 = Interactin between GPA and Gender. The respnse is starting salary after graduatin (in thusands f dllars). Suppse we use least squares t fit the mdel, and get ˆβ 0 =50, ˆβ 1 = 20, ˆβ 2 =0.07, ˆβ 3 =35, ˆβ 4 =0.01, ˆβ 5 = 10. (a) Which answer is crrect, and why? i. Fr a fixed value f IQ and GPA, males earn mre n average than females. ii. Fr a fixed value f IQ and GPA, females earn mre n average than males. iii. Fr a fixed value f IQ and GPA, males earn mre n average than females prvided that the GPA is high enugh. iv. Fr a fixed value f IQ and GPA, females earn mre n average than males prvided that the GPA is high enugh. (b) Predict the salary f a female with IQ f 110 and a GPA f 4.0. (c) True r false: Since the cefficient fr the GPA/IQ interactin term is very small, there is very little evidence f an interactin effect. Justify yur answer. 4. I cllect a set f data (n = 100 bservatins) cntaining a single predictr and a quantitative respnse. I then fit a linear regressin mdel t the data, as well as a separate cubic regressin, i.e. Y = β 0 + β 1 X + β 2 X 2 + β 3 X 3 + ɛ. (a) Suppse that the true relatinship between X and Y is linear, i.e. Y = β 0 + β 1 X + ɛ. Cnsider the training residual sum f squares (RSS) fr the linear regressin, and als the training RSS fr the cubic regressin. Wuld we expect ne t be lwer than the ther, wuld we expect them t be the same, r is there nt enugh infrmatin t tell? Justify yur answer.

136 3.7 Exercises 121 (b) Answer (a) using test rather than training RSS. (c) Suppse that the true relatinship between X and Y is nt linear, but we dn t knw hw far it is frm linear. Cnsider the training RSS fr the linear regressin, and als the training RSS fr the cubic regressin. Wuld we expect ne t be lwer than the ther, wuld we expect them t be the same, r is there nt enugh infrmatin t tell? Justify yur answer. (d) Answer (c) using test rather than training RSS. 5. Cnsider the fitted values that result frm perfrming linear regressin withut an intercept. In this setting, the ith fitted value takes the frm ŷ i = x i ˆβ, where Shw that we can write What is a i? ( n ) ( n ) ˆβ = x i y i / x 2 i. (3.38) i=1 ŷ i = i =1 n a i y i. i =1 Nte: We interpret this result by saying that the fitted values frm linear regressin are linear cmbinatins f the respnse values. 6. Using (3.4), argue that in the case f simple linear regressin, the least squares line always passes thrugh the pint ( x, ȳ). 7. It is claimed in the text that in the case f simple linear regressin f Y nt X, ther 2 statistic (3.17) is equal t the square f the crrelatin between X and Y (3.18). Prve that this is the case. Fr simplicity, yu may assume that x =ȳ =0. Applied 8. This questin invlves the use f simple linear regressin n the Aut data set. (a) Use the lm() functin t perfrm a simple linear regressin with mpg as the respnse and hrsepwer as the predictr. Use the summary() functin t print the results. Cmment n the utput. Fr example:

137 Linear Regressin i. Is there a relatinship between the predictr and the respnse? ii. Hw strng is the relatinship between the predictr and the respnse? iii. Is the relatinship between the predictr and the respnse psitive r negative? iv. What is the predicted mpg assciated with a hrsepwer f 98? What are the assciated 95 % cnfidence and predictin intervals? (b) Plt the respnse and the predictr. Use the abline() functin t display the least squares regressin line. (c) Use the plt() functin t prduce diagnstic plts f the least squares regressin fit. Cmment n any prblems yu see with the fit. 9. This questin invlves the use f multiple linear regressin n the Aut data set. (a) Prduce a scatterplt matrix which includes all f the variables in the data set. (b) Cmpute the matrix f crrelatins between the variables using the functin cr(). Yu will need t exclude the name variable, cr() which is qualitative. (c) Use the lm() functin t perfrm a multiple linear regressin with mpg as the respnse and all ther variables except name as thepredictrs.usethesummary() functin t print the results. Cmment n the utput. Fr instance: i. Is there a relatinship between the predictrs and the respnse? ii. Which predictrs appear t have a statistically significant relatinship t the respnse? iii. What des the cefficient fr the year variable suggest? (d) Use the plt() functin t prduce diagnstic plts f the linear regressin fit. Cmment n any prblems yu see with the fit. D the residual plts suggest any unusually large utliers? Des the leverage plt identify any bservatins with unusually high leverage? (e) Use the * and : symbls t fit linear regressin mdels with interactin effects. D any interactins appear t be statistically significant? (f) Try a few different transfrmatins f the variables, such as lg(x), X, X 2. Cmment n yur findings.

138 3.7 Exercises This questin shuld be answered using the Carseats data set. (a) Fit a multiple regressin mdel t predict Sales using Price, Urban, andus. (b) Prvide an interpretatin f each cefficient in the mdel. Be careful sme f the variables in the mdel are qualitative! (c) Write ut the mdel in equatin frm, being careful t handle the qualitative variables prperly. (d) Fr which f the predictrs can yu reject the null hypthesis H 0 : β j =0? (e) On the basis f yur respnse t the previus questin, fit a smaller mdel that nly uses the predictrs fr which there is evidence f assciatin with the utcme. (f) Hw well d the mdels in (a) and (e) fit the data? (g) Using the mdel frm (e), btain 95 % cnfidence intervals fr the cefficient(s). (h) Is there evidence f utliers r high leverage bservatins in the mdel frm (e)? 11. In this prblem we will investigate the t-statistic fr the null hypthesis H 0 : β = 0 in simple linear regressin withut an intercept. T begin, we generate a predictr x and a respnse y as fllws. > set.seed(1) > x=rnrm(100) > y=2*x+rnrm(100) (a) Perfrm a simple linear regressin f y nt x, withut an intercept. Reprt the cefficient estimate ˆβ, the standard errr f this cefficient estimate, and the t-statistic and p-value assciated with the null hypthesis H 0 : β = 0. Cmment n these results. (Yu can perfrm regressin withut an intercept using the cmmand lm(y x+0).) (b) Nw perfrm a simple linear regressin f x nt y withut an intercept, and reprt the cefficient estimate, its standard errr, and the crrespnding t-statistic and p-values assciated with the null hypthesis H 0 : β = 0. Cmment n these results. (c) What is the relatinship between the results btained in (a) and (b)? (d) Fr the regressin f Y nt X withut an intercept, the t- statistic fr H 0 : β =0takesthefrmˆβ/SE( ˆβ), where ˆβ is given by (3.38), and where SE( ˆβ) = n i=1 (y i x i ˆβ) 2 (n 1) n. i =1 x2 i

139 Linear Regressin (These frmulas are slightly different frm thse given in Sectins and 3.1.2, since here we are perfrming regressin withut an intercept.) Shw algebraically, and cnfirm numerically in R, that the t-statistic can be written as ( n 1) n i=1 x iy i n ( i=1 x2 i )( n i =1 y2 i ) ( n i =1 x i y. i )2 (e) Using the results frm (d), argue that the t-statistic fr the regressinf y nt x is the same as the t-statistic fr the regressin f x nt y. (f) In R, shw that when regressin is perfrmed with an intercept, the t-statistic fr H 0 : β 1 = 0 is the same fr the regressin f y nt x as it is fr the regressin f x nt y. 12. This prblem invlves simple linear regressin withut an intercept. (a) Recall that the cefficient estimate ˆβ frthelinearregressinf Y nt X withut an intercept is given by (3.38). Under what circumstance is the cefficient estimate fr the regressin f X nt Y the same as the cefficient estimate fr the regressin f Y nt X? (b) Generate an example in R with n = 100 bservatins in which the cefficient estimate fr the regressin f X nt Y is different frm the cefficient estimate fr the regressin f Y nt X. (c) Generate an example in R with n = 100 bservatins in which the cefficient estimate fr the regressin f X nt Y is the same as the cefficient estimate fr the regressin f Y nt X. 13. In this exercise yu will create sme simulated data and will fit simple linear regressin mdels t it. Make sure t use set.seed(1) prir t starting part (a) t ensure cnsistent results. (a) Using the rnrm() functin, create a vectr, x, cntaining 100 bservatins drawn frm a N(0, 1) distributin. This represents afeature,x. (b) Using the rnrm() functin, create a vectr, eps, cntaining 100 bservatins drawn frm a N(0, 0.25) distributin i.e. a nrmal distributin with mean zer and variance (c) Using x and eps, generate a vectr y accrding t the mdel Y = 1+0.5X + ɛ. (3.39) What is the length f the vectr y? What are the values f β 0 and β 1 in this linear mdel?

140 3.7 Exercises 125 (d) Create a scatterplt displaying the relatinship between x and y. Cmment n what yu bserve. (e) Fit a least squares linear mdel t predict y using x. Cmment n the mdel btained. Hw d ˆβ 0 and ˆβ 1 cmpare t β 0 and β 1? (f) Display the least squares line n the scatterplt btained in (d). Draw the ppulatin regressin line n the plt, in a different clr. Use the legend() cmmand t create an apprpriate legend. (g) Nw fit a plynmial regressin mdel that predicts y using x and x 2. Is there evidence that the quadratic term imprves the mdel fit? Explain yur answer. (h) Repeat (a) (f) after mdifying the data generatin prcess in such a way that there is less nise in the data. The mdel (3.39) shuld remain the same. Yu can d this by decreasing the variance f the nrmal distributin used t generate the errr term ɛ in (b). Describe yur results. (i) Repeat (a) (f) after mdifying the data generatin prcess in such a way that there is mre nise in the data. The mdel (3.39) shuld remain the same. Yu can d this by increasing the variance f the nrmal distributin used t generate the errr term ɛ in (b). Describe yur results. (j) What are the cnfidence intervals fr β 0 and β 1 based n the riginal data set, the nisier data set, and the less nisy data set? Cmment n yur results. 14. This prblem fcuses n the cllinearity prblem. (a) Perfrm the fllwing cmmands in R: > set.seed(1) > x1=runif(100) > x2=0.5*x1+rnrm(100)/10 > y=2+2*x1+0.3*x2+rnrm(100) The last line crrespnds t creating a linear mdel in which y is a functin f x1 and x2. Write ut the frm f the linear mdel. What are the regressin cefficients? (b) What is the crrelatin between x1 and x2? Create a scatterplt displaying the relatinship between the variables. (c) Using this data, fit a least squares regressin t predict y using x1 and x2. Describe the results btained. What are ˆβ 0, ˆβ 1,and ˆβ 2? Hw d these relate t the true β 0, β 1,andβ 2?Canyu reject the null hypthesis H 0 : β 1 = 0? Hw abut the null hypthesis H 0 : β 2 =0?

141 Linear Regressin (d) Nw fit a least squares regressin t predict y using nly x1. Cmment n yur results. Can yu reject the null hypthesis H 0 : β 1 =0? (e) Nw fit a least squares regressin t predict y using nly x2. Cmment n yur results. Can yu reject the null hypthesis H 0 : β 1 =0? (f) D the results btained in (c) (e) cntradict each ther? Explain yur answer. (g) Nw suppse we btain ne additinal bservatin, which was unfrtunately mismeasured. > x1=c(x1, 0.1) > x2=c(x2, 0.8) > y=c(y,6) Re-fit the linear mdels frm (c) t (e) using this new data. What effect des this new bservatin have n the each f the mdels? In each mdel, is this bservatin an utlier? A high-leverage pint? Bth? Explain yur answers. 15. This prblem invlves the Bstn data set, which we saw in the lab fr this chapter. We will nw try t predict per capita crime rate using the ther variables in this data set. In ther wrds, per capita crime rate is the respnse, and the ther variables are the predictrs. (a) Fr each predictr, fit a simple linear regressin mdel t predict the respnse. Describe yur results. In which f the mdels is there a statistically significant assciatin between the predictr and the respnse? Create sme plts t back up yur assertins. (b) Fit a multiple regressin mdel t predict the respnse using all f the predictrs. Describe yur results. Fr which predictrs can we reject the null hypthesis H 0 : β j =0? (c) Hw d yur results frm (a) cmpare t yur results frm (b)? Create a plt displaying the univariate regressin cefficients frm (a) n the x-axis, and the multiple regressin cefficients frm (b) n the y-axis. That is, each predictr is displayed as a single pint in the plt. Its cefficient in a simple linear regressin mdel is shwn n the x-axis, and its cefficient estimate in the multiple linear regressin mdel is shwn n the y-axis. (d) Is there evidence f nn-linear assciatin between any f the predictrs and the respnse? T answer this questin, fr each predictr X, fit a mdel f the frm Y = β 0 + β 1 X + β 2 X 2 + β 3 X 3 + ɛ.

142 4 Classificatin The linear regressin mdel discussed in Chapter 3 assumes that the respnse variable Y is quantitative. But in many situatins, the respnse variable is instead qualitative. Fr example, eye clr is qualitative, taking qualitative n values blue, brwn, r green. Often qualitative variables are referred t as categrical ; we will use these terms interchangeably. In this chapter, we study appraches fr predicting qualitative respnses, a prcess that is knwn as classificatin. Predicting a qualitative respnse fr an bser- classificatin vatin can be referred t as classifying that bservatin, since it invlves assigning the bservatin t a categry, r class. On the ther hand, ften the methds used fr classificatin first predict the prbability f each f the categries f a qualitative variable, as the basis fr making the classificatin. In this sense they als behave like regressin methds. There are many pssible classificatin techniques, r classifiers, that ne classifier might use t predict a qualitative respnse. We tuched n sme f these in Sectins and In this chapter we discuss three f the mst widely-used classifiers: lgistic regressin, linear discriminant analysis, and lgistic K-nearest neighbrs. We discuss mre cmputer-intensive methds in later chapters, such as generalized additive mdels (Chapter 7), trees, randm frests, and bsting (Chapter 8), and supprt vectr machines (Chapter 9). regressin linear discriminant analysis K-nearest neighbrs G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics, DOI / , Springer Science+Business Media New Yrk

143 Classificatin 4.1 An Overview f Classificatin Classificatin prblems ccur ften, perhaps even mre s than regressin prblems. Sme examples include: 1. A persn arrives at the emergency rm with a set f symptms that culd pssibly be attributed t ne f three medical cnditins. Which f the three cnditins des the individual have? 2. An nline banking service must be able t determine whether r nt a transactin being perfrmed n the site is fraudulent, n the basis f the user s IP address, past transactin histry, and s frth. 3. On the basis f DNA sequence data fr a number f patients with and withut a given disease, a bilgist wuld like t figure ut which DNA mutatins are deleterius (disease-causing) and which are nt. Just as in the regressin setting, in the classificatin setting we have a set f training bservatins (x 1,y 1 ),...,(x n,y n ) that we can use t build a classifier. We want ur classifier t perfrm well nt nly n the training data, but als n test bservatins that were nt used t train the classifier. In this chapter, we will illustrate the cncept f classificatin using the simulated Default data set. We are interested in predicting whether an individual will default n his r her credit card payment, n the basis f annual incme and mnthly credit card balance. The data set is displayed in Figure 4.1. We have pltted annual incme and mnthly credit card balance fr a subset f 10, 000 individuals. The left-hand panel f Figure 4.1 displays individuals wh defaulted in a given mnth in range, and thse wh did nt in blue. (The verall default rate is abut 3 %, s we have pltted nly a fractin f the individuals wh did nt default.) It appears that individuals wh defaulted tended t have higher credit card balances than thse wh did nt. In the right-hand panel f Figure 4.1, tw pairs f bxplts are shwn. The first shws the distributin f balance split by the binary default variable; the secnd is a similar plt fr incme. In this chapter, we learn hw t build a mdel t predict default (Y ) fr any given value f balance (X 1 )andincme (X 2 ). Since Y is nt quantitative, the simple linear regressin mdel f Chapter 3 is nt apprpriate. It is wrth nting that Figure 4.1 displays a very prnunced relatinship between the predictr balance and the respnse default. Inmstreal applicatins, the relatinship between the predictr and the respnse will nt be nearly s strng. Hwever, fr the sake f illustrating the classificatin prcedures discussed in this chapter, we use an example in which the relatinship between the predictr and the respnse is smewhat exaggerated.

144 4.2 Why Nt Linear Regressin? 129 Incme Balance Incme Balance N Yes Default N Yes Default FIGURE 4.1. The Default data set. Left: The annual incmes and mnthly credit card balances f a number f individuals. The individuals wh defaulted n their credit card payments are shwn in range, and thse wh did nt are shwn in blue. Center: Bxplts f balance as a functin f default status. Right: Bxplts f incme as a functin f default status. 4.2 Why Nt Linear Regressin? We have stated that linear regressin is nt apprpriate in the case f a qualitative respnse. Why nt? Suppse that we are trying t predict the medical cnditin f a patient in the emergency rm n the basis f her symptms. In this simplified example, there are three pssible diagnses: strke, drug verdse, and epileptic seizure. We culd cnsider encding these values as a quantitative respnse variable, Y, as fllws: 1 if strke; Y = 2 if drug verdse; 3 if epileptic seizure. Using this cding, least squares culd be used t fit a linear regressin mdel t predict Y n the basis f a set f predictrs X 1,...,X p. Unfrtunately, this cding implies an rdering n the utcmes, putting drug verdse in between strke and epileptic seizure, and insisting that the difference between strke and drug verdse is the same as the difference between drug verdse and epileptic seizure. In practice there is n particular reasn that this needs t be the case. Fr instance, ne culd chse an equally reasnable cding, 1 if epileptic seizure; Y = 2 if strke; 3 if drug verdse.

145 Classificatin which wuld imply a ttally different relatinship amng the three cnditins. Each f these cdings wuld prduce fundamentally different linear mdels that wuld ultimately lead t different sets f predictins n test bservatins. If the respnse variable s values did take n a natural rdering, such as mild, mderate, andsevere, and we felt the gap between mild and mderate was similar t the gap between mderate and severe, then a 1, 2, 3 cding wuld be reasnable. Unfrtunately, in general there is n natural way t cnvert a qualitative respnse variable with mre than tw levels int a quantitative respnse that is ready fr linear regressin. Fr a binary (tw level) qualitative respnse, the situatin is better. Fr binary instance, perhaps there are nly tw pssibilities fr the patient s medical cnditin: strke and drug verdse. We culd then ptentially use the dummy variable apprach frm Sectin t cde the respnse as fllws: { 0 if strke; Y = 1 if drug verdse. We culd then fit a linear regressin t this binary respnse, and predict drug verdse if Ŷ > 0.5 andstrke therwise. In the binary case it is nt hard t shw that even if we flip the abve cding, linear regressin will prduce the same final predictins. Fr a binary respnse with a 0/1 cding as abve, regressin by least squares des make sense; it can be shwn that the X ˆβ btained using linear regressin is in fact an estimate f Pr(drug verdsex) in this special case. Hwever, if we use linear regressin, sme f ur estimates might be utside the [0, 1] interval (see Figure 4.2), making them hard t interpret as prbabilities! Nevertheless, the predictins prvide an rdering and can be interpreted as crude prbability estimates. Curiusly, it turns ut that the classificatins that we get if we use linear regressin t predict a binary respnse will be the same as fr the linear discriminant analysis (LDA) prcedure we discuss in Sectin 4.4. Hwever, the dummy variable apprach cannt be easily extended t accmmdate qualitative respnses with mre than tw levels. Fr these reasns, it is preferable t use a classificatin methd that is truly suited fr qualitative respnse values, such as the nes presented next. 4.3 Lgistic Regressin Cnsider again the Default data set, where the respnse default falls int ne f tw categries, Yes r N. Rather than mdeling this respnse Y directly, lgistic regressin mdels the prbability that Y belngs t a particular categry.

146 4.3 Lgistic Regressin Balance Prbability f Default Balance Prbability f Default FIGURE 4.2. Classificatin using the Default data. Left: Estimated prbability f default using linear regressin. Sme estimated prbabilities are negative! The range ticks indicate the 0/1 values cded fr default(n r Yes). Right: Predicted prbabilities f default using lgistic regressin. All prbabilities lie between 0 and 1. Fr the Default data, lgistic regressin mdels the prbability f default. Fr example, the prbability f default given balance can be written as Pr(default = Yesbalance). The values f Pr(default = Yesbalance), which we abbreviate p(balance), will range between 0 and 1. Then fr any given value f balance, a predictin can be made fr default. Fr example, ne might predict default = Yes fr any individual fr whm p(balance) > 0.5. Alternatively, if a cmpany wishes t be cnservative in predicting individuals wh are at risk fr default, then they may chse t use a lwer threshld, such as p(balance) > The Lgistic Mdel Hw shuld we mdel the relatinship between p(x) =Pr(Y =1X) and X? (Fr cnvenience we are using the generic 0/1 cding fr the respnse). In Sectin 4.2 we talked f using a linear regressin mdel t represent these prbabilities: p(x) =β 0 + β 1 X. (4.1) If we use this apprach t predict default=yes using balance, thenwe btain the mdel shwn in the left-hand panel f Figure 4.2. Here we see the prblem with this apprach: fr balances clse t zer we predict a negative prbability f default; if we were t predict fr very large balances, we wuld get values bigger than 1. These predictins are nt sensible, since f curse the true prbability f default, regardless f credit card balance, must fall between 0 and 1. This prblem is nt unique t the credit default data. Any time a straight line is fit t a binary respnse that is cded as

147 Classificatin 0 r 1, in principle we can always predict p(x) < 0 fr sme values f X and p(x) > 1 fr thers (unless the range f X is limited). T avid this prblem, we must mdel p(x) using a functin that gives utputs between 0 and 1 fr all values f X. Many functins meet this descriptin. In lgistic regressin, we use the lgistic functin, p(x) = eβ0+β1x. (4.2) 1+eβ0+β1X lgistic functin T fit the mdel (4.2), we use a methd called maximum likelihd, which maximum we discuss in the next sectin. The right-hand panel f Figure 4.2 illustrates likelihd the fit f the lgistic regressin mdel t the Default data. Ntice that fr lw balances we nw predict the prbability f default as clse t, but never belw, zer. Likewise, fr high balances we predict a default prbability clse t, but never abve, ne. The lgistic functin will always prduce an S-shaped curve f this frm, and s regardless f the value f X, we will btain a sensible predictin. We als see that the lgistic mdel is better able t capture the range f prbabilities than is the linear regressin mdel in the left-hand plt. The average fitted prbability in bth cases is (averaged ver the training data), which is the same as the verall prprtin f defaulters in the data set. After a bit f manipulatin f (4.2), we find that p(x) 1 p(x) = eβ0+β1x. (4.3) The quantity p(x)/[1 p(x)] is called the dds, andcantakenanyvalue dds between 0 and. Values f the dds clse t 0 and indicate very lw and very high prbabilities f default, respectively. Fr example, n average 1 in 5 peple with an dds f 1/4 will default, since p(x) = 0.2 implies an 0.2 dds f =1/4. Likewise n average nine ut f every ten peple with 0.9 an dds f 9 will default, since p(x) =0.9 implies an dds f =9. Odds are traditinally used instead f prbabilities in hrse-racing, since they relate mre naturally t the crrect betting strategy. By taking the lgarithm f bth sides f (4.3), we arrive at ( ) p(x) lg = β 0 + β 1 X. (4.4) 1 p(x) The left-hand side is called the lg-dds r lgit. We see that the lgistic lg-dds regressin mdel (4.2) has a lgit that is linear in X. lgit Recall frm Chapter 3 that in a linear regressin mdel, β 1 gives the average change in Y assciated with a ne-unit increase in X. In cntrast, in a lgistic regressinmdel, increasing X by ne unit changes the lg dds by β 1 (4.4), r equivalently it multiplies the dds by e β1 (4.3). Hwever, because the relatinship between p(x) andx in (4.2) is nt a straight line,

148 4.3 Lgistic Regressin 133 β 1 des nt crrespnd t the change in p(x) assciated with a ne-unit increase in X. The amunt that p(x) changes due t a ne-unit change in X will depend n the current value f X. But regardless f the value f X, if β 1 is psitive then increasing X will be assciated with increasing p(x), and if β 1 is negative then increasing X will be assciated with decreasing p(x). The fact that there is nt a straight-line relatinship between p(x) and X, and the fact that the rate f change in p(x) per unit change in X depends n the current value f X, can als be seen by inspectin f the right-hand panel f Figure Estimating the Regressin Cefficients The cefficients β 0 and β 1 in (4.2) are unknwn, and must be estimated based n the available training data. In Chapter 3, we used the least squares apprach t estimate the unknwn linear regressin cefficients. Althugh we culd use (nn-linear) least squares t fit the mdel (4.4), the mre general methd f maximum likelihd is preferred, since it has better statistical prperties. The basic intuitin behind using maximum likelihd t fit a lgistic regressin mdel is as fllws: we seek estimates fr β 0 and β 1 such that the predicted prbability ˆp(x i ) f default fr each individual, using (4.2), crrespnds as clsely as pssible t the individual s bserved default status. In ther wrds, we try t find ˆβ 0 and ˆβ 1 such that plugging these estimates int the mdel fr p(x), given in (4.2), yields a number clse t ne fr all individuals wh defaulted, and a number clse t zer fr all individuals wh did nt. This intuitin can be frmalized using a mathematical equatin called a likelihd functin: l(β 0,β 1 )= i:y i=1 p(x i ) i :y i =0 (1 p(x i )). (4.5) The estimates ˆβ 0 and ˆβ 1 are chsen t maximize this likelihd functin. Maximum likelihd is a very general apprach that is used t fit many f the nn-linear mdels that we examine thrughut this bk. In the linear regressin setting, the least squares apprach is in fact a special case f maximum likelihd. The mathematical details f maximum likelihd are beynd the scpe f this bk. Hwever, in general, lgistic regressin and ther mdels can be easily fit using a statistical sftware package such as R, and s we d nt need t cncern urselves with the details f the maximum likelihd fitting prcedure. Table 4.1 shws the cefficient estimates and related infrmatin that result frm fitting a lgistic regressin mdel n the Default data in rder t predict the prbability f default=yes using balance. Weseethat ˆβ 1 = ; this indicates that an increase in balance is assciated with an increase in the prbability f default. T be precise, a ne-unit increase in balance is assciated with an increase in the lg dds f default by units. likelihd functin

149 Classificatin Cefficient Std. errr Z-statistic P-value Intercept < balance < TABLE 4.1. Fr the Default data, estimated cefficients f the lgistic regressin mdel that predicts the prbability f default using balance. A ne-unit increase in balance is assciated with an increase in the lg dds f default by units. Many aspects f the lgistic regressin utput shwn in Table 4.1 are similar t the linear regressin utput f Chapter 3. Fr example, we can measure the accuracy f the cefficient estimates by cmputing their standard errrs. The z-statistic in Table 4.1 plays the same rle as the t-statistic in the linear regressin utput, fr example in Table 3.1 n page 68. Fr instance, the z-statistic assciated with β 1 is equal t ˆβ 1 /SE( ˆβ 1 ), and s a large (abslute) value f the z-statistic indicates evidence against the null hypthesis H 0 : β 1 = 0. This null hypthesis implies that p(x) = eβ 0 1+e β 0 in ther wrds, that the prbability f default des nt depend n balance. Since the p-value assciated with balance in Table 4.1 is tiny, we can reject H 0. In ther wrds, we cnclude that there is indeed an assciatin between balance and prbability f default. The estimated intercept in Table 4.1 is typically nt f interest; its main purpse is t adjust the average fitted prbabilities t the prprtin f nes in the data Making Predictins Once the cefficients have been estimated, it is a simple matter t cmpute the prbability f default fr any given credit card balance. Fr example, using the cefficient estimates given in Table 4.1, we predict that the default prbability fr an individual with a balance f $1, 000 is e ˆβ 0+ ˆβ 1X ˆp(X) = 1+e ˆβ 0+ ˆβ = e ,000 1X 1+e ,000 = , which is belw 1 %. In cntrast, the predicted prbability f default fr an individual with a balance f $2, 000 is much higher, and equals r 58.6%. One can use qualitative predictrs with the lgistic regressin mdel using the dummy variable apprach frm Sectin As an example, the Default data set cntains the qualitative variable student. T fit the mdel we simply create a dummy variable that takes n a value f 1 fr students and 0 fr nn-students. The lgistic regressin mdel that results frm predicting prbability f default frm student status can be seen in Table 4.2. The cefficient assciated with the dummy variable is psitive,

150 4.3 Lgistic Regressin 135 Cefficient Std. errr Z-statistic P-value Intercept < student[yes] TABLE 4.2. Fr the Default data, estimated cefficients f the lgistic regressin mdel that predicts the prbability f default using student status. Student status is encded as a dummy variable, with a value f 1 fr a student and a value f 0 fr a nn-student, and represented by the variable student[yes] in the table. and the assciated p-value is statistically significant. This indicates that students tend t have higher default prbabilities than nn-students: Pr(default=Yesstudent=Yes) = Pr(default=Yesstudent=N) = e =0.0431, 1+e e = e Multiple Lgistic Regressin We nw cnsider the prblem f predicting a binary respnse using multiple predictrs. By analgy with the extensin frm simple t multiple linear regressin in Chapter 3, we can generalize (4.4) as fllws: ( ) p(x) lg = β 0 + β 1 X β p X p, (4.6) 1 p(x) where X =(X 1,...,X p )arep predictrs. Equatin 4.6 can be rewritten as p(x) = eβ0+β1x1+ +βpxp. (4.7) β0+β1x1+ +βpxp 1+e Just as in Sectin 4.3.2, we use the maximum likelihd methd t estimate β 0,β 1,...,β p. Table 4.3 shws the cefficient estimates fr a lgistic regressin mdel that uses balance, incme (in thusands f dllars), and student status t predict prbability f default. There is a surprising result here. The p- values assciated with balance and the dummy variable fr student status are very small, indicating that each f these variables is assciated with the prbability f default. Hwever, the cefficient fr the dummy variable is negative, indicating that students are less likely t default than nnstudents. In cntrast, the cefficient fr the dummy variable is psitive in Table 4.2. Hw is it pssible fr student status t be assciated with an increase in prbability f default in Table 4.2 and a decrease in prbability f default in Table 4.3? The left-hand panel f Figure 4.3 prvides a graphical illustratin f this apparent paradx. The range and blue slid lines shw the average default rates fr students and nn-students, respectively,

151 Classificatin Cefficient Std. errr Z-statistic P-value Intercept < balance < incme student[yes] TABLE 4.3. Fr the Default data, estimated cefficients f the lgistic regressin mdel that predicts the prbability f default using balance, incme, and student status. Student status is encded as a dummy variable student[yes], with a value f 1 fr a student and a value f 0 fr a nn-student. In fitting this mdel, incme was measured in thusands f dllars. as a functin f credit card balance. The negative cefficient fr student in the multiple lgistic regressin indicates that fr a fixed value f balance and incme, a student is less likely t default than a nn-student. Indeed, we bserve frm the left-hand panel f Figure 4.3 that the student default rate is at r belw that f the nn-student default rate fr every value f balance. But the hrizntal brken lines near the base f the plt, which shw the default rates fr students and nn-students averaged ver all values f balance and incme, suggest the ppsite effect: the verall student default rate is higher than the nn-student default rate. Cnsequently, there is a psitive cefficient fr student in the single variable lgistic regressin utput shwn in Table 4.2. The right-hand panel f Figure 4.3 prvides an explanatin fr this discrepancy. The variables student and balance are crrelated. Students tend t hld higher levels f debt, which is in turn assciated with higher prbability f default. In ther wrds, students are mre likely t have large credit card balances, which, as we knw frm the left-hand panel f Figure 4.3, tend t be assciated with high default rates. Thus, even thugh an individual student with a given credit card balance will tend t have a lwer prbability f default than a nn-student with the same credit card balance, the fact that students n the whle tend t have higher credit card balances means that verall, students tend t default at a higher rate than nn-students. This is an imprtant distinctin fr a credit card cmpany that is trying t determine t whm they shuld ffer credit. A student is riskier than a nn-student if n infrmatin abut the student s credit card balance is available. Hwever, that student is less risky than a nn-student with the same credit card balance! This simple example illustrates the dangers and subtleties assciated with perfrming regressins invlving nly a single predictr when ther predictrs may als be relevant. As in the linear regressin setting, the results btained using ne predictr may be quite different frm thse btained using multiple predictrs, especially when there is crrelatin amng the predictrs. In general, the phenmenn seen in Figure 4.3 is knwn as cnfunding. cnfunding

152 4.3 Lgistic Regressin 137 Default Rate Credit Card Balance Credit Card Balance N Student Status Yes FIGURE 4.3. Cnfunding in the Default data. Left: Default rates are shwn fr students (range) and nn-students (blue). The slid lines display default rate as a functin f balance, while the hrizntal brken lines display the verall default rates. Right: Bxplts f balance fr students (range) and nn-students (blue) are shwn. By substituting estimates fr the regressin cefficients frm Table 4.3 int (4.7), we can make predictins. Fr example, a student with a credit card balance f $1, 500 and an incme f $40, 000 has an estimated prbability f default f ˆp(X) = e , = (4.8) 1+e , A nn-student with the same balance and incme has an estimated prbability f default f ˆp(X) = e , = (4.9) 1+e , (Here we multiply the incme cefficient estimate frm Table 4.3 by 40, rather than by 40,000, because in that table the mdel was fit with incme measured in units f $1, 000.) Lgistic Regressin fr >2 Respnse Classes We smetimes wish t classify a respnse variable that has mre than tw classes. Fr example, in Sectin 4.2 we had three categries f medical cnditin in the emergency rm: strke, drug verdse, epileptic seizure. In this setting, we wish t mdel bth Pr(Y = strkex) and Pr(Y = drug verdsex), with the remaining Pr(Y = epileptic seizurex) = 1 Pr(Y = strkex) Pr(Y = drug verdsex). The tw-class lgistic regressin mdels discussed in the previus sectins have multiple-class extensins, but in practice they tend nt t be used all that ften. One f the reasns is that the methd we discuss in the next sectin, discriminant

153 Classificatin analysis, is ppular fr multiple-class classificatin. S we d nt g int the details f multiple-class lgistic regressin here, but simply nte that such an apprach is pssible, and that sftware fr it is available in R. 4.4 Linear Discriminant Analysis Lgistic regressin invlves directly mdeling Pr(Y = kx = x) usingthe lgistic functin, given by (4.7) fr the case f tw respnse classes. In statistical jargn, we mdel the cnditinal distributin f the respnse Y, given the predictr(s) X. We nw cnsider an alternative and less direct apprach t estimating these prbabilities. In this alternative apprach, we mdel the distributin f the predictrs X separately in each f the respnse classes (i.e. given Y ), and then use Bayes therem t flip these arund int estimates fr Pr(Y = kx = x). When these distributins are assumed t be nrmal, it turns ut that the mdel is very similar in frm t lgistic regressin. Why d we need anther methd, when we have lgistic regressin? There are several reasns: When the classes are well-separated, the parameter estimates fr the lgistic regressin mdel are surprisingly unstable. Linear discriminant analysis des nt suffer frm this prblem. If n is small and the distributin f the predictrs X is apprximately nrmal in each f the classes, the linear discriminant mdel is again mre stable than the lgistic regressin mdel. As mentined in Sectin 4.3.5, linear discriminant analysis is ppular when we have mre than tw respnse classes Using Bayes Therem fr Classificatin Suppse that we wish t classify an bservatin int ne f K classes, where K 2. In ther wrds, the qualitative respnse variable Y can take n K pssible distinct and unrdered values. Let π k represent the verall r prir prir prbability that a randmly chsen bservatin cmes frm the kth class; this is the prbability that a given bservatin is assciated with the kth categry f the respnse variable Y.Letf k (X) Pr(X = xy = k) dente the density functin f X fr an bservatin that cmes frm the kth class. density In ther wrds, f k (x) is relatively large if there is a high prbability that functin an bservatin in the kth class has X x, andf k (x) is small if it is very

154 4.4 Linear Discriminant Analysis 139 unlikely that an bservatin in the kth class has X x. Then Bayes therem states that Pr(Y = kx = x) = π kf k (x) K l=1 π lf l (x). (4.10) Bayes therem In accrdance with ur earlier ntatin, we will use the abbreviatin p k (X) =Pr(Y = kx). This suggests that instead f directly cmputing p k (X) as in Sectin 4.3.1, we can simply plug in estimates f π k and f k (X) int (4.10). In general, estimating π k is easy if we have a randm sample f Y s frm the ppulatin: we simply cmpute the fractin f the training bservatins that belng t the kth class. Hwever, estimating f k (X) tends t be mre challenging, unless we assume sme simple frms fr these densities. We refer t p k (x) asthepsterir prbability that an bservatin psterir X = x belngs t the kth class. That is, it is the prbability that the bservatin belngs t the kth class, given the predictr value fr that bservatin. We knw frm Chapter 2 that the Bayes classifier, which classifies an bservatin t the class fr which p k (X) is largest, has the lwest pssible errr rate ut f all classifiers. (This is f curse nly true if the terms in (4.10) are all crrectly specified.) Therefre, if we can find a way t estimate f k (X), then we can develp a classifier that apprximates the Bayes classifier. Such an apprach is the tpic f the fllwing sectins Linear Discriminant Analysis fr p =1 Fr nw, assume that p = 1 that is, we have nly ne predictr. We wuld like t btain an estimate fr f k (x) that we can plug int (4.10) in rder t estimate p k (x). We will then classify an bservatin t the class fr which p k (x) is greatest. In rder t estimate f k (x), we will first make sme assumptins abut its frm. Suppse we assume that f k (x) is nrmal r Gaussian. dimensinal setting, the nrmal density takes the frm f k (x) = 1 2πσk exp In the ne- nrmal Gaussian ( 1 ) 2σk 2 (x μ k ) 2, (4.11) where μ k and σk 2 are the mean and variance parameters fr the kth class. Fr nw, let us further assume that σ1 2 =...= σ2 K : that is, there is a shared variance term acrss all K classes, which fr simplicity we can dente by σ 2. Plugging (4.11) int (4.10), we find that p k (x) = π 1 k 2πσ exp ( 1 2σ (x μ 2 k ) 2) K l=1 π 1 l 2πσ exp ( (4.12) 1 2σ (x μ 2 l ) 2). (Nte that in (4.12), π k dentes the prir prbability that an bservatin belngs t the kth class, nt t be cnfused with π , the mathematical cnstant.) The Bayes classifier invlves assigning an bservatin

155 Classificatin FIGURE 4.4. Left: Tw ne-dimensinal nrmal density functins are shwn. The dashed vertical line represents the Bayes decisin bundary. Right: 20 bservatins were drawn frm each f the tw classes, and are shwn as histgrams. The Bayes decisin bundary is again shwn as a dashed vertical line. The slid vertical line represents the LDA decisin bundary estimated frm the training data. X = x t the class fr which (4.12) is largest. Taking the lg f (4.12) and rearranging the terms, it is nt hard t shw that this is equivalent t assigning the bservatin t the class fr which δ k (x) =x μk σ 2 μ2 k 2σ 2 + lg(π k) (4.13) is largest. Fr instance, if K =2andπ 1 = π 2, then the Bayes classifier assigns an bservatin t class 1 if 2x (μ 1 μ 2 ) >μ 2 1 μ2 2,andtclass 2 therwise. In this case, the Bayes decisin bundary crrespnds t the pint where x = μ2 1 μ2 2 2(μ 1 μ 2 ) = μ 1 + μ 2. (4.14) 2 An example is shwn in the left-hand panel f Figure 4.4. The tw nrmal density functins that are displayed, f 1 (x) andf 2 (x), represent tw distinct classes. The mean and variance parameters fr the tw density functins are μ 1 = 1.25, μ 2 =1.25, and σ1 2 = σ2 2 = 1. The tw densities verlap, and s given that X = x, there is sme uncertainty abut the class t which the bservatin belngs. If we assume that an bservatin is equally likely t cme frm either class that is, π 1 = π 2 =0.5 then by inspectin f (4.14), we see that the Bayes classifier assigns the bservatin t class 1 if x<0 and class 2 therwise. Nte that in this case, we can cmpute the Bayes classifier because we knw that X is drawn frm a Gaussian distributin within each class, and we knw all f the parameters invlved. In a real-life situatin, we are nt able t calculate the Bayes classifier. In practice, even if we are quite certain f ur assumptin that X is drawn frm a Gaussian distributin within each class, we still have t estimate the parameters μ 1,...,μ K, π 1,...,π K,andσ 2.Thelinear discriminant

156 4.4 Linear Discriminant Analysis 141 analysis (LDA) methd apprximates the Bayes classifier by plugging esti- linear mates fr π k, μ k,andσ 2 int (4.13). In particular, the fllwing estimates are used: ˆμ k = 1 x i n k ˆσ 2 = i:y i=k 1 n K K (x i ˆμ k ) 2 (4.15) k=1 i:y i=k where n is the ttal number f training bservatins, and n k is the number f training bservatins in the kth class. The estimate fr μ k is simply the average f all the training bservatins frm the kth class, while ˆσ 2 can be seen as a weighted average f the sample variances fr each f the K classes. Smetimes we have knwledge f the class membership prbabilities π 1,...,π K, which can be used directly. In the absence f any additinal infrmatin, LDA estimates π k using the prprtin f the training bservatins that belng t the kth class. In ther wrds, ˆπ k = n k /n. (4.16) The LDA classifier plugs the estimates given in (4.15) and (4.16) int (4.13), and assigns an bservatin X = x t the class fr which discriminant analysis ˆδ k (x) =x ˆμ k ˆσ 2 ˆμ2 k 2ˆσ 2 + lg(ˆπ k) (4.17) is largest. The wrd linear in the classifier s name stems frm the fact that the discriminant functins ˆδ k (x) in (4.17) are linear functins f x (as discriminant ppsed t a mre cmplex functin f x). functin The right-hand panel f Figure 4.4 displays a histgram f a randm sample f 20 bservatins frm each class. T implement LDA, we began by estimating π k, μ k,andσ 2 using (4.15) and (4.16). We then cmputed the decisin bundary, shwn as a black slid line, that results frm assigning an bservatin t the class fr which (4.17) is largest. All pints t the left f this line will be assigned t the green class, while pints t the right f this line are assigned t the purple class. In this case, since n 1 = n 2 = 20, we have ˆπ 1 =ˆπ 2. As a result, the decisin bundary crrespnds t the midpint between the sample means fr the tw classes, (ˆμ 1 +ˆμ 2 )/2. The figure indicates that the LDA decisin bundary is slightly t the left f the ptimal Bayes decisin bundary, which instead equals (μ 1 + μ 2 )/2 = 0. Hw well des the LDA classifier perfrm n this data? Since this is simulated data, we can generate a large number f test bservatins in rder t cmpute the Bayes errr rate and the LDA test errr rate. These are 10.6% and 11.1 %, respectively. In ther wrds, the LDA classifier s errr rate is nly 0.5 % abve the smallest pssible errr rate! This indicates that LDA is perfrming pretty well n this data set.

157 Classificatin x 2 x 2 x 1 x 1 FIGURE 4.5. Tw multivariate Gaussian density functins are shwn, with p =2. Left: The tw predictrs are uncrrelated. Right: The tw variables have a crrelatin f 0.7. T reiterate, the LDA classifier results frm assuming that the bservatins within each class cme frm a nrmal distributin with a class-specific mean vectr and a cmmn variance σ 2, and plugging estimates fr these parameters int the Bayes classifier. In Sectin 4.4.4, we will cnsider a less stringent set f assumptins, by allwing the bservatins in the kth class t have a class-specific variance, σ 2 k Linear Discriminant Analysis fr p>1 We nw extend the LDA classifier t the case f multiple predictrs. T d this, we will assume that X =(X 1,X 2,...,X p )isdrawnfrmamultivariate Gaussian (r multivariate nrmal) distributin, with a class-specific multivariate mean vectr and a cmmn cvariance matrix. We begin with a brief review Gaussian f such a distributin. The multivariate Gaussian distributin assumes that each individual predictr fllws a ne-dimensinal nrmal distributin, as in (4.11), with sme crrelatin between each pair f predictrs. Tw examples f multivariate Gaussian distributins with p = 2 are shwn in Figure 4.5. The height f the surface at any particular pint represents the prbability that bth X 1 and X 2 fall in a small regin arund that pint. In either panel, if the surface is cut alng the X 1 axis r alng the X 2 axis, the resulting crss-sectin will have the shape f a ne-dimensinal nrmal distributin. The left-hand panel f Figure 4.5 illustrates an example in which Var(X 1 )=Var(X 2 )and Cr(X 1,X 2 ) = 0; this surface has a characteristic bell shape. Hwever, the bell shape will be distrted if the predictrs are crrelated r have unequal variances, as is illustrated in the right-hand panel f Figure 4.5. In this situatin, the base f the bell will have an elliptical, rather than circular,

158 4.4 Linear Discriminant Analysis 143 X X X X 1 FIGURE 4.6. An example with three classes. The bservatins frm each class are drawn frm a multivariate Gaussian distributin with p =2, with a class-specific mean vectr and a cmmn cvariance matrix. Left: Ellipses that cntain 95 % f the prbability fr each f the three classes are shwn. The dashed lines are the Bayes decisin bundaries. Right: 20 bservatins were generated frm each class, and the crrespnding LDA decisin bundaries are indicated using slid black lines. The Bayes decisin bundaries are nce again shwn as dashed lines. shape. T indicate that a p-dimensinal randm variable X has a multivariate Gaussian distributin, we write X N(μ, Σ). Here E(X) =μ is the mean f X (a vectr with p cmpnents), and Cv(X) =Σ is the p p cvariance matrix f X. Frmally, the multivariate Gaussian density is defined as ( 1 f(x) = exp 1 ) (2π) p/2 Σ 1/2 2 (x μ)t Σ 1 (x μ). (4.18) In the case f p > 1 predictrs, the LDA classifier assumes that the bservatins in the kth class are drawn frm a multivariate Gaussian distributin N(μ k, Σ), where μ k is a class-specific mean vectr, and Σ is a cvariance matrix that is cmmn t all K classes. Plugging the density functin fr the kth class, f k (X = x), int (4.10) and perfrming a little bit f algebra reveals that the Bayes classifier assigns an bservatin X = x t the class fr which δ k (x) =x T Σ 1 μ k 1 2 μt k Σ 1 μ k +lgπ k (4.19) is largest. This is the vectr/matrix versin f (4.13). An example is shwn in the left-hand panel f Figure 4.6. Three equallysized Gaussian classes are shwn with class-specific mean vectrs and a cmmn cvariance matrix. The three ellipses represent regins that cntain 95 % f the prbability fr each f the three classes. The dashed lines

159 Classificatin are the Bayes decisin bundaries. In ther wrds, they represent the set f values x fr which δ k (x) =δ l (x); i.e. x T Σ 1 μ k 1 2 μt k Σ 1 μ k = x T Σ 1 μ l 1 2 μt l Σ 1 μ l (4.20) fr k l. (Thelgπ k term frm (4.19) has disappeared because each f the three classes has the same number f training bservatins; i.e. π k is the same fr each class.) Nte that there are three lines representing the Bayes decisin bundaries because there are three pairs f classes amng the three classes. That is, ne Bayes decisin bundary separates class 1 frm class 2, ne separates class 1 frm class 3, and ne separates class 2 frm class 3. These three Bayes decisin bundaries divide the predictr space int three regins. The Bayes classifier will classify an bservatin accrding t the regin in which it is lcated. Once again, we need t estimate the unknwn parameters μ 1,...,μ K, π 1,...,π K, and Σ; the frmulas are similar t thse used in the nedimensinal case, given in (4.15). T assign a new bservatin X = x, LDA plugs these estimates int (4.19) and classifies t the class fr which ˆδ k (x) is largest. Nte that in (4.19) δ k (x) is a linear functin f x; thatis, the LDA decisin rule depends n x nly thrugh a linear cmbinatin f its elements. Once again, this is the reasn fr the wrd linear in LDA. In the right-hand panel f Figure 4.6, 20 bservatins drawn frm each f the three classes are displayed, and the resulting LDA decisin bundaries are shwn as slid black lines. Overall, the LDA decisin bundaries are pretty clse t the Bayes decisin bundaries, shwn again as dashed lines. The test errr rates fr the Bayes and LDA classifiers are and , respectively. This indicates that LDA is perfrming well n this data. We can perfrm LDA n the Default data in rder t predict whether r nt an individual will default n the basis f credit card balance and student status. The LDA mdel fit t the 10, 000 training samples results in a training errr rate f 2.75 %. This sunds like a lw errr rate, but tw caveats must be nted. First f all, training errr rates will usually be lwer than test errr rates, which are the real quantity f interest. In ther wrds, we might expect this classifier t perfrm wrse if we use it t predict whether r nt a new set f individuals will default. The reasn is that we specifically adjust the parameters f ur mdel t d well n the training data. The higher the rati f parameters p t number f samples n, the mre we expect this verfitting t play a rle. Fr verfitting these data we dn t expect this t be a prblem, since p =3and n =10, 000. Secnd, since nly 3.33 % f the individuals in the training sample defaulted, a simple but useless classifier that always predicts that

160 4.4 Linear Discriminant Analysis 145 True default status N Yes Ttal Predicted N 9, , 896 default status Yes Ttal 9, , 000 TABLE 4.4. A cnfusin matrix cmpares the LDA predictins t the true default statuses fr the 10, 000 training bservatins in the Default data set. Elements n the diagnal f the matrix represent individuals whse default statuses were crrectly predicted, while ff-diagnal elements represent individuals that were misclassified. LDA made incrrect predictins fr 23 individuals wh did nt default and fr 252 individuals wh did default. each individual will nt default, regardless f his r her credit card balance and student status, will result in an errr rate f 3.33 %. In ther wrds, the trivial null classifier will achieve an errr rate that null is nly a bit higher than the LDA training set errr rate. In practice, a binary classifier such as this ne can make tw types f errrs: it can incrrectly assign an individual wh defaults t the n default categry, r it can incrrectly assign an individual wh des nt default t the default categry. It is ften f interest t determine which f these tw types f errrs are being made. A cnfusin matrix, shwn fr the Default cnfusin data in Table 4.4, is a cnvenient way t display this infrmatin. The matrix table reveals that LDA predicted that a ttal f 104 peple wuld default. Of these peple, 81 actually defaulted and 23 did nt. Hence nly 23 ut f 9, 667 f the individuals wh did nt default were incrrectly labeled. This lks like a pretty lw errr rate! Hwever, f the 333 individuals wh defaulted, 252 (r 75.7 %) were missed by LDA. S while the verall errr rate is lw, the errr rate amng individuals wh defaulted is very high. Frm the perspective f a credit card cmpany that is trying t identify high-risk individuals, an errr rate f 252/333 = 75.7 % amng individuals wh default may well be unacceptable. Class-specific perfrmance is als imprtant in medicine and bilgy, where the terms sensitivity and specificity characterize the perfrmance f sensitivity a classifier r screening test. In this case the sensitivity is the percentage f specificity true defaulters that are identified, a lw 24.3 % in this case. The specificity is the percentage f nn-defaulters that are crrectly identified, here (1 23/9, 667) 100 = 99.8%. Why des LDA d such a pr jb f classifying the custmers wh default? In ther wrds, why des it have such a lw sensitivity? As we have seen, LDA is trying t apprximate the Bayes classifier, which has the lwest ttal errr rate ut f all classifiers (if the Gaussian mdel is crrect). That is, the Bayes classifier will yield the smallest pssible ttal number f misclassified bservatins, irrespective f which class the errrs cme frm. That is, sme misclassificatins will result frm incrrectly assigning

161 Classificatin True default status N Yes Ttal Predicted N 9, , 570 default status Yes Ttal 9, , 000 TABLE 4.5. A cnfusin matrix cmpares the LDA predictins t the true default statuses fr the 10, 000 training bservatins in the Default data set, using a mdified threshld value that predicts default fr any individuals whse psterir default prbability exceeds 20 %. a custmer wh des nt default t the default class, and thers will result frm incrrectly assigning a custmer wh defaults t the nn-default class. In cntrast, a credit card cmpany might particularly wish t avid incrrectly classifying an individual wh will default, whereas incrrectly classifying an individual wh will nt default, thugh still t be avided, is less prblematic. We will nw see that it is pssible t mdify LDA in rder t develp a classifier that better meets the credit card cmpany s needs. The Bayes classifier wrks by assigning an bservatin t the class fr which the psterir prbability p k (X) is greatest. In the tw-class case, this amunts t assigning an bservatin t the default class if Pr(default = YesX = x) > 0.5. (4.21) Thus, the Bayes classifier, and by extensin LDA, uses a threshld f 50 % fr the psterir prbability f default in rder t assign an bservatin t the default class. Hwever, if we are cncerned abut incrrectly predicting the default status fr individuals wh default, then we can cnsider lwering this threshld. Fr instance, we might label any custmer with a psterir prbability f default abve 20 % t the default class. In ther wrds, instead f assigning an bservatin t the default class if (4.21) hlds, we culd instead assign an bservatin t this class if P (default = YesX = x) > 0.2. (4.22) The errr rates that result frm taking this apprach are shwn in Table 4.5. Nw LDA predicts that 430 individuals will default. Of the 333 individuals wh default, LDA crrectly predicts all but 138, r 41.4%.Thisisavast imprvement ver the errr rate f 75.7 % that resulted frm using the threshld f 50 %. Hwever, this imprvement cmes at a cst: nw 235 individuals wh d nt default are incrrectly classified. As a result, the verall errr rate has increased slightly t 3.73 %. But a credit card cmpany may cnsider this slight increase in the ttal errr rate t be a small price t pay fr mre accurate identificatin f individuals wh d indeed default. Figure 4.7 illustrates the trade-ff that results frm mdifying the threshld value fr the psterir prbability f default. Varius errr rates are

162 4.4 Linear Discriminant Analysis 147 Errr Rate Threshld FIGURE 4.7. Fr the Default data set, errr rates are shwn as a functin f the threshld value fr the psterir prbability that is used t perfrm the assignment. The black slid line displays the verall errr rate. The blue dashed line represents the fractin f defaulting custmers that are incrrectly classified, and the range dtted line indicates the fractin f errrs amng the nn-defaulting custmers. shwn as a functin f the threshld value. Using a threshld f 0.5, as in (4.21), minimizes the verall errr rate, shwn as a black slid line. This is t be expected, since the Bayes classifier uses a threshld f 0.5 andis knwn t have the lwest verall errr rate. But when a threshld f 0.5 is used, the errr rate amng the individuals wh default is quite high (blue dashed line). As the threshld is reduced, the errr rate amng individuals wh default decreases steadily, but the errr rate amng the individuals wh d nt default increases. Hw can we decide which threshld value is best? Such a decisin must be based n dmain knwledge, such as detailed infrmatin abut the csts assciated with default. The ROC curve is a ppular graphic fr simultaneusly displaying the ROC curve tw types f errrs fr all pssible threshlds. The name ROC is histric, and cmes frm cmmunicatins thery. It is an acrnym fr receiver perating characteristics. Figure 4.8 displays the ROC curve fr the LDA classifier n the training data. The verall perfrmance f a classifier, summarized ver all pssible threshlds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the tp left crner, s the larger area under the AUC the better the classifier. Fr this data the AUC is 0.95, which is the (ROC) clse t the maximum f ne s wuld be cnsidered very gd. We expect curve a classifier that perfrms n better than chance t have an AUC f 0.5 (when evaluated n an independent test set nt used in mdel training). ROC curves are useful fr cmparing different classifiers, since they take int accunt all pssible threshlds. It turns ut that the ROC curve fr the lgistic regressin mdel f Sectin fit t these data is virtually indistinguishable frm this ne fr the LDA mdel, s we d nt display it here. As we have seen abve, varying the classifier threshld changes its true psitive and false psitive rate. These are als called the sensitivity and ne sensitivity

163 Classificatin ROC Curve True psitive rate False psitive rate FIGURE 4.8. A ROC curve fr the LDA classifier n the Default data. It traces ut tw types f errr as we vary the threshld value fr the psterir prbability f default. The actual threshlds are nt shwn. The true psitive rate is the sensitivity: the fractin f defaulters that are crrectly identified, using a given threshld value. The false psitive rate is 1-specificity: the fractin f nn-defaulters that we classify incrrectly as defaulters, using that same threshld value. The ideal ROC curve hugs the tp left crner, indicating a high true psitive rate and a lw false psitive rate. The dtted line represents the n infrmatin classifier; this is what we wuld expect if student status and credit card balance are nt assciated with prbability f default. Predicted class r Null + rnn-null Ttal True r Null True Neg. (TN) False Ps. (FP) N class + rnn-null False Neg. (FN) True Ps. (TP) P Ttal N P TABLE 4.6. Pssible results when applying a classifier r diagnstic test t a ppulatin. minus the specificity f ur classifier. Since there is an almst bewildering specificity array f terms used in this cntext, we nw give a summary. Table 4.6 shws the pssible results when applying a classifier (r diagnstic test) t a ppulatin. T make the cnnectin with the epidemilgy literature, we think f + as the disease that we are trying t detect, and as the nn-disease state. T make the cnnectin t the classical hypthesis testing literature, we think f as the null hypthesis and + as the alternative (nn-null) hypthesis. In the cntext f the Default data, + indicates an individual wh defaults, and indicates ne wh des nt.

164 4.4 Linear Discriminant Analysis 149 Name Definitin Synnyms False Ps. rate FP/N Type I errr, 1 Specificity True Ps. rate TP/P 1 Type II errr, pwer, sensitivity, recall Ps. Pred. value TP/P Precisin, 1 false discvery prprtin Neg. Pred. value TN/N TABLE 4.7. Imprtant measures fr classificatin and diagnstic testing, derived frm quantities in Table 4.6. Table 4.7 lists many f the ppular perfrmance measures that are used in this cntext. The denminatrs fr the false psitive and true psitive rates are the actual ppulatin cunts in each class. In cntrast, the denminatrs fr the psitive predictive value and the negative predictive value are the ttal predicted cunts fr each class Quadratic Discriminant Analysis As we have discussed, LDA assumes that the bservatins within each class are drawn frm a multivariate Gaussian distributin with a classspecific mean vectr and a cvariance matrix that is cmmn t all K classes. Quadratic discriminant analysis (QDA) prvides an alternative quadratic apprach. Like LDA, the QDA classifier results frm assuming that the bservatins frm each class are drawn frm a Gaussian distributin, and plugging estimates fr the parameters int Bayes therem in rder t perfrm predictin. Hwever, unlike LDA, QDA assumes that each class has its wn cvariance matrix. That is, it assumes that an bservatin frm the kth class is f the frm X N(μ k, Σ k ), where Σ k is a cvariance matrix fr the kth class. Under this assumptin, the Bayes classifier assigns an bservatin X = x t the class fr which δ k (x) = 1 2 (x μ k) T Σ 1 k (x μ k) 1 2 lg Σ k +lgπ k = 1 2 xt Σ 1 k x + xt Σ 1 μ k 1 2 μt k Σ 1μ k 1 2 lg Σ k +lgπ k k k (4.23) is largest. S the QDA classifier invlves plugging estimates fr Σ k, μ k, and π k int (4.23), and then assigning an bservatin X = x t the class fr which this quantity is largest. Unlike in (4.19), the quantity x appears as a quadratic functin in (4.23). This is where QDA gets its name. Why des it matter whether r nt we assume that the K classes share a cmmn cvariance matrix? In ther wrds, why wuld ne prefer LDA t QDA, r vice-versa? The answer lies in the bias-variance trade-ff. When there are p predictrs, then estimating a cvariance matrix requires estimating p(p+1)/2 parameters. QDA estimates a separate cvariance matrix fr each class, fr a ttal f Kp(p+1)/2 parameters. With 50 predictrs this discriminant analysis

165 Classificatin X X X X 1 FIGURE 4.9. Left: The Bayes (purple dashed), LDA (black dtted), and QDA (green slid) decisin bundaries fr a tw-class prblem with Σ 1 = Σ 2.The shading indicates the QDA decisin rule. Since the Bayes decisin bundary is linear, it is mre accurately apprximated by LDA than by QDA. Right: Details are as given in the left-hand panel, except that Σ 1 Σ 2. Since the Bayes decisin bundary is nn-linear, it is mre accurately apprximated by QDA than by LDA. is sme multiple f 1,225, which is a lt f parameters. By instead assuming that the K classes share a cmmn cvariance matrix, the LDA mdel becmes linear in x, which means there are Kp linear cefficients t estimate. Cnsequently, LDA is a much less flexible classifier than QDA, and s has substantially lwer variance. This can ptentially lead t imprved predictin perfrmance. But there is a trade-ff: if LDA s assumptin that the K classes share a cmmn cvariance matrix is badly ff, then LDA can suffer frm high bias. Rughly speaking, LDA tends t be a better bet than QDA if there are relatively few training bservatins and s reducing variance is crucial. In cntrast, QDA is recmmended if the training set is very large, s that the variance f the classifier is nt a majr cncern, r if the assumptin f a cmmn cvariance matrix fr the K classesisclearly untenable. Figure 4.9 illustrates the perfrmances f LDA and QDA in tw scenaris. In the left-hand panel, the tw Gaussian classes have a cmmn crrelatin f 0.7 between X 1 and X 2. As a result, the Bayes decisin bundary is linear and is accurately apprximated by the LDA decisin bundary. The QDA decisin bundary is inferir, because it suffers frm higher variance withut a crrespnding decrease in bias. In cntrast, the right-hand panel displays a situatin in which the range class has a crrelatin f 0.7 between the variables and the blue class has a crrelatin f 0.7. Nw the Bayes decisin bundary is quadratic, and s QDA mre accurately apprximates this bundary than des LDA.

166 4.5 A Cmparisn f Classificatin Methds A Cmparisn f Classificatin Methds In this chapter, we have cnsidered three different classificatin appraches: lgistic regressin, LDA, and QDA. In Chapter 2, we als discussed the K-nearest neighbrs (KNN) methd. We nw cnsider the types f scenaris in which ne apprach might dminate the thers. Thugh their mtivatins differ, the lgistic regressin and LDA methds are clsely cnnected. Cnsider the tw-class setting with p =1predictr, and let p 1 (x) andp 2 (x) =1 p 1 (x) be the prbabilities that the bservatin X = x belngs t class 1 and class 2, respectively. In the LDA framewrk, we can see frm (4.12) t (4.13) (and a bit f simple algebra) that the lg dds is given by ( ) ( ) p1 (x) p1 (x) lg =lg = c 0 + c 1 x, (4.24) 1 p 1 (x) p 2 (x) where c 0 and c 1 are functins f μ 1,μ 2,andσ 2. Frm (4.4), we knw that in lgistic regressin, ( ) p1 lg = β 0 + β 1 x. (4.25) 1 p 1 Bth (4.24) and (4.25) are linear functins f x. Hence, bth lgistic regressin and LDA prduce linear decisin bundaries. The nly difference between the tw appraches lies in the fact that β 0 and β 1 are estimated using maximum likelihd, whereas c 0 and c 1 are cmputed using the estimated mean and variance frm a nrmal distributin. This same cnnectin between LDA and lgistic regressin als hlds fr multidimensinal data with p>1. Since lgistic regressin and LDA differ nly in their fitting prcedures, ne might expect the tw appraches t give similar results. This is ften, but nt always, the case. LDA assumes that the bservatins are drawn frm a Gaussian distributin with a cmmn cvariance matrix in each class, and s can prvide sme imprvements ver lgistic regressin when this assumptin apprximately hlds. Cnversely, lgistic regressin can utperfrm LDA if these Gaussian assumptins are nt met. Recall frm Chapter 2 that KNN takes a cmpletely different apprach frm the classifiers seen in this chapter. In rder t make a predictin fr an bservatin X = x, thek training bservatins that are clsest t x are identified. Then X is assigned t the class t which the plurality f these bservatins belng. Hence KNN is a cmpletely nn-parametric apprach: n assumptins are made abut the shape f the decisin bundary. Therefre, we can expect this apprach t dminate LDA and lgistic regressin when the decisin bundary is highly nn-linear. On the ther hand, KNN des nt tell us which predictrs are imprtant; we dn t get a table f cefficients as in Table 4.3.

167 Classificatin SCENARIO 1 SCENARIO 2 SCENARIO KNN 1 KNN CV LDA Lgistic QDA KNN 1 KNN CV LDA Lgistic QDA KNN 1 KNN CV LDA Lgistic QDA FIGURE Bxplts f the test errr rates fr each f the linear scenaris described in the main text. SCENARIO 4 SCENARIO 5 SCENARIO KNN 1 KNN CV LDA Lgistic QDA KNN 1 KNN CV LDA Lgistic QDA KNN 1 KNN CV LDA Lgistic QDA FIGURE Bxplts f the test errr rates fr each f the nn-linear scenaris described in the main text. Finally, QDA serves as a cmprmise between the nn-parametric KNN methd and the linear LDA and lgistic regressin appraches. Since QDA assumes a quadratic decisin bundary, it can accurately mdel a wider range f prblems than can the linear methds. Thugh nt as flexible as KNN, QDA can perfrm better in the presence f a limited number f training bservatins because it des make sme assumptins abut the frm f the decisin bundary. T illustrate the perfrmances f these fur classificatin appraches, we generated data frm six different scenaris. In three f the scenaris, the Bayes decisin bundary is linear, and in the remaining scenaris it is nn-linear. Fr each scenari, we prduced 100 randm training data sets. On each f these training sets, we fit each methd t the data and cmputed the resulting test errr rate n a large test set. Results fr the linear scenaris are shwn in Figure 4.10, and the results fr the nn-linear scenaris are in Figure The KNN methd requires selectin f K, the number f neighbrs. We perfrmed KNN with tw values f K: K =1,

168 4.5 A Cmparisn f Classificatin Methds 153 and a value f K that was chsen autmatically using an apprach called crss-validatin, which we discuss further in Chapter 5. In each f the six scenaris, there were p = 2 predictrs. The scenaris were as fllws: Scenari 1: There were 20 training bservatins in each f tw classes. The bservatins within each class were uncrrelated randm nrmal variables with a different mean in each class. The left-hand panel f Figure 4.10 shws that LDA perfrmed well in this setting, as ne wuld expect since this is the mdel assumed by LDA. KNN perfrmed prly because it paid a price in terms f variance that was nt ffset by a reductin in bias. QDA als perfrmed wrse than LDA, since it fit a mre flexible classifier than necessary. Since lgistic regressin assumes a linear decisin bundary, its results were nly slightly inferir t thse f LDA. Scenari 2: Details are as in Scenari 1, except that within each class, the tw predictrs had a crrelatin f 0.5. The center panel f Figure 4.10 indicates little change in the relative perfrmances f the methds as cmpared t the previus scenari. Scenari 3: We generated X 1 and X 2 frm the t-distributin, with t- 50 bservatins per class. The t-distributin has a similar shape t distributin the nrmal distributin, but it has a tendency t yield mre extreme pints that is, mre pints that are far frm the mean. In this setting, the decisin bundary was still linear, and s fit int the lgistic regressin framewrk. The set-up vilated the assumptins f LDA, since the bservatins were nt drawn frm a nrmal distributin. The right-hand panel f Figure 4.10 shws that lgistic regressin utperfrmed LDA, thugh bth methds were superir t the ther appraches. In particular, the QDA results deterirated cnsiderably as a cnsequence f nn-nrmality. Scenari 4: The data were generated frm a nrmal distributin, with a crrelatin f 0.5 between the predictrs in the first class, and crrelatin f 0.5 between the predictrs in the secnd class. This setup crrespnded t the QDA assumptin, and resulted in quadratic decisin bundaries. The left-hand panel f Figure 4.11 shws that QDA utperfrmed all f the ther appraches. Scenari 5: Within each class, the bservatins were generated frm a nrmal distributin with uncrrelated predictrs. Hwever, the respnses were sampled frm the lgistic functin using X 2 1, X 2 2,and X 1 X 2 as predictrs. Cnsequently, there is a quadratic decisin bundary. The center panel f Figure 4.11 indicates that QDA nce again perfrmed best, fllwed clsely by KNN-CV. The linear methds had pr perfrmance.

169 Classificatin Scenari 6: Details are as in the previus scenari, but the respnses were sampled frm a mre cmplicated nn-linear functin. As a result, even the quadratic decisin bundaries f QDA culd nt adequately mdel the data. The right-hand panel f Figure 4.11 shws that QDA gave slightly better results than the linear methds, while the much mre flexible KNN-CV methd gave the best results. But KNN with K = 1 gave the wrst results ut f all methds. This highlights the fact that even when the data exhibits a cmplex nnlinear relatinship, a nn-parametric methd such as KNN can still give pr results if the level f smthness is nt chsen crrectly. These six examples illustrate that n ne methd will dminate the thers in every situatin. When the true decisin bundaries are linear, then the LDA and lgistic regressin appraches will tend t perfrm well. When the bundaries are mderately nn-linear, QDA may give better results. Finally, fr much mre cmplicated decisin bundaries, a nn-parametric apprach such as KNN can be superir. But the level f smthness fr a nn-parametric apprach must be chsen carefully. In the next chapter we examine a number f appraches fr chsing the crrect level f smthness and, in general, fr selecting the best verall methd. Finally, recall frm Chapter 3 that in the regressin setting we can accmmdate a nn-linear relatinship between the predictrs and the respnse by perfrming regressin using transfrmatins f the predictrs. A similar apprach culd be taken in the classificatin setting. Fr instance, we culd create a mre flexible versin f lgistic regressin by including X 2, X 3, and even X 4 as predictrs. This may r may nt imprve lgistic regressin s perfrmance, depending n whether the increase in variance due t the added flexibility is ffset by a sufficiently large reductin in bias. We culd d the same fr LDA. If we added all pssible quadratic terms and crss-prducts t LDA, the frm f the mdel wuld be the same as the QDA mdel, althugh the parameter estimates wuld be different. This device allws us t mve smewhere between an LDA and a QDA mdel. 4.6 Lab: Lgistic Regressin, LDA, QDA, and KNN The Stck Market Data We will begin by examining sme numerical and graphical summaries f the Smarket data, which is part f the ISLR library. This data set cnsists f percentage returns fr the S&P 500 stck index ver 1, 250 days, frm the beginning f 2001 until the end f Fr each date, we have recrded the percentage returns fr each f the five previus trading days, Lag1 thrugh Lag5. We have als recrded Vlume (the number f shares traded

170 4.6 Lab: Lgistic Regressin, LDA, QDA, and KNN 155 n the previus day, in billins), Tday (the percentage return n the date in questin) and Directin (whether the market was Up r Dwn n this date). > library(islr) > names(smarket) [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" [6] "Lag5" "Vlume" "Tday" "Directin " > dim(smarket) [1] > summary(smarket) Year Lag1 Lag2 Min. :2001 Min. : Min. : st Qu.:2002 1st Qu.: st Qu.: Median :2003 Median : Median : Mean :2003 Mean : Mean : rd Qu.:2004 3rd Qu.: rd Qu.: Max. :2005 Max. : Max. : Lag3 Lag4 Lag5 Min. : Min. : Min. : st Qu.: st Qu.: st Qu.: Median : Median : Median : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Vlume Tday Directin Min. :0.356 Min. : Dwn:602 1st Qu.: st Qu.: Up :648 Median :1.423 Median : Mean :1.478 Mean : rd Qu.: rd Qu.: Max. :3.152 Max. : > pairs(smarket) The cr() functin prduces a matrix that cntains all f the pairwise crrelatins amng the predictrs in a data set. The first cmmand belw gives an errr message because the Directin variable is qualitative. > cr(smarket) Errr in cr(smarket) : x must be numeric > cr(smarket[,-9]) Year Lag1 Lag2 Lag3 Lag4 Lag5 Year Lag Lag Lag Lag Lag Vlume Tday Vlume Tday Year

171 Classificatin Lag Lag Lag Lag Lag Vlume Tday As ne wuld expect, the crrelatins between the lag variables and tday s returns are clse t zer. In ther wrds, there appears t be little crrelatin between tday s returns and previus days returns. The nly substantial crrelatin is between Year and Vlume. By pltting the data we see that Vlume is increasing ver time. In ther wrds, the average number f shares traded daily increased frm 2001 t > attach(smarket) > plt(vlume) Lgistic Regressin Next, we will fit a lgistic regressin mdel in rder t predict Directin using Lag1 thrugh Lag5 and Vlume. The glm() functin fits generalized glm() linear mdels, a class f mdels that includes lgistic regressin. The syntax generalized f the glm() functin is similar t that f lm(), exceptthatwemustpassin linear mdel the argument family=binmial in rder t tell R t run a lgistic regressin rather than sme ther type f generalized linear mdel. > glm.fit=glm(directin Lag1+Lag2+Lag3+Lag4+Lag5+Vlume, data=smarket,family=binmial) > summary(glm.fit) Call: glm(frmula = Directin Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Vlume, family = binmial, data = Smarket) Deviance Residuals : Min 1Q Median 3Q Max Cefficients: Estimate Std. Errr z value Pr(>z) (Intercept ) Lag Lag Lag Lag Lag Vlume

172 4.6 Lab: Lgistic Regressin, LDA, QDA, and KNN 157 (Dispersin parameter fr binmial family taken t be 1) Null deviance: n 1249 degrees f freedm Residual deviance: n 1243 degrees f freedm AIC: 1742 Number f Fisher Scring iteratins : 3 The smallest p-value here is assciated with Lag1. The negative cefficient fr this predictr suggests that if the market had a psitive return yesterday, then it is less likely t g up tday. Hwever, at a value f 0.15, the p-value is still relatively large, and s there is n clear evidence f a real assciatin between Lag1 and Directin. We use the cef() functin in rder t access just the cefficients fr this fitted mdel. We can als use the summary() functin t access particular aspects f the fitted mdel, such as the p-values fr the cefficients. > cef(glm.fit) (Intercept ) Lag1 Lag2 Lag3 Lag Lag5 Vlume > summary(glm.fit)$cef Estimate Std. Errr z value Pr(>z) (Intercept ) Lag Lag Lag Lag Lag Vlume > summary(glm.fit)$cef[,4] (Intercept ) Lag1 Lag2 Lag3 Lag Lag5 Vlume The predict() functin can be used t predict the prbability that the market will g up, given values f the predictrs. The type="respnse" ptin tells R t utput prbabilities f the frm P (Y =1X), as ppsed t ther infrmatin such as the lgit. If n data set is supplied t the predict() functin, then the prbabilities are cmputed fr the training data that was used t fit the lgistic regressin mdel. Here we have printed nly the first ten prbabilities. We knw that these values crrespnd t the prbability f the market ging up, rather than dwn, because the cntrasts() functin indicates that R has created a dummy variable with a1frup. > glm.prbs=predict(glm.fit,type="respnse ") > glm.prbs[1:10]

173 Classificatin > cntrasts (Directin ) Up Dwn 0 Up 1 In rder t make a predictin as t whether the market will g up r dwn n a particular day, we must cnvert these predicted prbabilities int class labels, Up r Dwn. The fllwing tw cmmands create a vectr f class predictins based n whether the predicted prbability f a market increase is greater thanrlessthan0.5. > glm.pred=rep("dwn",1250) > glm.pred[glm.prbs >.5]="Up" The first cmmand creates a vectr f 1,250 Dwn elements. The secnd line transfrms t Up all f the elements fr which the predicted prbability f a market increase exceeds 0.5. Given these predictins, the table() functin table() can be used t prduce a cnfusin matrix in rder t determine hw many bservatins were crrectly r incrrectly classified. > table(glm.pred,directin ) Directin glm.pred Dwn Up Dwn Up > ( ) /1250 [1] > mean(glm.pred==directin ) [1] The diagnal elements f the cnfusin matrix indicate crrect predictins, while the ff-diagnals represent incrrect predictins. Hence ur mdel crrectly predicted that the market wuld g up n 507 days and that it wuld g dwn n 145 days, fr a ttal f = 652 crrect predictins. The mean() functin can be used t cmpute the fractin f days fr which the predictin was crrect. In this case, lgistic regressin crrectly predicted the mvement f the market 52.2% f the time. At first glance, it appears that the lgistic regressin mdel is wrking a little better than randm guessing. Hwever, this result is misleading because we trained and tested the mdel n the same set f 1, 250 bservatins. In ther wrds, =47.8 % is the training errr rate. As we have seen previusly, the training errr rate is ften verly ptimistic it tends t underestimate the test errr rate. In rder t better assess the accuracy f the lgistic regressin mdel in this setting, we can fit the mdel using part f the data, and then examine hw well it predicts the held ut data. This will yield a mre realistic errr rate, in the sense that in practice we will be interested in ur mdel s perfrmance nt n the data that we used t fit the mdel, but rather n days in the future fr which the market s mvements are unknwn.

174 4.6 Lab: Lgistic Regressin, LDA, QDA, and KNN 159 T implement this strategy, we will first create a vectr crrespnding t the bservatins frm 2001 thrugh We will then use this vectr t create a held ut data set f bservatins frm > train=(year <2005) > Smarket.2005= Smarket[!train,] > dim(smarket.2005) [1] > Directin.2005=Directin [!train] The bject train is a vectr f 1, 250 elements, crrespnding t the bservatins in ur data set. The elements f the vectr that crrespnd t bservatins that ccurred befre 2005 are set t TRUE, whereas thse that crrespnd t bservatins in 2005 are set t FALSE. The bject train is a Blean vectr, since its elements are TRUE and FALSE. Blean vectrs blean can be used t btain a subset f the rws r clumns f a matrix. Fr instance, the cmmand Smarket[train,] wuld pick ut a submatrix f the stck market data set, crrespnding nly t the dates befre 2005, since thse are the nes fr which the elements f train are TRUE. The! symbl can be used t reverse all f the elements f a Blean vectr. That is,!train is a vectr similar t train, except that the elements that are TRUE in train get swapped t FALSE in!train, and the elements that are FALSE in train get swapped t TRUE in!train. Therefre, Smarket[!train,] yields a submatrix f the stck market data cntaining nly the bservatins fr which train is FALSE that is, the bservatins with dates in The utput abve indicates that there are 252 such bservatins. We nw fit a lgistic regressin mdel using nly the subset f the bservatins that crrespnd t dates befre 2005, using the subset argument. We then btain predicted prbabilities f the stck market ging up fr each f the days in ur test set that is, fr the days in > glm.fit=glm(directin Lag1+Lag2+Lag3+Lag4+Lag5+Vlume, data=smarket,family=binmial,subset=train) > glm.prbs=predict(glm.fit,smarket.2005,type="respnse ") Ntice that we have trained and tested ur mdel n tw cmpletely separate data sets: training was perfrmed using nly the dates befre 2005, and testing was perfrmed using nly the dates in Finally, we cmpute the predictins fr 2005 and cmpare them t the actual mvements f the market ver that time perid. > glm.pred=rep("dwn",252) > glm.pred[glm.prbs >.5]="Up" > table(glm.pred,directin.2005) Directin.2005 glm.pred Dwn Up Dwn Up > mean(glm.pred==directin.2005)

175 Classificatin [1] 0.48 > mean(glm.pred!=directin.2005) [1] 0.52 The!= ntatin means nt equal t, and s the last cmmand cmputes the test set errr rate. The results are rather disappinting: the test errr rate is 52 %, which is wrse than randm guessing! Of curse this result is nt all that surprising, given that ne wuld nt generally expect t be able t use previus days returns t predict future market perfrmance. (After all, if it were pssible t d s, then the authrs f this bk wuld be ut striking it rich rather than writing a statistics textbk.) We recall that the lgistic regressin mdel had very underwhelming p- values assciated with all f the predictrs, and that the smallest p-value, thugh nt very small, crrespnded t Lag1. Perhaps by remving the variables that appear nt t be helpful in predicting Directin, we can btain a mre effective mdel. After all, using predictrs that have n relatinship with the respnse tends t cause a deteriratin in the test errr rate (since such predictrs cause an increase in variance withut a crrespnding decrease in bias), and s remving such predictrs may in turn yield an imprvement. Belw we have refit the lgistic regressin using just Lag1 and Lag2, which seemed t have the highest predictive pwer in the riginal lgistic regressin mdel. > glm.fit=glm(directin Lag1+Lag2,data=Smarket,family=binmial, subset=train) > glm.prbs=predict(glm.fit,smarket.2005,type="respnse ") > glm.pred=rep("dwn",252) > glm.pred[glm.prbs >.5]="Up" > table(glm.pred,directin.2005) Directin.2005 glm.pred Dwn Up Dwn Up > mean(glm.pred==directin.2005) [1] 0.56 > 106/(106+76) [1] Nw the results appear t be a little better: 56% f the daily mvements have been crrectly predicted. It is wrth nting that in this case, a much simpler strategy f predicting that the market will increase every day will als be crrect 56% f the time! Hence, in terms f verall errr rate, the lgistic regressin methd is n better than the naïve apprach. Hwever, the cnfusin matrix shws that n days when lgistic regressin predicts an increase in the market, it has a 58% accuracy rate. This suggests a pssible trading strategy f buying n days when the mdel predicts an increasing market, and aviding trades n days when a decrease is predicted. Of curse ne wuld need t investigate mre carefully whether this small imprvement was real r just due t randm chance.

176 4.6 Lab: Lgistic Regressin, LDA, QDA, and KNN 161 Suppse that we want t predict the returns assciated with particular values f Lag1 and Lag2. In particular, we want t predict Directin n a day when Lag1 and Lag2 equal 1.2 and 1.1, respectively, and n a day when they equal 1.5 and 0.8. We d this using the predict() functin. > predict(glm.fit,newdata=data.frame(lag1=c(1.2,1.5), Lag2=c(1.1,-0.8)),type="respnse ") Linear Discriminant Analysis Nw we will perfrm LDA n the Smarket data. In R, wefitaldamdel using the lda() functin, which is part f the MASS library. Ntice that the lda() syntax fr the lda() functin is identical t that f lm(), and t that f glm() except fr the absence f the family ptin. We fit the mdel using nly the bservatins befre > library(mass) > lda.fit=lda(directin Lag1+Lag2,data=Smarket,subset=train) > lda.fit Call: lda(directin Lag1 + Lag2, data = Smarket, subset = train) Prir prbabilities f grups: Dwn Up Grup means: Lag1 Lag2 Dwn Up Cefficients f linear discriminants: LD1 Lag Lag > plt(lda.fit) The LDA utput indicates that ˆπ 1 =0.492 and ˆπ 2 =0.508; in ther wrds, 49.2 % f the training bservatins crrespnd t days during which the market went dwn. It als prvides the grup means; these are the average f each predictr within each class, and are used by LDA as estimates f μ k. These suggest that there is a tendency fr the previus 2 days returns t be negative n days when the market increases, and a tendency fr the previus days returns t be psitive n days when the market declines. The cefficients f linear discriminants utput prvides the linear cmbinatin f Lag1 and Lag2 that are used t frm the LDA decisin rule. In ther wrds, these are the multipliers f the elements f X = x in (4.19). If Lag Lag2 is large, then the LDA classifier will

177 Classificatin predict a market increase, and if it is small, then the LDA classifier will predict a market decline. The plt() functin prduces plts f the linear discriminants, btained by cmputing Lag Lag2 fr each f the training bservatins. The predict() functin returns a list with three elements. The first element, class, cntains LDA s predictins abut the mvement f the market. The secnd element, psterir, is a matrix whse kth clumn cntains the psterir prbability that the crrespnding bservatin belngs t the kth class, cmputed frm (4.10). Finally, x cntains the linear discriminants, described earlier. > lda.pred=predict(lda.fit, Smarket.2005) > names(lda.pred) [1] "class" "psterir " "x" As we bserved in Sectin 4.5, the LDA and lgistic regressin predictins are almst identical. > lda.class=lda.pred$class > table(lda.class,directin.2005) Directin.2005 lda.pred Dwn Up Dwn Up > mean(lda.class==directin.2005) [1] 0.56 Applying a 50 % threshld t the psterir prbabilities allws us t recreate the predictins cntained in lda.pred$class. > sum(lda.pred$psterir[,1]>=.5) [1] 70 > sum(lda.pred$psterir[,1]<.5) [1] 182 Ntice that the psterir prbability utput by the mdel crrespnds t the prbability that the market will decrease: > lda.pred$psterir[1:20,1] > lda.class[1:20] If we wanted t use a psterir prbability threshld ther than 50 % in rder t make predictins, then we culd easily d s. Fr instance, suppse that we wish t predict a market decrease nly if we are very certain that the market will indeed decrease n that day say, if the psterir prbability is at least 90 %. > sum(lda.pred$psterir[,1]>.9) [1] 0 N days in 2005 meet that threshld! In fact, the greatest psterir prbability f decrease in all f 2005 was %.

178 4.6 Lab: Lgistic Regressin, LDA, QDA, and KNN Quadratic Discriminant Analysis We will nw fit a QDA mdel t the Smarket data. QDA is implemented in R using the qda() functin, which is als part f the MASS library. The qda() syntax is identical t that f lda(). > qda.fit=qda(directin Lag1+Lag2,data=Smarket,subset=train) > qda.fit Call: qda(directin Lag1 + Lag2, data = Smarket, subset = train) Prir prbabilities f grups: Dwn Up Grup means: Lag1 Lag2 Dwn Up The utput cntains the grup means. But it des nt cntain the cefficients f the linear discriminants, because the QDA classifier invlves a quadratic, rather than a linear, functin f the predictrs. The predict() functin wrks in exactly the same fashin as fr LDA. > qda.class=predict(qda.fit,smarket.2005)$class > table(qda.class,directin.2005) Directin.2005 qda.class Dwn Up Dwn Up > mean(qda.class==directin.2005) [1] Interestingly, the QDA predictins are accurate almst 60 % f the time, even thugh the 2005 data was nt used t fit the mdel. This level f accuracy is quite impressive fr stck market data, which is knwn t be quite hard t mdel accurately. This suggests that the quadratic frm assumed by QDA may capture the true relatinship mre accurately than the linear frms assumed by LDA and lgistic regressin. Hwever, we recmmend evaluating this methd s perfrmance n a larger test set befre betting that this apprach will cnsistently beat the market! K-Nearest Neighbrs We will nw perfrm KNN using the knn() functin, which is part f the knn() class library. This functin wrks rather differently frm the ther mdelfitting functins that we have encuntered thus far. Rather than a tw-step apprach in which we first fit the mdel and then we use the mdel t make predictins, knn() frms predictins using a single cmmand. The functin requires fur inputs.

179 Classificatin 1. A matrix cntaining the predictrs assciated with the training data, labeled train.x belw. 2. A matrix cntaining the predictrs assciated with the data fr which we wish t make predictins, labeled test.x belw. 3. A vectr cntaining the class labels fr the training bservatins, labeled train.directin belw. 4. A value fr K, the number f nearest neighbrs t be used by the classifier. We use the cbind() functin, shrt fr clumn bind, tbindthelag1 and cbind() Lag2 variables tgether int tw matrices, ne fr the training set and the ther fr the test set. > library(class) > train.x=cbind(lag1,lag2)[train,] > test.x=cbind(lag1,lag2)[!train,] > train.directin =Directin [train] Nw the knn() functin can be used t predict the market s mvement fr the dates in We set a randm seed befre we apply knn() because if several bservatins are tied as nearest neighbrs, then R will randmly break the tie. Therefre, a seed must be set in rder t ensure reprducibility f results. > set.seed(1) > knn.pred=knn(train.x,test.x,train.directin,k=1) > table(knn.pred,directin.2005) Directin.2005 knn.pred Dwn Up Dwn Up > (83+43) /252 [1] 0.5 The results using K = 1 are nt very gd, since nly 50 % f the bservatins are crrectly predicted. Of curse, it may be that K = 1 results in an verly flexible fit t the data. Belw, we repeat the analysis using K =3. > knn.pred=knn(train.x,test.x,train.directin,k=3) > table(knn.pred,directin.2005) Directin.2005 knn.pred Dwn Up Dwn Up > mean(knn.pred==directin.2005) [1] The results have imprved slightly. But increasing K further turns ut t prvide n further imprvements. It appears that fr this data, QDA prvides the best results f the methds that we have examined s far.

180 4.6 Lab: Lgistic Regressin, LDA, QDA, and KNN An Applicatin t Caravan Insurance Data Finally, we will apply the KNN apprach t the Caravan data set, which is part f the ISLR library. This data set includes 85 predictrs that measure demgraphic characteristics fr 5,822 individuals. The respnse variable is Purchase, which indicates whether r nt a given individual purchases a caravan insurance plicy. In this data set, nly 6 % f peple purchased caravan insurance. > dim(caravan) [1] > attach(caravan) > summary(purchase) N Yes > 348/5822 [1] Because the KNN classifier predicts the class f a given test bservatin by identifying the bservatins that are nearest t it, the scale f the variables matters. Any variables that are n a large scale will have a much larger effect n the distance between the bservatins, and hence n the KNN classifier, than variables that are n a small scale. Fr instance, imagine a data set that cntains tw variables, salary and age (measured in dllars and years, respectively). As far as KNN is cncerned, a difference f $1,000 in salary is enrmus cmpared t a difference f 50 years in age. Cnsequently, salary will drive the KNN classificatin results, and age will have almst n effect. This is cntrary t ur intuitin that a salary difference f $1, 000 is quite small cmpared t an age difference f 50 years. Furthermre, the imprtance f scale t the KNN classifier leads t anther issue: if we measured salary in Japanese yen, r if we measured age in minutes, then we d get quite different classificatin results frm what we get if these tw variables are measured in dllars and years. A gd way t handle this prblem is t standardize thedatasthatall standardize variables are given a mean f zer and a standard deviatin f ne. Then all variables will be n a cmparable scale. The scale() functin des just scale() this. In standardizing the data, we exclude clumn 86, because that is the qualitative Purchase variable. > standardized.x=scale(caravan[,-86]) > var(caravan [,1]) [1] 165 > var(caravan [,2]) [1] > var(standardized.x[,1]) [1] 1 > var(standardized.x[,2]) [1] 1 Nw every clumn f standardized.x has a standard deviatin f ne and a mean f zer.

181 Classificatin We nw split the bservatins int a test set, cntaining the first 1,000 bservatins, and a training set, cntaining the remaining bservatins. We fit a KNN mdel n the training data using K = 1, and evaluate its perfrmance n the test data. > test=1:1000 > train.x=standardized.x[-test,] > test.x=standardized.x[test,] > train.y=purchase[-test] > test.y=purchase[test] > set.seed(1) > knn.pred=knn(train.x,test.x,train.y,k=1) > mean(test.y!=knn.pred) [1] > mean(test.y!="n") [1] The vectr test is numeric, with values frm 1 thrugh 1, 000. Typing standardized.x[test,] yields the submatrix f the data cntaining the bservatins whse indices range frm 1 t 1, 000, whereas typing standardized.x[-test,] yields the submatrix cntaining the bservatins whse indices d nt range frm 1 t 1, 000. The KNN errr rate n the 1,000 test bservatins is just under 12 %. At first glance, this may appear t be fairly gd. Hwever, since nly 6 % f custmers purchased insurance, we culd get the errr rate dwn t 6 % by always predicting N regardless f the values f the predictrs! Suppse that there is sme nn-trivial cst t trying t sell insurance t a given individual. Fr instance, perhaps a salespersn must visit each ptential custmer. If the cmpany tries t sell insurance t a randm selectin f custmers, then the success rate will be nly 6 %, which may be far t lw given the csts invlved. Instead, the cmpany wuld like t try t sell insurance nly t custmers wh are likely t buy it. S the verall errr rate is nt f interest. Instead, the fractin f individuals that are crrectly predicted t buy insurance is f interest. It turns ut that KNN with K = 1 des far better than randm guessing amng the custmers that are predicted t buy insurance. Amng 77 such custmers, 9, r 11.7 %, actually d purchase insurance. This is duble the rate that ne wuld btain frm randm guessing. > table(knn.pred,test.y) test.y knn.pred N Yes N Yes 68 9 > 9/(68+9) [1] Using K = 3, the success rate increases t 19 %, and with K =5therateis 26.7 %. This is ver fur times the rate that results frm randm guessing. It appears that KNN is finding sme real patterns in a difficult data set!

182 4.6 Lab: Lgistic Regressin, LDA, QDA, and KNN 167 > knn.pred=knn(train.x,test.x,train.y,k=3) > table(knn.pred,test.y) test.y knn.pred N Yes N Yes 21 5 > 5/26 [1] > knn.pred=knn(train.x,test.x,train.y,k=5) > table(knn.pred,test.y) test.y knn.pred N Yes N Yes 11 4 > 4/15 [1] As a cmparisn, we can als fit a lgistic regressin mdel t the data. If we use 0.5 as the predicted prbability cut-ff fr the classifier, then we have a prblem: nly seven f the test bservatins are predicted t purchase insurance. Even wrse, we are wrng abut all f these! Hwever, we are nt required t use a cut-ff f 0.5. If we instead predict a purchase any time the predicted prbability f purchase exceeds 0.25, we get much better results: we predict that 33 peple will purchase insurance, and we are crrect fr abut 33 % f these peple. This is ver five times better than randm guessing! > glm.fit=glm(purchase.,data=caravan,family=binmial, subset=-test) Warning message: glm.fit: fitted prbabilities numerically 0 r 1 ccurred > glm.prbs=predict(glm.fit,caravan[test,],type="respnse ") > glm.pred=rep("n",1000) > glm.pred[glm.prbs >.5]="Yes" > table(glm.pred,test.y) test.y glm.pred N Yes N Yes 7 0 > glm.pred=rep("n",1000) > glm.pred[glm.prbs >.25]=" Yes" > table(glm.pred,test.y) test.y glm.pred N Yes N Yes > 11/(22+11) [1] 0.333

183 Classificatin 4.7 Exercises Cnceptual 1. Using a little bit f algebra, prve that (4.2) is equivalent t (4.3). In ther wrds, the lgistic functin representatin and lgit representatin fr the lgistic regressin mdel are equivalent. 2. It was stated in the text that classifying an bservatin t the class fr which (4.12) is largest is equivalent t classifying an bservatin t the class fr which (4.13) is largest. Prve that this is the case. In ther wrds, under the assumptin that the bservatins in the kth class are drawn frm a N(μ k,σ 2 ) distributin, the Bayes classifier assigns an bservatin t the class fr which the discriminant functin is maximized. 3. This prblem relates t the QDA mdel, in which the bservatins within each class are drawn frm a nrmal distributin with a classspecific mean vectr and a class specific cvariance matrix. We cnsider the simple case where p = 1; i.e. there is nly ne feature. Suppse that we have K classes, and that if an bservatin belngs t the kth class then X cmes frm a ne-dimensinal nrmal distributin, X N(μ k,σk 2 ). Recall that the density functin fr the ne-dimensinal nrmal distributin is given in (4.11). Prve that in this case, the Bayes classifier is nt linear. Argue that it is in fact quadratic. Hint: Fr this prblem, yu shuld fllw the arguments laid ut in Sectin 4.4.2, but withut making the assumptin that σ1 2 =...= σk When the number f features p is large, there tends t be a deteriratin in the perfrmance f KNN and ther lcal appraches that perfrm predictin using nly bservatins that are near the test bservatin fr which a predictin must be made. This phenmenn is knwn as the curse f dimensinality, and it ties int the fact that curse f dimensinality nn-parametric appraches ften perfrm prly when p is large. We will nw investigate this curse. (a) Suppse that we have a set f bservatins, each with measurements n p = 1 feature,x. We assume that X is unifrmly (evenly) distributed n [0, 1]. Assciated with each bservatin is a respnse value. Suppse that we wish t predict a test bservatin s respnse using nly bservatins that are within 10 % f the range f X clsest t that test bservatin. Fr instance, in rder t predict the respnse fr a test bservatin with X =0.6,

184 4.7 Exercises 169 we will use bservatins in the range [0.55, 0.65]. On average, what fractin f the available bservatins will we use t make the predictin? (b) Nw suppse that we have a set f bservatins, each with measurements n p =2features,X 1 and X 2. We assume that (X 1,X 2 ) are unifrmly distributed n [0, 1] [0, 1]. We wish t predict a test bservatin s respnse using nly bservatins that are within 10 % f the range f X 1 and within 10 % f the range f X 2 clsest t that test bservatin. Fr instance, in rder t predict the respnse fr a test bservatin with X 1 =0.6 and X 2 =0.35, we will use bservatins in the range [0.55, 0.65] fr X 1 and in the range [0.3, 0.4] fr X 2. On average, what fractin f the available bservatins will we use t make the predictin? (c) Nw suppse that we have a set f bservatins n p = 100 features. Again the bservatins are unifrmly distributed n each feature, and again each feature ranges in value frm 0 t 1. We wish t predict a test bservatin s respnse using bservatins within the 10 % f each feature s range that is clsest t that test bservatin. What fractin f the available bservatins will we use t make the predictin? (d) Using yur answers t parts (a) (c), argue that a drawback f KNN when p is large is that there are very few training bservatins near any given test bservatin. (e) Nw suppse that we wish t make a predictin fr a test bservatin by creating a p-dimensinal hypercube centered arund the test bservatin that cntains, n average, 10 % f the training bservatins. Fr p =1, 2, and 100, what is the length f each side f the hypercube? Cmment n yur answer. Nte: A hypercube is a generalizatin f a cube t an arbitrary number f dimensins. When p =1, a hypercube is simply a line segment, when p =2it is a square, and when p = 100 it is a 100-dimensinal cube. 5. We nw examine the differences between LDA and QDA. (a) If the Bayes decisin bundary is linear, d we expect LDA r QDA t perfrm better n the training set? On the test set? (b) If the Bayes decisin bundary is nn-linear, d we expect LDA r QDA t perfrm better n the training set? On the test set? (c) In general, as the sample size n increases, d we expect the test predictin accuracy f QDA relative t LDA t imprve, decline, r be unchanged? Why?

185 Classificatin (d) True r False: Even if the Bayes decisin bundary fr a given prblem is linear, we will prbably achieve a superir test errr rate using QDA rather than LDA because QDA is flexible enugh t mdel a linear decisin bundary. Justify yur answer. 6. Suppse we cllect data fr a grup f students in a statistics class with variables X 1 = hurs studied, X 2 = undergrad GPA, and Y = receive an A. We fit a lgistic regressin and prduce estimated cefficient, ˆβ 0 = 6, ˆβ 1 =0.05, ˆβ 2 =1. (a) Estimate the prbability that a student wh studies fr 40 h and has an undergrad GPA f 3.5 gets an A in the class. (b) Hw many hurs wuld the student in part (a) need t study t have a 50 % chance f getting an A in the class? 7. Suppse that we wish t predict whether a given stck will issue a dividend this year ( Yes r N ) based n X, last year s percent prfit. We examine a large number f cmpanies and discver that the mean value f X fr cmpanies that issued a dividend was X = 10, while the mean fr thse that didn t was X = 0. In additin, the variance f X fr these tw sets f cmpanies was ˆσ 2 = 36. Finally, 80 % f cmpanies issued dividends. Assuming that X fllws a nrmal distributin, predict the prbability that a cmpany will issue a dividend this year given that its percentage prfit was X =4last year. Hint: Recall that the density functin fr a nrmal randm variable is f(x) = 1 /2σ 2 2πσ. Yu will need t use Bayes therem. 2 e (x μ)2 8. Suppse that we take a data set, divide it int equally-sized training and test sets, and then try ut tw different classificatin prcedures. First we use lgistic regressin and get an errr rate f 20 % n the training data and 30 % n the test data. Next we use 1-nearest neighbrs (i.e. K = 1) and get an average errr rate (averaged ver bth test and training data sets) f 18 %. Based n these results, which methd shuld we prefer t use fr classificatin f new bservatins? Why? 9. This prblem has t d with dds. (a) On average, what fractin f peple with an dds f 0.37 f defaulting n their credit card payment will in fact default? (b) Suppse that an individual has a 16 % chance f defaulting n her credit card payment. What are the dds that she will default?

186 Applied 4.7 Exercises This questin shuld be answered using the Weekly data set, which is part f the ISLR package. This data is similar in nature t the Smarket data frm this chapter s lab, except that it cntains 1, 089 weekly returns fr 21 years, frm the beginning f 1990 t the end f (a) Prduce sme numerical and graphical summaries f the Weekly data. D there appear t be any patterns? (b) Use the full data set t perfrm a lgistic regressin with Directin as the respnse and the five lag variables plus Vlume as predictrs. Use the summary functin t print the results. D any f the predictrs appear t be statistically significant? If s, which nes? (c) Cmpute the cnfusin matrix and verall fractin f crrect predictins. Explain what the cnfusin matrix is telling yu abut the types f mistakes made by lgistic regressin. (d) Nw fit the lgistic regressin mdel using a training data perid frm 1990 t 2008, with Lag2 as the nly predictr. Cmpute the cnfusin matrix and the verall fractin f crrect predictins fr the held ut data (that is, the data frm 2009 and 2010). (e) Repeat (d) using LDA. (f) Repeat (d) using QDA. (g) Repeat (d) using KNN with K =1. (h) Which f these methds appears t prvide the best results n this data? (i) Experiment with different cmbinatins f predictrs, including pssible transfrmatins and interactins, fr each f the methds. Reprt the variables, methd, and assciated cnfusin matrix that appears t prvide the best results n the held ut data. Nte that yu shuld als experiment with values fr K in the KNN classifier. 11. In this prblem, yu will develp a mdel t predict whether a given car gets high r lw gas mileage based n the Aut data set. (a) Create a binary variable, mpg01, that cntains a 1 if mpg cntains a value abve its median, and a 0 if mpg cntains a value belw its median. Yu can cmpute the median using the median() functin. Nte yu may find it helpful t use the data.frame() functin t create a single data set cntaining bth mpg01 and the ther Aut variables.

187 Classificatin (b) Explre the data graphically in rder t investigate the assciatin between mpg01 and the ther features. Which f the ther features seem mst likely t be useful in predicting mpg01? Scatterplts and bxplts may be useful tls t answer this questin. Describe yur findings. (c) Split the data int a training set and a test set. (d) Perfrm LDA n the training data in rder t predict mpg01 using the variables that seemed mst assciated with mpg01 in (b). What is the test errr f the mdel btained? (e) Perfrm QDA n the training data in rder t predict mpg01 using the variables that seemed mst assciated with mpg01 in (b). What is the test errr f the mdel btained? (f) Perfrm lgistic regressin n the training data in rder t predict mpg01 using the variables that seemed mst assciated with mpg01 in (b). What is the test errr f the mdel btained? (g) Perfrm KNN n the training data, with several values f K, in rder t predict mpg01. Use nly the variables that seemed mst assciated with mpg01 in (b). What test errrs d yu btain? Which value f K seems t perfrm the best n this data set? 12. This prblem invlves writing functins. (a) Write a functin, Pwer(), that prints ut the result f raising 2 t the 3rd pwer. In ther wrds, yur functin shuld cmpute 2 3 and print ut the results. Hint: Recall that xâ raises x t the pwer a. Usetheprint() functin t utput the result. (b) Create a new functin, Pwer2(), that allws yu t pass any tw numbers, x and a, and prints ut the value f xâ. Yucan d this by beginning yur functin with the line > Pwer2=functin(x,a){ Yu shuld be able t call yur functin by entering, fr instance, > Pwer2(3,8) n the cmmand line. This shuld utput the value f 3 8,namely, 6, 561. (c) Using the Pwer2() functin that yu just wrte, cmpute 10 3, 8 17, and (d) Nw create a new functin, Pwer3(), that actually returns the result xâ as an R bject, rather than simply printing it t the screen. That is, if yu stre the value xâ in an bject called result within yur functin, then yu can simply return() this return() result, using the fllwing line:

188 4.7 Exercises 173 return(result) The line abve shuld be the last line in yur functin, befre the } symbl. (e) Nw using the Pwer3() functin, create a plt f f(x) =x 2. The x-axis shuld display a range f integers frm 1 t 10, and the y-axis shuld display x 2. Label the axes apprpriately, and use an apprpriate title fr the figure. Cnsider displaying either the x-axis, the y-axis, r bth n the lg-scale. Yu can d this by using lg= x, lg= y, rlg= xy as arguments t the plt() functin. (f) Create a functin, PltPwer(), that allws yu t create a plt f x against x^a fr a fixed a and fr a range f values f x. Fr instance, if yu call > PltPwer (1:10,3) then a plt shuld be created with an x-axis taking n values 1, 2,...,10, and a y-axis taking n values 1 3, 2 3,..., Using the Bstn data set, fit classificatin mdels in rder t predict whether a given suburb has a crime rate abve r belw the median. Explre lgistic regressin, LDA, and KNN mdels using varius subsets f the predictrs. Describe yur findings.

189

190 5 Resampling Methds Resampling methds are an indispensable tl in mdern statistics. They invlve repeatedly drawing samples frm a training set and refitting a mdel f interest n each sample in rder t btain additinal infrmatin abut the fitted mdel. Fr example, in rder t estimate the variability f a linear regressin fit, we can repeatedly draw different samples frm the training data, fit a linear regressin t each new sample, and then examine the extent t which the resulting fits differ. Such an apprach may allw us t btain infrmatin that wuld nt be available frm fitting the mdel nly nce using the riginal training sample. Resampling appraches can be cmputatinally expensive, because they invlve fitting the same statistical methd multiple times using different subsets f the training data. Hwever, due t recent advances in cmputing pwer, the cmputatinal requirements f resampling methds generally are nt prhibitive. In this chapter, we discuss tw f the mst cmmnly used resamplingmethds,crss-validatin and the btstrap. Bth methds are imprtant tls in the practical applicatin f many statistical learning prcedures. Fr example, crss-validatin can be used t estimate the test errr assciated with a given statistical learning methd in rder t evaluate its perfrmance, r t select the apprpriate level f flexibility. The prcess f evaluating a mdel s perfrmance is knwn as mdel assessment, whereas mdel the prcess f selecting the prper level f flexibility fr a mdel is knwn as assessment mdel selectin. The btstrap is used in several cntexts, mst cmmnly mdel t prvide a measure f accuracy f a parameter estimate r f a given selectin statistical learning methd. G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics, DOI / , Springer Science+Business Media New Yrk

191 Resampling Methds 5.1 Crss-Validatin In Chapter 2 we discuss the distinctin between the test errr rate and the training errr rate. The test errr is the average errr that results frm using a statistical learning methd t predict the respnse n a new bservatin that is, a measurement that was nt used in training the methd. Given a data set, the use f a particular statistical learning methd is warranted if it results in a lw test errr. The test errr can be easily calculated if a designated test set is available. Unfrtunately, this is usually nt the case. In cntrast, the training errr can be easily calculated by applying the statistical learning methd t the bservatins used in its training. But as we saw in Chapter 2, the training errr rate ften is quite different frm the test errr rate, and in particular the frmer can dramatically underestimate the latter. In the absence f a very large designated test set that can be used t directly estimate the test errr rate, a number f techniques can be used t estimate this quantity using the available training data. Sme methds make a mathematical adjustment t the training errr rate in rder t estimate the test errr rate. Such appraches are discussed in Chapter 6. In this sectin, we instead cnsider a class f methds that estimate the test errr rate by hlding ut a subset f the training bservatins frm the fitting prcess, and then applying the statistical learning methd t thse held ut bservatins. In Sectins , fr simplicity we assume that we are interested in perfrming regressin with a quantitative respnse. In Sectin we cnsider the case f classificatin with a qualitative respnse. As we will see, the key cncepts remain the same regardless f whether the respnse is quantitative r qualitative The Validatin Set Apprach Suppse that we wuld like t estimate the test errr assciated with fitting a particular statistical learning methd n a set f bservatins. The validatin set apprach, displayed in Figure 5.1, is a very simple strategy validatin fr this task. It invlves randmly dividing the available set f bserva- set apprach tins int tw parts, a training set and a validatin set r hld-ut set. The validatin mdel is fit n the training set, and the fitted mdel is used t predict the respnses fr the bservatins in the validatin set. The resulting validatin set errr rate typically assessed using MSE in the case f a quantitative respnse prvides an estimate f the test errr rate. We illustrate the validatin set apprach n the Aut data set. Recall frm Chapter 3 that there appears t be a nn-linear relatinship between mpg and hrsepwer, and that a mdel that predicts mpg using hrsepwer and hrsepwer 2 gives better results than a mdel that uses nly a linear term. It is natural t wnder whether a cubic r higher-rder fit might prvide set hld-ut set

192 5.1 Crss-Validatin n FIGURE 5.1. A schematic display f the validatin set apprach. A set f n bservatins are randmly split int a training set (shwn in blue, cntaining bservatins 7, 22, and 13, amng thers) and a validatin set (shwn in beige, and cntaining bservatin 91, amng thers). The statistical learning methd is fit n the training set, and its perfrmance is evaluated n the validatin set. even better results. We answer this questin in Chapter 3 by lking at the p-values assciated with a cubic term and higher-rder plynmial terms in a linear regressin. But we culd als answer this questin using the validatin methd. We randmly split the 392 bservatins int tw sets, a training set cntaining 196 f the data pints, and a validatin set cntaining the remaining 196 bservatins. The validatin set errr rates that result frm fitting varius regressin mdels n the training sample and evaluating their perfrmance n the validatin sample, using MSE as a measure f validatin set errr, are shwn in the left-hand panel f Figure 5.2. The validatin set MSE fr the quadratic fit is cnsiderably smaller than fr the linear fit. Hwever, the validatin set MSE fr the cubic fit is actually slightly larger than fr the quadratic fit. This implies that including a cubic term in the regressin des nt lead t better predictin than simply using a quadratic term. Recall that in rder t create the left-hand panel f Figure 5.2, we randmly divided the data set int tw parts, a training set and a validatin set. If we repeat the prcess f randmly splitting the sample set int tw parts, we will get a smewhat different estimate fr the test MSE. As an illustratin, the right-hand panel f Figure 5.2 displays ten different validatin set MSE curves frm the Aut data set, prduced using ten different randm splits f the bservatins int training and validatin sets. All ten curves indicate that the mdel with a quadratic term has a dramatically smaller validatin set MSE than the mdel with nly a linear term. Furthermre, all ten curves indicate that there is nt much benefit in including cubic r higher-rder plynmial terms in the mdel. But it is wrth nting that each f the ten curves results in a different test MSE estimate fr each f the ten regressin mdels cnsidered. And there is n cnsensus amng the curves as t which mdel results in the smallest validatin set MSE. Based n the variability amng these curves, all that we can cnclude with any cnfidence is that the linear fit is nt adequate fr this data. The validatin set apprach is cnceptually simple and is easy t implement. But it has tw ptential drawbacks:

193 Resampling Methds Mean Squared Errr Mean Squared Errr Degree f Plynmial Degree f Plynmial FIGURE 5.2. The validatin set apprach was used n the Aut data set in rder t estimate the test errr that results frm predicting mpg using plynmial functins f hrsepwer. Left: Validatin errr estimates fr a single split int training and validatin data sets. Right: The validatin methd was repeated ten times, each time using a different randm split f the bservatins int a training set and a validatin set. This illustrates the variability in the estimated test MSE that results frm this apprach. 1. As is shwn in the right-hand panel f Figure 5.2, the validatin estimate f the test errr rate can be highly variable, depending n precisely which bservatins are included in the training set and which bservatins are included in the validatin set. 2. In the validatin apprach, nly a subset f the bservatins thse that are included in the training set rather than in the validatin set are used t fit the mdel. Since statistical methds tend t perfrm wrse when trained n fewer bservatins, this suggests that the validatin set errr rate may tend t verestimate the test errr rate fr the mdel fit n the entire data set. In the cming subsectins, we will present crss-validatin, a refinement f the validatin set apprach that addresses these tw issues Leave-One-Out Crss-Validatin Leave-ne-ut crss-validatin (LOOCV) is clsely related t the validatin leave-neut set apprach f Sectin 5.1.1, but it attempts t address that methd s drawbacks. crssvalidatin Like the validatin set apprach, LOOCV invlves splitting the set f bservatins int tw parts. Hwever, instead f creating tw subsets f cmparable size, a single bservatin (x 1,y 1 ) is used fr the validatin set, and the remaining bservatins {(x 2,y 2 ),...,(x n,y n )} make up the training set. The statistical learning methd is fit n the n 1 training bservatins, and a predictin ŷ 1 is made fr the excluded bservatin, using its value x 1. Since (x 1,y 1 ) was nt used in the fitting prcess, MSE 1 =

194 5.1 Crss-Validatin n n n n n FIGURE 5.3. A schematic display f LOOCV. A set f n data pints is repeatedly split int a training set (shwn in blue) cntaining all but ne bservatin, and a validatin set that cntains nly that bservatin (shwn in beige). The test errr is then estimated by averaging the n resulting MSE s. The first training set cntains all but bservatin 1, the secnd training set cntains all but bservatin 2, and s frth. (y 1 ŷ 1 ) 2 prvides an apprximately unbiased estimate fr the test errr. But even thugh MSE 1 is unbiased fr the test errr, it is a pr estimate because it is highly variable, since it is based upn a single bservatin (x 1,y 1 ). We can repeat the prcedure by selecting (x 2,y 2 ) fr the validatin data, training the statistical learning prcedure n the n 1 bservatins {(x 1,y 1 ), (x 3,y 3 ),...,(x n,y n )}, and cmputing MSE 2 =(y 2 ŷ 2 ) 2. Repeating this apprach n times prduces n squared errrs, MSE 1,..., MSE n. The LOOCV estimate fr the test MSE is the average f these n test errr estimates: CV (n) = 1 n n MSE i. (5.1) i=1 A schematic f the LOOCV apprach is illustrated in Figure 5.3. LOOCV has a cuple f majr advantages ver the validatin set apprach. First, it has far less bias. In LOOCV, we repeatedly fit the statistical learning methd using training sets that cntain n 1 bservatins, almst as many as are in the entire data set. This is in cntrast t the validatin set apprach, in which the training set is typically arund half the size f the riginal data set. Cnsequently, the LOOCV apprach tends nt t verestimate the test errr rate as much as the validatin set apprach des. Secnd, in cntrast t the validatin apprach which will yield different results when applied repeatedly due t randmness in the training/validatin set splits, perfrming LOOCV multiple times will

195 Resampling Methds LOOCV 10 fld CV Mean Squared Errr Mean Squared Errr Degree f Plynmial Degree f Plynmial FIGURE 5.4. Crss-validatin was used n the Aut data set in rder t estimate the test errr that results frm predicting mpg using plynmial functins f hrsepwer. Left: The LOOCV errr curve. Right: 10-fld CV was run nine separate times, each with a different randm split f the data int ten parts. The figure shws the nine slightly different CV errr curves. always yield the same results: there is n randmness in the training/validatin set splits. We used LOOCV n the Aut data set in rder t btain an estimate f the test set MSE that results frm fitting a linear regressin mdel t predict mpg using plynmial functins f hrsepwer. The results are shwn in the left-hand panel f Figure 5.4. LOOCV has the ptential t be expensive t implement, since the mdel has t be fit n times. This can be very time cnsuming if n is large, and if each individual mdel is slw t fit. With least squares linear r plynmial regressin, an amazing shrtcut makes the cst f LOOCV the same as that f a single mdel fit! The fllwing frmula hlds: n ( ) 2 yi ŷ i, (5.2) CV (n) = 1 n i=1 1 h i where ŷ i is the ith fitted value frm the riginal least squares fit, and h i is the leverage defined in (3.37) n page 98. This is like the rdinary MSE, except the ith residual is divided by 1 h i. The leverage lies between 1/n and 1, and reflects the amunt that an bservatin influences its wn fit. Hence the residuals fr high-leverage pints are inflated in this frmula by exactly the right amunt fr this equality t hld. LOOCV is a very general methd, and can be used with any kind f predictive mdeling. Fr example we culd use it with lgistic regressin r linear discriminant analysis, r any f the methds discussed in later

196 5.1 Crss-Validatin n FIGURE 5.5. A schematic display f 5-fld CV. A set f n bservatins is randmly split int five nn-verlapping grups. Each f these fifths acts as a validatin set (shwn in beige), and the remainder as a training set (shwn in blue). The test errr is estimated by averaging the five resulting MSE estimates. chapters. The magic frmula (5.2) des nt hld in general, in which case themdelhastberefitn times k-fld Crss-Validatin An alternative t LOOCV is k-fld CV. This apprach invlves randmly k-fld CV dividing the set f bservatins int k grups, r flds, f apprximately equal size. The first fld is treated as a validatin set, and the methd is fit n the remaining k 1 flds. The mean squared errr, MSE 1,is then cmputed n the bservatins in the held-ut fld. This prcedure is repeated k times; each time, a different grup f bservatins is treated as a validatin set. This prcess results in k estimates f the test errr, MSE 1, MSE 2,...,MSE k.thek-fld CV estimate is cmputed by averaging these values, CV (k) = 1 k k MSE i. (5.3) Figure 5.5 illustrates the k-fld CV apprach. It is nt hard t see that LOOCV is a special case f k-fld CV in which k is set t equal n. In practice, ne typically perfrms k-fld CV using k =5 r k = 10. What is the advantage f using k =5rk = 10 rather than k = n? The mst bvius advantage is cmputatinal. LOOCV requires fitting the statistical learning methd n times. This has the ptential t be cmputatinally expensive (except fr linear mdels fit by least squares, in which case frmula (5.2) can be used). But crss-validatin is a very general apprach that can be applied t almst any statistical learning methd. Sme statistical learning methds have cmputatinally intensive fitting prcedures, and s perfrming LOOCV may pse cmputatinal prblems, especially if n is extremely large. In cntrast, perfrming 10-fld i=1

197 Resampling Methds Mean Squared Errr Mean Squared Errr Mean Squared Errr Flexibility Flexibility Flexibility FIGURE 5.6. True and estimated test MSE fr the simulated data sets in Figures 2.9 ( left), 2.10 ( center),and2.11(right). The true test MSE is shwn in blue, the LOOCV estimate is shwn as a black dashed line, and the 10-fld CV estimate is shwn in range. The crsses indicate the minimum f each f the MSE curves. CV requires fitting the learning prcedure nly ten times, which may be much mre feasible. As we see in Sectin 5.1.4, there als can be ther nn-cmputatinal advantages t perfrming 5-fld r 10-fld CV, which invlve the bias-variance trade-ff. The right-hand panel f Figure 5.4 displays nine different 10-fld CV estimates fr the Aut data set, each resulting frm a different randm split f the bservatins int ten flds. As we can see frm the figure, there is sme variability in the CV estimates as a result f the variability in hw the bservatins are divided int ten flds. But this variability is typically much lwer than the variability in the test errr estimates that results frm the validatin set apprach (right-hand panel f Figure 5.2). When we examine real data, we d nt knw the true test MSE, and s it is difficult t determine the accuracy f the crss-validatin estimate. Hwever, if we examine simulated data, then we can cmpute the true test MSE, and can thereby evaluate the accuracy f ur crss-validatin results. In Figure 5.6, we plt the crss-validatin estimates and true test errr rates that result frm applying smthing splines t the simulated data sets illustrated in Figures f Chapter 2. The true test MSE is displayed in blue. The black dashed and range slid lines respectively shw the estimated LOOCV and 10-fld CV estimates. In all three plts, the tw crss-validatin estimates are very similar. In the right-hand panel f Figure 5.6, the true test MSE and the crss-validatin curves are almst identical. In the center panel f Figure 5.6, the tw sets f curves are similar at the lwer degrees f flexibility, while the CV curves verestimate the test set MSE fr higher degrees f flexibility. In the left-hand panel f Figure 5.6, the CV curves have the crrect general shape, but they underestimate the true test MSE.

198 5.1 Crss-Validatin 183 When we perfrm crss-validatin, ur gal might be t determine hw well a given statistical learning prcedure can be expected t perfrm n independent data; in this case, the actual estimate f the test MSE is f interest. But at ther times we are interested nly in the lcatin f the minimum pint in the estimated test MSE curve. This is because we might be perfrming crss-validatin n a number f statistical learning methds, r n a single methd using different levels f flexibility, in rder t identify the methd that results in the lwest test errr. Fr this purpse, the lcatin f the minimum pint in the estimated test MSE curve is imprtant, but the actual value f the estimated test MSE is nt. We find in Figure 5.6 that despite the fact that they smetimes underestimate the true test MSE, all f the CV curves cme clse t identifying the crrect level f flexibility that is, the flexibility level crrespnding t the smallest test MSE Bias-Variance Trade-Off fr k-fld Crss-Validatin We mentined in Sectin that k-fld CV with k<nhas a cmputatinal advantage t LOOCV. But putting cmputatinal issues aside, a less bvius but ptentially mre imprtant advantage f k-fld CV is that it ften gives mre accurate estimates f the test errr rate than des LOOCV. This has t d with a bias-variance trade-ff. It was mentined in Sectin that the validatin set apprach can lead t verestimates f the test errr rate, since in this apprach the training set used t fit the statistical learning methd cntains nly half the bservatins f the entire data set. Using this lgic, it is nt hard t see that LOOCV will give apprximately unbiased estimates f the test errr, since each training set cntains n 1 bservatins, which is almst as many as the number f bservatins in the full data set. And perfrming k-fld CV fr, say, k =5rk = 10 will lead t an intermediate level f bias, since each training set cntains (k 1)n/k bservatins fewer than in the LOOCV apprach, but substantially mre than in the validatin set apprach. Therefre, frm the perspective f bias reductin, it is clear that LOOCV is t be preferred t k-fld CV. Hwever, we knw that bias is nt the nly surce fr cncern in an estimating prcedure; we must als cnsider the prcedure s variance. It turns ut that LOOCV has higher variance than des k-fld CV with k<n.why is this the case? When we perfrm LOOCV, we are in effect averaging the utputs f n fitted mdels, each f which is trained n an almst identical set f bservatins; therefre, these utputs are highly (psitively) crrelated with each ther. In cntrast, when we perfrm k-fld CV with k<n, we are averaging the utputs f k fitted mdels that are smewhat less crrelated with each ther, since the verlap between the training sets in each mdel is smaller. Since the mean f many highly crrelated quantities

199 Resampling Methds has higher variance than des the mean f many quantities that are nt as highly crrelated, the test errr estimate resulting frm LOOCV tends t have higher variance than des the test errr estimate resulting frm k-fld CV. T summarize, there is a bias-variance trade-ff assciated with the chice f k in k-fld crss-validatin. Typically, given these cnsideratins, ne perfrms k-fld crss-validatin using k =5rk = 10, as these values have been shwn empirically t yield test errr rate estimates that suffer neither frm excessively high bias nr frm very high variance Crss-Validatin n Classificatin Prblems In this chapter s far, we have illustrated the use f crss-validatin in the regressin setting where the utcme Y is quantitative, and s have used MSE t quantify test errr. But crss-validatin can als be a very useful apprach in the classificatin setting when Y is qualitative. In this setting, crss-validatin wrks just as described earlier in this chapter, except that rather than using MSE t quantify test errr, we instead use the number f misclassified bservatins. Fr instance, in the classificatin setting, the LOOCV errr rate takes the frm CV (n) = 1 n n Err i, (5.4) where Err i = I(y i ŷ i ). The k-fld CV errr rate and validatin set errr rates are defined analgusly. As an example, we fit varius lgistic regressin mdels n the twdimensinal classificatin data displayed in Figure In the tp-left panel f Figure 5.7, the black slid line shws the estimated decisin bundary resulting frm fitting a standard lgistic regressin mdel t this data set. Since this is simulated data, we can cmpute the true test errr rate, which takes a value f and s is substantially larger than the Bayes errr rate f Clearly lgistic regressin des nt have enugh flexibility t mdel the Bayes decisin bundary in this setting. We can easily extend lgistic regressin t btain a nn-linear decisin bundary by using plynmial functins f the predictrs, as we did in the regressin setting in Sectin Fr example, we can fit a quadratic lgistic regressin mdel, given by ( ) p lg = β 0 + β 1 X 1 + β 2 X1 2 + β 3 X 2 + β 4 X2 2. (5.5) 1 p The tp-right panel f Figure 5.7 displays the resulting decisin bundary, which is nw curved. Hwever, the test errr rate has imprved nly slightly, t A much larger imprvement is apparent in the bttm-left panel i=1

200 5.1 Crss-Validatin 185 Degree=1 Degree=2 Degree=3 Degree=4 FIGURE 5.7. Lgistic regressin fits n the tw-dimensinal classificatin data displayed in Figure The Bayes decisin bundary is represented using a purple dashed line. Estimated decisin bundaries frm linear, quadratic, cubic and quartic (degrees 1 4) lgistic regressins are displayed in black. The test errr rates fr the fur lgistic regressin fits are respectively 0.201, 0.197, 0.160, and 0.162, while the Bayes errr rate is f Figure 5.7, in which we have fit a lgistic regressin mdel invlving cubic plynmials f the predictrs. Nw the test errr rate has decreased t Ging t a quartic plynmial (bttm-right) slightly increases the test errr. In practice, fr real data, the Bayes decisin bundary and the test errr rates are unknwn. S hw might we decide between the fur lgistic regressin mdels displayed in Figure 5.7? We can use crss-validatin in rder t make this decisin. The left-hand panel f Figure 5.8 displays in

201 Resampling Methds Errr Rate Errr Rate Order f Plynmials Used /K FIGURE 5.8. Test errr (brwn), training errr (blue), and 10-fld CV errr (black) n the tw-dimensinal classificatin data displayed in Figure 5.7. Left: Lgistic regressin using plynmial functins f the predictrs. The rder f the plynmials used is displayed n the x-axis. Right: The KNN classifier with different values f K, the number f neighbrs used in the KNN classifier. black the 10-fld CV errr rates that result frm fitting ten lgistic regressin mdels t the data, using plynmial functins f the predictrs up t tenth rder. The true test errrs are shwn in brwn, and the training errrs are shwn in blue. As we have seen previusly, the training errr tends t decrease as the flexibility f the fit increases. (The figure indicates that thugh the training errr rate desn t quite decrease mntnically, it tends t decrease n the whle as the mdel cmplexity increases.) In cntrast, the test errr displays a characteristic U-shape. The 10-fld CV errr rate prvides a pretty gd apprximatin t the test errr rate. While it smewhat underestimates the errr rate, it reaches a minimum when furth-rder plynmials are used, which is very clse t the minimum f the test curve, which ccurs when third-rder plynmials are used. In fact, using furth-rder plynmials wuld likely lead t gd test set perfrmance, as the true test errr rate is apprximately the same fr third, furth, fifth, and sixth-rder plynmials. The right-hand panel f Figure 5.8 displays the same three curves using the KNN apprach fr classificatin, as a functin f the value f K (which in this cntext indicates the number f neighbrs used in the KNN classifier, rather than the number f CV flds used). Again the training errr rate declines as the methd becmes mre flexible, and s we see that the training errr rate cannt be used t select the ptimal value fr K. Thugh the crss-validatin errr curve slightly underestimates the test errr rate, it takes n a minimum very clse t the best value fr K.

202 5.2 The Btstrap The Btstrap The btstrap is a widely applicable and extremely pwerful statistical tl that can be used t quantify the uncertainty assciated with a given estimatr r statistical learning methd. As a simple example, the btstrap can be used t estimate the standard errrs f the cefficients frm a linear regressin fit. In the specific case f linear regressin, this is nt particularly useful, since we saw in Chapter 3 that standard statistical sftware such as R utputs such standard errrs autmatically. Hwever, the pwer f the btstrap lies in the fact that it can be easily applied t a wide range f statistical learning methds, including sme fr which a measure f variability is therwise difficult t btain and is nt autmatically utput by statistical sftware. In this sectin we illustrate the btstrap n a ty example in which we wish t determine the best investment allcatin under a simple mdel. In Sectin 5.3 we explre the use f the btstrap t assess the variability assciated with the regressin cefficients in a linear mdel fit. Suppse that we wish t invest a fixed sum f mney in tw financial assets that yield returns f X and Y, respectively, where X and Y are randm quantities. We will invest a fractin α f ur mney in X, and will invest the remaining 1 α in Y. Since there is variability assciated with the returns n these tw assets, we wish t chse α t minimize the ttal risk, r variance, f ur investment. In ther wrds, we want t minimize Var(αX +(1 α)y ). One can shw that the value that minimizes the risk is given by btstrap α = σy 2 σ XY σx 2 + σ2 Y 2σ, (5.6) XY where σ 2 X =Var(X),σ2 Y =Var(Y ), and σ XY =Cv(X, Y ). In reality, the quantities σ 2 X, σ2 Y,andσ XY are unknwn. We can cmpute estimates fr these quantities, ˆσ 2 X, ˆσ2 Y,andˆσ XY,usingadatasetthat cntains past measurements fr X and Y. We can then estimate the value f α that minimizes the variance f ur investment using ˆα = ˆσ Y 2 ˆσ XY ˆσ X 2 +ˆσ2 Y 2ˆσ. (5.7) XY Figure 5.9 illustrates this apprach fr estimating α n a simulated data set. In each panel, we simulated 100 pairs f returns fr the investments X and Y. We used these returns t estimate σ 2 X,σ2 Y,andσ XY,whichwe then substituted int (5.7) in rder t btain estimates fr α. The value f ˆα resulting frm each simulated data set ranges frm t It is natural t wish t quantify the accuracy f ur estimate f α. T estimate the standard deviatin f ˆα, we repeated the prcess f simulating 100 paired bservatins f X and Y, and estimating α using (5.7),

203 Resampling Methds Y Y X X Y Y X X FIGURE 5.9. Each panel displays 100 simulated returns fr investments X and Y. Frm left t right and tp t bttm, the resulting estimates fr α are 0.576, 0.532, 0.657, and ,000 times. We thereby btained 1,000 estimates fr α, whichwecancall ˆα 1, ˆα 2,...,ˆα 1,000. The left-hand panel f Figure 5.10 displays a histgram f the resulting estimates. Fr these simulatins the parameters were set t σ 2 X =1,σ2 Y =1.25, and σ XY =0.5, and s we knw that the true value f α is 0.6. We indicated this value using a slid vertical line n the histgram. The mean ver all 1,000 estimates fr α is ᾱ = 1 1,000 ˆα r =0.5996, 1, 000 r=1 very clse t α =0.6, and the standard deviatin f the estimates is 1 1, ,000 r=1 (ˆα r ᾱ) 2 = This gives us a very gd idea f the accuracy f ˆα: SE(ˆα) S rughly speaking, fr a randm sample frm the ppulatin, we wuld expect ˆα t differ frm α by apprximately 0.08, n average. In practice, hwever, the prcedure fr estimating SE( ˆα) utlined abve cannt be applied, because fr real data we cannt generate new samples frm the riginal ppulatin. Hwever, the btstrap apprach allws us t use a cmputer t emulate the prcess f btaining new sample sets,

204 5.2 The Btstrap α α α True Btstrap FIGURE Left: A histgram f the estimates f α btained by generating 1,000 simulated data sets frm the true ppulatin. Center: A histgram f the estimates f α btained frm 1,000 btstrap samples frm a single data set. Right: The estimates f α displayed in the left and center panels are shwn as bxplts. In each panel, the pink line indicates the true value f α. s that we can estimate the variability f ˆα withut generating additinal samples. Rather than repeatedly btaining independent data sets frm the ppulatin, we instead btain distinct data sets by repeatedly sampling bservatins frm the riginal data set. This apprach is illustrated in Figure 5.11 n a simple data set, which we call Z, that cntains nly n = 3 bservatins. We randmly select n bservatins frm the data set in rder t prduce a btstrap data set, Z 1. The sampling is perfrmed with replacement, which means that the replacement same bservatin can ccur mre than nce in the btstrap data set. In this example, Z 1 cntains the third bservatin twice, the first bservatin nce, and n instances f the secnd bservatin. Nte that if an bservatin is cntained in Z 1,thenbthitsX and Y values are included. We can use Z 1 t prduce a new btstrap estimate fr α, which we call ˆα 1.This prcedure is repeated B times fr sme large value f B, in rder t prduce B different btstrap data sets, Z 1,Z 2,...,Z B,andB crrespnding α estimates, ˆα 1, ˆα 2,...,ˆα B. We can cmpute the standard errr f these btstrap estimates using the frmula SE B (ˆα) = 1 B 1 ( B ˆα r 1 B r=1 ) 2 B ˆα r. (5.8) r =1 This serves as an estimate f the standard errr f ˆα estimated frm the riginal data set. The btstrap apprach is illustrated in the center panel f Figure 5.10, which displays a histgram f 1,000 btstrap estimates f α, each cmputed using a distinct btstrap data set. This panel was cnstructed n the basis f a single data set, and hence culd be created using real data.

205 Resampling Methds Obs X Y Z * ˆ a *1 Obs X Y Obs X Y Original Data (Z) Z * Z *B Obs X Y ˆ a *2 a ˆ *B FIGURE A graphical illustratin f the btstrap apprach n a small sample cntaining n =3bservatins. Each btstrap data set cntains n bservatins, sampled with replacement frm the riginal data set. Each btstrap data set is used t btain an estimate f α. Nte that the histgram lks very similar t the left-hand panel which displays the idealized histgram f the estimates f α btained by generating 1,000 simulated data sets frm the true ppulatin. In particular the btstrap estimate SE(ˆα) frm (5.8) is 0.087, very clse t the estimate f btained using 1,000 simulated data sets. The right-hand panel displays the infrmatin in the center and left panels in a different way, via bxplts f the estimates fr α btained by generating 1,000 simulated data sets frm the true ppulatin and using the btstrap apprach. Again, the bxplts are quite similar t each ther, indicating that the btstrap apprach can be used t effectively estimate the variability assciated with ˆα. 5.3 Lab: Crss-Validatin and the Btstrap In this lab, we explre the resampling techniques cvered in this chapter. Sme f the cmmands in this lab may take a while t run n yur cmputer.

206 5.3.1 The Validatin Set Apprach 5.3 Lab: Crss-Validatin and the Btstrap 191 We explre the use f the validatin set apprach in rder t estimate the test errr rates that result frm fitting varius linear mdels n the Aut data set. Befre we begin, we use the set.seed() functin in rder t set a seed fr seed R s randm number generatr, s that the reader f this bk will btain precisely the same results as thse shwn belw. It is generally a gd idea t set a randm seed when perfrming an analysis such as crss-validatin that cntains an element f randmness, s that the results btained can be reprduced precisely at a later time. We begin by using the sample() functin t split the set f bservatins sample() int tw halves, by selecting a randm subset f 196 bservatins ut f the riginal 392 bservatins. We refer t these bservatins as the training set. > library(islr) > set.seed(1) > train=sample(392,196) (Here we use a shrtcut in the sample cmmand; see?sample fr details.) We then use the subset ptin in lm() t fit a linear regressin using nly the bservatins crrespnding t the training set. > lm.fit=lm(mpg hrsepwer,data=aut,subset=train) We nw use the predict() functin t estimate the respnse fr all 392 bservatins, and we use the mean() functin t calculate the MSE f the 196 bservatins in the validatin set. Nte that the -train index belw selects nly the bservatins that are nt in the training set. > attach(aut) > mean((mpg-predict(lm.fit,aut))[-train]^2) [1] Therefre, the estimated test MSE fr the linear regressin fit is We can use the ply() functin t estimate the test errr fr the plynmial and cubic regressins. > lm.fit2=lm(mpg ply(hrsepwer,2),data=aut,subset=train) > mean((mpg-predict(lm.fit2,aut))[-train]^2) [1] > lm.fit3=lm(mpg ply(hrsepwer,3),data=aut,subset=train) > mean((mpg-predict(lm.fit3,aut))[-train]^2) [1] These errr rates are and 19.78, respectively. If we chse a different training set instead, then we will btain smewhat different errrs n the validatin set. > set.seed(2) > train=sample(392,196) > lm.fit=lm(mpg hrsepwer,subset=train)

207 Resampling Methds > mean((mpg-predict(lm.fit,aut))[-train]^2) [1] > lm.fit2=lm(mpg ply(hrsepwer,2),data=aut,subset=train) > mean((mpg-predict(lm.fit2,aut))[-train]^2) [1] > lm.fit3=lm(mpg ply(hrsepwer,3),data=aut,subset=train) > mean((mpg-predict(lm.fit3,aut))[-train]^2) [1] Using this split f the bservatins int a training set and a validatin set, we find that the validatin set errr rates fr the mdels with linear, quadratic, and cubic terms are 23.30, 18.90, and 19.26, respectively. These results are cnsistent with ur previus findings: a mdel that predicts mpg using a quadratic functin f hrsepwer perfrms better than a mdel that invlves nly a linear functin f hrsepwer, andthereis little evidence in favr f a mdel that uses a cubic functin f hrsepwer Leave-One-Out Crss-Validatin The LOOCV estimate can be autmatically cmputed fr any generalized linear mdel using the glm() and cv.glm() functins. In the lab fr Chap- cv.glm() ter 4, we used the glm() functin t perfrm lgistic regressin by passing in the family="binmial" argument. But if we use glm() t fit a mdel withut passing in the family argument, then it perfrms linear regressin, just like the lm() functin. S fr instance, > glm.fit=glm(mpg hrsepwer,data=aut) > cef(glm.fit) (Intercept ) hrsepwer and > lm.fit=lm(mpg hrsepwer,data=aut) > cef(lm.fit) (Intercept ) hrsepwer yield identical linear regressin mdels. In this lab, we will perfrm linear regressin using the glm() functin rather than the lm() functin because the latter can be used tgether with cv.glm(). The cv.glm() functin is part f the bt library. > library(bt) > glm.fit=glm(mpg hrsepwer,data=aut) > cv.err=cv.glm(aut,glm.fit) > cv.err$delta The cv.glm() functin prduces a list with several cmpnents. The tw numbers in the delta vectr cntain the crss-validatin results. In this

208 5.3 Lab: Crss-Validatin and the Btstrap 193 case the numbers are identical (up t tw decimal places) and crrespnd t the LOOCV statistic given in (5.1). Belw, we discuss a situatin in which the tw numbers differ. Our crss-validatin estimate fr the test errr is apprximately We can repeat this prcedure fr increasingly cmplex plynmial fits. T autmate the prcess, we use the fr() functin t initiate a fr lp fr() which iteratively fits plynmial regressins fr plynmials f rder i =1 t i = 5, cmputes the assciated crss-validatin errr, and stres it in the ith element f the vectr cv.errr. We begin by initializing the vectr. This cmmand will likely take a cuple f minutes t run. > cv.errr=rep(0,5) > fr (i in 1:5){ + glm.fit=glm(mpg ply(hrsepwer,i),data=aut) + cv.errr[i]=cv.glm(aut,glm.fit)$delta[1] + } > cv.errr [1] As in Figure 5.4, we see a sharp drp in the estimated test MSE between the linear and quadratic fits, but then n clear imprvement frm using higher-rder plynmials. fr lp k-fld Crss-Validatin The cv.glm() functin can als be used t implement k-fld CV. Belw we use k = 10, a cmmn chice fr k, ntheaut data set. We nce again set a randm seed and initialize a vectr in which we will stre the CV errrs crrespnding t the plynmial fits f rders ne t ten. > set.seed(17) > cv.errr.10=rep(0,10) > fr (i in 1:10){ + glm.fit=glm(mpg ply(hrsepwer,i),data=aut) + cv.errr.10[i]=cv.glm(aut,glm.fit,k=10)$delta[1] + } > cv.errr.10 [1] Ntice that the cmputatin time is much shrter than that f LOOCV. (In principle, the cmputatin time fr LOOCV fr a least squares linear mdel shuld be faster than fr k-fld CV, due t the availability f the frmula (5.2) fr LOOCV; hwever, unfrtunately the cv.glm() functin des nt make use f this frmula.) We still see little evidence that using cubic r higher-rder plynmial terms leads t lwer test errr than simply using a quadratic fit. We saw in Sectin that the tw numbers assciated with delta are essentially the same when LOOCV is perfrmed. When we instead perfrm k-fld CV, then the tw numbers assciated with delta differ slightly. The

209 Resampling Methds first is the standard k-fld CV estimate, as in (5.3). The secnd is a biascrrected versin. On this data set, the tw estimates are very similar t each ther The Btstrap We illustrate the use f the btstrap in the simple example f Sectin 5.2, as well as n an example invlving estimating the accuracy f the linear regressin mdel n the Aut data set. Estimating the Accuracy f a Statistic f Interest One f the great advantages f the btstrap apprach is that it can be applied in almst all situatins. N cmplicated mathematical calculatins are required. Perfrming a btstrap analysis in R entails nly tw steps. First, we must create a functin that cmputes the statistic f interest. Secnd, we use the bt() functin, which is part f the bt library, t bt() perfrm the btstrap by repeatedly sampling bservatins frm the data set with replacement. The Prtfli data set in the ISLR package is described in Sectin 5.2. T illustrate the use f the btstrap n this data, we must first create a functin, alpha.fn(), which takes as input the (X, Y ) dataaswellas a vectr indicating which bservatins shuld be used t estimate α. The functin then utputs the estimate fr α based n the selected bservatins. > alpha.fn=functin(data,index){ + X=data$X[index] + Y=data$Y[index] + return((var(y)-cv(x,y))/(var(x)+var(y)-2*cv(x,y))) + } This functin returns, r utputs, an estimate fr α based n applying (5.7) t the bservatins indexed by the argument index. Fr instance, the fllwing cmmand tells R t estimate α using all 100 bservatins. > alpha.fn(prtfli,1:100) [1] The next cmmand uses the sample() functin t randmly select 100 bservatins frm the range 1 t 100, with replacement. This is equivalent t cnstructing a new btstrap data set and recmputing ˆα based n the new data set. > set.seed(1) > alpha.fn(prtfli,sample(100,100, replace=t)) [1] We can implement a btstrap analysis by perfrming this cmmand many times, recrding all f the crrespnding estimates fr α, and cmputing

210 5.3 Lab: Crss-Validatin and the Btstrap 195 the resulting standard deviatin. Hwever, the bt() functin autmates bt() this apprach. Belw we prduce R =1, 000 btstrap estimates fr α. > bt(prtfli,alpha.fn,r=1000) ORDINARY NONPARAMETRIC BOOTSTRAP Call: bt(data = Prtfli, statistic = alpha.fn, R = 1000) Btstrap Statistics : riginal bias std. errr t1* e The final utput shws that using the riginal data, ˆα =0.5758, and that the btstrap estimate fr SE( ˆα) is Estimating the Accuracy f a Linear Regressin Mdel The btstrap apprach can be used t assess the variability f the cefficient estimates and predictins frm a statistical learning methd. Here we use the btstrap apprach in rder t assess the variability f the estimates fr β 0 and β 1, the intercept and slpe terms fr the linear regressin mdel that uses hrsepwer t predict mpg in the Aut data set. We will cmpare the estimates btained using the btstrap t thse btained using the frmulas fr SE( ˆβ 0 ) and SE( ˆβ 1 ) described in Sectin We first create a simple functin, bt.fn(), which takes in the Aut data set as well as a set f indices fr the bservatins, and returns the intercept and slpe estimates fr the linear regressin mdel. We then apply this functin t the full set f 392 bservatins in rder t cmpute the estimates f β 0 and β 1 n the entire data set using the usual linear regressin cefficient estimate frmulas frm Chapter 3. Nte that we d nt need the { and } at the beginning and end f the functin because it is nly ne line lng. > bt.fn=functin(data,index) + return(cef(lm(mpg hrsepwer,data=data,subset=index))) > bt.fn(aut,1:392) (Intercept ) hrsepwer The bt.fn() functin can als be used in rder t create btstrap estimates fr the intercept and slpe terms by randmly sampling frm amng the bservatins with replacement. Here we give tw examples. > set.seed(1) > bt.fn(aut,sample(392,392, replace=t)) (Intercept ) hrsepwer > bt.fn(aut,sample(392,392, replace=t)) (Intercept ) hrsepwer

211 Resampling Methds Next, we use the bt() functin t cmpute the standard errrs f 1,000 btstrap estimates fr the intercept and slpe terms. > bt(aut,bt.fn,1000) ORDINARY NONPARAMETRIC BOOTSTRAP Call: bt(data = Aut, statistic = bt.fn, R = 1000) Btstrap Statistics : riginal bias std. errr t1* t2* This indicates that the btstrap estimate fr SE( ˆβ 0 )is0.86, and that the btstrap estimate fr SE( ˆβ 1 )is As discussed in Sectin 3.1.2, standard frmulas can be used t cmpute the standard errrs fr the regressin cefficients in a linear mdel. These can be btained using the summary() functin. > summary(lm(mpg hrsepwer,data=aut))$cef Estimate Std. Errr t value Pr(>t) (Intercept ) e-187 hrsepwer e-81 The standard errr estimates fr ˆβ 0 and ˆβ 1 btained using the frmulas frm Sectin are fr the intercept and fr the slpe. Interestingly, these are smewhat different frm the estimates btained using the btstrap. Des this indicate a prblem with the btstrap? In fact, it suggests the ppsite. Recall that the standard frmulas given in Equatin 3.8 n page 66 rely n certain assumptins. Fr example, they depend n the unknwn parameter σ 2, the nise variance. We then estimate σ 2 using the RSS. Nw althugh the frmula fr the standard errrs d nt rely n the linear mdel being crrect, the estimate fr σ 2 des. We see in Figure 3.8 n page 91 that there is a nn-linear relatinship in the data, and s the residuals frm a linear fit will be inflated, and s will ˆσ 2. Secndly, the standard frmulas assume (smewhat unrealistically) that the x i are fixed, and all the variability cmes frm the variatin in the errrs ɛ i.the btstrap apprach des nt rely n any f these assumptins, and s it is likely giving a mre accurate estimate f the standard errrs f ˆβ 0 and ˆβ 1 than is the summary() functin. Belw we cmpute the btstrap standard errr estimates and the standard linear regressin estimates that result frm fitting the quadratic mdel t the data. Since this mdel prvides a gd fit t the data (Figure 3.8), there is nw a better crrespndence between the btstrap estimates and the standard estimates f SE( ˆβ 0 ), SE( ˆβ 1 ) and SE( ˆβ 2 ).

212 5.4 Exercises 197 > bt.fn=functin(data,index) + cefficients(lm(mpg hrsepwer +I(hrsepwer ^2),data=data, subset=index)) > set.seed(1) > bt(aut,bt.fn,1000) ORDINARY NONPARAMETRIC BOOTSTRAP Call: bt(data = Aut, statistic = bt.fn, R = 1000) Btstrap Statistics : riginal bias std. errr t1* e t2* e t3* e > summary(lm(mpg hrsepwer +I(hrsepwer ^2),data=Aut))$cef Estimate Std. Errr t value Pr(>t) (Intercept ) e-109 hrsepwer e-40 I(hrsepwer ^2) e Exercises Cnceptual 1. Using basic statistical prperties f the variance, as well as singlevariable calculus, derive (5.6). In ther wrds, prve that α given by (5.6) des indeed minimize Var(αX +(1 α)y ). 2. We will nw derive the prbability that a given bservatin is part f a btstrap sample. Suppse that we btain a btstrap sample frm a set f n bservatins. (a) What is the prbability that the first btstrap bservatin is nt the jth bservatin frm the riginal sample? Justify yur answer. (b) What is the prbability that the secnd btstrap bservatin is nt the jth bservatin frm the riginal sample? (c) Argue that the prbability that the jth bservatin is nt in the btstrap sample is (1 1/n) n. (d) When n = 5, what is the prbability that the jth bservatin is in the btstrap sample? (e) When n = 100, what is the prbability that the jth bservatin is in the btstrap sample?

213 Resampling Methds (f) When n =10, 000, what is the prbability that the jth bservatin is in the btstrap sample? (g) Create a plt that displays, fr each integer value f n frm 1 t 100, 000, the prbability that the jth bservatin is in the btstrap sample. Cmment n what yu bserve. (h) We will nw investigate numerically the prbability that a btstrap sample f size n = 100 cntains the jth bservatin. Here j = 4. We repeatedly create btstrap samples, and each time we recrd whether r nt the furth bservatin is cntained in the btstrap sample. > stre=rep(na, 10000) > fr(i in 1:10000){ stre[i]=sum(sample(1:100, } > mean(stre) Cmment n the results btained. 3. We nw review k-fld crss-validatin. rep=true)==4)>0 (a) Explain hw k-fld crss-validatin is implemented. (b) What are the advantages and disadvantages f k-fld crssvalidatin relative t: i. The validatin set apprach? ii. LOOCV? 4. Suppse that we use sme statistical learning methd t make a predictin fr the respnse Y fr a particular value f the predictr X. Carefully describe hw we might estimate the standard deviatin f ur predictin. Applied 5. In Chapter 4, we used lgistic regressin t predict the prbability f default using incme and balance n the Default data set. We will nw estimate the test errr f this lgistic regressin mdel using the validatin set apprach. D nt frget t set a randm seed befre beginning yur analysis. (a) Fit a lgistic regressin mdel that uses incme and balance t predict default. (b) Using the validatin set apprach, estimate the test errr f this mdel. In rder t d this, yu must perfrm the fllwing steps: i. Split the sample set int a training set and a validatin set.

214 5.4 Exercises 199 ii. Fit a multiple lgistic regressin mdel using nly the training bservatins. iii. Obtain a predictin f default status fr each individual in the validatin set by cmputing the psterir prbability f default fr that individual, and classifying the individual t the default categry if the psterir prbability is greater than 0.5. iv. Cmpute the validatin set errr, which is the fractin f the bservatins in the validatin set that are misclassified. (c) Repeat the prcess in (b) three times, using three different splits f the bservatins int a training set and a validatin set. Cmment n the results btained. (d) Nw cnsider a lgistic regressin mdel that predicts the prbability f default using incme, balance, and a dummy variable fr student. Estimate the test errr fr this mdel using the validatin set apprach. Cmment n whether r nt including a dummy variable fr student leads t a reductin in the test errr rate. 6. We cntinue t cnsider the use f a lgistic regressin mdel t predict the prbability f default using incme and balance n the Default data set. In particular, we will nw cmpute estimates fr the standard errrs f the incme and balance lgistic regressin cefficients in tw different ways: (1) using the btstrap, and (2) using the standard frmula fr cmputing the standard errrs in the glm() functin. D nt frget t set a randm seed befre beginning yur analysis. (a) Using the summary() and glm() functins, determine the estimated standard errrs fr the cefficients assciated with incme and balance in a multiple lgistic regressin mdel that uses bth predictrs. (b) Write a functin, bt.fn(), that takes as input the Default data set as well as an index f the bservatins, and that utputs the cefficient estimates fr incme and balance in the multiple lgistic regressin mdel. (c) Use the bt() functin tgether with yur bt.fn() functin t estimate the standard errrs f the lgistic regressin cefficients fr incme and balance. (d) Cmment n the estimated standard errrs btained using the glm() functin and using yur btstrap functin. 7. In Sectins and 5.3.3, we saw that the cv.glm() functin can be used in rder t cmpute the LOOCV test errr estimate. Alternatively, ne culd cmpute thse quantities using just the glm() and

215 Resampling Methds predict.glm() functins, and a fr lp. Yu will nw take this apprach in rder t cmpute the LOOCV errr fr a simple lgistic regressin mdel n the Weekly data set. Recall that in the cntext f classificatin prblems, the LOOCV errr is given in (5.4). (a) Fit a lgistic regressin mdel that predicts Directin using Lag1 and Lag2. (b) Fit a lgistic regressin mdel that predicts Directin using Lag1 and Lag2 using all but the first bservatin. (c) Use the mdel frm (b) t predict the directin f the first bservatin. Yu can d this by predicting that the first bservatin will g up if P (Directin="Up"Lag1, Lag2) > 0.5.Wasthisbservatin crrectly classified? (d) Write a fr lp frm i =1ti = n, wheren is the number f bservatins in the data set, that perfrms each f the fllwing steps: i. Fit a lgistic regressin mdel using all but the ith bservatin t predict Directin using Lag1 and Lag2. ii. Cmpute the psterir prbability f the market mving up fr the ith bservatin. iii. Use the psterir prbability fr the ith bservatin in rder t predict whether r nt the market mves up. iv. Determine whether r nt an errr was made in predicting the directin fr the ith bservatin. If an errr was made, then indicate this as a 1, and therwise indicate it as a 0. (e) Take the average f the n numbers btained in (d)iv in rder t btain the LOOCV estimate fr the test errr. Cmment n the results. 8. We will nw perfrm crss-validatin n a simulated data set. (a) Generate a simulated data set as fllws: > set.seed(1) > y=rnrm(100) > x=rnrm(100) > y=x-2*x^2+rnrm(100) In this data set, what is n and what is p? Write ut the mdel used t generate the data in equatin frm. (b) Create a scatterplt f X against Y. Cmment n what yu find. (c) Set a randm seed, and then cmpute the LOOCV errrs that result frm fitting the fllwing fur mdels using least squares:

216 5.4 Exercises 201 i. Y = β 0 + β 1 X + ɛ ii. Y = β 0 + β 1 X + β 2 X 2 + ɛ iii. Y = β 0 + β 1 X + β 2 X 2 + β 3 X 3 + ɛ iv. Y = β 0 + β 1 X + β 2 X 2 + β 3 X 3 + β 4 X 4 + ɛ. Nte yu may find it helpful t use the data.frame() functin t create a single data set cntaining bth X and Y. (d) Repeat (c) using anther randm seed, and reprt yur results. Are yur results the same as what yu gt in (c)? Why? (e) Which f the mdels in (c) had the smallest LOOCV errr? Is this what yu expected? Explain yur answer. (f) Cmment n the statistical significance f the cefficient estimates that results frm fitting each f the mdels in (c) using least squares. D these results agree with the cnclusins drawn based n the crss-validatin results? 9. We will nw cnsider the Bstn husing data set, frm the MASS library. (a) Based n this data set, prvide an estimate fr the ppulatin mean f medv. Call this estimate ˆμ. (b) Prvide an estimate f the standard errr f ˆμ. Interpret this result. Hint: We can cmpute the standard errr f the sample mean by dividing the sample standard deviatin by the square rt f the number f bservatins. (c) Nw estimate the standard errr f ˆμ using the btstrap. Hw des this cmpare t yur answer frm (b)? (d) Based n yur btstrap estimate frm (c), prvide a 95 % cnfidence interval fr the mean f medv. Cmpare it t the results btained using t.test(bstn$medv). Hint: Yu can apprximate a 95 % cnfidence interval using the frmula [ˆμ 2SE(ˆμ), ˆμ +2SE(ˆμ)]. (e) Based n this data set, prvide an estimate, ˆμ med, fr the median value f medv in the ppulatin. (f) We nw wuld like t estimate the standard errr f ˆμ med.unfrtunately, there is n simple frmula fr cmputing the standard errr f the median. Instead, estimate the standard errr f the median using the btstrap. Cmment n yur findings. (g) Based n this data set, prvide an estimate fr the tenth percentile f medv in Bstn suburbs. Call this quantity ˆμ 0.1.(Yu can use the quantile() functin.) (h) Use the btstrap t estimate the standard errr f ˆμ 0.1.Cmment n yur findings.

217

218 6 Linear Mdel Selectin and Regularizatin In the regressin setting, the standard linear mdel Y = β 0 + β 1 X β p X p + ɛ (6.1) is cmmnly used t describe the relatinship between a respnse Y and a set f variables X 1,X 2,...,X p. We have seen in Chapter 3 that ne typically fits this mdel using least squares. In the chapters that fllw, we cnsider sme appraches fr extending the linear mdel framewrk. In Chapter 7 we generalize (6.1) in rder t accmmdate nn-linear, but still additive, relatinships, while in Chapter 8 we cnsider even mre general nn-linear mdels. Hwever, the linear mdel has distinct advantages in terms f inference and, n real-wrld prblems, is ften surprisingly cmpetitive in relatin t nn-linear methds. Hence, befre mving t the nn-linear wrld, we discuss in this chapter sme ways in which the simple linear mdel can be imprved, by replacing plain least squares fitting with sme alternative fitting prcedures. Why might we want t use anther fitting prcedure instead f least squares? As we will see, alternative fitting prcedures can yield better predictin accuracy and mdel interpretability. Predictin Accuracy: Prvided that the true relatinship between the respnse and the predictrs is apprximately linear, the least squares estimates will have lw bias. If n p that is, if n, thenumberf bservatins, is much larger than p, the number f variables then the least squares estimates tend t als have lw variance, and hence will perfrm well n test bservatins. Hwever, if n is nt much larger G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics, DOI / , Springer Science+Business Media New Yrk

219 Linear Mdel Selectin and Regularizatin than p, then there can be a lt f variability in the least squares fit, resulting in verfitting and cnsequently pr predictins n future bservatins nt used in mdel training. And if p>n, then there is n lnger a unique least squares cefficient estimate: the variance is infinite s the methd cannt be used at all. By cnstraining r shrinking the estimated cefficients, we can ften substantially reduce the variance at the cst f a negligible increase in bias. This can lead t substantial imprvements in the accuracy with which we can predict the respnse fr bservatins nt used in mdel training. Mdel Interpretability: It is ften the case that sme r many f the variables used in a multiple regressin mdel are in fact nt assciated with the respnse. Including such irrelevant variables leads t unnecessary cmplexity in the resulting mdel. By remving these variables that is, by setting the crrespnding cefficient estimates t zer we can btain a mdel that is mre easily interpreted. Nw least squares is extremely unlikely t yield any cefficient estimates that are exactly zer. In this chapter, we see sme appraches fr autmatically perfrming feature selectin r variable selectin that is, fr excluding irrelevant variables frm a multiple regressin mdel. There are many alternatives, bth classical and mdern, t using least squares t fit (6.1). In this chapter, we discuss three imprtant classes f methds. Subset Selectin. This apprach invlves identifying a subset f the p predictrs that we believe t be related t the respnse. We then fit a mdel using least squares n the reduced set f variables. Shrinkage. This apprach invlves fitting a mdel invlving all p predictrs. Hwever, the estimated cefficients are shrunken twards zer relative t the least squares estimates. This shrinkage (als knwn as regularizatin) has the effect f reducing variance. Depending n what type f shrinkage is perfrmed, sme f the cefficients may be estimated t be exactly zer. Hence, shrinkage methds can als perfrm variable selectin. Dimensin Reductin. This apprach invlves prjecting the p predictrs int a M-dimensinal subspace, where M<p.Thisisachieved by cmputing M different linear cmbinatins, r prjectins, f the variables. Then these M prjectins are used as predictrs t fit a linear regressin mdel by least squares. In the fllwing sectins we describe each f these appraches in greater detail, alng with their advantages and disadvantages. Althugh this chapter describes extensins and mdificatins t the linear mdel fr regressin seen in Chapter 3, the same cncepts apply t ther methds, such as the classificatin mdels seen in Chapter 4. feature selectin variable selectin

220 6.1 Subset Selectin Subset Selectin In this sectin we cnsider sme methds fr selecting subsets f predictrs. These include best subset and stepwise mdel selectin prcedures Best Subset Selectin T perfrm best subset selectin, we fit a separate least squares regressin best subset fr each pssible cmbinatin f the p predictrs. That is, we fit all p mdels selectin that cntain exactly ne predictr, all ( p 2) = p(p 1)/2 mdelsthatcntain exactly tw predictrs, and s frth. We then lk at all f the resulting mdels, with the gal f identifying the ne that is best. The prblem f selecting the best mdel frm amng the 2 p pssibilities cnsidered by best subset selectin is nt trivial. This is usually brken up int tw stages, as described in Algrithm 6.1. Algrithm 6.1 Best subset selectin 1. Let M 0 dente the null mdel, which cntains n predictrs. This mdel simply predicts the sample mean fr each bservatin. 2. Fr k =1, 2,...p: (a) Fit all ( p k) mdels that cntain exactly k predictrs. (b) Pick the best amng these ( p k) mdels, and call it Mk.Herebest is defined as having the smallest RSS, r equivalently largest R Select a single best mdel frm amng M 0,...,M p using crssvalidated predictin errr, C p (AIC), BIC, r adjusted R 2. In Algrithm 6.1, Step 2 identifies the best mdel (n the training data) fr each subset size, in rder t reduce the prblem frm ne f 2 p pssible mdels t ne f p + 1 pssible mdels. In Figure 6.1, these mdels frm the lwer frntier depicted in red. Nw in rder t select a single best mdel, we must simply chse amng these p + 1 ptins. This task must be perfrmed with care, because the RSS f these p + 1 mdels decreases mntnically, and the R 2 increases mntnically, as the number f features included in the mdels increases. Therefre, if we use these statistics t select the best mdel, then we will always end up with a mdel invlving all f the variables. The prblem is that a lw RSS r a high R 2 indicates a mdel with a lw training errr, whereas we wish t chse a mdel that has a lw test errr. (As shwn in Chapter 2 in Figures , training errr tends t be quite a bit smaller than test errr, and a lw training errr by n means guarantees a lw test errr.) Therefre, in Step 3, we use crss-validated predictin

221 Linear Mdel Selectin and Regularizatin 0.0 Residual Sum f Squares 2e+07 4e+07 6e+07 8e+07 R Number f Predictrs Number f Predictrs FIGURE 6.1. Fr each pssible mdel cntaining a subset f the ten predictrs in the Credit data set, the RSS and R 2 are displayed. The red frntier tracks the best mdel fr a given number f predictrs, accrding t RSS and R 2. Thugh the data set cntains nly ten predictrs, the x-axis ranges frm 1 t 11, sincene f the variables is categrical and takes n three values, leading t the creatin f tw dummy variables. errr, C p, BIC, r adjusted R 2 in rder t select amng M 0, M 1,...,M p. These appraches are discussed in Sectin An applicatin f best subset selectin is shwn in Figure 6.1. Each pltted pint crrespnds t a least squares regressin mdel fit using a different subset f the 11 predictrs in the Credit data set, discussed in Chapter 3. Here the variable ethnicity is a three-level qualitative variable, and s is represented by tw dummy variables, which are selected separately in this case. We have pltted the RSS and R 2 statistics fr each mdel, as a functin f the number f variables. The red curves cnnect the best mdels fr each mdel size, accrding t RSS r R 2. The figure shws that, as expected, these quantities imprve as the number f variables increases; hwever, frm the three-variable mdel n, there is little imprvement in RSS and R 2 as a result f including additinal predictrs. Althugh we have presented best subset selectin here fr least squares regressin, the same ideas apply t ther types f mdels, such as lgistic regressin. In the case f lgistic regressin, instead f rdering mdels by RSS in Step 2 f Algrithm 6.1, we instead use the deviance, ameasure deviance that plays the rle f RSS fr a brader class f mdels. The deviance is negative tw times the maximized lg-likelihd; the smaller the deviance, the better the fit. While best subset selectin is a simple and cnceptually appealing apprach, it suffers frm cmputatinal limitatins. The number f pssible mdels that must be cnsidered grws rapidly as p increases. In general, there are 2 p mdels that invlve subsets f p predictrs. S if p = 10, then there are apprximately 1,000 pssible mdels t be cnsidered, and if

222 6.1 Subset Selectin 207 p = 20, then there are ver ne millin pssibilities! Cnsequently, best subset selectin becmes cmputatinally infeasible fr values f p greater than arund 40, even with extremely fast mdern cmputers. There are cmputatinal shrtcuts s called branch-and-bund techniques fr eliminating sme chices, but these have their limitatins as p gets large. They als nly wrk fr least squares linear regressin. We present cmputatinally efficient alternatives t best subset selectin next Stepwise Selectin Fr cmputatinal reasns, best subset selectin cannt be applied with very large p. Best subset selectin may als suffer frm statistical prblems when p is large. The larger the search space, the higher the chance f finding mdels that lk gd n the training data, even thugh they might nt have any predictive pwer n future data. Thus an enrmus search space can lead t verfitting and high variance f the cefficient estimates. Fr bth f these reasns, stepwise methds, which explre a far mre restricted set f mdels, are attractive alternatives t best subset selectin. Frward Stepwise Selectin Frward stepwise selectin is a cmputatinally efficient alternative t best frward subset selectin. While the best subset selectin prcedure cnsiders all 2 p pssible mdels cntaining subsets f the p predictrs, frward stepwise cnsiders a much smaller set f mdels. Frward stepwise selectin begins with a mdel cntaining n predictrs, and then adds predictrs t the mdel, ne-at-a-time, until all f the predictrs are in the mdel. In particular, at each step the variable that gives the greatest additinal imprvement t the fit is added t the mdel. Mre frmally, the frward stepwise selectin prcedure is given in Algrithm 6.2. Algrithm 6.2 Frward stepwise selectin 1. Let M 0 dente the null mdel, which cntains n predictrs. 2. Fr k =0,...,p 1: (a) Cnsider all p k mdels that augment the predictrs in M k with ne additinal predictr. (b) Chse the best amng these p k mdels, and call it M k+1. Here best is defined as having smallest RSS r highest R Select a single best mdel frm amng M 0,...,M p using crssvalidated predictin errr, C p (AIC), BIC, r adjusted R 2. stepwise selectin

223 Linear Mdel Selectin and Regularizatin Unlike best subset selectin, which invlved fitting 2 p mdels, frward stepwise selectin invlves fitting ne null mdel, alng with p k mdels in the kth iteratin, fr k =0,...,p 1. This amunts t a ttal f 1 + p 1 k=0 (p k) =1+p(p+1)/2 mdels. This is a substantial difference: when p = 20, best subset selectin requires fitting 1,048,576 mdels, whereas frward stepwise selectin requires fitting nly 211 mdels. 1 In Step 2(b) f Algrithm 6.2, we must identify the best mdel frm amng thse p k that augment M k with ne additinal predictr. We can d this by simply chsing the mdel with the lwest RSS r the highest R 2. Hwever, in Step 3, we must identify the best mdel amng a set f mdels with different numbers f variables. This is mre challenging, and is discussed in Sectin Frward stepwise selectin s cmputatinal advantage ver best subset selectin is clear. Thugh frward stepwise tends t d well in practice, it is nt guaranteed t find the best pssible mdel ut f all 2 p mdels cntaining subsets f the p predictrs. Fr instance, suppse that in a given data set with p = 3 predictrs, the best pssible ne-variable mdel cntains X 1, and the best pssible tw-variable mdel instead cntains X 2 and X 3. Then frward stepwise selectin will fail t select the best pssible tw-variable mdel, because M 1 will cntain X 1,sM 2 must als cntain X 1 tgether with ne additinal variable. Table 6.1, which shws the first fur selected mdels fr best subset and frward stepwise selectin n the Credit data set, illustrates this phenmenn. Bth best subset selectin and frward stepwise selectin chse rating fr the best ne-variable mdel and then include incme and student fr the tw- and three-variable mdels. Hwever, best subset selectin replaces rating by cards in the fur-variable mdel, while frward stepwise selectin must maintain rating in its fur-variable mdel. In this example, Figure 6.1 indicates that there is nt much difference between the threeand fur-variable mdels in terms f RSS, s either f the fur-variable mdels will likely be adequate. Frward stepwise selectin can be applied even in the high-dimensinal setting where n<p; hwever, in this case, it is pssible t cnstruct submdels M 0,...,M n 1 nly, since each submdel is fit using least squares, which will nt yield a unique slutin if p n. Backward Stepwise Selectin Like frward stepwise selectin, backward stepwise selectin prvides an backward efficient alternative t best subset selectin. Hwever, unlike frward stepwise selectin 1 Thugh frward stepwise selectin cnsiders p(p +1)/2 +1mdels,itperfrmsa guided search ver mdel space, and s the effective mdel space cnsidered cntains substantially mre than p(p +1)/2 +1mdels.

224 6.1 Subset Selectin 209 #Variables Best subset Frward stepwise One rating rating Tw rating, incme rating, incme Three rating, incme, student rating, incme, student Fur cards, incme rating, incme, student, limit student, limit TABLE 6.1. The first fur selected mdels fr best subset selectin and frward stepwise selectin n the Credit data set. The first three mdels are identical but the furth mdels differ. stepwise selectin, it begins with the full least squares mdel cntaining all p predictrs, and then iteratively remves the least useful predictr, ne-at-a-time. Details are given in Algrithm 6.3. Algrithm 6.3 Backward stepwise selectin 1. Let M p dente the full mdel, which cntains all p predictrs. 2. Fr k = p, p 1,...,1: (a) Cnsider all k mdels that cntain all but ne f the predictrs in M k, fr a ttal f k 1predictrs. (b) Chse the best amng these k mdels, and call it M k 1.Here best is defined as having smallest RSS r highest R Select a single best mdel frm amng M 0,...,M p using crssvalidated predictin errr, C p (AIC), BIC, r adjusted R 2. Like frward stepwise selectin, the backward selectin apprach searches thrugh nly 1+p(p+1)/2 mdels, and s can be applied in settings where p is t large t apply best subset selectin. 2 Als like frward stepwise selectin, backward stepwise selectin is nt guaranteed t yield the best mdel cntaining a subset f the p predictrs. Backward selectin requires that the number f samples n is larger than the number f variables p (s that the full mdel can be fit). In cntrast, frward stepwise can be used even when n<p, and s is the nly viable subset methd when p is very large. 2 Like frward stepwise selectin, backward stepwise selectin perfrms a guided search ver mdel space, and s effectively cnsiders substantially mre than 1+p(p+1)/2 mdels.

225 Linear Mdel Selectin and Regularizatin Hybrid Appraches The best subset, frward stepwise, and backward stepwise selectin appraches generally give similar but nt identical mdels. As anther alternative, hybrid versins f frward and backward stepwise selectin are available, in which variables are added t the mdel sequentially, in analgy t frward selectin. Hwever, after adding each new variable, the methd may als remve any variables that n lnger prvide an imprvement in the mdel fit. Such an apprach attempts t mre clsely mimic best subset selectin while retaining the cmputatinal advantages f frward and backward stepwise selectin Chsing the Optimal Mdel Best subset selectin, frward selectin, and backward selectin result in the creatin f a set f mdels, each f which cntains a subset f the p predictrs. In rder t implement these methds, we need a way t determine which f these mdels is best. As we discussed in Sectin 6.1.1, the mdel cntaining all f the predictrs will always have the smallest RSS and the largest R 2, since these quantities are related t the training errr. Instead, we wish t chse a mdel with a lw test errr. As is evident here, and as we shw in Chapter 2, the training errr can be a pr estimate f the test errr. Therefre, RSS and R 2 are nt suitable fr selecting the best mdel amng a cllectin f mdels with different numbers f predictrs. In rder t select the best mdel with respect t test errr, we need t estimate this test errr. There are tw cmmn appraches: 1. We can indirectly estimate test errr by making an adjustment t the training errr t accunt fr the bias due t verfitting. 2. We can directly estimate the test errr, using either a validatin set apprach r a crss-validatin apprach, as discussed in Chapter 5. We cnsider bth f these appraches belw. C p,aic,bic,andadjustedr 2 We shw in Chapter 2 that the training set MSE is generally an underestimate f the test MSE. (Recall that MSE = RSS/n.) This is because when we fit a mdel t the training data using least squares, we specifically estimate the regressin cefficients such that the training RSS (but nt the test RSS) is as small as pssible. In particular, the training errr will decrease as mre variables are included in the mdel, but the test errr may nt. Therefre, training set RSS and training set R 2 cannt be used t select frm amng a set f mdels with different numbers f variables. Hwever, a number f techniques fr adjusting the training errr fr the mdel size are available. These appraches can be used t select amng a set

226 6.1 Subset Selectin 211 C p BIC Adjusted R Number f Predictrs Number f Predictrs Number f Predictrs FIGURE 6.2. C p, BIC, and adjusted R 2 are shwn fr the best mdels f each size fr the Credit data set (the lwer frntier in Figure 6.1). C p and BIC are estimates f test MSE. In the middle plt we see that the BIC estimate f test errr shws an increase after fur variables are selected. The ther tw plts are rather flat after fur variables are included. f mdels with different numbers f variables. We nw cnsider fur such appraches: C p, Akaike infrmatin criterin (AIC), Bayesian infrmatin Cp criterin (BIC), and adjusted R 2. Figure 6.2 displays C p,bic,andadjusted R 2 fr the best mdel f each size prduced by best subset selectin n the Credit data set. Fr a fitted least squares mdel cntaining d predictrs, the C p estimate f test MSE is cmputed using the equatin C p = 1 ( RSS + 2dˆσ 2 ), (6.2) n where ˆσ 2 is an estimate f the variance f the errr ɛ assciated with each respnse measurement in (6.1). 3 Essentially, the C p statistic adds a penalty f 2dˆσ 2 t the training RSS in rder t adjust fr the fact that the training errr tends t underestimate the test errr. Clearly, the penalty increases as the number f predictrs in the mdel increases; this is intended t adjust fr the crrespnding decrease in training RSS. Thugh it is beynd the scpe f this bk, ne can shw that if ˆσ 2 is an unbiased estimate f σ 2 in (6.2), then C p is an unbiased estimate f test MSE. As a cnsequence, the C p statistic tends t take n a small value fr mdels with a lw test errr, s when determining which f a set f mdels is best, we chse the mdel with the lwest C p value. In Figure 6.2, C p selects the six-variable mdel cntaining the predictrs incme, limit, rating, cards, age and student. Akaike infrmatin criterin Bayesian infrmatin criterin adjusted R 2 3 Mallw s C p is smetimes defined as C p =RSS/ˆσ2 +2d n. Thisisequivalentt the definitin given abve in the sense that C p = 1 n ˆσ2 (C p + n), and s the mdel with smallest C p als has smallest C p.

227 Linear Mdel Selectin and Regularizatin The AIC criterin is defined fr a large class f mdels fit by maximum likelihd. In the case f the mdel (6.1) with Gaussian errrs, maximum likelihd and least squares are the same thing. In this case AIC is given by AIC = 1 ( RSS + 2dˆσ 2 ) nˆσ 2, where, fr simplicity, we have mitted an additive cnstant. Hence fr least squares mdels, C p and AIC are prprtinal t each ther, and s nly C p is displayed in Figure 6.2. BIC is derived frm a Bayesian pint f view, but ends up lking similar t C p (and AIC) as well. Fr the least squares mdel with d predictrs, the BIC is, up t irrelevant cnstants, given by BIC = 1 ( RSS + lg(n)dˆσ 2 ). (6.3) n Like C p, the BIC will tend t take n a small value fr a mdel with a lw test errr, and s generally we select the mdel that has the lwest BIC value. Ntice that BIC replaces the 2dˆσ 2 used by C p with a lg(n)dˆσ 2 term, where n is the number f bservatins. Since lg n>2 fr any n>7, the BIC statistic generally places a heavier penalty n mdels with many variables, and hence results in the selectin f smaller mdels than C p. In Figure 6.2, we see that this is indeed the case fr the Credit data set; BIC chses a mdel that cntains nly the fur predictrs incme, limit, cards, andstudent. In this case the curves are very flat and s there des nt appear t be much difference in accuracy between the fur-variable and six-variable mdels. The adjusted R 2 statistic is anther ppular apprach fr selecting amng a set f mdels that cntain different numbers f variables. Recall frm Chapter 3 that the usual R 2 is defined as 1 RSS/TSS, where TSS = (yi y) 2 is the ttal sum f squares fr the respnse. Since RSS always decreases as mre variables are added t the mdel, the R 2 always increases as mre variables are added. Fr a least squares mdel with d variables, the adjusted R 2 statistic is calculated as Adjusted R 2 RSS/(n d 1) =1. (6.4) TSS/(n 1) Unlike C p,aic,andbic,frwhichasmall value indicates a mdel with a lw test errr, a large value f adjusted R 2 indicates a mdel with a small test errr. Maximizing the adjusted R 2 is equivalent t minimizing RSS n d 1. While RSS always decreases as the number f variables in the mdel increases, RSS n d 1 may increase r decrease, due t the presence f d in the denminatr. The intuitin behind the adjusted R 2 is that nce all f the crrect variables have been included in the mdel, adding additinal nise variables

228 6.1 Subset Selectin 213 will lead t nly a very small decrease in RSS. Since adding nise variables leads t an increase in d, such variables will lead t an increase in RSS n d 1, and cnsequently a decrease in the adjusted R 2. Therefre, in thery, the mdel with the largest adjusted R 2 will have nly crrect variables and n nise variables. Unlike the R 2 statistic, the adjusted R 2 statistic pays apricefr the inclusin f unnecessary variables in the mdel. Figure 6.2 displays the adjusted R 2 fr the Credit data set. Using this statistic results in the selectin f a mdel that cntains seven variables, adding gender t the mdel selected by C p and AIC. C p, AIC, and BIC all have rigrus theretical justificatins that are beynd the scpe f this bk. These justificatins rely n asympttic arguments (scenaris where the sample size n is very large). Despite its ppularity, and even thugh it is quite intuitive, the adjusted R 2 is nt as well mtivated in statistical thery as AIC, BIC, and C p. All f these measures are simple t use and cmpute. Here we have presented the frmulas fr AIC, BIC, and C p in the case f a linear mdel fit using least squares; hwever, these quantities can als be defined fr mre general types f mdels. Validatin and Crss-Validatin As an alternative t the appraches just discussed, we can directly estimate the test errr using the validatin set and crss-validatin methds discussed in Chapter 5. We can cmpute the validatin set errr r the crss-validatin errr fr each mdel under cnsideratin, and then select the mdel fr which the resulting estimated test errr is smallest. This prcedure has an advantage relative t AIC, BIC, C p,andadjustedr 2,inthat it prvides a direct estimate f the test errr, and makes fewer assumptins abut the true underlying mdel. It can als be used in a wider range f mdel selectin tasks, even in cases where it is hard t pinpint the mdel degrees f freedm (e.g. the number f predictrs in the mdel) r hard t estimate the errr variance σ 2. In the past, perfrming crss-validatin was cmputatinally prhibitive fr many prblems with large p and/r large n, and s AIC, BIC, C p, and adjusted R 2 were mre attractive appraches fr chsing amng a set f mdels. Hwever, nwadays with fast cmputers, the cmputatins required t perfrm crss-validatin are hardly ever an issue. Thus, crssvalidatin is a very attractive apprach fr selecting frm amng a number f mdels under cnsideratin. Figure 6.3 displays, as a functin f d, the BIC, validatin set errrs, and crss-validatin errrs n the Credit data, fr the best d-variable mdel. The validatin errrs were calculated by randmly selecting three-quarters f the bservatins as the training set, and the remainder as the validatin set. The crss-validatin errrs were cmputed using k =10flds. In this case, the validatin and crss-validatin methds bth result in a

229 Linear Mdel Selectin and Regularizatin Square Rt f BIC Validatin Set Errr Crss Validatin Errr Number f Predictrs Number f Predictrs Number f Predictrs FIGURE 6.3. Fr the Credit data set, three quantities are displayed fr the best mdel cntaining d predictrs, fr d ranging frm 1 t 11. The verall best mdel, based n each f these quantities, is shwn as a blue crss. Left: Square rt f BIC. Center: Validatin set errrs. Right: Crss-validatin errrs. six-variable mdel. Hwever, all three appraches suggest that the fur-, five-, and six-variable mdels are rughly equivalent in terms f their test errrs. In fact, the estimated test errr curves displayed in the center and righthand panels f Figure 6.3 are quite flat. While a three-variable mdel clearly has lwer estimated test errr than a tw-variable mdel, the estimated test errrs f the 3- t 11-variable mdels are quite similar. Furthermre, if we repeated the validatin set apprach using a different split f the data int a training set and a validatin set, r if we repeated crss-validatin using a different set f crss-validatin flds, then the precise mdel with the lwest estimated test errr wuld surely change. In this setting, we can select a mdel using the ne-standard-errr rule. We first calculate the ne- standard errr f the estimated test MSE fr each mdel size, and then select the smallest mdel fr which the estimated test errr is within ne standard errr f the lwest pint n the curve. The ratinale here is that if a set f mdels appear t be mre r less equally gd, then we might as well chse the simplest mdel that is, the mdel with the smallest number f predictrs. In this case, applying the ne-standard-errr rule t the validatin set r crss-validatin apprach leads t selectin f the three-variable mdel. standard- errr rule 6.2 Shrinkage Methds The subset selectin methds described in Sectin 6.1 invlve using least squares t fit a linear mdel that cntains a subset f the predictrs. As an alternative, we can fit a mdel cntaining all p predictrs using a technique that cnstrains r regularizes the cefficient estimates, r equivalently, that shrinks the cefficient estimates twards zer. It may nt be immediately

230 6.2 Shrinkage Methds 215 bvius why such a cnstraint shuld imprve the fit, but it turns ut that shrinking the cefficient estimates can significantly reduce their variance. The tw best-knwn techniques fr shrinking the regressin cefficients twards zer are ridge regressin and the lass Ridge Regressin Recall frm Chapter 3 that the least squares fitting prcedure estimates β 0,β 1,...,β p using the values that minimize RSS = n y i β 0 i=1 p j=1 2 β j x ij. Ridge regressin is very similar t least squares, except that the cefficients ridge are estimated by minimizing a slightly different quantity. In particular, the regressin ridge regressin cefficient estimates ˆβ R are the values that minimize n y i β 0 i=1 p j=1 2 β j x ij + λ p p βj 2 =RSS+λ βj 2, (6.5) j=1 j=1 where λ 0isatuning parameter, t be determined separately. Equa- tuning tin 6.5 trades ff tw different criteria. As with least squares, ridge regressin seeks cefficient estimates that fit the data well, by making the RSS parameter small. Hwever, the secnd term, λ j β2 j, called a shrinkage penalty, is shrinkage small when β 1,...,β p are clse t zer, and s it has the effect f shrinking penalty the estimates f β j twards zer. The tuning parameter λ serves t cntrl the relative impact f these tw terms n the regressin cefficient estimates. When λ = 0, the penalty term has n effect, and ridge regressin will prduce the least squares estimates. Hwever, as λ, the impact f the shrinkage penalty grws, and the ridge regressin cefficient estimates will apprach zer. Unlike least squares, which generates nly ne set f cefficient estimates, ridge regressin will prduce a different set f cefficient estimates, ˆβ λ R, fr each value f λ. Selecting a gd value fr λ is critical; we defer this discussin t Sectin 6.2.3, where we use crss-validatin. Nte that in (6.5), the shrinkage penalty is applied t β 1,...,β p, but nt t the intercept β 0. We want t shrink the estimated assciatin f each variable with the respnse; hwever, we d nt want t shrink the intercept, which is simply a measure f the mean value f the respnse when x i1 = x i2 =... = x ip = 0. If we assume that the variables that is, the clumns f the data matrix X have been centered t have mean zer befre ridge regressin is perfrmed, then the estimated intercept will take the frm ˆβ 0 =ȳ = n i=1 y i/n.

231 Linear Mdel Selectin and Regularizatin Standardized Cefficients e 02 1e+00 1e+02 1e+04 Incme Limit Rating Student Standardized Cefficients λ ˆβ R λ 2/ ˆβ 2 FIGURE 6.4. The standardized ridge regressin cefficients are displayed fr the Credit data set, as a functin f λ and ˆβ R λ 2/ ˆβ 2. An Applicatin t the Credit Data In Figure 6.4, the ridge regressin cefficient estimates fr the Credit data set are displayed. In the left-hand panel, each curve crrespnds t the ridge regressin cefficient estimate fr ne f the ten variables, pltted as a functin f λ. Fr example, the black slid line represents the ridge regressin estimate fr the incme cefficient, as λ is varied. At the extreme left-hand side f the plt, λ is essentially zer, and s the crrespnding ridge cefficient estimates are the same as the usual least squares estimates. But as λ increases, the ridge cefficient estimates shrink twards zer. When λ is extremely large, then all f the ridge cefficient estimates are basically zer; this crrespnds t the null mdel that cntains n predictrs. In this plt, the incme, limit, rating, andstudent variables are displayed in distinct clrs, since these variables tend t have by far the largest cefficient estimates. While the ridge cefficient estimates tend t decrease in aggregate as λ increases, individual cefficients, such as rating and incme, may ccasinally increase as λ increases. The right-hand panel f Figure 6.4 displays the same ridge cefficient estimates as the left-hand panel, but instead f displaying λ n the x-axis, we nw display ˆβ λ R 2/ ˆβ 2,where ˆβ dentes the vectr f least squares cefficient estimates. The ntatin β 2 dentes the l 2 nrm (prnunced p l2 nrm ell 2 ) f a vectr, and is defined as β 2 = j=1 β j 2.Itmeasures the distance f β frm zer. As λ increases, the l 2 nrm f ˆβ λ R will always decrease, and s will ˆβ λ R 2/ ˆβ 2. The latter quantity ranges frm 1 (when λ = 0, in which case the ridge regressin cefficient estimate is the same as the least squares estimate, and s their l 2 nrms are the same) t 0 (when λ =, in which case the ridge regressin cefficient estimate is a vectr f zers, with l 2 nrm equal t zer). Therefre, we can think f the x-axis in the right-hand panel f Figure 6.4 as the amunt that the ridge

232 6.2 Shrinkage Methds 217 regressin cefficient estimates have been shrunken twards zer; a small value indicates that they have been shrunken very clse t zer. The standard least squares cefficient estimates discussed in Chapter 3 by a cnstant c simply leads t a scale equivariant are scale equivariant: multiplying X j scaling f the least squares cefficient estimates by a factr f 1/c. Inther wrds, regardless f hw the jth predictr is scaled, X j ˆβj will remain the same. In cntrast, the ridge regressin cefficient estimates can change substantially when multiplying a given predictr by a cnstant. Fr instance, cnsider the incme variable, which is measured in dllars. One culd reasnably have measured incme in thusands f dllars, which wuld result in a reductin in the bserved values f incme by a factr f 1,000. Nw due t the sum f squared cefficients term in the ridge regressin frmulatin (6.5), such a change in scale will nt simply cause the ridge regressin cefficient estimate fr incme t change by a factr f 1,000. In ther wrds, X j ˆβR j,λ will depend nt nly n the value f λ, but als n the scaling f the jth predictr. In fact, the value f X j ˆβR j,λ may even depend n the scaling f the ther predictrs! Therefre, it is best t apply ridge regressin after standardizing the predictrs, usingthefrmula x ij = 1 n x ij n i=1 (x ij x j ) 2, (6.6) s that they are all n the same scale. In (6.6), the denminatr is the estimated standard deviatin f the jth predictr. Cnsequently, all f the standardized predictrs will have a standard deviatin f ne. As a result the final fit will nt depend n the scale n which the predictrs are measured. In Figure 6.4, the y-axis displays the standardized ridge regressin cefficient estimates that is, the cefficient estimates that result frm perfrming ridge regressin using standardized predictrs. Why Des Ridge Regressin Imprve Over Least Squares? Ridge regressin s advantage ver least squares is rted in the bias-variance trade-ff. Asλ increases, the flexibility f the ridge regressin fit decreases, leading t decreased variance but increased bias. This is illustrated in the left-hand panel f Figure 6.5, using a simulated data set cntaining p =45 predictrs and n = 50 bservatins. The green curve in the left-hand panel f Figure 6.5 displays the variance f the ridge regressin predictins as a functin f λ. At the least squares cefficient estimates, which crrespnd t ridge regressin with λ = 0, the variance is high but there is n bias. But as λ increases, the shrinkage f the ridge cefficient estimates leads t a substantial reductin in the variance f the predictins, at the expense f a slight increase in bias. Recall that the test mean squared errr (MSE), pltted in purple, is a functin f the variance plus the squared bias. Fr values

233 Linear Mdel Selectin and Regularizatin Mean Squared Errr Mean Squared Errr e 01 1e+01 1e λ ˆβR λ 2/ ˆβ 2 FIGURE 6.5. Squared bias (black), variance (green), and test mean squared errr (purple) fr the ridge regressin predictins n a simulated data set, as a functin f λ and ˆβ R λ 2/ ˆβ 2. The hrizntal dashed lines indicate the minimum pssible MSE. The purple crsses indicate the ridge regressin mdels fr which the MSE is smallest. f λ up t abut 10, the variance decreases rapidly, with very little increase in bias, pltted in black. Cnsequently, the MSE drps cnsiderably as λ increases frm 0 t 10. Beynd this pint, the decrease in variance due t increasing λ slws, and the shrinkage n the cefficients causes them t be significantly underestimated, resulting in a large increase in the bias. The minimum MSE is achieved at apprximately λ = 30. Interestingly, because f its high variance, the MSE assciated with the least squares fit, when λ = 0, is almst as high as that f the null mdel fr which all cefficient estimates are zer, when λ =. Hwever, fr an intermediate value f λ, the MSE is cnsiderably lwer. The right-hand panel f Figure 6.5 displays the same curves as the lefthand panel, this time pltted against the l 2 nrm f the ridge regressin cefficient estimates divided by the l 2 nrm f the least squares estimates. Nw as we mve frm left t right, the fits becme mre flexible, and s the bias decreases and the variance increases. In general, in situatins where the relatinship between the respnse and the predictrs is clse t linear, the least squares estimates will have lw bias but may have high variance. This means that a small change in the training data can cause a large change in the least squares cefficient estimates. In particular, when the number f variables p is almst as large as the number f bservatins n, as in the example in Figure 6.5, the least squares estimates will be extremely variable. And if p>n, then the least squares estimates d nt even have a unique slutin, whereas ridge regressin can still perfrm well by trading ff a small increase in bias fr a large decrease in variance. Hence, ridge regressin wrks best in situatins where the least squares estimates have high variance. Ridge regressin als has substantial cmputatinal advantages ver best subset selectin, which requires searching thrugh 2 p mdels. As we

234 6.2 Shrinkage Methds 219 discussed previusly, even fr mderate values f p, such a search can be cmputatinally infeasible. In cntrast, fr any fixed value f λ, ridge regressin nly fits a single mdel, and the mdel-fitting prcedure can be perfrmed quite quickly. In fact, ne can shw that the cmputatins required t slve (6.5), simultaneusly fr all values f λ, are almst identical t thse fr fitting a mdel using least squares The Lass Ridge regressin des have ne bvius disadvantage. Unlike best subset, frward stepwise, and backward stepwise selectin, which will generally select mdels that invlve just a subset f the variables, ridge regressin will include all p predictrs in the final mdel. The penalty λ βj 2 in (6.5) will shrink all f the cefficients twards zer, but it will nt set any f them exactly t zer (unless λ = ). This may nt be a prblem fr predictin accuracy, but it can create a challenge in mdel interpretatin in settings in which the number f variables p is quite large. Fr example, in the Credit data set, it appears that the mst imprtant variables are incme, limit, rating, and student. S we might wish t build a mdel including just these predictrs. Hwever, ridge regressin will always generate a mdel invlving all ten predictrs. Increasing the value f λ will tend t reduce the magnitudes f the cefficients, but will nt result in exclusin f any f the variables. The lass is a relatively recent alternative t ridge regressin that ver- lass cmes this disadvantage. The lass cefficients, ˆβ λ L, minimize the quantity 2 n p p p y i β 0 β j x ij + λ β j =RSS+λ β j. (6.7) i=1 j=1 j=1 Cmparing (6.7) t (6.5), we see that the lass and ridge regressin have similar frmulatins. The nly difference is that the βj 2 term in the ridge regressin penalty (6.5) has been replaced by β j in the lass penalty (6.7). In statistical parlance, the lass uses an l 1 (prnunced ell 1 ) penalty instead f an l 2 penalty. The l 1 nrm f a cefficient vectr β is given by β 1 = β j. As with ridge regressin, the lass shrinks the cefficient estimates twards zer. Hwever, in the case f the lass, the l 1 penalty has the effect f frcing sme f the cefficient estimates t be exactly equal t zer when the tuning parameter λ is sufficiently large. Hence, much like best subset selectin, the lass perfrms variable selectin. As a result, mdels generated frm the lass are generally much easier t interpret than thse prduced by ridge regressin. We say that the lass yields sparse mdels that is, sparse mdels that invlve nly a subset f the variables. As in ridge regressin, selecting a gd value f λ fr the lass is critical; we defer this discussin t Sectin 6.2.3, where we use crss-validatin. j=1

235 Linear Mdel Selectin and Regularizatin Standardized Cefficients Standardized Cefficients Incme Limit Rating Student λ ˆβL λ 1/ ˆβ 1 FIGURE 6.6. The standardized lass cefficients n the Credit data set are shwn as a functin f λ and ˆβ L λ 1/ ˆβ 1. As an example, cnsider the cefficient plts in Figure 6.6, which are generated frm applying the lass t the Credit data set. When λ =0,then the lass simply gives the least squares fit, and when λ becmes sufficiently large, the lass gives the null mdel in which all cefficient estimates equal zer. Hwever, in between these tw extremes, the ridge regressin and lass mdels are quite different frm each ther. Mving frm left t right in the right-hand panel f Figure 6.6, we bserve that at first the lass results in a mdel that cntains nly the rating predictr. Then student and limit enter the mdel almst simultaneusly, shrtly fllwed by incme. Eventually, the remaining variables enter the mdel. Hence, depending n the value f λ, the lass can prduce a mdel invlving any number f variables. In cntrast, ridge regressin will always include all f the variables in the mdel, althugh the magnitude f the cefficient estimates will depend n λ. Anther Frmulatin fr Ridge Regressin and the Lass One can shw that the lass and ridge regressin cefficient estimates slve the prblems 2 n p p minimize y i β 0 β j x ij β i=1 j=1 subject t β j s j=1 (6.8) and 2 n p p minimize y i β 0 β j x ij β i=1 j=1 subject t βj 2 s, j=1 (6.9)

236 6.2 Shrinkage Methds 221 respectively. In ther wrds, fr every value f λ, thereissmes such that the Equatins (6.7) and (6.8) will give the same lass cefficient estimates. Similarly, fr every value f λ there is a crrespnding s such that Equatins (6.5) and (6.9) will give the same ridge regressin cefficient estimates. When p = 2, then (6.8) indicates that the lass cefficient estimates have the smallest RSS ut f all pints that lie within the diamnd defined by β 1 + β 2 s. Similarly, the ridge regressin estimates have the smallest RSS ut f all pints that lie within the circle defined by β β 2 2 s. We can think f (6.8) as fllws. When we perfrm the lass we are trying t find the set f cefficient estimates that lead t the smallest RSS, subject t the cnstraint that there is a budget s fr hw large p j=1 β j can be. When s is extremely large, then this budget is nt very restrictive, and s the cefficient estimates can be large. In fact, if s is large enugh that the least squares slutin falls within the budget, then (6.8) will simply yield the least squares slutin. In cntrast, if s is small, then p j=1 β j must be small in rder t avid vilating the budget. Similarly, (6.9) indicates that when we perfrm ridge regressin, we seek a set f cefficient estimates such that the RSS is as small as pssible, subject t the requirement that p j=1 β2 j nt exceed the budget s. The frmulatins (6.8) and (6.9) reveal a clse cnnectin between the lass, ridge regressin, and best subset selectin. Cnsider the prblem minimize β n y i β 0 i=1 2 p p β j x ij subject t I(β j 0) s. j=1 (6.10) Here I(β j 0) is an indicatr variable: it takes n a value f 1 if β j 0,and equals zer therwise. Then (6.10) amunts t finding a set f cefficient estimates such that RSS is as small as pssible, subject t the cnstraint that n mre than s cefficients can be nnzer. The prblem (6.10) is equivalent t best subset selectin. Unfrtunately, slving (6.10) is cmputatinally infeasible when p is large, since it requires cnsidering all ( p s) mdels cntaining s predictrs. Therefre, we can interpret ridge regressin and the lass as cmputatinally feasible alternatives t best subset selectin that replace the intractable frm f the budget in (6.10) with frms that are much easier t slve. Of curse, the lass is much mre clsely related t best subset selectin, since nly the lass perfrms feature selectin fr s sufficiently small in (6.8). The Variable Selectin Prperty f the Lass Why is it that the lass, unlike ridge regressin, results in cefficient estimates that are exactly equal t zer? The frmulatins (6.8) and (6.9) can be used t shed light n the issue. Figure 6.7 illustrates the situatin. The least squares slutin is marked as ˆβ, while the blue diamnd and j=1

237 Linear Mdel Selectin and Regularizatin β 2 ^ β 2 β β^ β 1 β 1 FIGURE 6.7. Cnturs f the errr and cnstraint functins fr the lass (left) and ridge regressin (right). The slid blue areas are the cnstraint regins, β 1 + β 2 s and β β 2 2 s, while the red ellipses are the cnturs f the RSS. circle represent the lass and ridge regressin cnstraints in (6.8) and (6.9), respectively. If s is sufficiently large, then the cnstraint regins will cntain ˆβ, and s the ridge regressin and lass estimates will be the same as the least squares estimates. (Such a large value f s crrespnds t λ =0 in (6.5) and (6.7).) Hwever, in Figure 6.7 the least squares estimates lie utside f the diamnd and the circle, and s the least squares estimates are nt the same as the lass and ridge regressin estimates. The ellipses that are centered arund ˆβ represent regins f cnstant RSS. In ther wrds, all f the pints n a given ellipse share a cmmn value f the RSS. As the ellipses expand away frm the least squares cefficient estimates, the RSS increases. Equatins (6.8) and (6.9) indicate that the lass and ridge regressin cefficient estimates are given by the first pint at which an ellipse cntacts the cnstraint regin. Since ridge regressin has a circular cnstraint with n sharp pints, this intersectin will nt generally ccur n an axis, and s the ridge regressin cefficient estimates will be exclusively nn-zer. Hwever, the lass cnstraint has crners at each f the axes, and s the ellipse will ften intersect the cnstraint regin at an axis. When this ccurs, ne f the cefficients will equal zer. In higher dimensins, many f the cefficient estimates may equal zer simultaneusly. In Figure 6.7, the intersectin ccurs at β 1 =0,andsthe resulting mdel will nly include β 2. In Figure 6.7, we cnsidered the simple case f p =2.Whenp =3, then the cnstraint regin fr ridge regressin becmes a sphere, and the cnstraint regin fr the lass becmes a plyhedrn. When p>3, the

238 6.2 Shrinkage Methds 223 Mean Squared Errr Mean Squared Errr λ R 2 n Training Data FIGURE 6.8. Left: Plts f squared bias (black), variance (green), and test MSE (purple) fr the lass n a simulated data set. Right: Cmparisn f squared bias, variance and test MSE between lass (slid) and ridge (dashed). Bth are pltted against their R 2 n the training data, as a cmmn frm f indexing. The crsses in bth plts indicate the lass mdel fr which the MSE is smallest. cnstraint fr ridge regressin becmes a hypersphere, and the cnstraint fr the lass becmes a plytpe. Hwever, the key ideas depicted in Figure 6.7 still hld. In particular, the lass leads t feature selectin when p>2 due t the sharp crners f the plyhedrn r plytpe. Cmparing the Lass and Ridge Regressin It is clear that the lass has a majr advantage ver ridge regressin, in that it prduces simpler and mre interpretable mdels that invlve nly a subset f the predictrs. Hwever, which methd leads t better predictin accuracy? Figure 6.8 displays the variance, squared bias, and test MSE f the lass applied t the same simulated data as in Figure 6.5. Clearly the lass leads t qualitatively similar behavir t ridge regressin, in that as λ increases, the variance decreases and the bias increases. In the right-hand panel f Figure 6.8, the dtted lines represent the ridge regressin fits. Here we plt bth against their R 2 n the training data. This is anther useful way t index mdels, and can be used t cmpare mdels with different types f regularizatin, as is the case here. In this example, the lass and ridge regressin result in almst identical biases. Hwever, the variance f ridge regressin is slightly lwer than the variance f the lass. Cnsequently, the minimum MSE f ridge regressin is slightly smaller than that f the lass. Hwever, the data in Figure 6.8 were generated in such a way that all 45 predictrs were related t the respnse that is, nne f the true cefficients β 1,...,β 45 equaled zer. The lass implicitly assumes that a number f the cefficients truly equal zer. Cnsequently, it is nt surprising that ridge regressin utperfrms the lass in terms f predictin errr in this setting. Figure 6.9 illustrates a similar situatin, except that nw the respnse is a

239 Linear Mdel Selectin and Regularizatin Mean Squared Errr Mean Squared Errr λ R 2 n Training Data FIGURE 6.9. Left: Plts f squared bias (black), variance (green), and test MSE (purple) fr the lass. The simulated data is similar t that in Figure 6.8, except that nw nly tw predictrs are related t the respnse. Right: Cmparisn f squared bias, variance and test MSE between lass (slid) and ridge (dashed). Bth are pltted against their R 2 n the training data, as a cmmn frm f indexing. The crsses in bth plts indicate the lass mdel fr which the MSE is smallest. functin f nly 2 ut f 45 predictrs. Nw the lass tends t utperfrm ridge regressin in terms f bias, variance, and MSE. These tw examples illustrate that neither ridge regressin nr the lass will universally dminate the ther. In general, ne might expect the lass t perfrm better in a setting where a relatively small number f predictrs have substantial cefficients, and the remaining predictrs have cefficients that are very small r that equal zer. Ridge regressin will perfrm better when the respnse is a functin f many predictrs, all with cefficients f rughly equal size. Hwever, the number f predictrs that is related t the respnse is never knwn apririfr real data sets. A technique such as crss-validatin can be used in rder t determine which apprach is better n a particular data set. As with ridge regressin, when the least squares estimates have excessively high variance, the lass slutin can yield a reductin in variance at the expense f a small increase in bias, and cnsequently can generate mre accurate predictins. Unlike ridge regressin, the lass perfrms variable selectin, and hence results in mdels that are easier t interpret. There are very efficient algrithms fr fitting bth ridge and lass mdels; in bth cases the entire cefficient paths can be cmputed with abut the same amunt f wrk as a single least squares fit. We will explre this further in the lab at the end f this chapter. A Simple Special Case fr Ridge Regressin and the Lass In rder t btain a better intuitin abut the behavir f ridge regressin and the lass, cnsider a simple special case with n = p, andx a diagnal matrix with 1 s n the diagnal and 0 s in all ff-diagnal elements.

240 6.2 Shrinkage Methds 225 T simplify the prblem further, assume als that we are perfrming regressin withut an intercept. With these assumptins, the usual least squares prblem simplifies t finding β 1,...,β p that minimize p (y j β j ) 2. (6.11) j=1 In this case, the least squares slutin is given by ˆβ j = y j. And in this setting, ridge regressin amunts t finding β 1,...,β p such that p (y j β j ) 2 + λ j=1 p βj 2 (6.12) is minimized, and the lass amunts t finding the cefficients such that p p (y j β j ) 2 + λ β j (6.13) j=1 is minimized. One can shw that in this setting, the ridge regressin estimates take the frm ˆβ R j = y j/(1 + λ), (6.14) j=1 j=1 and the lass estimates take the frm y j λ/2 if y j >λ/2; ˆβ j L = y j + λ/2 if y j < λ/2; 0 if y j λ/2. (6.15) Figure 6.10 displays the situatin. We can see that ridge regressin and the lass perfrm tw very different types f shrinkage. In ridge regressin, each least squares cefficient estimate is shrunken by the same prprtin. In cntrast, the lass shrinks each least squares cefficient twards zer by a cnstant amunt, λ/2; the least squares cefficients that are less than λ/2 in abslute value are shrunken entirely t zer. The type f shrinkage perfrmed by the lass in this simple setting (6.15) is knwn as sftthreshlding. The fact that sme lass cefficients are shrunken entirely t sftthreshlding zer explains why the lass perfrms feature selectin. In the case f a mre general data matrix X, the stry is a little mre cmplicated than what is depicted in Figure 6.10, but the main ideas still hld apprximately: ridge regressin mre r less shrinks every dimensin f the data by the same prprtin, whereas the lass mre r less shrinks all cefficients tward zer by a similar amunt, and sufficiently small cefficients are shrunken all the way t zer.

241 Linear Mdel Selectin and Regularizatin Cefficient Estimate Ridge Least Squares Cefficient Estimate Lass Least Squares y j y j FIGURE The ridge regressin and lass cefficient estimates fr a simple setting with n = p and X a diagnal matrix with 1 s n the diagnal. Left: The ridge regressin cefficient estimates are shrunken prprtinally twards zer, relative t the least squares estimates. Right: The lass cefficient estimates are sft-threshlded twards zer. Bayesian Interpretatin fr Ridge Regressin and the Lass We nw shw that ne can view ridge regressin and the lass thrugh a Bayesian lens. A Bayesian viewpint fr regressin assumes that the cefficient vectr β has sme prir distributin, say p(β), where β = (β 0,β 1,...,β p ) T. The likelihd f the data can be written as f(y X, β), where X =(X 1,...,X p ). Multiplying the prir distributin by the likelihd gives us (up t a prprtinality cnstant) the psterir distributin, which takes the frm p(βx, Y ) f(y X, β)p(βx) =f(y X, β)p(β), where the prprtinality abve fllws frm Bayes therem, and the equality abve fllws frm the assumptin that X is fixed. We assume the usual linear mdel, Y = β 0 + X 1 β X p β p + ɛ, and suppse that the errrs are independent and drawn frm a nrmal distributin. Furthermre, assume that p(β) = p j=1 g(β j), fr sme density functin g. It turns ut that ridge regressin and the lass fllw naturally frm tw special cases f g: psterir distributin If g is a Gaussian distributin with mean zer and standard deviatin a functin f λ, then it fllws that the psterir mde fr β that psterir is, the mst likely value fr β, given the data is given by the ridge mde regressin slutin. (In fact, the ridge regressin slutin is als the psterir mean.)

242 6.2 Shrinkage Methds 227 (β j ) (βj) β j β j FIGURE Left: Ridge regressin is the psterir mde fr β under a Gaussian prir. Right: The lass is the psterir mde fr β under a duble-expnential prir. If g is a duble-expnential (Laplace) distributin with mean zer and scale parameter a functin f λ, then it fllws that the psterir mde fr β is the lass slutin. (Hwever, the lass slutin is nt the psterir mean, and in fact, the psterir mean des nt yield a sparse cefficient vectr.) The Gaussian and duble-expnential prirs are displayed in Figure Therefre, frm a Bayesian viewpint, ridge regressin and the lass fllw directly frm assuming the usual linear mdel with nrmal errrs, tgether with a simple prir distributin fr β. Ntice that the lass prir is steeply peaked at zer, while the Gaussian is flatter and fatter at zer. Hence, the lass expects a priri that many f the cefficients are (exactly) zer, while ridge assumes the cefficients are randmly distributed abut zer Selecting the Tuning Parameter Just as the subset selectin appraches cnsidered in Sectin 6.1 require a methd t determine which f the mdels under cnsideratin is best, implementing ridge regressin and the lass requires a methd fr selecting a value fr the tuning parameter λ in (6.5) and (6.7), r equivalently, the value f the cnstraint s in (6.9) and (6.8). Crss-validatin prvides a simple way t tackle this prblem. We chse a grid f λ values, and cmpute the crss-validatin errr fr each value f λ, as described in Chapter 5. We then select the tuning parameter value fr which the crss-validatin errr is smallest. Finally, the mdel is re-fit using all f the available bservatins and the selected value f the tuning parameter. Figure 6.12 displays the chice f λ that results frm perfrming leavene-ut crss-validatin n the ridge regressin fits frm the Credit data set. The dashed vertical lines indicate the selected value f λ. Inthiscase the value is relatively small, indicating that the ptimal fit nly invlves a

243 Linear Mdel Selectin and Regularizatin Crss Validatin Errr e 03 5e 02 5e 01 5e+00 λ Standardized Cefficients e 03 5e 02 5e 01 5e+00 λ FIGURE Left: Crss-validatin errrs that result frm applying ridge regressin t the Credit data set with varius value f λ. Right: The cefficient estimates as a functin f λ. The vertical dashed lines indicate the value f λ selected by crss-validatin. small amunt f shrinkage relative t the least squares slutin. In additin, the dip is nt very prnunced, s there is rather a wide range f values that wuld give very similar errr. In a case like this we might simply use the least squares slutin. Figure 6.13 prvides an illustratin f ten-fld crss-validatin applied t the lass fits n the sparse simulated data frm Figure 6.9. The left-hand panel f Figure 6.13 displays the crss-validatin errr, while the right-hand panel displays the cefficient estimates. The vertical dashed lines indicate the pint at which the crss-validatin errr is smallest. The tw clred lines in the right-hand panel f Figure 6.13 represent the tw predictrs that are related t the respnse, while the grey lines represent the unrelated predictrs; these are ften referred t as signal and nise variables, signal respectively. Nt nly has the lass crrectly given much larger cefficient estimates t the tw signal predictrs, but als the minimum crssvalidatin errr crrespnds t a set f cefficient estimates fr which nly the signal variables are nn-zer. Hence crss-validatin tgether with the lass has crrectly identified the tw signal variables in the mdel, even thugh this is a challenging setting, with p = 45 variables and nly n =50 bservatins. In cntrast, the least squares slutin displayed n the far right f the right-hand panel f Figure 6.13 assigns a large cefficient estimate t nly ne f the tw signal variables. 6.3 Dimensin Reductin Methds The methds that we have discussed s far in this chapter have cntrlled variance in tw different ways, either by using a subset f the riginal variables, r by shrinking their cefficients tward zer. All f these methds

244 6.3 Dimensin Reductin Methds 229 Crss Validatin Errr Standardized Cefficients ˆβ L λ 1/ ˆβ ˆβ L λ 1/ ˆβ 1 FIGURE Left: Ten-fld crss-validatin MSE fr the lass, applied t the sparse simulated data set frm Figure 6.9. Right: The crrespnding lass cefficient estimates are displayed. The vertical dashed lines indicate the lass fit fr which the crss-validatin errr is smallest. are defined using the riginal predictrs, X 1,X 2,...,X p.wenwexplre a class f appraches that transfrm the predictrs and then fit a least squares mdel using the transfrmed variables. We will refer t these techniques as dimensin reductin methds. Let Z 1,Z 2,...,Z M represent M<plinear cmbinatins f ur riginal p predictrs. That is, p Z m = φ jm X j (6.16) j=1 fr sme cnstants φ 1m,φ 2m...,φ pm, m =1,...,M. We can then fit the linear regressin mdel y i = θ 0 + M θ m z im + ɛ i, i =1,...,n, (6.17) m=1 using least squares. Nte that in (6.17), the regressin cefficients are given by θ 0,θ 1,...,θ M. If the cnstants φ 1m,φ 2m,...,φ pm are chsen wisely, then such dimensin reductin appraches can ften utperfrm least squares regressin. In ther wrds, fitting (6.17) using least squares can lead t better results than fitting (6.1) using least squares. The term dimensin reductin cmes frm the fact that this apprach reduces the prblem f estimating the p+1 cefficients β 0,β 1,...,β p t the simpler prblem f estimating the M + 1 cefficients θ 0,θ 1,...,θ M,where M < p. In ther wrds, the dimensin f the prblem has been reduced frm p +1tM +1. Ntice that frm (6.16), M θ m z im = m=1 M m=1 θ m p φ jm x ij = j=1 p j=1 m=1 M θ m φ jm x ij = p β j x ij, j=1 dimensin reductin linear cmbinatin

245 Linear Mdel Selectin and Regularizatin Ad Spending Ppulatin FIGURE The ppulatin size (pp) and ad spending (ad) fr 100 different cities are shwn as purple circles. The green slid line indicates the first principal cmpnent, and the blue dashed line indicates the secnd principal cmpnent. where β j = M θ m φ jm. (6.18) m=1 Hence (6.17) can be thught f as a special case f the riginal linear regressin mdel given by (6.1). Dimensin reductin serves t cnstrain the estimated β j cefficients, since nw they must take the frm (6.18). This cnstraint n the frm f the cefficients has the ptential t bias the cefficient estimates. Hwever, in situatins where p is large relative t n, selecting a value f M p can significantly reduce the variance f the fitted cefficients. If M = p, andallthez m are linearly independent, then (6.18) pses n cnstraints. In this case, n dimensin reductin ccurs, and s fitting (6.17) is equivalent t perfrming least squares n the riginal p predictrs. All dimensin reductin methds wrk in tw steps. First, the transfrmed predictrs Z 1,Z 2,...,Z M are btained. Secnd, the mdel is fit using these M predictrs. Hwever, the chice f Z 1,Z 2,...,Z M, r equivalently, the selectin f the φ jm s, can be achieved in different ways. In this chapter, we will cnsider tw appraches fr this task: principal cmpnents and partial least squares Principal Cmpnents Regressin Principal cmpnents analysis (PCA) is a ppular apprach fr deriving principal a lw-dimensinal set f features frm a large set f variables. PCA is discussed in greater detail as a tl fr unsupervised learning in Chapter 10. Here we describe its use as a dimensin reductin technique fr regressin. cmpnents analysis

246 6.3 Dimensin Reductin Methds 231 An Overview f Principal Cmpnents Analysis PCA is a technique fr reducing the dimensin f a n p data matrix X. The first principal cmpnent directin f the data is that alng which the bservatins vary the mst. Fr instance, cnsider Figure 6.14, which shws ppulatin size (pp) in tens f thusands f peple, and ad spending fr a particular cmpany (ad) in thusands f dllars, fr 100 cities. The green slid line represents the first principal cmpnent directin f the data. We can see by eye that this is the directin alng which there is the greatest variability in the data. That is, if we prjected the 100 bservatins nt this line (as shwn in the left-hand panel f Figure 6.15), then the resulting prjected bservatins wuld have the largest pssible variance; prjecting the bservatins nt any ther line wuld yield prjected bservatins with lwer variance. Prjecting a pint nt a line simply invlves finding the lcatin n the line which is clsest t the pint. The first principal cmpnent is displayed graphically in Figure 6.14, but hw can it be summarized mathematically? It is given by the frmula Z 1 =0.839 (pp pp) (ad ad). (6.19) Here φ 11 =0.839 and φ 21 =0.544 are the principal cmpnent ladings, which define the directin referred t abve. In (6.19), pp indicates the mean f all pp values in this data set, and ad indicates the mean f all advertising spending. The idea is that ut f every pssible linear cmbinatin f pp and ad such that φ φ2 21 = 1, this particular linear cmbinatin yields the highest variance: i.e. this is the linear cmbinatin fr which Var(φ 11 (pp pp)+φ 21 (ad ad)) is maximized. It is necessary t cnsider nly linear cmbinatins f the frm φ φ 2 21 = 1, since therwise we culd increase φ 11 and φ 21 arbitrarily in rder t blw up the variance. In (6.19), the tw ladings are bth psitive and have similar size, and s Z 1 is almst an average f the tw variables. Since n = 100, pp and ad are vectrs f length 100, and s is Z 1 in (6.19). Fr instance, z i1 =0.839 (pp i pp) (ad i ad). (6.20) The values f z 11,...,z n1 are knwn as the principal cmpnent scres, and can be seen in the right-hand panel f Figure There is als anther interpretatin fr PCA: the first principal cmpnent vectr defines the line that is as clse as pssible t the data. Fr instance, in Figure 6.14, the first principal cmpnent line minimizes the sum f the squared perpendicular distances between each pint and the line. These distances are pltted as dashed line segments in the left-hand panel f Figure 6.15, in which the crsses represent the prjectin f each pint nt the first principal cmpnent line. The first principal cmpnent has been chsen s that the prjected bservatins are as clse as pssible t the riginal bservatins.

247 Linear Mdel Selectin and Regularizatin Ad Spending nd Principal Cmpnent Ppulatin st Principal Cmpnent FIGURE A subset f the advertising data. The mean pp and ad budgets are indicated with a blue circle. Left: The first principal cmpnent directin is shwn in green. It is the dimensin alng which the data vary the mst, and it als defines the line that is clsest t all n f the bservatins. The distances frm each bservatin t the principal cmpnent are represented using the black dashed line segments. The blue dt represents (pp, ad). Right: The left-hand panel has been rtated s that the first principal cmpnent directin cincides with the x-axis. In the right-hand panel f Figure 6.15, the left-hand panel has been rtated s that the first principal cmpnent directin cincides with the x-axis. It is pssible t shw that the first principal cmpnent scre fr the ith bservatin, given in (6.20), is the distance in the x-directin f the ith crss frm zer. S fr example, the pint in the bttm-left crner f the left-hand panel f Figure 6.15 has a large negative principal cmpnent scre, z i1 = 26.1, while the pint in the tp-right crner has a large psitive scre, z i1 =18.7. These scres can be cmputed directly using (6.20). We can think f the values f the principal cmpnent Z 1 as singlenumber summaries f the jint pp and ad budgets fr each lcatin. In this example, if z i1 = (pp i pp) (ad i ad) < 0, then this indicates a city with belw-average ppulatin size and belwaverage ad spending. A psitive scre suggests the ppsite. Hw well can a single number represent bth pp and ad? In this case, Figure 6.14 indicates that pp and ad have apprximately a linear relatinship, and s we might expect that a single-number summary will wrk well. Figure 6.16 displays z i1 versus bth pp and ad. The plts shw a strng relatinship between the first principal cmpnent and the tw features. In ther wrds, the first principal cmpnent appears t capture mst f the infrmatin cntained in the pp and ad predictrs. S far we have cncentrated n the first principal cmpnent. In general, ne can cnstruct up t p distinct principal cmpnents. The secnd principal cmpnent Z 2 is a linear cmbinatin f the variables that is uncrrelated with Z 1, and has largest variance subject t this cnstraint. The secnd principal cmpnent directin is illustrated as a dashed blue line in Figure It turns ut that the zer crrelatin cnditin f Z 1 with Z 2

248 6.3 Dimensin Reductin Methds 233 Ppulatin Ad Spending st Principal Cmpnent 1st Principal Cmpnent FIGURE Plts f the first principal cmpnent scres z i1 versus pp and ad. The relatinships are strng. is equivalent t the cnditin that the directin must be perpendicular, r perpendicular rthgnal, t the first principal cmpnent directin. The secnd principal rthgnal cmpnent is given by the frmula Z 2 =0.544 (pp pp) (ad ad). Since the advertising data has tw predictrs, the first tw principal cmpnents cntain all f the infrmatin that is in pp and ad. Hwever, by cnstructin, the first cmpnent will cntain the mst infrmatin. Cnsider, fr example, the much larger variability f z i1 (the x-axis) versus z i2 (the y-axis) in the right-hand panel f Figure The fact that the secnd principal cmpnent scres are much clser t zer indicates that this cmpnent captures far less infrmatin. As anther illustratin, Figure 6.17 displays z i2 versus pp and ad. There is little relatinship between the secnd principal cmpnent and these tw predictrs, again suggesting that in this case, ne nly needs the first principal cmpnent in rder t accurately represent the pp and ad budgets. With tw-dimensinal data, such as in ur advertising example, we can cnstruct at mst tw principal cmpnents. Hwever, if we had ther predictrs, such as ppulatin age, incme level, educatin, and s frth, then additinal cmpnents culd be cnstructed. They wuld successively maximize variance, subject t the cnstraint f being uncrrelated with the preceding cmpnents. The Principal Cmpnents Regressin Apprach The principal cmpnents regressin (PCR) apprach invlves cnstructing principal the first M principal cmpnents, Z 1,...,Z M, and then using these cmpnents as the predictrs in a linear regressin mdel that is fit using least squares. The key idea is that ften a small number f principal cmpnents suffice t explain mst f the variability in the data, as well as the relatinship with the respnse. In ther wrds, we assume that the directins in which X 1,...,X p shw the mst variatin are the directins that are assciated with Y. While this assumptin is nt guaranteed cmpnents regressin

249 Linear Mdel Selectin and Regularizatin Ppulatin Ad Spending nd Principal Cmpnent 2nd Principal Cmpnent FIGURE Plts f the secnd principal cmpnent scres z i2 versus pp and ad. The relatinships are weak. Mean Squared Errr Mean Squared Errr Squared Bias Test MSE Variance Number f Cmpnents Number f Cmpnents FIGURE PCR was applied t tw simulated data sets. Left: Simulated data frm Figure 6.8. Right: Simulated data frm Figure 6.9. t be true, it ften turns ut t be a reasnable enugh apprximatin t give gd results. If the assumptin underlying PCR hlds, then fitting a least squares mdel t Z 1,...,Z M will lead t better results than fitting a least squares mdel t X 1,...,X p, since mst r all f the infrmatin in the data that relates t the respnse is cntained in Z 1,...,Z M, and by estimating nly M p cefficients we can mitigate verfitting. In the advertising data, the first principal cmpnent explains mst f the variance in bth pp and ad, s a principal cmpnent regressin that uses this single variable t predict sme respnse f interest, such as sales, will likely perfrm quite well. Figure 6.18 displays the PCR fits n the simulated data sets frm Figures 6.8 and 6.9. Recall that bth data sets were generated using n = 50 bservatins and p = 45 predictrs. Hwever, while the respnse in the first data set was a functin f all the predictrs, the respnse in the secnd data set was generated using nly tw f the predictrs. The curves are pltted as a functin f M, the number f principal cmpnents used as predictrs in the regressin mdel. As mre principal cmpnents are used in

250 6.3 Dimensin Reductin Methds 235 PCR Ridge Regressin and Lass Mean Squared Errr Squared Bias Test MSE Variance Mean Squared Errr Number f Cmpnents Shrinkage Factr FIGURE PCR, ridge regressin, and the lass were applied t a simulated data set in which the first five principal cmpnents f X cntain all the infrmatin abut the respnse Y. In each panel, the irreducible errr Var(ɛ) is shwn as a hrizntal dashed line. Left: Results fr PCR. Right: Results fr lass (slid) and ridge regressin (dtted). The x-axis displays the shrinkage factr f the cefficient estimates, defined as the l 2 nrm f the shrunken cefficient estimates divided by the l 2 nrm f the least squares estimate. the regressin mdel, the bias decreases, but the variance increases. This results in a typical U-shape fr the mean squared errr. When M = p = 45, then PCR amunts simply t a least squares fit using all f the riginal predictrs. The figure indicates that perfrming PCR with an apprpriate chice f M can result in a substantial imprvement ver least squares, especially in the left-hand panel. Hwever, by examining the ridge regressin and lass results in Figures 6.5, 6.8, and 6.9, we see that PCR des nt perfrm as well as the tw shrinkage methds in this example. The relatively wrse perfrmance f PCR in Figure 6.18 is a cnsequence f the fact that the data were generated in such a way that many principal cmpnents are required in rder t adequately mdel the respnse. In cntrast, PCR will tend t d well in cases when the first few principal cmpnents are sufficient t capture mst f the variatin in the predictrs as well as the relatinship with the respnse. The left-hand panel f Figure 6.19 illustrates the results frm anther simulated data set designed t be mre favrable t PCR. Here the respnse was generated in such a way that it depends exclusively n the first five principal cmpnents. Nw the bias drps t zer rapidly as M, the number f principal cmpnents used in PCR, increases. The mean squared errr displays a clear minimum at M = 5. The right-hand panel f Figure 6.19 displays the results n these data using ridge regressin and the lass. All three methds ffer a significant imprvement ver least squares. Hwever, PCR and ridge regressin slightly utperfrm the lass. We nte that even thugh PCR prvides a simple way t perfrm regressin using M < p predictrs, it is nt a feature selectin methd. This is because each f the M principal cmpnents used in the regressin

251 Linear Mdel Selectin and Regularizatin Standardized Cefficients Incme Limit Rating Student Crss Validatin MSE Number f Cmpnents Number f Cmpnents FIGURE Left: PCR standardized cefficient estimates n the Credit data set fr different values f M. Right: The ten-fld crss validatin MSE btained using PCR, as a functin f M. is a linear cmbinatin f all p f the riginal features. Fr instance, in (6.19), Z 1 was a linear cmbinatin f bth pp and ad. Therefre, while PCR ften perfrms quite well in many practical settings, it des nt result in the develpment f a mdel that relies upn a small set f the riginal features. In this sense, PCR is mre clsely related t ridge regressin than t the lass. In fact, ne can shw that PCR and ridge regressin are very clsely related. One can even think f ridge regressin as a cntinuus versin f PCR! 4 In PCR, the number f principal cmpnents, M, is typically chsen by crss-validatin. The results f applying PCR t the Credit data set are shwn in Figure 6.20; the right-hand panel displays the crss-validatin errrs btained, as a functin f M. On these data, the lwest crssvalidatin errr ccurs when there are M = 10 cmpnents; this crrespnds t almst n dimensin reductin at all, since PCR with M =11 is equivalent t simply perfrming least squares. When perfrming PCR, we generally recmmend standardizing each predictr, using (6.6), prir t generating the principal cmpnents. This standardizatin ensures that all variables are n the same scale. In the absence f standardizatin, the high-variance variables will tend t play a larger rle in the principal cmpnents btained, and the scale n which the variables are measured will ultimately have an effect n the final PCR mdel. Hwever, if the variables are all measured in the same units (say, kilgrams, r inches), then ne might chse nt t standardize them. 4 Mre details can be fund in Sectin 3.5 f Elements f Statistical Learning by Hastie, Tibshirani, and Friedman.

252 6.3 Dimensin Reductin Methds 237 Ad Spending Ppulatin FIGURE Fr the advertising data, the first PLS directin (slid line) and first PCR directin (dtted line) are shwn Partial Least Squares The PCR apprach that we just described invlves identifying linear cmbinatins, r directins, that best represent the predictrs X 1,...,X p.these directins are identified in an unsupervised way, since the respnse Y is nt used t help determine the principal cmpnent directins. That is, the respnse des nt supervise the identificatin f the principal cmpnents. Cnsequently, PCR suffers frm a drawback: there is n guarantee that the directins that best explain the predictrs will als be the best directins t use fr predicting the respnse. Unsupervised methds are discussed further in Chapter 10. We nw present partial least squares (PLS), a supervised alternative t partial least PCR. Like PCR, PLS is a dimensin reductin methd, which first identifies squares a new set f features Z 1,...,Z M that are linear cmbinatins f the riginal features, and then fits a linear mdel via least squares using these M new features. But unlike PCR, PLS identifies these new features in a supervised way that is, it makes use f the respnse Y in rder t identify new features that nt nly apprximate the ld features well, but als that are relatedttherespnse. Rughly speaking, the PLS apprach attempts t find directins that help explain bth the respnse and the predictrs. We nw describe hw the first PLS directin is cmputed. After standardizing the p predictrs, PLS cmputes the first directin Z 1 by setting each φ j1 in (6.16) equal t the cefficient frm the simple linear regressin f Y nt X j. One can shw that this cefficient is prprtinal t the crrelatin between Y and X j. Hence, in cmputing Z 1 = p j=1 φ j1x j,pls places the highest weight n the variables that are mst strngly related t the respnse. Figure 6.21 displays an example f PLS n the advertising data. The slid green line indicates the first PLS directin, while the dtted line shws the first principal cmpnent directin. PLS has chsen a directin that has less change in the ad dimensin per unit change in the pp dimensin, relative

253 Linear Mdel Selectin and Regularizatin t PCA. This suggests that pp is mre highly crrelated with the respnse than is ad. The PLS directin des nt fit the predictrs as clsely as des PCA, but it des a better jb explaining the respnse. T identify the secnd PLS directin we first adjust each f the variables fr Z 1, by regressing each variable n Z 1 and taking residuals. These residuals can be interpreted as the remaining infrmatin that has nt been explained by the first PLS directin. We then cmpute Z 2 using this rthgnalized data in exactly the same fashin as Z 1 was cmputed based n the riginal data. This iterative apprach can be repeated M times t identify multiple PLS cmpnents Z 1,...,Z M. Finally, at the end f this prcedure, we use least squares t fit a linear mdel t predict Y using Z 1,...,Z M in exactly the same fashin as fr PCR. As with PCR, the number M f partial least squares directins used in PLS is a tuning parameter that is typically chsen by crss-validatin. We generally standardize the predictrs and respnse befre perfrming PLS. PLS is ppular in the field f chemmetrics, where many variables arise frm digitized spectrmetry signals. In practice it ften perfrms n better than ridge regressin r PCR. While the supervised dimensin reductin f PLS can reduce bias, it als has the ptential t increase variance, s that the verall benefit f PLS relative t PCR is a wash. 6.4 Cnsideratins in High Dimensins High-Dimensinal Data Mst traditinal statistical techniques fr regressin and classificatin are intended fr the lw-dimensinal setting in which n, the number f b- lwdimensinal servatins, is much greater than p, the number f features. This is due in part t the fact that thrughut mst f the field s histry, the bulk f scientific prblems requiring the use f statistics have been lw-dimensinal. Fr instance, cnsider the task f develping a mdel t predict a patient s bld pressure n the basis f his r her age, gender, and bdy mass index (BMI). There are three predictrs, r fur if an intercept is included in the mdel, and perhaps several thusand patients fr whm bld pressure and age, gender, and BMI are available. Hence n p, and s the prblem is lw-dimensinal. (By dimensin here we are referring t the size f p.) In the past 20 years, new technlgies have changed the way that data are cllected in fields as diverse as finance, marketing, and medicine. It is nw cmmnplace t cllect an almst unlimited number f feature measurements (p very large). While p can be extremely large, the number f bservatins n is ften limited due t cst, sample availability, r ther cnsideratins. Tw examples are as fllws: 1. Rather than predicting bld pressure n the basis f just age, gender, and BMI, ne might als cllect measurements fr half a millin

254 6.4 Cnsideratins in High Dimensins 239 single nucletide plymrphisms (SNPs; these are individual DNA mutatins that are relatively cmmn in the ppulatin) fr inclusin in the predictive mdel. Then n 200 and p 500, A marketing analyst interested in understanding peple s nline shpping patterns culd treat as features all f the search terms entered by users f a search engine. This is smetimes knwn as the bag-fwrds mdel. The same researcher might have access t the search histries f nly a few hundred r a few thusand search engine users wh have cnsented t share their infrmatin with the researcher. Fr a given user, each f the p search terms is scred present (0) r absent (1), creating a large binary feature vectr. Then n 1,000 and p is much larger. Data sets cntaining mre features than bservatins are ften referred t as high-dimensinal. Classical appraches such as least squares linear highdimensinal regressin are nt apprpriate in this setting. Many f the issues that arise in the analysis f high-dimensinal data were discussed earlier in this bk, since they apply als when n>p: these include the rle f the bias-variance trade-ff and the danger f verfitting. Thugh these issues are always relevant, they can becme particularly imprtant when the number f features is very large relative t the number f bservatins. We have defined the high-dimensinal setting as the case where the number f features p is larger than the number f bservatins n. Butthecnsideratins that we will nw discuss certainly als apply if p is slightly smaller than n, and are best always kept in mind when perfrming supervised learning What Ges Wrng in High Dimensins? In rder t illustrate the need fr extra care and specialized techniques fr regressin and classificatin when p>n, we begin by examining what can g wrng if we apply a statistical technique nt intended fr the highdimensinal setting. Fr this purpse, we examine least squares regressin. But the same cncepts apply t lgistic regressin, linear discriminant analysis, and ther classical statistical appraches. When the number f features p is as large as, r larger than, the number f bservatins n, least squares as described in Chapter 3 cannt (r rather, shuld nt) be perfrmed. The reasn is simple: regardless f whether r nt there truly is a relatinship between the features and the respnse, least squares will yield a set f cefficient estimates that result in a perfect fit t the data, such that the residuals are zer. An example is shwn in Figure 6.22 with p = 1 feature (plus an intercept) in tw cases: when there are 20 bservatins, and when there are nly tw bservatins. When there are 20 bservatins, n>pand the least

255 Linear Mdel Selectin and Regularizatin Y Y X X FIGURE Left: Least squares regressin in the lw-dimensinal setting. Right: Least squares regressin with n =2bservatins and tw parameters t be estimated (an intercept and a cefficient). squares regressin line des nt perfectly fit the data; instead, the regressin line seeks t apprximate the 20 bservatins as well as pssible. On the ther hand, when there are nly tw bservatins, then regardless f the values f thse bservatins, the regressin line will fit the data exactly. This is prblematic because this perfect fit will almst certainly lead t verfitting f the data. In ther wrds, thugh it is pssible t perfectly fit the training data in the high-dimensinal setting, the resulting linear mdel will perfrm extremely prly n an independent test set, and therefre des nt cnstitute a useful mdel. In fact, we can see that this happened in Figure 6.22: the least squares line btained in the right-hand panel will perfrm very prly n a test set cmprised f the bservatins in the lefthand panel. The prblem is simple: when p>nr p n, a simple least squares regressin line is t flexible and hence verfits the data. Figure 6.23 further illustrates the risk f carelessly applying least squares when the number f features p is large. Data were simulated with n =20 bservatins, and regressin was perfrmed with between 1 and 20 features, each f which was cmpletely unrelated t the respnse. As shwn in the figure, the mdel R 2 increases t 1 as the number f features included in the mdel increases, and crrespndingly the training set MSE decreases t 0 as the number f features increases, even thugh the features are cmpletely unrelatedttherespnse. On the ther hand, the MSE n an independent test set becmes extremely large as the number f features included in the mdel increases, because including the additinal predictrs leads t a vast increase in the variance f the cefficient estimates. Lking at the test set MSE, it is clear that the best mdel cntains at mst a few variables. Hwever, smene wh carelessly examines nly the R 2 r the training set MSE might errneusly cnclude that the mdel with the greatest number f variables is best. This indicates the imprtance f applying extra care

256 6.4 Cnsideratins in High Dimensins 241 R Training MSE Test MSE Number f Variables Number f Variables Number f Variables FIGURE On a simulated example with n = 20 training bservatins, features that are cmpletely unrelated t the utcme are added t the mdel. Left: The R 2 increases t 1 as mre features are included. Center: The training set MSE decreases t 0 as mre features are included. Right: The test set MSE increases as mre features are included. when analyzing data sets with a large number f variables, and f always evaluating mdel perfrmance n an independent test set. In Sectin 6.1.3, we saw a number f appraches fr adjusting the training set RSS r R 2 in rder t accunt fr the number f variables used t fit a least squares mdel. Unfrtunately, the C p, AIC, and BIC appraches are nt apprpriate in the high-dimensinal setting, because estimating ˆσ 2 is prblematic. (Fr instance, the frmula fr ˆσ 2 frm Chapter 3 yields an estimate ˆσ 2 = 0 in this setting.) Similarly, prblems arise in the applicatin f adjusted R 2 in the high-dimensinal setting, since ne can easily btain a mdel with an adjusted R 2 value f 1. Clearly, alternative appraches that are better-suited t the high-dimensinal setting are required Regressin in High Dimensins It turns ut that many f the methds seen in this chapter fr fitting less flexible least squares mdels, such as frward stepwise selectin, ridge regressin, the lass, and principal cmpnents regressin, are particularly useful fr perfrming regressin in the high-dimensinal setting. Essentially, these appraches avid verfitting by using a less flexible fitting apprach than least squares. Figure 6.24 illustrates the perfrmance f the lass in a simple simulated example. There are p = 20, 50, r 2,000 features, f which 20 are truly assciated with the utcme. The lass was perfrmed n n = 100 training bservatins, and the mean squared errr was evaluated n an independent test set. As the number f features increases, the test set errr increases. When p = 20, the lwest validatin set errr was achieved when λ in (6.7) was small; hwever, when p was larger then the lwest validatin set errr was achieved using a larger value f λ. In each bxplt, rather than reprting the values f λ used, the degrees f freedm f the resulting

257 Linear Mdel Selectin and Regularizatin p = 20 p = 50 p = Degrees f Freedm Degrees f Freedm Degrees f Freedm FIGURE The lass was perfrmed with n = 100 bservatins and three values f p, the number f features. Of the p features, 20 were assciated with the respnse. The bxplts shw the test MSEs that result using three different values f the tuning parameter λ in (6.7). Fr ease f interpretatin, rather than reprting λ, the degrees f freedm are reprted; fr the lass this turns ut t be simply the number f estimated nn-zer cefficients. When p =20,the lwest test MSE was btained with the smallest amunt f regularizatin. When p =50, the lwest test MSE was achieved when there is a substantial amunt f regularizatin. When p = 2,000 the lass perfrmed prly regardless f the amunt f regularizatin, due t the fact that nly 20 f the 2,000 features truly are assciated with the utcme. lass slutin is displayed; this is simply the number f nn-zer cefficient estimates in the lass slutin, and is a measure f the flexibility f the lass fit. Figure 6.24 highlights three imprtant pints: (1) regularizatin r shrinkage plays a key rle in high-dimensinal prblems, (2) apprpriate tuning parameter selectin is crucial fr gd predictive perfrmance, and (3) the test errr tends t increase as the dimensinality f the prblem (i.e. the number f features r predictrs) increases, unless the additinal features are truly assciated with the respnse. The third pint abve is in fact a key principle in the analysis f highdimensinal data, which is knwn as the curse f dimensinality. Onemight curse f dimensinality think that as the number f features used t fit a mdel increases, the quality f the fitted mdel will increase as well. Hwever, cmparing the left-hand and right-hand panels in Figure 6.24, we see that this is nt necessarily the case: in this example, the test set MSE almst dubles as p increases frm 20 t 2,000. In general, adding additinal signal features that are truly assciated with the respnse will imprve the fitted mdel, in the sense f leading t a reductin in test set errr. Hwever, adding nise features that are nt truly assciated with the respnse will lead t a deteriratin in the fitted mdel, and cnsequently an increased test set errr. This is because nise features increase the dimensinality f the

258 6.4 Cnsideratins in High Dimensins 243 prblem, exacerbating the risk f verfitting (since nise features may be assigned nnzer cefficients due t chance assciatins with the respnse n the training set) withut any ptential upside in terms f imprved test set errr. Thus, we see that new technlgies that allw fr the cllectin f measurements fr thusands r millins f features are a duble-edged swrd: they can lead t imprved predictive mdels if these features are in fact relevant t the prblem at hand, but will lead t wrse results if the features are nt relevant. Even if they are relevant, the variance incurred in fitting their cefficients may utweigh the reductin in bias that they bring Interpreting Results in High Dimensins When we perfrm the lass, ridge regressin, r ther regressin prcedures in the high-dimensinal setting, we must be quite cautius in the way that we reprt the results btained. In Chapter 3, we learned abut multicllinearity, the cncept that the variables in a regressin might be crrelated with each ther. In the high-dimensinal setting, the multicllinearity prblem is extreme: any variable in the mdel can be written as a linear cmbinatin f all f the ther variables in the mdel. Essentially, this means that we can never knw exactly which variables (if any) truly are predictive f the utcme, and we can never identify the best cefficients fr use in the regressin. At mst, we can hpe t assign large regressin cefficients t variables that are crrelated with the variables that truly are predictive f the utcme. Fr instance, suppse that we are trying t predict bld pressure n the basis f half a millin SNPs, and that frward stepwise selectin indicates that 17 f thse SNPs lead t a gd predictive mdel n the training data. It wuld be incrrect t cnclude that these 17 SNPs predict bld pressure mre effectively than the ther SNPs nt included in the mdel. There are likely t be many sets f 17 SNPs that wuld predict bld pressure just as well as the selected mdel. If we were t btain an independent data set and perfrm frward stepwise selectin n that data set, we wuld likely btain a mdel cntaining a different, and perhaps even nn-verlapping, set f SNPs. This des nt detract frm the value f the mdel btained fr instance, the mdel might turn ut t be very effective in predicting bld pressure n an independent set f patients, and might be clinically useful fr physicians. But we must be careful nt t verstate the results btained, and t make it clear that what we have identified is simply ne f many pssible mdels fr predicting bld pressure, and that it must be further validated n independent data sets. It is als imprtant t be particularly careful in reprting errrs and measures f mdel fit in the high-dimensinal setting. We have seen that when p>n, it is easy t btain a useless mdel that has zer residuals. Therefre, ne shuld never use sum f squared errrs, p-values, R 2

259 Linear Mdel Selectin and Regularizatin statistics, r ther traditinal measures f mdel fit n the training data as evidence f a gd mdel fit in the high-dimensinal setting. Fr instance, as we saw in Figure 6.23, ne can easily btain a mdel with R 2 =1when p>n. Reprting this fact might mislead thers int thinking that a statistically valid and useful mdel has been btained, whereas in fact this prvides abslutely n evidence f a cmpelling mdel. It is imprtant t instead reprt results n an independent test set, r crss-validatin errrs. Fr instance, the MSE r R 2 n an independent test set is a valid measure f mdel fit, but the MSE n the training set certainly is nt. 6.5 Lab 1: Subset Selectin Methds Best Subset Selectin Here we apply the best subset selectin apprach t the Hitters data. We wish t predict a baseball player s Salary n the basis f varius statistics assciated with perfrmance in the previus year. First f all, we nte that the Salary variable is missing fr sme f the players. The is.na() functin can be used t identify the missing bserva- is.na() tins. It returns a vectr f the same length as the input vectr, with a TRUE fr any elements that are missing, and a FALSE fr nn-missing elements. The sum() functin can then be used t cunt all f the missing elements. sum() > library(islr) > fix(hitters) > names(hitters) [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" [6] "Walks" "Years" "CAtBat" "CHits" "CHmRun" [11] "CRuns" "CRBI" "CWalks" "League" "Divisin" [16] "PutOuts" "Assists" "Errrs" "Salary" "NewLeague " > dim(hitters) [1] > sum(is.na(hitters$salary)) [1] 59 Hence we see that Salary is missing fr 59 players. The na.mit() functin remves all f the rws that have missing values in any variable. > Hitters=na.mit(Hitters) > dim(hitters) [1] > sum(is.na(hitters)) [1] 0 The regsubsets() functin (part f the leaps library) perfrms best sub- regsubsets() set selectin by identifying the best mdel that cntains a given number f predictrs, where best is quantified using RSS. The syntax is the same as fr lm(). The summary() cmmand utputs the best set f variables fr each mdel size.

260 6.5 Lab 1: Subset Selectin Methds 245 > library(leaps) > regfit.full=regsubsets (Salary.,Hitters) > summary(regfit.full) Subset selectin bject Call: regsubsets.frmula(salary., Hitters) 19 Variables (and intercept )... 1 subsets f each size up t 8 Selectin Algrithm : exhaustive AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits 1 ( 1 ) " " " " " " " " " " " " " " " " " " 2 ( 1 ) " " "*" " " " " " " " " " " " " " " 3 ( 1 ) " " "*" " " " " " " " " " " " " " " 4 ( 1 ) " " "*" " " " " " " " " " " " " " " 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " CHmRun CRuns CRBI CWalks LeagueN DivisinW PutOuts 1 ( 1 ) " " " " "*" " " " " " " " " 2 ( 1 ) " " " " "*" " " " " " " " " 3 ( 1 ) " " " " "*" " " " " " " "*" 4 ( 1 ) " " " " "*" " " " " "*" "*" 5 ( 1 ) " " " " "*" " " " " "*" "*" 6 ( 1 ) " " " " "*" " " " " "*" "*" 7 ( 1 ) "*" " " " " " " " " "*" "*" 8 ( 1 ) "*" "*" " " "*" " " "*" "*" Assists Errrs NewLeagueN 1 ( 1 ) " " " " " " 2 ( 1 ) " " " " " " 3 ( 1 ) " " " " " " 4 ( 1 ) " " " " " " 5 ( 1 ) " " " " " " 6 ( 1 ) " " " " " " 7 ( 1 ) " " " " " " 8 ( 1 ) " " " " " " An asterisk indicates that a given variable is included in the crrespnding mdel. Fr instance, this utput indicates that the best tw-variable mdel cntains nly Hits and CRBI. By default, regsubsets() nly reprts results up t the best eight-variable mdel. But the nvmax ptin can be used in rder t return as many variables as are desired. Here we fit up t a 19-variable mdel. > regfit.full=regsubsets (Salary.,data=Hitters,nvmax=19) > reg.summary=summary(regfit.full) The summary() functin als returns R 2, RSS, adjusted R 2, C p,andbic. We can examine these t try t select the best verall mdel. > names(reg.summary) [1] "which" "rsq" "rss" "adjr2" "cp" "bic" [7] "utmat" "bj"

261 Linear Mdel Selectin and Regularizatin Fr instance, we see that the R 2 statistic increases frm 32 %, when nly ne variable is included in the mdel, t almst 55 %, when all variables are included. As expected, the R 2 statistic increases mntnically as mre variables are included. > reg.summary$rsq [1] [10] [19] Pltting RSS, adjusted R 2, C p, and BIC fr all f the mdels at nce will help us decide which mdel t select. Nte the type="l" ptin tells R t cnnect the pltted pints with lines. > par(mfrw=c(2,2)) > plt(reg.summary$rss,xlab="number f Variables ",ylab="rss", type="l") > plt(reg.summary$adjr2,xlab="number f Variables ", ylab="adjusted RSq",type="l") The pints() cmmand wrks like the plt() cmmand, except that it pints() puts pints n a plt that has already been created, instead f creating a new plt. The which.max() functin can be used t identify the lcatin f the maximum pint f a vectr. We will nw plt a red dt t indicate the mdel with the largest adjusted R 2 statistic. > which.max(reg.summary$adjr2) [1] 11 > pints(11,reg.summary$adjr2[11], cl="red",cex=2,pch=20) In a similar fashin we can plt the C p and BIC statistics, and indicate the mdels with the smallest statistic using which.min(). > plt(reg.summary$cp,xlab="number f Variables ",ylab="cp", type= l ) > which.min(reg.summary$cp ) [1] 10 > pints(10,reg.summary$cp [10],cl="red",cex=2,pch=20) > which.min(reg.summary$bic ) [1] 6 > plt(reg.summary$bic,xlab="number f Variables ",ylab="bic", type= l ) > pints(6,reg.summary$bic [6],cl="red",cex=2,pch=20) which.min() The regsubsets() functin has a built-in plt() cmmand which can be used t display the selected variables fr the best mdel with a given number f predictrs, ranked accrding t the BIC, C p, adjusted R 2,r AIC. T find ut mre abut this functin, type?plt.regsubsets. > plt(regfit.full,scale="r2") > plt(regfit.full,scale="adjr2") > plt(regfit.full,scale="cp") > plt(regfit.full,scale="bic")

262 6.5 Lab 1: Subset Selectin Methds 247 The tp rw f each plt cntains a black square fr each variable selected accrding t the ptimal mdel assciated with that statistic. Fr instance, we see that several mdels share a BIC clse t 150. Hwever, the mdel with the lwest BIC is the six-variable mdel that cntains nly AtBat, Hits, Walks, CRBI, DivisinW, andputouts. We can use the cef() functin t see the cefficient estimates assciated with this mdel. > cef(regfit.full,6) (Intercept ) AtBat Hits Walks CRBI DivisinW PutOuts Frward and Backward Stepwise Selectin We can als use the regsubsets() functin t perfrm frward stepwise r backward stepwise selectin, using the argument methd="frward" r methd="backward". > regfit.fwd=regsubsets (Salary.,data=Hitters,nvmax=19, methd="frward") > summary(regfit.fwd) > regfit.bwd=regsubsets (Salary.,data=Hitters,nvmax=19, methd="backward ") > summary(regfit.bwd) Fr instance, we see that using frward stepwise selectin, the best nevariable mdel cntains nly CRBI, and the best tw-variable mdel additinally includes Hits. Fr this data, the best ne-variable thrugh sixvariable mdels are each identical fr best subset and frward selectin. Hwever, the best seven-variable mdels identified by frward stepwise selectin, backward stepwise selectin, and best subset selectin are different. > cef(regfit.full,7) (Intercept ) Hits Walks CAtBat CHits CHmRun DivisinW PutOuts > cef(regfit.fwd,7) (Intercept ) AtBat Hits Walks CRBI CWalks DivisinW PutOuts > cef(regfit.bwd,7) (Intercept ) AtBat Hits Walks CRuns CWalks DivisinW PutOuts

263 Linear Mdel Selectin and Regularizatin Chsing Amng Mdels Using the Validatin Set Apprach and Crss-Validatin We just saw that it is pssible t chse amng a set f mdels f different sizes using C p,bic,andadjustedr 2. We will nw cnsider hw t d this using the validatin set and crss-validatin appraches. In rder fr these appraches t yield accurate estimates f the test errr, we must use nly the training bservatins t perfrm all aspects f mdel-fitting including variable selectin. Therefre, the determinatin f which mdel f a given size is best must be made using nly the training bservatins. This pint is subtle but imprtant. If the full data set is used t perfrm the best subset selectin step, the validatin set errrs and crss-validatin errrs that we btain will nt be accurate estimates f the test errr. In rder t use the validatin set apprach, we begin by splitting the bservatins int a training set and a test set. We d this by creating a randm vectr, train, f elements equal t TRUE if the crrespnding bservatin is in the training set, and FALSE therwise. The vectr test has a TRUE if the bservatin is in the test set, and a FALSE therwise. Nte the! in the cmmand t create test causes TRUEs tbeswitchedtfalses and vice versa. We als set a randm seed s that the user will btain the same training set/test set split. > set.seed(1) > train=sample(c(true,false), nrw(hitters),rep=true) > test=(!train) Nw, we apply regsubsets() t the training set in rder t perfrm best subset selectin. > regfit.best=regsubsets (Salary.,data=Hitters[train,], nvmax=19) Ntice that we subset the Hitters data frame directly in the call in rder t access nly the training subset f the data, using the expressin Hitters[train,]. We nw cmpute the validatin set errr fr the best mdel f each mdel size. We first make a mdel matrix frm the test data. test.mat=mdel.matrix(salary.,data=hitters[test,]) The mdel.matrix() functin is used in many regressin packages fr build- mdel. ing an X matrix frm data. Nw we run a lp, and fr each size i, we matrix() extract the cefficients frm regfit.best fr the best mdel f that size, multiply them int the apprpriate clumns f the test mdel matrix t frm the predictins, and cmpute the test MSE. > val.errrs=rep(na,19) > fr(i in 1:19){ + cefi=cef(regfit.best,id=i)

264 6.5 Lab 1: Subset Selectin Methds pred=test.mat[,names(cefi)]%*%cefi + val.errrs[i]=mean((hitters$salary[test]-pred)^2) } We find that the best mdel is the ne that cntains ten variables. > val.errrs [1] [9] [17] > which.min(val.errrs) [1] 10 > cef(regfit.best,10) (Intercept ) AtBat Hits Walks CAtBat CHits CHmRun CWalks LeagueN DivisinW PutOuts This was a little tedius, partly because there is n predict() methd fr regsubsets(). Since we will be using this functin again, we can capture ur steps abve and write ur wn predict methd. > predict.regsubsets =functin(bject,newdata,id,...){ + frm=as.frmula(bject$call [[2]]) + mat=mdel.matrix(frm,newdata) + cefi=cef(bject,id=id) + xvars=names(cefi) + mat[,xvars]%*%cefi + } Our functin pretty much mimics what we did abve. The nly cmplex part is hw we extracted the frmula used in the call t regsubsets(). We demnstrate hw we use this functin belw, when we d crss-validatin. Finally, we perfrm best subset selectin n the full data set, and select the best ten-variable mdel. It is imprtant that we make use f the full data set in rder t btain mre accurate cefficient estimates. Nte that we perfrm best subset selectin n the full data set and select the best tenvariable mdel, rather than simply using the variables that were btained frm the training set, because the best ten-variable mdel n the full data set may differ frm the crrespnding mdel n the training set. > regfit.best=regsubsets (Salary.,data=Hitters,nvmax=19) > cef(regfit.best,10) (Intercept ) AtBat Hits Walks CAtBat CRuns CRBI CWalks DivisinW PutOuts Assists 0.283

265 Linear Mdel Selectin and Regularizatin In fact, we see that the best ten-variable mdel n the full data set has a different set f variables than the best ten-variable mdel n the training set. We nw try t chse amng the mdels f different sizes using crssvalidatin. This apprach is smewhat invlved, as we must perfrm best subset selectin within each f the k training sets. Despite this, we see that with its clever subsetting syntax, R makes this jb quite easy. First, we create a vectr that allcates each bservatin t ne f k = 10 flds, and we create a matrix in which we will stre the results. > k=10 > set.seed(1) > flds=sample(1:k,nrw(hitters),replace=true) > cv.errrs=matrix(na,k,19, dimnames=list(null, paste(1:19))) Nw we write a fr lp that perfrms crss-validatin. In the jth fld, the elements f flds that equal j are in the test set, and the remainder are in the training set. We make ur predictins fr each mdel size (using ur new predict() methd), cmpute the test errrs n the apprpriate subset, and stre them in the apprpriate slt in the matrix cv.errrs. > fr(j in 1:k){ + best.fit=regsubsets (Salary.,data=Hitters[flds!=j,], nvmax=19) + fr(i in 1:19){ + pred=predict(best.fit,hitters[flds==j,],id=i) + cv.errrs[j,i]=mean( (Hitters$Salary[flds==j]-pred)^2) + } + } This has given us a matrix, f which the (i, j)th element crrespnds t the test MSE fr the ith crss-validatin fld fr the best j-variable mdel. We use the apply() functin t average ver the clumns f this apply() matrix in rder t btain a vectr fr which the jth element is the crssvalidatin errr fr the j-variable mdel. > mean.cv.errrs=apply(cv.errrs,2,mean) > mean.cv.errrs [1] [9] [17] > par(mfrw=c(1,1)) > plt(mean.cv.errrs,type= b ) We see that crss-validatin selects an 11-variable mdel. We nw perfrm best subset selectin n the full data set in rder t btain the 11-variable mdel. > reg.best=regsubsets (Salary.,data=Hitters, nvmax=19) > cef(reg.best,11) (Intercept ) AtBat Hits Walks CAtBat

266 6.6 Lab 2: Ridge Regressin and the Lass 251 CRuns CRBI CWalks LeagueN DivisinW PutOuts Assists Lab 2: Ridge Regressin and the Lass We will use the glmnet package in rder t perfrm ridge regressin and the lass. The main functin in this package is glmnet(), whichcanbeused glmnet() t fit ridge regressin mdels, lass mdels, and mre. This functin has slightly different syntax frm ther mdel-fitting functins that we have encuntered thus far in this bk. In particular, we must pass in an x matrix as well as a y vectr, and we d nt use the y x syntax. We will nw perfrm ridge regressin and the lass in rder t predict Salary n the Hitters data. Befre prceeding ensure that the missing values have been remved frm the data, as described in Sectin 6.5. > x=mdel.matrix(salary.,hitters)[,-1] > y=hitters$salary The mdel.matrix() functin is particularly useful fr creating x; nt nly des it prduce a matrix crrespnding t the 19 predictrs but it als autmatically transfrms any qualitative variables int dummy variables. The latter prperty is imprtant because glmnet() can nly take numerical, quantitative inputs Ridge Regressin The glmnet() functin has an alpha argument that determines what type f mdel is fit. If alpha=0 then a ridge regressin mdel is fit, and if alpha=1 then a lass mdel is fit. We first fit a ridge regressin mdel. > library(glmnet) > grid=10^seq(10,-2,length=100) > ridge.md=glmnet(x,y,alpha=0,lambda=grid) By default the glmnet() functin perfrms ridge regressin fr an autmatically selected range f λ values. Hwever, here we have chsen t implement the functin ver a grid f values ranging frm λ =10 10 t λ =10 2,essentially cvering the full range f scenaris frm the null mdel cntaining nly the intercept, t the least squares fit. As we will see, we can als cmpute mdel fits fr a particular value f λ that is nt ne f the riginal grid values. Nte that by default, the glmnet() functin standardizes the variables s that they are n the same scale. T turn ff this default setting, use the argument standardize=false. Assciated with each value f λ is a vectr f ridge regressin cefficients, stred in a matrix that can be accessed by cef(). In this case, it is a

267 Linear Mdel Selectin and Regularizatin matrix, with 20 rws (ne fr each predictr, plus an intercept) and 100 clumns (ne fr each value f λ). > dim(cef(ridge.md)) [1] We expect the cefficient estimates t be much smaller, in terms f l 2 nrm, when a large value f λ is used, as cmpared t when a small value f λ is used. These are the cefficients when λ =11,498, alng with their l 2 nrm: > ridge.md$lambda [50] [1] > cef(ridge.md)[,50] (Intercept ) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisinW PutOuts Assists Errrs NewLeagueN > sqrt(sum(cef(ridge.md)[-1,50]^2)) [1] 6.36 In cntrast, here are the cefficients when λ = 705, alng with their l 2 nrm. Nte the much larger l 2 nrm f the cefficients assciated with this smaller value f λ. > ridge.md$lambda [60] [1] 705 > cef(ridge.md)[,60] (Intercept ) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisinW PutOuts Assists Errrs NewLeagueN > sqrt(sum(cef(ridge.md)[-1,60]^2)) [1] 57.1 We can use the predict() functin fr a number f purpses. Fr instance, we can btain the ridge regressin cefficients fr a new value f λ, say 50: > predict(ridge.md,s=50,type="cefficients")[1:20,] (Intercept ) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisinW PutOuts Assists Errrs NewLeagueN

268 6.6 Lab 2: Ridge Regressin and the Lass 253 We nw split the samples int a training set and a test set in rder t estimate the test errr f ridge regressin and the lass. There are tw cmmn ways t randmly split a data set. The first is t prduce a randm vectr f TRUE, FALSE elements and select the bservatins crrespnding t TRUE fr the training data. The secnd is t randmly chse a subset f numbers between 1 and n; these can then be used as the indices fr the training bservatins. The tw appraches wrk equally well. We used the frmer methd in Sectin Here we demnstrate the latter apprach. We first set a randm seed s that the results btained will be reprducible. > set.seed(1) > train=sample(1:nrw(x), nrw(x)/2) > test=(-train) > y.test=y[test] Next we fit a ridge regressin mdel n the training set, and evaluate its MSE n the test set, using λ = 4. Nte the use f the predict() functin again. This time we get predictins fr a test set, by replacing type="cefficients" with the newx argument. > ridge.md=glmnet(x[train,],y[train],alpha=0,lambda=grid, thresh=1e-12) > ridge.pred=predict(ridge.md,s=4,newx=x[test,]) > mean((ridge.pred-y.test)^2) [1] The test MSE is Nte that if we had instead simply fit a mdel with just an intercept, we wuld have predicted each test bservatin using the mean f the training bservatins. In that case, we culd cmpute the test set MSE like this: > mean((mean(y[train])-y.test)^2) [1] We culd als get the same result by fitting a ridge regressin mdel with a very large value f λ. Ntethat1e10 means > ridge.pred=predict(ridge.md,s=1e10,newx=x[test,]) > mean((ridge.pred-y.test)^2) [1] S fitting a ridge regressin mdel with λ = 4 leads t a much lwer test MSE than fitting a mdel with just an intercept. We nw check whether there is any benefit t perfrming ridge regressin with λ = 4 instead f just perfrming least squares regressin. Recall that least squares is simply ridge regressin with λ = In rder fr glmnet() t yield the exact least squares cefficients when λ =0, we use the argument exact=t when calling the predict() functin. Otherwise, the predict() functin will interplate ver the grid f λ values used in fitting the

269 Linear Mdel Selectin and Regularizatin > ridge.pred=predict(ridge.md,s=0,newx=x[test,],exact=t) > mean((ridge.pred-y.test)^2) [1] > lm(y x, subset=train) > predict(ridge.md,s=0,exact=t,type="cefficients")[1:20,] In general, if we want t fit a (unpenalized) least squares mdel, then we shuld use the lm() functin, since that functin prvides mre useful utputs, such as standard errrs and p-values fr the cefficients. In general, instead f arbitrarily chsing λ = 4, it wuld be better t use crss-validatin t chse the tuning parameter λ. We can d this using the built-in crss-validatin functin, cv.glmnet(). By default, the functin cv.glmnet() perfrms ten-fld crss-validatin, thugh this can be changed using the argument nflds. Nte that we set a randm seed first s ur results will be reprducible, since the chice f the crss-validatin flds is randm. > set.seed(1) > cv.ut=cv.glmnet(x[train,],y[train],alpha=0) > plt(cv.ut) > bestlam=cv.ut$lambda.min > bestlam [1] 212 Therefre, we see that the value f λ that results in the smallest crssvalidatin errr is 212. What is the test MSE assciated with this value f λ? > ridge.pred=predict(ridge.md,s=bestlam,newx=x[test,]) > mean((ridge.pred-y.test)^2) [1] This represents a further imprvement ver the test MSE that we gt using λ = 4. Finally, we refit ur ridge regressin mdel n the full data set, using the value f λ chsen by crss-validatin, and examine the cefficient estimates. > ut=glmnet(x,y,alpha=0) > predict(ut,type="cefficients",s=bestlam)[1:20,] (Intercept ) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisinW PutOuts Assists Errrs NewLeagueN glmnet() mdel, yielding apprximate results. When we use exact=t, there remains a slight discrepancy in the third decimal place between the utput f glmnet() when λ = 0 and the utput f lm(); this is due t numerical apprximatin n the part f glmnet().

270 6.6 Lab 2: Ridge Regressin and the Lass 255 As expected, nne f the cefficients are zer ridge regressin des nt perfrm variable selectin! The Lass We saw that ridge regressin with a wise chice f λ can utperfrm least squares as well as the null mdel n the Hitters data set. We nw ask whether the lass can yield either a mre accurate r a mre interpretable mdel than ridge regressin. In rder t fit a lass mdel, we nce again use the glmnet() functin; hwever, this time we use the argument alpha=1. Other than that change, we prceed just as we did in fitting a ridge mdel. > lass.md=glmnet(x[train,],y[train],alpha=1,lambda=grid) > plt(lass.md) We can see frm the cefficient plt that depending n the chice f tuning parameter, sme f the cefficients will be exactly equal t zer. We nw perfrm crss-validatin and cmpute the assciated test errr. > set.seed(1) > cv.ut=cv.glmnet(x[train,],y[train],alpha=1) > plt(cv.ut) > bestlam=cv.ut$lambda.min > lass.pred=predict(lass.md,s=bestlam,newx=x[test,]) > mean((lass.pred-y.test)^2) [1] This is substantially lwer than the test set MSE f the null mdel and f least squares, and very similar t the test MSE f ridge regressin with λ chsen by crss-validatin. Hwever, the lass has a substantial advantage ver ridge regressin in that the resulting cefficient estimates are sparse. Here we see that 12 f the 19 cefficient estimates are exactly zer. S the lass mdel with λ chsen by crss-validatin cntains nly seven variables. > ut=glmnet(x,y,alpha=1,lambda=grid) > lass.cef=predict(ut,type="cefficients",s=bestlam)[1:20,] > lass.cef (Intercept ) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks LeagueN DivisinW PutOuts Assists Errrs NewLeagueN > lass.cef[lass.cef!=0] (Intercept ) Hits Walks CRuns CRBI LeagueN DivisinW PutOuts

271 Linear Mdel Selectin and Regularizatin 6.7 Lab 3: PCR and PLS Regressin Principal Cmpnents Regressin Principal cmpnents regressin (PCR) can be perfrmed using the pcr() pcr() functin, which is part f the pls library. We nw apply PCR t the Hitters data, in rder t predict Salary. Again, ensure that the missing values have been remved frm the data, as described in Sectin 6.5. > library(pls) > set.seed(2) > pcr.fit=pcr(salary., data=hitters,scale=true, validatin ="CV") The syntax fr the pcr() functin is similar t that fr lm(), with a few additinal ptins. Setting scale=true has the effect f standardizing each predictr, using (6.6), prir t generating the principal cmpnents, s that the scale n which each variable is measured will nt have an effect. Setting validatin="cv" causes pcr() t cmpute the ten-fld crss-validatin errr fr each pssible value f M, the number f principal cmpnents used. The resulting fit can be examined using summary(). > summary(pcr.fit) Data: X dimensin : Y dimensin : Fit methd: svdpc Number f cmpnents cnsidered : 19 VALIDATION : RMSEP Crss -validated using 10 randm segments. (Intercept ) 1 cmps 2 cmps 3 cmps 4 cmps CV adjcv TRAINING : % variance explained 1 cmps 2 cmps 3 cmps 4 cmps 5 cmps 6 cmps X Salary The CV scre is prvided fr each pssible number f cmpnents, ranging frm M = 0 nwards. (We have printed the CV utput nly up t M =4.) Nte that pcr() reprts the rt mean squared errr; in rder t btain the usual MSE, we must square this quantity. Fr instance, a rt mean squared errr f crrespnds t an MSE f = 124,468. One can als plt the crss-validatin scres using the validatinplt() validatin functin. Using val.type="msep" will cause the crss-validatin MSE t be plt() pltted. > validatinplt(pcr.fit,val.type="msep")

272 6.7 Lab 3: PCR and PLS Regressin 257 We see that the smallest crss-validatin errr ccurs when M =16cmpnents are used. This is barely fewer than M = 19, which amunts t simply perfrming least squares, because when all f the cmpnents are used in PCR n dimensin reductin ccurs. Hwever, frm the plt we als see that the crss-validatin errr is rughly the same when nly ne cmpnent is included in the mdel. This suggests that a mdel that uses just a small number f cmpnents might suffice. The summary() functin als prvides the percentage f variance explained in the predictrs and in the respnse using different numbers f cmpnents. This cncept is discussed in greater detail in Chapter 10. Briefly, we can think f this as the amunt f infrmatin abut the predictrs r the respnse that is captured using M principal cmpnents. Fr example, setting M = 1 nly captures % f all the variance, r infrmatin, in the predictrs. In cntrast, using M = 6 increases the value t %. If we were t use all M = p = 19 cmpnents, this wuld increase t 100 %. We nw perfrm PCR n the training data and evaluate its test set perfrmance. > set.seed(1) > pcr.fit=pcr(salary., data=hitters,subset=train,scale=true, validatin ="CV") > validatinplt(pcr.fit,val.type="msep") Nw we find that the lwest crss-validatin errr ccurs when M =7 cmpnent are used. We cmpute the test MSE as fllws. > pcr.pred=predict(pcr.fit,x[test,],ncmp=7) > mean((pcr.pred-y.test)^2) [1] This test set MSE is cmpetitive with the results btained using ridge regressin and the lass. Hwever, as a result f the way PCR is implemented, the final mdel is mre difficult t interpret because it des nt perfrm any kind f variable selectin r even directly prduce cefficient estimates. Finally, we fit PCR n the full data set, using M =7,thenumberf cmpnents identified by crss-validatin. > pcr.fit=pcr(y x,scale=true,ncmp=7) > summary(pcr.fit) Data: X dimensin : Y dimensin : Fit methd: svdpc Number f cmpnents cnsidered : 7 TRAINING : % variance explained 1 cmps 2 cmps 3 cmps 4 cmps 5 cmps 6 cmps X y cmps X y 46.69

273 Linear Mdel Selectin and Regularizatin Partial Least Squares We implement partial least squares (PLS) using the plsr() functin, als plsr() in the pls library. The syntax is just like that f the pcr() functin. > set.seed(1) > pls.fit=plsr(salary., data=hitters,subset=train,scale=true, validatin ="CV") > summary(pls.fit) Data: X dimensin : Y dimensin : Fit methd: kernelpls Number f cmpnents cnsidered : 19 VALIDATION : RMSEP Crss -validated using 10 randm segments. (Intercept ) 1 cmps 2 cmps 3 cmps 4 cmps CV adjcv TRAINING : % variance explained 1 cmps 2 cmps 3 cmps 4 cmps 5 cmps 6 cmps X Salary > validatinplt(pls.fit,val.type="msep") The lwest crss-validatin errr ccurs when nly M =2partialleast squares directins are used. We nw evaluate the crrespnding test set MSE. > pls.pred=predict(pls.fit,x[test,],ncmp=2) > mean((pls.pred-y.test)^2) [1] The test MSE is cmparable t, but slightly higher than, the test MSE btained using ridge regressin, the lass, and PCR. Finally, we perfrm PLS using the full data set, using M =2,thenumber f cmpnents identified by crss-validatin. > pls.fit=plsr(salary., data=hitters,scale=true,ncmp=2) > summary(pls.fit) Data: X dimensin : Y dimensin : Fit methd: kernelpls Number f cmpnents cnsidered : 2 TRAINING : % variance explained 1 cmps 2 cmps X Salary Ntice that the percentage f variance in Salary that the tw-cmpnent PLS fit explains, %, is almst as much as that explained using the

274 6.8 Exercises 259 final seven-cmpnent mdel PCR fit, %. This is because PCR nly attempts t maximize the amunt f variance explained in the predictrs, while PLS searches fr directins that explain variance in bth the predictrs and the respnse. 6.8 Exercises Cnceptual 1. We perfrm best subset, frward stepwise, and backward stepwise selectin n a single data set. Fr each apprach, we btain p +1 mdels, cntaining 0, 1, 2,..., p predictrs. Explain yur answers: (a) Which f the three mdels with k predictrs has the smallest training RSS? (b) Which f the three mdels with k predictrs has the smallest test RSS? (c) True r False: i. The predictrs in the k-variable mdel identified by frward stepwise are a subset f the predictrs in the (k+1)-variable mdel identified by frward stepwise selectin. ii. The predictrs in the k-variable mdel identified by backward stepwise are a subset f the predictrs in the (k +1)- variable mdel identified by backward stepwise selectin. iii. The predictrs in the k-variable mdel identified by backward stepwise are a subset f the predictrs in the (k +1)- variable mdel identified by frward stepwise selectin. iv. The predictrs in the k-variable mdel identified by frward stepwise are a subset f the predictrs in the (k+1)-variable mdel identified by backward stepwise selectin. v. The predictrs in the k-variable mdel identified by best subset are a subset f the predictrs in the (k +1)-variable mdel identified by best subset selectin. 2. Fr parts (a) thrugh (c), indicate which f i. thrugh iv. is crrect. Justify yur answer. (a) The lass, relative t least squares, is: i. Mre flexible and hence will give imprved predictin accuracy when its increase in bias is less than its decrease in variance. ii. Mre flexible and hence will give imprved predictin accuracy when its increase in variance is less than its decrease in bias.

275 Linear Mdel Selectin and Regularizatin iii. Less flexible and hence will give imprved predictin accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give imprved predictin accuracy when its increase in variance is less than its decrease in bias. (b) Repeat (a) fr ridge regressin relative t least squares. (c) Repeat (a) fr nn-linear methds relative t least squares. 3. Suppse we estimate the regressin cefficients in a linear regressin mdel by minimizing n y i β 0 i=1 p j=1 2 β j x ij subject t p β j s j=1 fr a particular value f s. Fr parts (a) thrugh (e), indicate which f i. thrugh v. is crrect. Justify yur answer. (a) As we increase s frm 0, the training RSS will: i. Increase initially, and then eventually start decreasing in an inverted U shape. ii. Decrease initially, and then eventually start increasing in a Ushape. iii. Steadily increase. iv. Steadily decrease. v. Remain cnstant. (b) Repeat (a) fr test RSS. (c) Repeat (a) fr variance. (d) Repeat (a) fr (squared) bias. (e) Repeat (a) fr the irreducible errr. 4. Suppse we estimate the regressin cefficients in a linear regressin mdel by minimizing 2 n p p y i β 0 β j x ij + λ i=1 j=1 fr a particular value f λ. Fr parts (a) thrugh (e), indicate which f i. thrugh v. is crrect. Justify yur answer. j=1 β 2 j

276 6.8 Exercises 261 (a) As we increase λ frm 0, the training RSS will: i. Increase initially, and then eventually start decreasing in an inverted U shape. ii. Decrease initially, and then eventually start increasing in a Ushape. iii. Steadily increase. iv. Steadily decrease. v. Remain cnstant. (b) Repeat (a) fr test RSS. (c) Repeat (a) fr variance. (d) Repeat (a) fr (squared) bias. (e) Repeat (a) fr the irreducible errr. 5. It is well-knwn that ridge regressin tends t give similar cefficient values t crrelated variables, whereas the lass may give quite different cefficient values t crrelated variables. We will nw explre this prperty in a very simple setting. Suppse that n =2,p =2,x 11 = x 12, x 21 = x 22.Furthermre, suppse that y 1 + y 2 =0andx 11 + x 21 =0andx 12 + x 22 =0,sthat the estimate fr the intercept in a least squares, ridge regressin, r lass mdel is zer: ˆβ 0 =0. (a) Write ut the ridge regressin ptimizatin prblem in this setting. (b) Argue that in this setting, the ridge cefficient estimates satisfy ˆβ 1 = ˆβ 2. (c) Write ut the lass ptimizatin prblem in this setting. (d) Argue that in this setting, the lass cefficients ˆβ 1 and ˆβ 2 are nt unique in ther wrds, there are many pssible slutins t the ptimizatin prblem in (c). Describe these slutins. 6. We will nw explre (6.12) and (6.13) further. (a) Cnsider (6.12) with p = 1. Fr sme chice f y 1 and λ>0, plt (6.12) as a functin f β 1. Yur plt shuld cnfirm that (6.12) is slved by (6.14). (b) Cnsider (6.13) with p = 1. Fr sme chice f y 1 and λ>0, plt (6.13) as a functin f β 1. Yur plt shuld cnfirm that (6.13) is slved by (6.15).

277 Linear Mdel Selectin and Regularizatin 7. We will nw derive the Bayesian cnnectin t the lass and ridge regressin discussed in Sectin (a) Suppse that y i = β 0 + p j=1 x ijβ j +ɛ i where ɛ 1,...,ɛ n are independent and identically distributed frm a N(0,σ 2 ) distributin. Write ut the likelihd fr the data. (b) Assume the fllwing prir fr β: β 1,...,β p are independent and identically distributed accrding t a duble-expnential distributin with mean 0 and cmmn scale parameter b: i.e. p(β) = 1 2b exp( β/b). Write ut the psterir fr β in this setting. (c) Argue that the lass estimate is the mde fr β under this psterir distributin. (d) Nw assume the fllwing prir fr β: β 1,...,β p are independent and identically distributed accrding t a nrmal distributin with mean zer and variance c. Write ut the psterir fr β in this setting. (e) Argue that the ridge regressin estimate is bth the mde and the mean fr β under this psterir distributin. Applied 8. In this exercise, we will generate simulated data, and will then use this data t perfrm best subset selectin. (a) Use the rnrm() functin t generate a predictr X f length n = 100, as well as a nise vectr ɛ f length n = 100. (b) Generate a respnse vectr Y f length n = 100 accrding t the mdel Y = β 0 + β 1 X + β 2 X 2 + β 3 X 3 + ɛ, where β 0, β 1, β 2,andβ 3 are cnstants f yur chice. (c) Use the regsubsets() functin t perfrm best subset selectin in rder t chse the best mdel cntaining the predictrs X, X 2,...,X 10. What is the best mdel btained accrding t C p,bic,andadjustedr 2? Shw sme plts t prvide evidence fr yur answer, and reprt the cefficients f the best mdel btained. Nte yu will need t use the data.frame() functin t create a single data set cntaining bth X and Y.

278 6.8 Exercises 263 (d) Repeat (c), using frward stepwise selectin and als using backwards stepwise selectin. Hw des yur answer cmpare t the results in (c)? (e) Nw fit a lass mdel t the simulated data, again using X, X 2,...,X 10 as predictrs. Use crss-validatin t select the ptimal value f λ. Create plts f the crss-validatin errr as a functin f λ. Reprt the resulting cefficient estimates, and discuss the results btained. (f) Nw generate a respnse vectr Y accrding t the mdel Y = β 0 + β 7 X 7 + ɛ, and perfrm best subset selectin and the lass. Discuss the results btained. 9. In this exercise, we will predict the number f applicatins received using the ther variables in the Cllege data set. (a) Split the data set int a training set and a test set. (b) Fit a linear mdel using least squares n the training set, and reprt the test errr btained. (c) Fit a ridge regressin mdel n the training set, with λ chsen by crss-validatin. Reprt the test errr btained. (d) Fit a lass mdel n the training set, with λ chsen by crssvalidatin. Reprt the test errr btained, alng with the number f nn-zer cefficient estimates. (e) Fit a PCR mdel n the training set, with M chsen by crssvalidatin. Reprt the test errr btained, alng with the value f M selected by crss-validatin. (f) Fit a PLS mdel n the training set, with M chsen by crssvalidatin. Reprt the test errr btained, alng with the value f M selected by crss-validatin. (g) Cmment n the results btained. Hw accurately can we predict the number f cllege applicatins received? Is there much difference amng the test errrs resulting frm these five appraches? 10. We have seen that as the number f features used in a mdel increases, the training errr will necessarily decrease, but the test errr may nt. We will nw explre this in a simulated data set. (a) Generate a data set with p =20features,n =1,000 bservatins, and an assciated quantitative respnse vectr generated accrding t the mdel Y = Xβ + ɛ, where β has sme elements that are exactly equal t zer.

279 Linear Mdel Selectin and Regularizatin (b) Split yur data set int a training set cntaining 100 bservatins and a test set cntaining 900 bservatins. (c) Perfrm best subset selectin n the training set, and plt the training set MSE assciated with the best mdel f each size. (d) Plt the test set MSE assciated with the best mdel f each size. (e) Fr which mdel size des the test set MSE take n its minimum value? Cmment n yur results. If it takes n its minimum value fr a mdel cntaining nly an intercept r a mdel cntaining all f the features, then play arund with the way that yu are generating the data in (a) until yu cme up with a scenari in which the test set MSE is minimized fr an intermediate mdel size. (f) Hw des the mdel at which the test set MSE is minimized cmpare t the true mdel used t generate the data? Cmment n the cefficient values. p (g) Create a plt displaying j=1 (β j ˆβ j r)2 fr a range f values f r, whereˆβ j r is the jth cefficient estimate fr the best mdel cntaining r cefficients. Cmment n what yu bserve. Hw des this cmpare t the test MSE plt frm (d)? 11. We will nw try t predict per capita crime rate in the Bstn data set. (a) Try ut sme f the regressin methds explred in this chapter, such as best subset selectin, the lass, ridge regressin, and PCR. Present and discuss results fr the appraches that yu cnsider. (b) Prpse a mdel (r set f mdels) that seem t perfrm well n this data set, and justify yur answer. Make sure that yu are evaluating mdel perfrmance using validatin set errr, crssvalidatin, r sme ther reasnable alternative, as ppsed t using training errr. (c) Des yur chsen mdel invlve all f the features in the data set? Why r why nt?

280 7 Mving Beynd Linearity S far in this bk, we have mstly fcused n linear mdels. Linear mdels are relatively simple t describe and implement, and have advantages ver ther appraches in terms f interpretatin and inference. Hwever, standard linear regressin can have significant limitatins in terms f predictive pwer. This is because the linearity assumptin is almst always an apprximatin, and smetimes a pr ne. In Chapter 6 we see that we can imprve upn least squares using ridge regressin, the lass, principal cmpnents regressin, and ther techniques. In that setting, the imprvement is btained by reducing the cmplexity f the linear mdel, and hence the variance f the estimates. But we are still using a linear mdel, which can nly be imprved s far! In this chapter we relax the linearity assumptin while still attempting t maintain as much interpretability as pssible. We d this by examining very simple extensins f linear mdels like plynmial regressin and step functins, as well as mre sphisticated appraches such as splines, lcal regressin, and generalized additive mdels. Plynmial regressin extends the linear mdel by adding extra predictrs, btained by raising each f the riginal predictrs t a pwer. Fr example, a cubic regressin uses three variables, X, X 2,andX 3, as predictrs. This apprach prvides a simple way t prvide a nnlinear fit t data. Step functins cut the range f a variable int K distinct regins in rder t prduce a qualitative variable. This has the effect f fitting a piecewise cnstant functin. G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics, DOI / , Springer Science+Business Media New Yrk

281 Mving Beynd Linearity Regressin splines are mre flexible than plynmials and step functins, and in fact are an extensin f the tw. They invlve dividing the range f X int K distinct regins. Within each regin, a plynmial functin is fit t the data. Hwever, these plynmials are cnstrained s that they jin smthly at the regin bundaries, r knts. Prvided that the interval is divided int enugh regins, this can prduce an extremely flexible fit. Smthing splines are similar t regressin splines, but arise in a slightly different situatin. Smthing splines result frm minimizing a residual sum f squares criterin subject t a smthness penalty. Lcal regressin is similar t splines, but differs in an imprtant way. The regins are allwed t verlap, and indeed they d s in a very smth way. Generalized additive mdels allwustextendthemethdsabvet deal with multiple predictrs. In Sectins , we present a number f appraches fr mdeling the relatinship between a respnse Y and a single predictr X in a flexible way. In Sectin 7.7, we shw that these appraches can be seamlessly integrated in rder t mdel a respnse Y as a functin f several predictrs X 1,...,X p. 7.1 Plynmial Regressin Histrically, the standard way t extend linear regressin t settings in which the relatinship between the predictrs and the respnse is nnlinear has been t replace the standard linear mdel with a plynmial functin y i = β 0 + β 1 x i + ɛ i y i = β 0 + β 1 x i + β 2 x 2 i + β 3x 3 i β dx d i + ɛ i, (7.1) where ɛ i is the errr term. This apprach is knwn as plynmial regressin, and in fact we saw an example f this methd in Sectin Fr large enugh degree d, a plynmial regressin allws us t prduce an extremely nn-linear curve. Ntice that the cefficients in (7.1) can be easily estimated using least squares linear regressin because this is just a standard linear mdel with predictrs x i,x 2 i,x3 i,...,xd i. Generally speaking, it is unusual t use d greater than 3 r 4 because fr large values f d, the plynmial curve can becme verly flexible and can take n sme very strange shapes. This is especially true near the bundary f the X variable. plynmial regressin

282 7.1 Plynmial Regressin 267 Degree 4 Plynmial Wage Pr(Wage>250 Age) Age Age FIGURE 7.1. The Wage data. Left: The slid blue curve is a degree-4 plynmial f wage (in thusands f dllars) as a functin f age, fit by least squares. The dtted curves indicate an estimated 95 % cnfidence interval. Right: We mdel the binary event wage>250 using lgistic regressin, again with a degree-4 plynmial. The fitted psterir prbability f wage exceeding $250,000 is shwn in blue, alng with an estimated 95 % cnfidence interval. The left-hand panel in Figure 7.1 is a plt f wage against age fr the Wage data set, which cntains incme and demgraphic infrmatin fr males wh reside in the central Atlantic regin f the United States. We see the results f fitting a degree-4 plynmial using least squares (slid blue curve). Even thugh this is a linear regressin mdel like any ther, the individual cefficients are nt f particular interest. Instead, we lk at the entire fitted functin acrss a grid f 62 values fr age frm 18 t 80 in rder t understand the relatinship between age and wage. In Figure 7.1, a pair f dtted curves accmpanies the fit; these are (2 ) standard errr curves. Let s see hw these arise. Suppse we have cmputed the fit at a particular value f age, x 0 : ˆf(x 0 )= ˆβ 0 + ˆβ 1 x 0 + ˆβ 2 x ˆβ 3 x ˆβ 4 x 4 0. (7.2) What is the variance f the fit, i.e. Var ˆf(x 0 )? Least squares returns variance estimates fr each f the fitted cefficients ˆβ j, as well as the cvariances between pairs f cefficient estimates. We can use these t cmpute the estimated variance f ˆf(x0 ). 1 The estimated pintwise standard errr f ˆf(x 0 ) is the square-rt f this variance. This cmputatin is repeated 1 If Ĉ is the 5 5 cvariance matrix f the ˆβ j,andifl T 0 =(1,x 0,x 2 0,x3 0,x4 0 ), then Var[ ˆf(x 0 )] = l T 0 Ĉl 0.

283 Mving Beynd Linearity at each reference pint x 0, and we plt the fitted curve, as well as twice the standard errr n either side f the fitted curve. We plt twice the standard errr because, fr nrmally distributed errr terms, this quantity crrespnds t an apprximate 95 % cnfidence interval. It seems like the wages in Figure 7.1 are frm tw distinct ppulatins: there appears t be a high earners grup earning mre than $250,000 per annum, as well as a lw earners grup. We can treat wage as a binary variable by splitting it int these tw grups. Lgistic regressin can then be used t predict this binary respnse, using plynmial functins f age as predictrs. In ther wrds, we fit the mdel Pr(y i > 250x i )= exp(β 0 + β 1 x i + β 2 x 2 i β dx d i ) 1+exp(β 0 + β 1 x i + β 2 x 2 i β dx d (7.3) i ). The result is shwn in the right-hand panel f Figure 7.1. The gray marks n the tp and bttm f the panel indicate the ages f the high earners and the lw earners. The slid blue curve indicates the fitted prbabilities f being a high earner, as a functin f age. The estimated 95 % cnfidence interval is shwn as well. We see that here the cnfidence intervals are fairly wide, especially n the right-hand side. Althugh the sample size fr this data set is substantial (n =3,000), there are nly 79 high earners, which results in a high variance in the estimated cefficients and cnsequently wide cnfidence intervals. 7.2 Step Functins Using plynmial functins f the features as predictrs in a linear mdel impses a glbal structure n the nn-linear functin f X. We can instead use step functins in rder t avid impsing such a glbal structure. Here step functin we break the range f X int bins, and fit a different cnstant in each bin. This amunts t cnverting a cntinuus variable int an rdered categrical variable. In greater detail, we create cutpints c 1, c 2,...,c K in the range f X, andthencnstructk + 1 new variables rdered categrical variable C 0 (X) = I(X <c 1 ), C 1 (X) = I(c 1 X<c 2 ), C 2 (X) = I(c 2 X<c 3 ),. C K 1 (X) = I(c K 1 X<c K ), C K (X) = I(c K X), (7.4) where I( ) isanindicatr functin that returns a 1 if the cnditin is true, and returns a 0 therwise. Fr example, I(c K X) equals1ifc K X,and indicatr functin

284 7.2 Step Functins 269 Piecewise Cnstant Wage Pr(Wage>250 Age) Age Age FIGURE 7.2. The Wage data. Left: The slid curve displays the fitted value frm a least squares regressin f wage (in thusands f dllars) using step functins f age. The dtted curves indicate an estimated 95 % cnfidence interval. Right: We mdel the binary event wage>250 using lgistic regressin, again using step functins f age. The fitted psterir prbability f wage exceeding $250,000 is shwn, alng with an estimated 95 % cnfidence interval. equals 0 therwise. These are smetimes called dummy variables. Ntice that fr any value f X, C 0 (X)+C 1 (X)+...+ C K (X) =1,sinceX must be in exactly ne f the K + 1 intervals. We then use least squares t fit a linear mdel using C 1 (X),C 2 (X),...,C K (X) aspredictrs 2 : y i = β 0 + β 1 C 1 (x i )+β 2 C 2 (x i )+...+ β K C K (x i )+ɛ i. (7.5) Fr a given value f X, at mst ne f C 1,C 2,...,C K can be nn-zer. Nte that when X<c 1, all f the predictrs in (7.5) are zer, s β 0 can be interpreted as the mean value f Y fr X<c 1. By cmparisn, (7.5) predicts a respnse f β 0 +β j fr c j X<c j+1,sβ j represents the average increase in the respnse fr X in c j X<c j+1 relative t X<c 1. An example f fitting step functins t the Wage data frm Figure 7.1 is shwn in the left-hand panel f Figure 7.2. We als fit the lgistic regressin mdel 2 We exclude C 0 (X) as a predictr in (7.5) because it is redundant with the intercept. This is similar t the fact that we need nly tw dummy variables t cde a qualitative variable with three levels, prvided that the mdel will cntain an intercept. The decisin t exclude C 0 (X) instead f sme ther C k (X) in (7.5) is arbitrary. Alternatively, we culd include C 0 (X),C 1 (X),...,C K (X), and exclude the intercept.

285 Mving Beynd Linearity Pr(y i > 250x i )= exp(β 0 + β 1 C 1 (x i )+...+ β K C K (x i )) 1+exp(β 0 + β 1 C 1 (x i )+...+ β K C K (x i )) (7.6) in rder t predict the prbability that an individual is a high earner n the basis f age. The right-hand panel f Figure 7.2 displays the fitted psterir prbabilities btained using this apprach. Unfrtunately, unless there are natural breakpints in the predictrs, piecewise-cnstant functins can miss the actin. Fr example, in the lefthand panel f Figure 7.2, the first bin clearly misses the increasing trend f wage with age. Nevertheless, step functin appraches are very ppular in bistatistics and epidemilgy, amng ther disciplines. Fr example, 5-year age grups are ften used t define the bins. 7.3 Basis Functins Plynmial and piecewise-cnstant regressin mdels are in fact special cases f a basis functin apprach. The idea is t have at hand a fam- basis ily f functins r transfrmatins that can be applied t a variable X: functin b 1 (X),b 2 (X),...,b K (X). Instead f fitting a linear mdel in X, wefitthe mdel y i = β 0 + β 1 b 1 (x i )+β 2 b 2 (x i )+β 3 b 3 (x i )+...+ β K b K (x i )+ɛ i. (7.7) Nte that the basis functins b 1 ( ),b 2 ( ),...,b K ( ) are fixed and knwn. (In ther wrds, we chse the functins ahead f time.) Fr plynmial regressin, the basis functins are b j (x i )=x j i, and fr piecewise cnstant functins they are b j (x i )=I(c j x i <c j+1 ). We can think f (7.7) as a standard linear mdel with predictrs b 1 (x i ),b 2 (x i ),...,b K (x i ). Hence, we can use least squares t estimate the unknwn regressin cefficients in (7.7). Imprtantly, this means that all f the inference tls fr linear mdels that are discussed in Chapter 3, such as standard errrs fr the cefficient estimates and F-statistics fr the mdel s verall significance, are available in this setting. Thus far we have cnsidered the use f plynmial functins and piecewise cnstant functins fr ur basis functins; hwever, many alternatives are pssible. Fr instance, we can use wavelets r Furier series t cnstruct basis functins. In the next sectin, we investigate a very cmmn chice fr a basis functin: regressin splines. regressin spline

286 7.4 Regressin Splines Regressin Splines Nw we discuss a flexible class f basis functins that extends upn the plynmial regressin and piecewise cnstant regressin appraches that we have just seen Piecewise Plynmials Instead f fitting a high-degree plynmial ver the entire range f X, piecewise plynmial regressin invlves fitting separate lw-degree plynmials piecewise ver different regins f X. Fr example, a piecewise cubic plynmial wrks by fitting a cubic regressin mdel f the frm y i = β 0 + β 1 x i + β 2 x 2 i + β 3 x 3 i + ɛ i, (7.8) where the cefficients β 0, β 1, β 2,andβ 3 differ in different parts f the range f X. The pints where the cefficients change are called knts. Fr example, a piecewise cubic with n knts is just a standard cubic plynmial, as in (7.1) with d = 3. A piecewise cubic plynmial with a single knt at a pint c takes the frm { β 01 + β 11 x i + β 21 x 2 i y i = + β 31x 3 i + ɛ i if x i <c; β 02 + β 12 x i + β 22 x 2 i + β 32x 3 i + ɛ i if x i c. In ther wrds, we fit tw different plynmial functins t the data, ne n the subset f the bservatins with x i <c, and ne n the subset f the bservatins with x i c. The first plynmial functin has cefficients β 01,β 11,β 21,β 31, and the secnd has cefficients β 02,β 12,β 22,β 32.Eachf these plynmial functins can be fit using least squares applied t simple functins f the riginal predictr. Using mre knts leads t a mre flexible piecewise plynmial. In general, if we place K different knts thrughut the range f X, thenwe willendupfittingk + 1 different cubic plynmials. Nte that we d nt need t use a cubic plynmial. Fr example, we can instead fit piecewise linear functins. In fact, ur piecewise cnstant functins f Sectin 7.2 are piecewise plynmials f degree 0! The tp left panel f Figure 7.3 shws a piecewise cubic plynmial fit t a subset f the Wage data, with a single knt at age=50. We immediately see a prblem: the functin is discntinuus and lks ridiculus! Since each plynmial has fur parameters, we are using a ttal f eight degrees f freedm in fitting this piecewise plynmial mdel Cnstraints and Splines The tp left panel f Figure 7.3 lks wrng because the fitted curve is just t flexible. T remedy this prblem, we can fit a piecewise plynmial plynmial regressin knt degrees f freedm

287 Mving Beynd Linearity Piecewise Cubic Cntinuus Piecewise Cubic Wage Wage Age Age Cubic Spline Linear Spline Wage Wage Age Age FIGURE 7.3. Varius piecewise plynmials are fit t a subset f the Wage data, with a knt at age=50. Tp Left: The cubic plynmials are uncnstrained. Tp Right: The cubic plynmials are cnstrained t be cntinuus at age=50. Bttm Left: The cubic plynmials are cnstrained t be cntinuus, and t have cntinuus first and secnd derivatives. Bttm Right: A linear spline is shwn, which is cnstrained t be cntinuus. under the cnstraint that the fitted curve must be cntinuus. In ther wrds, there cannt be a jump when age=50. The tp right plt in Figure 7.3 shws the resulting fit. This lks better than the tp left plt, but the V- shaped jin lks unnatural. In the lwer left plt, we have added tw additinal cnstraints: nw bth the first and secnd derivatives f the piecewise plynmials are cntinuus derivative at age=50. In ther wrds, we are requiring that the piecewise plynmial be nt nly cntinuus when age=50, but als very smth. Each cnstraint that we impse n the piecewise cubic plynmials effectively frees up ne degree f freedm, by reducing the cmplexity f the resulting piecewise plynmial fit. S in the tp left plt, we are using eight degrees f freedm, but in the bttm left plt we impsed three cnstraints (cntinuity, cntinuity f the first derivative, and cntinuity f the secnd derivative) and s are left with five degrees f freedm. The curve in the bttm left

288 7.4 Regressin Splines 273 plt is called a cubic spline. 3 In general, a cubic spline with K knts uses cubic spline a ttal f 4 + K degrees f freedm. In Figure 7.3, the lwer right plt is a linear spline, which is cntinuus linear spline at age=50. The general definitin f a degree-d spline is that it is a piecewise degree-d plynmial, with cntinuity in derivatives up t degree d 1 at each knt. Therefre, a linear spline is btained by fitting a line in each regin f the predictr space defined by the knts, requiring cntinuity at each knt. In Figure 7.3, there is a single knt at age=50. Of curse, we culd add mre knts, and impse cntinuity at each The Spline Basis Representatin The regressin splines that we just saw in the previus sectin may have seemed smewhat cmplex: hw can we fit a piecewise degree-d plynmial under the cnstraint that it (and pssibly its first d 1 derivatives) be cntinuus? It turns ut that we can use the basis mdel (7.7) t represent a regressin spline. A cubic spline with K knts can be mdeled as y i = β 0 + β 1 b 1 (x i )+β 2 b 2 (x i )+ + β K+3 b K+3 (x i )+ɛ i, (7.9) fr an apprpriate chice f basis functins b 1,b 2,...,b K+3. The mdel (7.9) can then be fit using least squares. Just as there were several ways t represent plynmials, there are als many equivalent ways t represent cubic splines using different chices f basis functins in (7.9). The mst direct way t represent a cubic spline using (7.9) is t start ff with a basis fr a cubic plynmial namely, x, x 2,x 3 and then add ne truncated pwer basis functin per knt. A truncated pwer basis functin is defined as truncated pwer basis { h(x, ξ) =(x ξ) 3 (x ξ) + = 3 if x>ξ 0 therwise, (7.10) where ξ is the knt. One can shw that adding a term f the frm β 4 h(x, ξ) t the mdel (7.8) fr a cubic plynmial will lead t a discntinuity in nly the third derivative at ξ; the functin will remain cntinuus, with cntinuus first and secnd derivatives, at each f the knts. In ther wrds, in rder t fit a cubic spline t a data set with K knts, we perfrm least squares regressin with an intercept and 3 + K predictrs, f the frm X, X 2,X 3,h(X, ξ 1 ),h(x, ξ 2 ),...,h(x, ξ K ), where ξ 1,...,ξ K are the knts. This amunts t estimating a ttal f K + 4 regressin cefficients; fr this reasn, fitting a cubic spline with K knts uses K +4 degrees f freedm. 3 Cubic splines are ppular because mst human eyes cannt detect the discntinuity at the knts.

289 Mving Beynd Linearity Wage Natural Cubic Spline Cubic Spline Age FIGURE 7.4. A cubic spline and a natural cubic spline, with three knts, fit t asubsetfthewage data. Unfrtunately, splines can have high variance at the uter range f the predictrs that is, when X takes n either a very small r very large value. Figure 7.4 shws a fit t the Wage data with three knts. We see that the cnfidence bands in the bundary regin appear fairly wild. A natural spline is a regressin spline with additinal bundary cnstraints: the natural functin is required t be linear at the bundary (in the regin where X is spline smaller than the smallest knt, r larger than the largest knt). This additinal cnstraint means that natural splines generally prduce mre stable estimates at the bundaries. In Figure 7.4, a natural cubic spline is als displayed as a red line. Nte that the crrespnding cnfidence intervals are narrwer Chsing the Number and Lcatins f the Knts When we fit a spline, where shuld we place the knts? The regressin spline is mst flexible in regins that cntain a lt f knts, because in thse regins the plynmial cefficients can change rapidly. Hence, ne ptin is t place mre knts in places where we feel the functin might vary mst rapidly, and t place fewer knts where it seems mre stable. While this ptin can wrk well, in practice it is cmmn t place knts in a unifrm fashin. One way t d this is t specify the desired degrees f freedm, and then have the sftware autmatically place the crrespnding number f knts at unifrm quantiles f the data. Figure 7.5 shws an example n the Wage data. As in Figure 7.4, we have fit a natural cubic spline with three knts, except this time the knt lcatins were chsen autmatically as the 25th, 50th, and 75th percentiles

290 7.4 Regressin Splines 275 Natural Cubic Spline Wage Pr(Wage>250 Age) Age Age FIGURE 7.5. A natural cubic spline functin with fur degrees f freedm is fit t the Wage data. Left: Asplineisfittwage (in thusands f dllars) as a functin f age. Right: Lgistic regressin is used t mdel the binary event wage>250 as a functin f age. The fitted psterir prbability f wage exceeding $250,000 is shwn. f age. This was specified by requesting fur degrees f freedm. The argument by which fur degrees f freedm leads t three interir knts is smewhat technical. 4 Hw many knts shuld we use, r equivalently hw many degrees f freedm shuld ur spline cntain? One ptin is t try ut different numbers f knts and see which prduces the best lking curve. A smewhat mre bjective apprach is t use crss-validatin, as discussed in Chapters 5 and 6. With this methd, we remve a prtin f the data (say 10 %), fit a spline with a certain number f knts t the remaining data, and then use the spline t make predictins fr the held-ut prtin. We repeat this prcess multiple times until each bservatin has been left ut nce, and then cmpute the verall crss-validated RSS. This prcedure can be repeated fr different numbers f knts K. Then the value f K giving the smallest RSS is chsen. 4 There are actually five knts, including the tw bundary knts. A cubic spline with five knts wuld have nine degrees f freedm. But natural cubic splines have tw additinal natural cnstraints at each bundary t enfrce linearity, resulting in 9 4 = 5 degrees f freedm. Since this includes a cnstant, which is absrbed in the intercept, we cunt it as fur degrees f freedm.

291 Mving Beynd Linearity Mean Squared Errr Mean Squared Errr Degrees f Freedm f Natural Spline Degrees f Freedm f Cubic Spline FIGURE 7.6. Ten-fld crss-validated mean squared errrs fr selecting the degrees f freedm when fitting splines t the Wage data. The respnse is wage and the predictr age. Left: A natural cubic spline. Right: Acubicspline. Figure 7.6 shws ten-fld crss-validated mean squared errrs fr splines with varius degrees f freedm fit t the Wage data. The left-hand panel crrespnds t a natural spline and the right-hand panel t a cubic spline. The tw methds prduce almst identical results, with clear evidence that a ne-degree fit (a linear regressin) is nt adequate. Bth curves flatten ut quickly, and it seems that three degrees f freedm fr the natural spline and fur degrees f freedm fr the cubic spline are quite adequate. In Sectin 7.7 we fit additive spline mdels simultaneusly n several variables at a time. This culd ptentially require the selectin f degrees f freedm fr each variable. In cases like this we typically adpt a mre pragmatic apprach and set the degrees f freedm t a fixed number, say fur, fr all terms Cmparisn t Plynmial Regressin Regressin splines ften give superir results t plynmial regressin. This is because unlike plynmials, which must use a high degree (expnent in the highest mnmial term, e.g. X 15 ) t prduce flexible fits, splines intrduce flexibility by increasing the number f knts but keeping the degree fixed. Generally, this apprach prduces mre stable estimates. Splines als allw us t place mre knts, and hence flexibility, ver regins where the functin f seems t be changing rapidly, and fewer knts where f appears mre stable. Figure 7.7 cmpares a natural cubic spline with 15 degrees f freedm t a degree-15 plynmial n the Wage data set. The extra flexibility in the plynmial prduces undesirable results at the bundaries, while the natural cubic spline still prvides a reasnable fit t the data.

292 7.5 Smthing Splines 277 Wage Natural Cubic Spline Plynmial Age FIGURE 7.7. On the Wage data set, a natural cubic spline with 15 degrees f freedm is cmpared t a degree-15 plynmial. Plynmials can shw wild behavir, especially near the tails. 7.5 Smthing Splines An Overview f Smthing Splines In the last sectin we discussed regressin splines, which we create by specifying a set f knts, prducing a sequence f basis functins, and then using least squares t estimate the spline cefficients. We nw intrduce a smewhat different apprach that als prduces a spline. In fitting a smth curve t a set f data, what we really want t d is find sme functin, say g(x), that fits the bserved data well: that is, we want RSS = n i=1 (y i g(x i )) 2 t be small. Hwever, there is a prblem with this apprach. If we dn t put any cnstraints n g(x i ), then we can always make RSS zer simply by chsing g such that it interplates all f the y i. Such a functin wuld wefully verfit the data it wuld be far t flexible. What we really want is a functin g that makes RSS small, but that is als smth. Hw might we ensure that g is smth? There are a number f ways t d this. A natural apprach is t find the functin g that minimizes n (y i g(x i )) 2 + λ g (t) 2 dt (7.11) i=1 where λ is a nnnegative tuning parameter. The functin g that minimizes (7.11) is knwn as a smthing spline. What des (7.11) mean? Equatin 7.11 takes the Lss+Penalty frmulatin that we encunter in the cntext f ridge regressin and the lass smthing spline in Chapter 6. The term n i=1 (y i g(x i )) 2 is a lss functin that encur- lss functin ages g t fit the data well, and the term λ g (t) 2 dt is a penalty term

293 Mving Beynd Linearity that penalizes the variability in g. The ntatin g (t) indicates the secnd derivative f the functin g. The first derivative g (t) measures the slpe f a functin at t, and the secnd derivative crrespnds t the amunt by which the slpe is changing. Hence, bradly speaking, the secnd derivative f a functin is a measure f its rughness: it is large in abslute value if g(t) isverywigglyneart, and it is clse t zer therwise. (The secnd derivative f a straight line is zer; nte that a line is perfectly smth.) The ntatin is an integral, which we can think f as a summatin ver the range f t. Intherwrds, g (t) 2 dt is simply a measure f the ttal change in the functin g (t), ver its entire range. If g is very smth, then g (t) will be clse t cnstant and g (t) 2 dt will take n a small value. Cnversely, if g is jumpy and variable then g (t) will vary significantly and g (t) 2 dt will take n a large value. Therefre, in (7.11), λ g (t) 2 dt encurages g t be smth. The larger the value f λ, thesmtherg will be. When λ = 0, then the penalty term in (7.11) has n effect, and s the functin g will be very jumpy and will exactly interplate the training bservatins. When λ, g will be perfectly smth it will just be a straight line that passes as clsely as pssible t the training pints. In fact, in this case, g will be the linear least squares line, since the lss functin in (7.11) amunts t minimizing the residual sum f squares. Fr an intermediate value f λ, g will apprximate the training bservatins but will be smewhat smth. We see that λ cntrls the bias-variance trade-ff f the smthing spline. The functin g(x) that minimizes (7.11) can be shwn t have sme special prperties: it is a piecewise cubic plynmial with knts at the unique values f x 1,...,x n, and cntinuus first and secnd derivatives at each knt. Furthermre, it is linear in the regin utside f the extreme knts. In ther wrds, the functin g(x) that minimizes (7.11) is a natural cubic spline with knts at x 1,...,x n! Hwever, it is nt the same natural cubic spline that ne wuld get if ne applied the basis functin apprach described in Sectin with knts at x 1,...,x n rather, it is a shrunken versin f such a natural cubic spline, where the value f the tuning parameter λ in (7.11) cntrls the level f shrinkage Chsing the Smthing Parameter λ We have seen that a smthing spline is simply a natural cubic spline with knts at every unique value f x i. It might seem that a smthing spline will have far t many degrees f freedm, since a knt at each data pint allws a great deal f flexibility. But the tuning parameter λ cntrls the rughness f the smthing spline, and hence the effective degrees f freedm. It is pssible t shw that as λ increases frm 0 t, the effective effective degrees f freedm, which we write df λ, decrease frm n t 2. In the cntext f smthing splines, why d we discuss effective degrees f freedm instead f degrees f freedm? Usually degrees f freedm refer degrees f freedm

294 7.5 Smthing Splines 279 t the number f free parameters, such as the number f cefficients fit in a plynmial r cubic spline. Althugh a smthing spline has n parameters and hence n nminal degrees f freedm, these n parameters are heavily cnstrained r shrunk dwn. Hence df λ is a measure f the flexibility f the smthing spline the higher it is, the mre flexible (and the lwer-bias but higher-variance) the smthing spline. The definitin f effective degrees f freedm is smewhat technical. We can write ĝ λ = S λ y, (7.12) where ĝ is the slutin t (7.11) fr a particular chice f λ that is, it is a n-vectr cntaining the fitted values f the smthing spline at the training pints x 1,...,x n. Equatin 7.12 indicates that the vectr f fitted values when applying a smthing spline t the data can be written as a n n matrix S λ (fr which there is a frmula) times the respnse vectr y. Then the effective degrees f freedm is defined t be n df λ = {S λ } ii, (7.13) i=1 the sum f the diagnal elements f the matrix S λ. In fitting a smthing spline, we d nt need t select the number r lcatin f the knts there will be a knt at each training bservatin, x 1,...,x n. Instead, we have anther prblem: we need t chse the value f λ. It shuld cme as n surprise that ne pssible slutin t this prblem is crss-validatin. In ther wrds, we can find the value f λ that makes the crss-validated RSS as small as pssible. It turns ut that the leavene-ut crss-validatin errr (LOOCV) can be cmputed very efficiently fr smthing splines, with essentially the same cst as cmputing a single fit, using the fllwing frmula: n n [ ] 2 RSS cv (λ) = (y i ĝ ( i) λ (x i )) 2 yi ĝ λ (x i ) =. 1 {S λ } ii i=1 The ntatin ĝ ( i) λ (x i ) indicates the fitted value fr this smthing spline evaluated at x i, where the fit uses all f the training bservatins except fr the ith bservatin (x i,y i ). In cntrast, ĝ λ (x i ) indicates the smthing spline functin fit t all f the training bservatins and evaluated at x i. This remarkable frmula says that we can cmpute each f these leavene-ut fits using nly ĝ λ, the riginal fit t all f the data! 5 We have a very similar frmula (5.2) n page 180 in Chapter 5 fr least squares linear regressin. Using (5.2), we can very quickly perfrm LOOCV fr the regressin splines discussed earlier in this chapter, as well as fr least squares regressin using arbitrary basis functins. i=1 5 The exact frmulas fr cmputing ĝ(x i )ands λ are very technical; hwever, efficient algrithms are available fr cmputing these quantities.

295 Mving Beynd Linearity Smthing Spline Wage Degrees f Freedm 6.8 Degrees f Freedm (LOOCV) Age FIGURE 7.8. Smthing spline fits t the Wage data. The red curve results frm specifying 16 effective degrees f freedm. Fr the blue curve, λ was fund autmatically by leave-ne-ut crss-validatin, which resulted in 6.8 effective degrees f freedm. Figure 7.8 shws the results frm fitting a smthing spline t the Wage data. The red curve indicates the fit btained frm pre-specifying that we wuld like a smthing spline with 16 effective degrees f freedm. The blue curve is the smthing spline btained when λ is chsen using LOOCV; in this case, the value f λ chsen results in 6.8 effective degrees f freedm (cmputed using (7.13)). Fr this data, there is little discernible difference between the tw smthing splines, beynd the fact that the ne with 16 degrees f freedm seems slightly wigglier. Since there is little difference between the tw fits, the smthing spline fit with 6.8 degrees f freedm is preferable, since in general simpler mdels are better unless the data prvides evidence in supprt f a mre cmplex mdel. 7.6 Lcal Regressin Lcal regressin is a different apprach fr fitting flexible nn-linear func- lcal regressin tins, which invlves cmputing the fit at a target pint x 0 using nly the nearby training bservatins. Figure 7.9 illustrates the idea n sme simulated data, with ne target pint near 0.4, and anther near the bundary at In this figure the blue line represents the functin f(x) frmwhich the data were generated, and the light range line crrespnds t the lcal regressin estimate ˆf(x). Lcal regressin is described in Algrithm 7.1. Nte that in Step 3 f Algrithm 7.1, the weights K i0 will differ fr each value f x 0. In ther wrds, in rder t btain the lcal regressin fit at a new pint, we need t fit a new weighted least squares regressin mdel by

296 7.6 Lcal Regressin 281 Lcal Regressin OO O O O O OO O O OO O O O O O O OO O O O O O O O O O OO O O O OO O O O O O O O O OOO O O O O O O O O O O O OO O O O O O O O OO O O O O O O O O O O O O O O O O O OO O O OO O O O O O OO O O O O O O OO O O O O O O O O OO O O O O OOO O O O O O O O O O O O O O OO O O O O O O O O OO O O O O O O O O O FIGURE 7.9. Lcal regressin illustrated n sme simulated data, where the blue curve represents f(x) frm which the data were generated, and the light range curve crrespnds t the lcal regressin estimate ˆf(x). The range clred pints are lcal t the target pint x 0, represented by the range vertical line. The yellw bell-shape superimpsed n the plt indicates weights assigned t each pint, decreasing t zer with distance frm the target pint. The fit ˆf(x 0) at x 0 is btained by fitting a weighted linear regressin (range line segment), and using the fitted value at x 0 (range slid dt) as the estimate ˆf(x 0). minimizing (7.14) fr a new set f weights. Lcal regressin is smetimes referred t as a memry-based prcedure, because like nearest-neighbrs, we need all the training data each time we wish t cmpute a predictin. We will avid getting int the technical details f lcal regressin here there are bks written n the tpic. In rder t perfrm lcal regressin, there are a number f chices t be made, such as hw t define the weighting functin K, and whether t fit a linear, cnstant, r quadratic regressin in Step 3 abve. (Equatin 7.14 crrespnds t a linear regressin.) While all f these chices make sme difference, the mst imprtant chice is the span s, definedinstep1abve. The span plays a rle like that f the tuning parameter λ in smthing splines: it cntrls the flexibility f the nn-linear fit. The smaller the value f s, themrelcal and wiggly will be ur fit; alternatively, a very large value f s will lead t a glbal fit t the data using all f the training bservatins. We can again use crss-validatin t chse s, rwecan specify it directly. Figure 7.10 displays lcal linear regressin fits n the Wage data, using tw values f s: 0.7 and0.2. As expected, the fit btained using s =0.7 is smther than that btained using s =0.2. The idea f lcal regressin can be generalized in many different ways. In a setting with multiple features X 1,X 2,...,X p, ne very useful generalizatin invlves fitting a multiple linear regressin mdel that is glbal in sme variables, but lcal in anther, such as time. Such varying cefficient

297 Mving Beynd Linearity Algrithm 7.1 Lcal Regressin At X = x 0 1. Gather the fractin s = k/n f training pints whse x i are clsest t x Assign a weight K i0 = K(x i,x 0 ) t each pint in this neighbrhd, s that the pint furthest frm x 0 has weight zer, and the clsest has the highest weight. All but these k nearest neighbrs get weight zer. 3. Fit a weighted least squares regressin f the y i n the x i using the afrementined weights, by finding ˆβ 0 and ˆβ 1 that minimize n K i0 (y i β 0 β 1 x i ) 2. (7.14) i=1 4. The fitted value at x 0 is given by ˆf(x 0 )= ˆβ 0 + ˆβ 1 x 0. mdels are a useful way f adapting a mdel t the mst recently gathered varying data. Lcal regressin als generalizes very naturally when we want t fit mdels that are lcal in a pair f variables X 1 and X 2, rather than ne. We can simply use tw-dimensinal neighbrhds, and fit bivariate linear regressin mdels using the bservatins that are near each target pint in tw-dimensinal space. Theretically the same apprach can be implemented in higher dimensins, using linear regressins fit t p-dimensinal neighbrhds. Hwever, lcal regressin can perfrm prly if p is much larger than abut 3 r 4 because there will generally be very few training bservatins clse t x 0. Nearest-neighbrs regressin, discussed in Chapter 3, suffers frm a similar prblem in high dimensins. cefficient mdel 7.7 Generalized Additive Mdels In Sectins , we present a number f appraches fr flexibly predictingarespnsey n the basis f a single predictr X. These appraches can be seen as extensins f simple linear regressin. Here we explre the prblem f flexibly predicting Y n the basis f several predictrs, X 1,...,X p. This amunts t an extensin f multiple linear regressin. Generalized additive mdels (GAMs) prvide a general framewrk fr generalized extending a standard linear mdel by allwing nn-linear functins f each additive mdel f the variables, while maintaining additivity. Just like linear mdels, GAMs additivity can be applied with bth quantitative and qualitative respnses. We first examine GAMs fr a quantitative respnse in Sectin 7.7.1, and then fr a qualitative respnse in Sectin

298 7.7 Generalized Additive Mdels 283 Lcal Linear Regressin Wage Span is 0.2 (16.4 Degrees f Freedm) Span is 0.7 (5.3 Degrees f Freedm) Age FIGURE Lcal linear fits t the Wage data. The span specifies the fractin f the data used t cmpute the fit at each target pint GAMs fr Regressin Prblems A natural way t extend the multiple linear regressin mdel y i = β 0 + β 1 x i1 + β 2 x i2 + + β p x ip + ɛ i in rder t allw fr nn-linear relatinships between each feature and the respnse is t replace each linear cmpnent β j x ij with a (smth) nnlinear functin f j (x ij ). We wuld then write the mdel as y i = β 0 + p f j (x ij )+ɛ i j=1 = β 0 + f 1 (x i1 )+f 2 (x i2 )+ + f p (x ip )+ɛ i. (7.15) This is an example f a GAM. It is called an additive mdel because we calculate a separate f j fr each X j, and then add tgether all f their cntributins. In Sectins , we discuss many methds fr fitting functins t a single variable. The beauty f GAMs is that we can use these methds as building blcks fr fitting an additive mdel. In fact, fr mst f the methds that we have seen s far in this chapter, this can be dne fairly trivially. Take, fr example, natural splines, and cnsider the task f fitting the mdel wage = β 0 + f 1 (year)+f 2 (age)+f 3 (educatin)+ɛ (7.16)

299 Mving Beynd Linearity <HS HS <Cll Cll >Cll f1(year) f2(age) f3(educatin) year age educatin FIGURE Fr the Wage data, plts f the relatinship between each feature and the respnse, wage, in the fitted mdel (7.16). Each plt displays the fitted functin and pintwise standard errrs. The first tw functins are natural splines in year and age, with fur and five degrees f freedm, respectively. The third functin is a step functin, fit t the qualitative variable educatin. n the Wage data. Here year and age are quantitative variables, and educatin is a qualitative variable with five levels: <HS, HS, <Cll, Cll, >Cll, referring t the amunt f high schl r cllege educatin that an individual has cmpleted. We fit the first tw functins using natural splines. We fit the third functin using a separate cnstant fr each level, via the usual dummy variable apprach f Sectin Figure 7.11 shws the results f fitting the mdel (7.16) using least squares. This is easy t d, since as discussed in Sectin 7.4, natural splines can be cnstructed using an apprpriately chsen set f basis functins. Hence the entire mdel is just a big regressin nt spline basis variables and dummy variables, all packed int ne big regressin matrix. Figure 7.11 can be easily interpreted. The left-hand panel indicates that hlding age and educatin fixed, wage tends t increase slightly with year; this may be due t inflatin. The center panel indicates that hlding educatin and year fixed, wage tends t be highest fr intermediate values f age, and lwest fr the very yung and very ld. The right-hand panel indicates that hlding year and age fixed, wage tends t increase with educatin: the mre educated a persn is, the higher their salary, n average. All f these findings are intuitive. Figure 7.12 shws a similar triple f plts, but this time f 1 and f 2 are smthing splines with fur and five degrees f freedm, respectively. Fitting a GAM with a smthing spline is nt quite as simple as fitting a GAM with a natural spline, since in the case f smthing splines, least squares cannt be used. Hwever, standard sftware such as the gam() functin in R can be used t fit GAMs using smthing splines, via an apprach knwn as backfitting. This methd fits a mdel invlving multiple predictrs by backfitting

300 7.7 Generalized Additive Mdels 285 <HS HS <Cll Cll >Cll f1(year) f2(age) f3(educatin) year age educatin FIGURE Details are as in Figure 7.11, but nw f 1 and f 2 are smthing splines with fur and five degrees f freedm, respectively. repeatedly updating the fit fr each predictr in turn, hlding the thers fixed. The beauty f this apprach is that each time we update a functin, we simply apply the fitting methd fr that variable t a partial residual. 6 The fitted functins in Figures 7.11 and 7.12 lk rather similar. In mst situatins, the differences in the GAMs btained using smthing splines versus natural splines are small. We d nt have t use splines as the building blcks fr GAMs: we can just as well use lcal regressin, plynmial regressin, r any cmbinatin f the appraches seen earlier in this chapter in rder t create a GAM. GAMs are investigated in further detail in the lab at the end f this chapter. Prs and Cns f GAMs Befre we mve n, let us summarize the advantages and limitatins f a GAM. GAMs allw us t fit a nn-linear f j t each X j,sthatwecan autmatically mdel nn-linear relatinships that standard linear regressin will miss. This means that we d nt need t manually try ut many different transfrmatins n each variable individually. The nn-linear fits can ptentially make mre accurate predictins fr the respnse Y. Because the mdel is additive, we can still examine the effect f each X j n Y individually while hlding all f the ther variables fixed. Hence if we are interested in inference, GAMs prvide a useful representatin. 6 A partial residual fr X 3, fr example, has the frm r i = y i f 1 (x i1 ) f 2 (x i2 ). If we knw f 1 and f 2, then we can fit f 3 by treating this residual as a respnse in a nn-linear regressin n X 3.

301 Mving Beynd Linearity The smthness f the functin f j fr the variable X j can be summarized via degrees f freedm. The main limitatin f GAMs is that the mdel is restricted t be additive. With many variables, imprtant interactins can be missed. Hwever, as with linear regressin, we can manually add interactin terms t the GAM mdel by including additinal predictrs f the frm X j X k. In additin we can add lw-dimensinal interactin functins f the frm f jk (X j,x k ) int the mdel; such terms can be fit using tw-dimensinal smthers such as lcal regressin, r tw-dimensinal splines (nt cvered here). Fr fully general mdels, we have t lk fr even mre flexible appraches such as randm frests and bsting, described in Chapter 8. GAMs prvide a useful cmprmise between linear and fully nnparametric mdels GAMs fr Classificatin Prblems GAMscanalsbeusedinsituatinswhereY is qualitative. Fr simplicity, here we will assume Y takes n values zer r ne, and let p(x) =Pr(Y = 1X) be the cnditinal prbability (given the predictrs) that the respnse equals ne. Recall the lgistic regressin mdel (4.6): ( ) p(x) lg = β 0 + β 1 X 1 + β 2 X β p X p. (7.17) 1 p(x) This lgit is the lg f the dds f P (Y = 1X) versusp (Y = 0X), which (7.17) represents as a linear functin f the predictrs. A natural way t extend (7.17) t allw fr nn-linear relatinships is t use the mdel ( ) p(x) lg = β 0 + f 1 (X 1 )+f 2 (X 2 )+ + f p (X p ). (7.18) 1 p(x) Equatin 7.18 is a lgistic regressin GAM. It has all the same prs and cns as discussed in the previus sectin fr quantitative respnses. We fit a GAM t the Wage data in rder t predict the prbability that an individual s incme exceeds $250,000 per year. The GAM that we fit takes the frm ( ) p(x) lg = β 0 + β 1 year + f 2 (age)+f 3 (educatin), (7.19) 1 p(x) where p(x) =Pr(wage > 250year, age, educatin).

302 7.8 Lab: Nn-linear Mdeling 287 <HS HS <Cll Cll >Cll f1(year) f2(age) f3(educatin) year age educatin FIGURE Fr the Wage data, the lgistic regressin GAM given in (7.19) is fit t the binary respnse I(wage>250). Each plt displays the fitted functin and pintwise standard errrs. The first functin is linear in year, the secnd functin a smthing spline with five degrees f freedm in age, and the third a step functin fr educatin. There are very wide standard errrs fr the first level <HS f educatin. Once again f 2 is fit using a smthing spline with five degrees f freedm, and f 3 is fit as a step functin, by creating dummy variables fr each f the levels f educatin. The resulting fit is shwn in Figure The last panel lks suspicius, with very wide cnfidence intervals fr level <HS. In fact, there are n nes fr that categry: n individuals with less than a high schl educatin make mre than $250,000peryear.Hencewerefit the GAM, excluding the individuals with less than a high schl educatin. The resulting mdel is shwn in Figure As in Figures 7.11 and 7.12, all three panels have the same vertical scale. This allws us t visually assess the relative cntributins f each f the variables. We bserve that age and educatin have a much larger effect than year n the prbability f being a high earner. 7.8 Lab: Nn-linear Mdeling In this lab, we re-analyze the Wage data cnsidered in the examples thrughut this chapter, in rder t illustrate the fact that many f the cmplex nn-linear fitting prcedures discussed can be easily implemented in R. We begin by lading the ISLR library, which cntains the data. > library(islr) > attach(wage)

303 Mving Beynd Linearity HS <Cll Cll >Cll f1(year) f2(age) f3(educatin) year age educatin FIGURE The same mdel is fit as in Figure 7.13, this time excluding the bservatins fr which educatin is <HS. Nw we see that increased educatin tends t be assciated with higher salaries Plynmial Regressin and Step Functins We nw examine hw Figure 7.1 was prduced. We first fit the mdel using the fllwing cmmand: > fit=lm(wage ply(age,4),data=wage) > cef(summary(fit)) Estimate Std. Errr t value Pr(>t) (Intercept ) <2e-16 ply(age, 4) <2e-16 ply(age, 4) <2e-16 ply(age, 4) ply(age, 4) This syntax fits a linear mdel, using the lm() functin, in rder t predict wage using a furth-degree plynmial in age: ply(age,4). Theply() cmmand allws us t avid having t write ut a lng frmula with pwers f age. The functin returns a matrix whse clumns are a basis f rthgnal plynmials, which essentially means that each clumn is a linear rthgnal cmbinatin f the variables age, age^2, age^3 and age^4. plynmial Hwever, we can als use ply() t btain age, age^2, age^3 and age^4 directly, if we prefer. We can d this by using the raw=true argument t the ply() functin. Later we see that this des nt affect the mdel in a meaningful way thugh the chice f basis clearly affects the cefficient estimates, it des nt affect the fitted values btained. > fit2=lm(wage ply(age,4,raw=t),data=wage) > cef(summary(fit2)) Estimate Std. Errr t value Pr(>t) (Intercept ) -1.84e e ply(age, 4, raw = T)1 2.12e e ply(age, 4, raw = T)2-5.64e e

304 7.8 Lab: Nn-linear Mdeling 289 ply(age, 4, raw = T)3 6.81e e ply(age, 4, raw = T)4-3.20e e There are several ther equivalent ways f fitting this mdel, which shwcase the flexibility f the frmula language in R. Fr example > fit2a=lm(wage age+i(age^2)+i(age^3)+i(age^4),data=wage) > cef(fit2a) (Intercept ) age I(age^2) I(age^3) I(age^4) -1.84e e e e e-05 This simply creates the plynmial basis functins n the fly, taking care t prtect terms like age^2 via the wrapper functin I() (the ^ symbl has wrapper a special meaning in frmulas). > fit2b=lm(wage cbind(age,age^2,age^3,age^4),data=wage) This des the same mre cmpactly, using the cbind() functin fr building a matrix frm a cllectin f vectrs; any functin call such as cbind() inside a frmula als serves as a wrapper. We nw create a grid f values fr age at which we want predictins, and then call the generic predict() functin, specifying that we want standard errrs as well. > agelims=range(age) > age.grid=seq(frm=agelims[1],t=agelims [2]) > preds=predict(fit,newdata=list(age=age.grid),se=true) > se.bands=cbind(preds$fit +2*preds$se.fit,preds$fit -2*preds$se. fit) Finally, we plt the data and add the fit frm the degree-4 plynmial. > par(mfrw=c(1,2),mar=c(4.5,4.5,1,1),ma=c(0,0,4,0)) > plt(age,wage,xlim=agelims,cex=.5,cl="darkgrey ") > title("degree -4 Plynmial ",uter=t) > lines(age.grid,preds$fit,lwd=2,cl="blue") > matlines(age.grid,se.bands,lwd=1,cl="blue",lty=3) Here the mar and ma arguments t par() allw us t cntrl the margins f the plt, and the title() functin creates a figure title that spans bth title() subplts. We mentined earlier that whether r nt an rthgnal set f basis functinsisprducedintheply() functin will nt affect the mdel btained in a meaningful way. What d we mean by this? The fitted values btained in either case are identical: > preds2=predict(fit2,newdata=list(age=age.grid),se=true) > max(abs(preds$fit -preds2$fit )) [1] 7.39e-13 In perfrming a plynmial regressin we must decide n the degree f the plynmial t use. One way t d this is by using hypthesis tests. We nw fit mdels ranging frm linear t a degree-5 plynmial and seek t determine the simplest mdel which is sufficient t explain the relatinship

305 Mving Beynd Linearity between wage and age. Weusethe anva() functin, which perfrms an anva() analysis f variance (ANOVA, using an F-test) in rder t test the null analysis f variance hypthesis that a mdel M 1 is sufficient t explain the data against the alternative hypthesis that a mre cmplex mdel M 2 is required. In rder t use the anva() functin, M 1 and M 2 must be nested mdels: the predictrs in M 1 must be a subset f the predictrs in M 2.Inthiscase, we fit five different mdels and sequentially cmpare the simpler mdel t the mre cmplex mdel. > fit.1=lm(wage age,data=wage) > fit.2=lm(wage ply(age,2),data=wage) > fit.3=lm(wage ply(age,3),data=wage) > fit.4=lm(wage ply(age,4),data=wage) > fit.5=lm(wage ply(age,5),data=wage) > anva(fit.1,fit.2,fit.3,fit.4,fit.5) Analysis f Variance Table Mdel 1: wage age Mdel 2: wage ply(age, 2) Mdel 3: wage ply(age, 3) Mdel 4: wage ply(age, 4) Mdel 5: wage ply(age, 5) Res.Df RSS Df Sum f Sq F Pr(>F) <2e-16 *** ** Signif. cdes: 0 *** ** 0.01 * The p-value cmparing the linear Mdel 1 t the quadratic Mdel 2 is essentially zer (<10 15 ), indicating that a linear fit is nt sufficient. Similarly the p-value cmparing the quadratic Mdel 2 t the cubic Mdel 3 is very lw (0.0017), s the quadratic fit is als insufficient. The p-value cmparing the cubic and degree-4 plynmials, Mdel 3 and Mdel 4, isapprximately 5 % while the degree-5 plynmial Mdel 5 seems unnecessary because its p-value is Hence, either a cubic r a quartic plynmial appear t prvide a reasnable fit t the data, but lwer- r higher-rder mdels are nt justified. In this case, instead f using the anva() functin, we culd have btained these p-values mre succinctly by expliting the fact that ply() creates rthgnal plynmials. > cef(summary(fit.5)) Estimate Std. Errr t value Pr(>t) (Intercept ) e+00 ply(age, 5) e-28 ply(age, 5) e-32 ply(age, 5) e-03

306 7.8 Lab: Nn-linear Mdeling 291 ply(age, 5) e-02 ply(age, 5) e-01 Ntice that the p-values are the same, and in fact the square f the t-statistics are equal t the F-statistics frm the anva() functin; fr example: > ( )^2 [1] Hwever, the ANOVA methd wrks whether r nt we used rthgnal plynmials; it als wrks when we have ther terms in the mdel as well. Fr example, we can use anva() t cmpare these three mdels: > fit.1=lm(wage educatin +age,data=wage) > fit.2=lm(wage educatin +ply(age,2),data=wage) > fit.3=lm(wage educatin +ply(age,3),data=wage) > anva(fit.1,fit.2,fit.3) As an alternative t using hypthesis tests and ANOVA, we culd chse the plynmial degree using crss-validatin, as discussed in Chapter 5. Next we cnsider the task f predicting whether an individual earns mre than $250,000 per year. We prceed much as befre, except that first we create the apprpriate respnse vectr, and then apply the glm() functin using family="binmial" in rder t fit a plynmial lgistic regressin mdel. > fit=glm(i(wage >250) ply(age,4),data=wage,family=binmial) Nte that we again use the wrapper I() t create this binary respnse variable n the fly. The expressin wage>250 evaluates t a lgical variable cntaining TRUEs andfalses, which glm() cerces t binary by setting the TRUEs t1andthefalses t0. Once again, we make predictins using the predict() functin. > preds=predict(fit,newdata=list(age=age.grid),se=t) Hwever, calculating the cnfidence intervals is slightly mre invlved than in the linear regressin case. The default predictin type fr a glm() mdel is type="link", which is what we use here. This means we get predictins fr the lgit: that is, we have fit a mdel f the frm ( ) Pr(Y =1X) lg = Xβ, 1 Pr(Y =1X) and the predictins given are f the frm X ˆβ. The standard errrs given are als f this frm. In rder t btain cnfidence intervals fr Pr(Y = 1X), we use the transfrmatin Pr(Y =1X) = exp(xβ) 1+exp(Xβ).

307 Mving Beynd Linearity > pfit=exp(preds$fit )/(1+exp(preds$fit )) > se.bands.lgit = cbind(preds$fit +2*preds$se.fit, preds$fit -2* preds$se.fit) > se.bands = exp(se.bands.lgit)/(1+exp(se.bands.lgit)) Nte that we culd have directly cmputed the prbabilities by selecting the type="respnse" ptin in the predict() functin. > preds=predict(fit,newdata=list(age=age.grid),type="respnse", se=t) Hwever, the crrespnding cnfidence intervals wuld nt have been sensible because we wuld end up with negative prbabilities! Finally, the right-hand plt frm Figure 7.1 was made as fllws: > plt(age,i(wage >250),xlim=agelims,type="n",ylim=c(0,.2)) > pints(jitter(age), I((wage >250)/5),cex=.5,pch="", cl="darkgrey ") > lines(age.grid,pfit,lwd=2, cl="blue") > matlines(age.grid,se.bands,lwd=1,cl="blue",lty=3) We have drawn the age values crrespnding t the bservatins with wage values abve 250 as gray marks n the tp f the plt, and thse with wage values belw 250 are shwn as gray marks n the bttm f the plt. We used the jitter() functin t jitter the age values a bit s that bservatins jitter() with the same age value d nt cver each ther up. This is ften called a rug plt. In rder t fit a step functin, as discussed in Sectin 7.2, we use the rug plt cut() functin. cut() > table(cut(age,4)) (17.9,33.5] (33.5,49] (49,64.5] (64.5,80.1] > fit=lm(wage cut(age,4),data=wage) > cef(summary(fit)) Estimate Std. Errr t value Pr(>t) (Intercept ) e+00 cut(age, 4)(33.5,49] e-38 cut(age, 4)(49,64.5] e-29 cut(age, 4)(64.5,80.1] e-01 Here cut() autmatically picked the cutpints at 33.5, 49, and 64.5 years f age. We culd als have specified ur wn cutpints directly using the breaks ptin. The functin cut() returns an rdered categrical variable; the lm() functin then creates a set f dummy variables fr use in the regressin. The age<33.5 categry is left ut, s the intercept cefficient f $94,160 can be interpreted as the average salary fr thse under 33.5 years f age, and the ther cefficients can be interpreted as the average additinal salary fr thse in the ther age grups. We can prduce predictins and plts just as we did in the case f the plynmial fit.

308 7.8.2 Splines 7.8 Lab: Nn-linear Mdeling 293 In rder t fit regressin splines in R, we use the splines library. In Sectin 7.4, we saw that regressin splines can be fit by cnstructing an apprpriate matrix f basis functins. The bs() functin generates the entire matrix f bs() basis functins fr splines with the specified set f knts. By default, cubic splines are prduced. Fitting wage t age using a regressin spline is simple: > library(splines) > fit=lm(wage bs(age,knts=c(25,40,60) ),data=wage) > pred=predict(fit,newdata=list(age=age.grid),se=t) > plt(age,wage,cl="gray") > lines(age.grid,pred$fit,lwd=2) > lines(age.grid,pred$fit +2*pred$se,lty="dashed") > lines(age.grid,pred$fit -2*pred$se,lty="dashed") Here we have prespecified knts at ages 25, 40, and 60. This prduces a spline with six basis functins. (Recall that a cubic spline with three knts has seven degrees f freedm; these degrees f freedm are used up by an intercept, plus six basis functins.) We culd als use the df ptin t prduce a spline with knts at unifrm quantiles f the data. > dim(bs(age,knts=c(25,40,60))) [1] > dim(bs(age,df=6)) [1] > attr(bs(age,df=6),"knts") 25% 50% 75% In this case R chses knts at ages 33.8, 42.0, and 51.0, which crrespnd t the 25th, 50th, and 75th percentiles f age. The functin bs() als has a degree argument, s we can fit splines f any degree, rather than the default degree f 3 (which yields a cubic spline). In rder t instead fit a natural spline, we use the ns() functin. Here ns() we fit a natural spline with fur degrees f freedm. > fit2=lm(wage ns(age,df=4),data=wage) > pred2=predict(fit2,newdata=list(age=age.grid),se=t) > lines(age.grid, pred2$fit,cl="red",lwd=2) As with the bs() functin, we culd instead specify the knts directly using the knts ptin. In rder t fit a smthing spline, we use the smth.spline() functin. Figure 7.8 was prduced with the fllwing cde: > plt(age,wage,xlim=agelims,cex=.5,cl="darkgrey ") > title("smthing Spline") > fit=smth.spline(age,wage,df=16) > fit2=smth.spline(age,wage,cv=true) > fit2$df [1] 6.8 > lines(fit,cl="red",lwd=2) smth. spline()

309 Mving Beynd Linearity > lines(fit2,cl="blue",lwd=2) > legend("tpright",legend=c("16 DF","6.8 DF"), cl=c("red","blue"),lty=1,lwd=2,cex=.8) Ntice that in the first call t smth.spline(), we specified df=16. The functin then determines which value f λ leads t 16 degrees f freedm. In the secnd call t smth.spline(), we select the smthness level by crssvalidatin; this results in a value f λ that yields 6.8 degrees f freedm. In rder t perfrm lcal regressin, we use the less() functin. > plt(age,wage,xlim=agelims,cex=.5,cl="darkgrey ") > title("lcal Regressin ") > fit=less(wage age,span=.2,data=wage) > fit2=less(wage age,span=.5,data=wage) > lines(age.grid,predict(fit,data.frame(age=age.grid)), cl="red",lwd=2) > lines(age.grid,predict(fit2,data.frame(age=age.grid)), cl="blue",lwd=2) > legend("tpright",legend=c("span=0.2"," Span=0.5"), cl=c("red","blue"),lty=1,lwd=2,cex=.8) Here we have perfrmed lcal linear regressin using spans f 0.2 and0.5: that is, each neighbrhd cnsists f 20 % r 50 % f the bservatins. The larger the span, the smther the fit. The lcfit library can als be used fr fitting lcal regressin mdels in R. less() GAMs We nw fit a GAM t predict wage using natural spline functins f year and age, treating educatin as a qualitative predictr, as in (7.16). Since this is just a big linear regressin mdel using an apprpriate chice f basis functins, we can simply d this using the lm() functin. > gam1=lm(wage ns(year,4)+ns(age,5)+educatin,data=wage) We nw fit the mdel (7.16) using smthing splines rather than natural splines. In rder t fit mre general srts f GAMs, using smthing splines r ther cmpnents that cannt be expressed in terms f basis functins and then fit using least squares regressin, we will need t use the gam library in R. The s() functin, which is part f the gam library, is used t indicate that s() we wuld like t use a smthing spline. We specify that the functin f year shuld have 4 degrees f freedm, and that the functin f age will have 5 degrees f freedm. Since educatin is qualitative, we leave it as is, and it is cnverted int fur dummy variables. We use the gam() functin in gam() rder t fit a GAM using these cmpnents. All f the terms in (7.16) are fit simultaneusly, taking each ther int accunt t explain the respnse. > library(gam) > gam.m3=gam(wage s(year,4)+s(age,5)+educatin,data=wage)

310 7.8 Lab: Nn-linear Mdeling 295 In rder t prduce Figure 7.12, we simply call the plt() functin: > par(mfrw=c(1,3)) > plt(gam.m3, se=true,cl="blue") The generic plt() functin recgnizes that gam.m3 is an bject f class gam, and invkes the apprpriate plt.gam() methd. Cnveniently, even thugh plt.gam() gam1 is nt f class gam but rather f class lm, wecanstill use plt.gam() n it. Figure 7.11 was prduced using the fllwing expressin: > plt.gam(gam1, se=true, cl="red") Ntice here we had t use plt.gam() rather than the generic plt() functin. In these plts, the functin f year lks rather linear. We can perfrm a series f ANOVA tests in rder t determine which f these three mdels is best: a GAM that excludes year (M 1 ), a GAM that uses a linear functin f year (M 2 ), r a GAM that uses a spline functin f year (M 3 ). > gam.m1=gam(wage s(age,5)+educatin,data=wage) > gam.m2=gam(wage year+s(age,5)+educatin,data=wage) > anva(gam.m1,gam.m2,gam.m3,test="f") Analysis f Deviance Table Mdel 1: wage s(age, 5) + educatin Mdel 2: wage year + s(age, 5) + educatin Mdel 3: wage s(year, 4) + s(age, 5) + educatin Resid. Df Resid. Dev Df Deviance F Pr(>F) *** Signif. cdes: 0 *** ** 0.01 * We find that there is cmpelling evidence that a GAM with a linear functin f year is better than a GAM that des nt include year at all (p-value = ). Hwever, there is n evidence that a nn-linear functin f year is needed (p-value = 0.349). In ther wrds, based n the results f this ANOVA, M 2 is preferred. The summary() functin prduces a summary f the gam fit. > summary(gam.m3) Call: gam(frmula = wage s(year, 4) + s(age, 5) + educatin, data = Wage) Deviance Residuals : Min 1Q Median 3Q Max (Dispersin Parameter fr gaussian family taken t be 1236) Null Deviance: n 2999 degrees f freedm Residual Deviance: n 2986 degrees f freedm

311 Mving Beynd Linearity AIC: Number f Lcal Scring Iteratins : 2 DF fr Terms and F-values fr Nnparametric Effects Df Npar Df Npar F Pr(F) (Intercept ) 1 s(year, 4) s(age, 5) <2e-16 *** educatin Signif. cdes: 0 *** ** 0.01 * The p-values fr year and age crrespnd t a null hypthesis f a linear relatinship versus the alternative f a nn-linear relatinship. The large p-value fr year reinfrces ur cnclusin frm the ANOVA test that a linear functin is adequate fr this term. Hwever, there is very clear evidence that a nn-linear term is required fr age. We can make predictins frm gam bjects, just like frm lm bjects, using the predict() methd fr the class gam. Herewemakepredictinsn the training set. > preds=predict(gam.m2,newdata=wage) We can als use lcal regressin fits as building blcks in a GAM, using the l() functin. > gam.l=gam(wage s(year,df=4)+l(age,span=0.7)+educatin, data=wage) > plt.gam(gam.l, se=true, cl="green") Here we have used lcal regressin fr the age term, with a span f 0.7. We can als use the l() functin t create interactins befre calling the gam() functin. Fr example, > gam.l.i=gam(wage l(year,age,span=0.5)+educatin, data=wage) fits a tw-term mdel, in which the first term is an interactin between year and age, fit by a lcal regressin surface. We can plt the resulting tw-dimensinal surface if we first install the akima package. > library(akima) > plt(gam.l.i) In rder t fit a lgistic regressin GAM, we nce again use the I() functin in cnstructing the binary respnse variable, and set family=binmial. > gam.lr=gam(i(wage >250) year+s(age,df=5)+educatin, family=binmial,data=wage) > par(mfrw=c(1,3)) > plt(gam.lr,se=t,cl="green") l()

312 7.9 Exercises 297 It is easy t see that there are n high earners in the <HS categry: > table(educatin,i(wage >250)) educatin FALSE TRUE 1. < HS Grad HS Grad Sme Cllege Cllege Grad Advanced Degree Hence, we fit a lgistic regressin GAM using all but this categry. This prvides mre sensible results. > gam.lr.s=gam(i(wage >250) year+s(age,df=5)+educatin,family= binmial,data=wage,subset=(educatin!="1. < HS Grad")) > plt(gam.lr.s,se=t,cl="green") 7.9 Exercises Cnceptual 1. It was mentined in the chapter that a cubic regressin spline with ne knt at ξ can be btained using a basis f the frm x, x 2, x 3, (x ξ) 3 +,where(x ξ)3 + =(x ξ)3 if x>ξand equals 0 therwise. We will nw shw that a functin f the frm f(x) =β 0 + β 1 x + β 2 x 2 + β 3 x 3 + β 4 (x ξ) 3 + is indeed a cubic regressin spline, regardless f the values f β 0,β 1,β 2, β 3,β 4. (a) Find a cubic plynmial f 1 (x) =a 1 + b 1 x + c 1 x 2 + d 1 x 3 such that f(x) =f 1 (x) fr all x ξ. Expressa 1,b 1,c 1,d 1 in terms f β 0,β 1,β 2,β 3,β 4. (b) Find a cubic plynmial f 2 (x) =a 2 + b 2 x + c 2 x 2 + d 2 x 3 such that f(x) =f 2 (x) fr all x>ξ.expressa 2,b 2,c 2,d 2 in terms f β 0,β 1,β 2,β 3,β 4. We have nw established that f(x) is a piecewise plynmial. (c) Shw that f 1 (ξ) =f 2 (ξ). That is, f(x) is cntinuus at ξ. (d) Shw that f 1 (ξ) =f 2 (ξ). That is, f (x) is cntinuus at ξ.

313 Mving Beynd Linearity (e) Shw that f 1 (ξ) =f 2 (ξ). That is, f (x) is cntinuus at ξ. Therefre, f(x) is indeed a cubic spline. Hint: Parts (d) and (e) f this prblem require knwledge f singlevariable calculus. As a reminder, given a cubic plynmial the first derivative takes the frm f 1 (x) =a 1 + b 1 x + c 1 x 2 + d 1 x 3, f 1(x) =b 1 +2c 1 x +3d 1 x 2 and the secnd derivative takes the frm f 1 (x) =2c 1 +6d 1 x. 2. Suppse that a curve ĝ is cmputed t smthly fit a set f n pints using the fllwing frmula: ( n ) [ ] 2 ĝ =argmin (y i g(x i )) 2 + λ g (m) (x) dx, g i=1 where g (m) represents the mth derivative f g (and g (0) = g). Prvide example sketches f ĝ in each f the fllwing scenaris. (a) λ =,m=0. (b) λ =,m=1. (c) λ =,m=2. (d) λ =,m=3. (e) λ =0,m=3. 3. Suppse we fit a curve with basis functins b 1 (X) =X, b 2 (X) = (X 1) 2 I(X 1). (Nte that I(X 1) equals 1 fr X 1and0 therwise.) We fit the linear regressin mdel Y = β 0 + β 1 b 1 (X)+β 2 b 2 (X)+ɛ, and btain cefficient estimates ˆβ 0 =1, ˆβ 1 =1, ˆβ 2 = 2. Sketch the estimated curve between X = 2 andx = 2. Nte the intercepts, slpes, and ther relevant infrmatin. 4. Suppse we fit a curve with basis functins b 1 (X) =I(0 X 2) (X 1)I(1 X 2), b 2 (X) =(X 3)I(3 X 4) + I(4 <X 5). We fit the linear regressin mdel Y = β 0 + β 1 b 1 (X)+β 2 b 2 (X)+ɛ, and btain cefficient estimates ˆβ 0 =1, ˆβ 1 =1, ˆβ 2 =3.Sketchthe estimated curve between X = 2 andx = 2. Nte the intercepts, slpes, and ther relevant infrmatin.

314 7.9 Exercises Cnsider tw curves, ĝ 1 and ĝ 2, defined by ( n ) [ ] 2 ĝ 1 =argmin (y i g(x i )) 2 + λ g (3) (x) dx, g Applied ĝ 2 =argmin g i=1 ( n ) [ ] 2 (y i g(x i )) 2 + λ g (4) (x) dx, i=1 where g (m) represents the mth derivative f g. (a) As λ, will ĝ 1 r ĝ 2 have the smaller training RSS? (b) As λ, will ĝ 1 r ĝ 2 have the smaller test RSS? (c) Fr λ = 0, will ĝ 1 r ĝ 2 have the smaller training and test RSS? 6. In this exercise, yu will further analyze the Wage data set cnsidered thrughut this chapter. (a) Perfrm plynmial regressin t predict wage using age. Use crss-validatin t select the ptimal degree d fr the plynmial. What degree was chsen, and hw des this cmpare t the results f hypthesis testing using ANOVA? Make a plt f the resulting plynmial fit t the data. (b) Fit a step functin t predict wage using age, and perfrm crssvalidatin t chse the ptimal number f cuts. Make a plt f the fit btained. 7. The Wage data set cntains a number f ther features nt explred in this chapter, such as marital status (maritl), jb class (jbclass), and thers. Explre the relatinships between sme f these ther predictrs and wage, and use nn-linear fitting techniques in rder t fit flexible mdels t the data. Create plts f the results btained, and write a summary f yur findings. 8. Fit sme f the nn-linear mdels investigated in this chapter t the Aut data set. Is there evidence fr nn-linear relatinships in this data set? Create sme infrmative plts t justify yur answer. 9. This questin uses the variables dis (the weighted mean f distances t five Bstn emplyment centers) and nx (nitrgen xides cncentratin in parts per 10 millin) frm the Bstn data. We will treat dis as the predictr and nx as the respnse. (a) Use the ply() functin t fit a cubic plynmial regressin t predict nx using dis. Reprt the regressin utput, and plt the resulting data and plynmial fits.

315 Mving Beynd Linearity (b) Plt the plynmial fits fr a range f different plynmial degrees (say, frm 1 t 10), and reprt the assciated residual sum f squares. (c) Perfrm crss-validatin r anther apprach t select the ptimal degree fr the plynmial, and explain yur results. (d) Use the bs() functin t fit a regressin spline t predict nx using dis. Reprt the utput fr the fit using fur degrees f freedm. Hw did yu chse the knts? Plt the resulting fit. (e) Nw fit a regressin spline fr a range f degrees f freedm, and plt the resulting fits and reprt the resulting RSS. Describe the results btained. (f) Perfrm crss-validatin r anther apprach in rder t select the best degrees f freedm fr a regressin spline n this data. Describe yur results. 10. This questin relates t the Cllege data set. (a) Split the data int a training set and a test set. Using ut-f-state tuitin as the respnse and the ther variables as the predictrs, perfrm frward stepwise selectin n the training set in rder t identify a satisfactry mdel that uses just a subset f the predictrs. (b) Fit a GAM n the training data, using ut-f-state tuitin as the respnse and the features selected in the previus step as the predictrs. Plt the results, and explain yur findings. (c) Evaluate the mdel btained n the test set, and explain the results btained. (d) Fr which variables, if any, is there evidence f a nn-linear relatinship with the respnse? 11. In Sectin 7.7, it was mentined that GAMs are generally fit using a backfitting apprach. The idea behind backfitting is actually quite simple. We will nw explre backfitting in the cntext f multiple linear regressin. Suppse that we wuld like t perfrm multiple linear regressin, but we d nt have sftware t d s. Instead, we nly have sftware t perfrm simple linear regressin. Therefre, we take the fllwing iterative apprach: we repeatedly hld all but ne cefficient estimate fixed at its current value, and update nly that cefficient estimate using a simple linear regressin. The prcess is cntinued until cnvergence that is, until the cefficient estimates stp changing. We nw try this ut n a ty example.

316 7.9 Exercises 301 (a) Generate a respnse Y and tw predictrs X 1 and X 2, with n = 100. (b) Initialize ˆβ 1 t take n a value f yur chice. It des nt matter what value yu chse. (c) Keeping ˆβ 1 fixed, fit the mdel Yu can d this as fllws: > a=y-beta1*x1 > beta2=lm(a x2)$cef[2] (d) Keeping ˆβ 2 fixed, fit the mdel Yu can d this as fllws: > a=y-beta2*x2 > beta1=lm(a x1)$cef[2] Y ˆβ 1 X 1 = β 0 + β 2 X 2 + ɛ. Y ˆβ 2 X 2 = β 0 + β 1 X 1 + ɛ. (e) Write a fr lp t repeat (c) and (d) 1,000 times. Reprt the estimates f ˆβ 0, ˆβ1,and ˆβ 2 at each iteratin f the fr lp. Create a plt in which each f these values is displayed, with ˆβ 0, ˆβ 1,and ˆβ 2 each shwn in a different clr. (f) Cmpare yur answer in (e) t the results f simply perfrming multiple linear regressin t predict Y using X 1 and X 2.Use the abline() functin t verlay thse multiple linear regressin cefficient estimates n the plt btained in (e). (g) On this data set, hw many backfitting iteratins were required in rder t btain a gd apprximatin t the multiple regressin cefficient estimates? 12. This prblem is a cntinuatin f the previus exercise. In a ty example with p = 100, shw that ne can apprximate the multiple linear regressincefficientestimates by repeatedly perfrming simple linear regressin in a backfitting prcedure. Hw many backfitting iteratins are required in rder t btain a gd apprximatin t the multiple regressin cefficient estimates? Create a plt t justify yur answer.

317

318 8 Tree-Based Methds In this chapter, we describe tree-based methds fr regressin and classificatin. These invlve stratifying r segmenting the predictr space int a number f simple regins. In rder t make a predictin fr a given bservatin, we typically use the mean r the mde f the training bservatins in the regin t which it belngs. Since the set f splitting rules used t segment the predictr space can be summarized in a tree, these types f appraches are knwn as decisin tree methds. Tree-based methds are simple and useful fr interpretatin. Hwever, they typically are nt cmpetitive with the best supervised learning appraches, such as thse seen in Chapters 6 and 7, in terms f predictin accuracy. Hence in this chapter we als intrduce bagging, randm frests, and bsting. Each f these appraches invlves prducing multiple trees which are then cmbined t yield a single cnsensus predictin. We will see that cmbining a large number f trees can ften result in dramatic imprvements in predictin accuracy, at the expense f sme lss in interpretatin. decisin tree 8.1 The Basics f Decisin Trees Decisin trees can be applied t bth regressin and classificatin prblems. We first cnsider regressin prblems, and then mve n t classificatin. G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics, DOI / , Springer Science+Business Media New Yrk

319 Tree-Based Methds Years < Hits < FIGURE 8.1. Fr the Hitters data, a regressin tree fr predicting the lg salary f a baseball player, based n the number f years that he has played in the majr leagues and the number f hits that he made in the previus year. At a given internal nde, the label (f the frm X j <t k ) indicates the left-hand branch emanating frm that split, and the right-hand branch crrespnds t X j t k. Fr instance, the split at the tp f the tree results in tw large branches. The left-hand branch crrespnds t Years<4.5, and the right-hand branch crrespnds t Years>=4.5. The tree has tw internal ndes and three terminal ndes, r leaves. The number in each leaf is the mean f the respnse fr the bservatins that fall there Regressin Trees In rder t mtivate regressin trees, we begin with a simple example. Predicting Baseball Players Salaries Using Regressin Trees We use the Hitters data set t predict a baseball player s Salary based n Years (the number f years that he has played in the majr leagues) and Hits (the number f hits that he made in the previus year). We first remve bservatins that are missing Salary values, and lg-transfrm Salary s that its distributin has mre f a typical bell-shape. (Recall that Salary is measured in thusands f dllars.) Figure 8.1 shws a regressin tree fit t this data. It cnsists f a series f splitting rules, starting at the tp f the tree. The tp split assigns bservatins having Years<4.5 t the left branch. 1 The predicted salary regressin tree 1 Bth Years and Hits are integers in these data; the tree() functin in R labels the splits at the midpint between tw adjacent values.

320 8.1 The Basics f Decisin Trees R 3 Hits R R Years 1 FIGURE 8.2. The three-regin partitin fr the Hitters data set frm the regressin tree illustrated in Figure 8.1. fr these players is given by the mean respnse value fr the players in the data set with Years<4.5. Fr such players, the mean lg salary is 5.107, and s we make a predictin f e thusands f dllars, i.e. $165,174, fr these players. Players with Years>=4.5 are assigned t the right branch, and then that grup is further subdivided by Hits. Overall, the tree stratifies r segments the players int three regins f predictr space: players wh have played fr fur r fewer years, players wh have played fr five r mre years and wh made fewer than 118 hits last year, and players wh have played fr five r mre years and wh made at least 118 hits last year. These three regins can be written as R 1 ={X Years<4.5}, R 2 ={X Years>=4.5, Hits<117.5}, andr 3 ={X Years>=4.5, Hits>=117.5}. Figure 8.2 illustrates the regins as a functin f Years and Hits. The predicted salaries fr these three grups are $1,000 e =$165,174, $1,000 e =$402,834, and $1,000 e =$845,346 respectively. In keeping with the tree analgy, the regins R 1, R 2,andR 3 are knwn as terminal ndes r leaves f the tree. As is the case fr Figure 8.1, decisin terminal trees are typically drawn upside dwn, in the sense that the leaves are at nde the bttm f the tree. The pints alng the tree where the predictr space leaf is split are referred t as internal ndes. In Figure 8.1, the tw internal internal nde ndes are indicated by the text Years<4.5 and Hits< We refer t the segments f the trees that cnnect the ndes as branches. branch We might interpret the regressin tree displayed in Figure 8.1 as fllws: Years is the mst imprtant factr in determining Salary, and players with less experience earn lwer salaries than mre experienced players. Given that a player is less experienced, the number f hits that he made in the previus year seems t play little rle in his salary. But amng players wh

321 Tree-Based Methds have been in the majr leagues fr five r mre years, the number f hits made in the previus year des affect salary, and players wh made mre hits last year tend t have higher salaries. The regressin tree shwn in Figure 8.1 is likely an ver-simplificatin f the true relatinship between Hits, Years, and Salary. Hwever, it has advantages ver ther types f regressin mdels (such as thse seen in Chapters 3 and 6): it is easier t interpret, and has a nice graphical representatin. Predictin via Stratificatin f the Feature Space We nw discuss the prcess f building a regressin tree. Rughly speaking, there are tw steps. 1. We divide the predictr space that is, the set f pssible values fr X 1,X 2,...,X p int J distinct and nn-verlapping regins, R 1,R 2,...,R J. 2. Fr every bservatin that falls int the regin R j,wemakethesame predictin, which is simply the mean f the respnse values fr the training bservatins in R j. Fr instance, suppse that in Step 1 we btain tw regins, R 1 and R 2, and that the respnse mean f the training bservatins in the first regin is 10, while the respnse mean f the training bservatins in the secnd regin is 20. Then fr a given bservatin X = x, ifx R 1 we will predict a value f 10, and if x R 2 we will predict a value f 20. We nw elabrate n Step 1 abve. Hw d we cnstruct the regins R 1,...,R J? In thery, the regins culd have any shape. Hwever, we chse t divide the predictr space int high-dimensinal rectangles, r bxes, fr simplicity and fr ease f interpretatin f the resulting predictive mdel. The gal is t find bxes R 1,...,R J that minimize the RSS, given by J (y i ŷ Rj ) 2, (8.1) j=1 i R j where ŷ Rj is the mean respnse fr the training bservatins within the jth bx. Unfrtunately, it is cmputatinally infeasible t cnsider every pssible partitin f the feature space int J bxes. Fr this reasn, we take a tp-dwn, greedy apprach that is knwn as recursive binary splitting.the recursive apprachis tp-dwn because it begins at the tp f the tree (at which pint binary all bservatins belng t a single regin) and then successively splits the splitting predictr space; each split is indicated via tw new branches further dwn n the tree. It is greedy because at each step f the tree-building prcess, the best split is made at that particular step, rather than lking ahead and picking a split that will lead t a better tree in sme future step.

322 8.1 The Basics f Decisin Trees 307 In rder t perfrm recursive binary splitting, we first select the predictr X j and the cutpint s such that splitting the predictr space int the regins {XX j <s} and {XX j s} leads t the greatest pssible reductin in RSS. (The ntatin {XX j <s} means the regin f predictr space in which X j takes n a value less than s.) That is, we cnsider all predictrs X 1,...,X p, and all pssible values f the cutpint s fr each f the predictrs, and then chse the predictr and cutpint such that the resulting tree has the lwest RSS. In greater detail, fr any j and s, we define the pair f half-planes R 1 (j, s) ={XX j <s} and R 2 (j, s) ={XX j s}, (8.2) and we seek the value f j and s that minimize the equatin (y i ŷ R1 ) 2 + (y i ŷ R2 ) 2, (8.3) i: x i R 1(j,s) i: x i R 2(j,s) where ŷ R1 is the mean respnse fr the training bservatins in R 1 (j, s), and ŷ R2 is the mean respnse fr the training bservatins in R 2 (j, s). Finding the values f j and s that minimize (8.3) can be dne quite quickly, especially when the number f features p is nt t large. Next, we repeat the prcess, lking fr the best predictr and best cutpint in rder t split the data further s as t minimize the RSS within each f the resulting regins. Hwever, this time, instead f splitting the entire predictr space, we split ne f the tw previusly identified regins. We nw have three regins. Again, we lk t split ne f these three regins further, s as t minimize the RSS. The prcess cntinues until a stpping criterin is reached; fr instance, we may cntinue until n regin cntains mre than five bservatins. Once the regins R 1,...,R J have been created, we predict the respnse fr a given test bservatin using the mean f the training bservatins in the regin t which that test bservatin belngs. A five-regin example f this apprach is shwn in Figure 8.3. Tree Pruning The prcess described abve may prduce gd predictins n the training set, but is likely t verfit the data, leading t pr test set perfrmance. This is because the resulting tree might be t cmplex. A smaller tree with fewer splits (that is, fewer regins R 1,...,R J ) might lead t lwer variance and better interpretatin at the cst f a little bias. One pssible alternative t the prcess described abve is t build the tree nly s lng as the decrease in the RSS due t each split exceeds sme (high) threshld. This strategy will result in smaller trees, but is t shrt-sighted since a seemingly wrthless split early n in the tree might be fllwed by a very gd split that is, a split that leads t a large reductin in RSS later n.

323 Tree-Based Methds R5 R2 t4 X2 X2 R3 t2 R4 R1 t1 t3 X1 X1 X1 t1 X2 t2 X1 t3 X2 t4 R1 R2 R3 X2 X1 R4 R5 FIGURE 8.3. Tp Left: A partitin f tw-dimensinal feature space that culd nt result frm recursive binary splitting. Tp Right: The utput f recursive binary splitting n a tw-dimensinal example. Bttm Left: A tree crrespnding t the partitin in the tp right panel. Bttm Right: A perspective plt f the predictin surface crrespnding t that tree. Therefre, a better strategy is t grw a very large tree T 0,andthen prune it back in rder t btain a subtree. Hw d we determine the best prune way t prune the tree? Intuitively, ur gal is t select a subtree that leads t the lwest test errr rate. Given a subtree, we can estimate its test errr using crss-validatin r the validatin set apprach. Hwever, estimating the crss-validatin errr fr every pssible subtree wuld be t cumbersme, since there is an extremely large number f pssible subtrees. Instead, we need a way t select a small set f subtrees fr cnsideratin. Cst cmplexity pruning als knwn as weakest link pruning gives us cst a way t d just this. Rather than cnsidering every pssible subtree, we cnsider a sequence f trees indexed by a nnnegative tuning parameter α. subtree cmplexity pruning weakest link pruning

324 Algrithm 8.1 Building a Regressin Tree 8.1 The Basics f Decisin Trees Use recursive binary splitting t grw a large tree n the training data, stpping nly when each terminal nde has fewer than sme minimum number f bservatins. 2. Apply cst cmplexity pruning t the large tree in rder t btain a sequence f best subtrees, as a functin f α. 3. Use K-fld crss-validatin t chse α. That is, divide the training bservatins int K flds. Fr each k =1,...,K: (a) Repeat Steps 1 and 2 n all but the kth fld f the training data. (b) Evaluate the mean squared predictin errr n the data in the left-ut kth fld, as a functin f α. Average the results fr each value f α, and pick α t minimize the average errr. 4. Return the subtree frm Step 2 that crrespnds t the chsen value f α. Fr each value f α there crrespnds a subtree T T 0 such that T m=1 i: x i R m (y i ŷ Rm ) 2 + αt (8.4) is as small as pssible. Here T indicates the number f terminal ndes f the tree T, R m is the rectangle (i.e. the subset f predictr space) crrespnding t the mth terminal nde, and ŷ Rm is the predicted respnse assciated with R m that is, the mean f the training bservatins in R m. The tuning parameter α cntrls a trade-ff between the subtree s cmplexity and its fit t the training data. When α = 0, then the subtree T will simply equal T 0, because then (8.4) just measures the training errr. Hwever, as α increases, there is a price t pay fr having a tree with many terminal ndes, and s the quantity (8.4) will tend t be minimized fr a smaller subtree. Equatin 8.4 is reminiscent f the lass (6.7) frm Chapter 6, in which a similar frmulatin was used in rder t cntrl the cmplexity f a linear mdel. It turns ut that as we increase α frm zer in (8.4), branches get pruned frm the tree in a nested and predictable fashin, s btaining the whle sequence f subtrees as a functin f α is easy. We can select a value f α using a validatin set r using crss-validatin. We then return t the full data set and btain the subtree crrespnding t α. This prcess is summarized in Algrithm 8.1.

325 Tree-Based Methds Years < 4.5 RBI < 60.5 Hits < Pututs < 82 Years < 3.5 Years < Walks < 43.5 Runs < Walks < 52.5 RBI < 80.5 Years < FIGURE 8.4. Regressin tree analysis fr the Hitters data. The unpruned tree that results frm tp-dwn greedy splitting n the training data is shwn. Figures 8.4 and 8.5 display the results f fitting and pruning a regressin tree n the Hitters data, using nine f the features. First, we randmly divided the data set in half, yielding 132 bservatins in the training set and 131 bservatins in the test set. We then built a large regressin tree n the training data and varied α in (8.4) in rder t create subtrees with different numbers f terminal ndes. Finally, we perfrmed six-fld crssvalidatin in rder t estimate the crss-validated MSE f the trees as a functin f α. (We chse t perfrm six-fld crss-validatin because 132 is an exact multiple f six.) The unpruned regressin tree is shwn in Figure 8.4. The green curve in Figure 8.5 shws the CV errr as a functin f the number f leaves, 2 while the range curve indicates the test errr. Als shwn are standard errr bars arund the estimated errrs. Fr reference, the training errr curve is shwn in black. The CV errr is a reasnable apprximatin f the test errr: the CV errr takes n its 2 Althugh CV errr is cmputed as a functin f α, it is cnvenient t display the result as a functin f T, the number f leaves; this is based n the relatinship between α and T in the riginal tree grwn t all the training data.

326 8.1 The Basics f Decisin Trees 311 Mean Squared Errr Training Crss Validatin Test Tree Size FIGURE 8.5. Regressin tree analysis fr the Hitters data. The training, crss-validatin, and test MSE are shwn as a functin f the number f terminal ndes in the pruned tree. Standard errr bands are displayed. The minimum crss-validatin errr ccurs at a tree size f three. minimum fr a three-nde tree, while the test errr als dips dwn at the three-nde tree (thugh it takes n its lwest value at the ten-nde tree). The pruned tree cntaining three terminal ndes is shwn in Figure Classificatin Trees A classificatin tree is very similar t a regressin tree, except that it is classificatin used t predict a qualitative respnse rather than a quantitative ne. Recall that fr a regressin tree, the predicted respnse fr an bservatin is tree given by the mean respnse f the training bservatins that belng t the same terminal nde. In cntrast, fr a classificatin tree, we predict that each bservatin belngs t the mst cmmnly ccurring class f training bservatins in the regin t which it belngs. In interpreting the results f a classificatin tree, we are ften interested nt nly in the class predictin crrespnding t a particular terminal nde regin, but als in the class prprtins amng the training bservatins that fall int that regin. The task f grwing a classificatin tree is quite similar t the task f grwing a regressin tree. Just as in the regressin setting, we use recursive binary splitting t grw a classificatin tree. Hwever, in the classificatin setting, RSS cannt be used as a criterin fr making the binary splits. A natural alternative t RSS is the classificatin errr rate. Sinceweplan classificatin t assign an bservatin in a given regin t the mst cmmnly ccurring errr rate class f training bservatins in that regin, the classificatin errr rate is simply the fractin f the training bservatins in that regin that d nt belng t the mst cmmn class:

327 Tree-Based Methds E =1 max (ˆp mk). (8.5) k Here ˆp mk represents the prprtin f training bservatins in the mth regin that are frm the kth class. Hwever, it turns ut that classificatin errr is nt sufficiently sensitive fr tree-grwing, and in practice tw ther measures are preferable. The Gini index is defined by G = K ˆp mk (1 ˆp mk ), (8.6) k=1 a measure f ttal variance acrss the K classes. It is nt hard t see that the Gini index takes n a small value if all f the ˆp mk s are clse t zer r ne. Fr this reasn the Gini index is referred t as a measure f nde purity a small value indicates that a nde cntains predminantly bservatins frm a single class. An alternative t the Gini index is crss-entrpy, given by D = K ˆp mk lg ˆp mk. (8.7) k=1 Since 0 ˆp mk 1, it fllws that 0 ˆp mk lg ˆp mk. One can shw that the crss-entrpy will take n a value near zer if the ˆp mk s are all near zer r near ne. Therefre, like the Gini index, the crss-entrpy will take n a small value if the mth nde is pure. In fact, it turns ut that the Gini index and the crss-entrpy are quite similar numerically. When building a classificatin tree, either the Gini index r the crssentrpy are typically used t evaluate the quality f a particular split, since these tw appraches are mre sensitive t nde purity than is the classificatin errr rate. Any f these three appraches might be used when pruning the tree, but the classificatin errr rate is preferable if predictin accuracy f the final pruned tree is the gal. Figure 8.6 shws an example n the Heart data set. These data cntain a binary utcme HD fr 303 patients wh presented with chest pain. An utcme value f Yes indicates the presence f heart disease based n an angigraphic test, while N means n heart disease. There are 13 predictrs including Age, Sex, Chl (a chlesterl measurement), and ther heart and lung functin measurements. Crss-validatin results in a tree with six terminal ndes. In ur discussin thus far, we have assumed that the predictr variables take n cntinuus values. Hwever, decisin trees can be cnstructed even in the presence f qualitative predictr variables. Fr instance, in the Heart data, sme f the predictrs, such as Sex, Thal (Thalium stress test), and ChestPain, are qualitative. Therefre, a split n ne f these variables amunts t assigning sme f the qualitative values t ne branch and Gini index crssentrpy

328 8.1 The Basics f Decisin Trees 313 Thal:a Ca < 0.5 Ca < 0.5 Slpe < 1.5 Oldpeak < 1.1 MaxHR < RestBP < 157 Chl < 244 MaxHR < 156 MaxHR < Yes N N N Yes N ChestPain:bc Chl < 244 Sex < 0.5 N N N Yes Age < 52 Yes N N Thal:b ChestPain:a N Yes RestECG < 1 Yes Yes Yes Errr Training Crss Validatin Test MaxHR < N N Ca < 0.5 ChestPain:bc Thal:a Yes Ca < 0.5 Yes N Yes Tree Size FIGURE 8.6. Heart data. Tp: The unpruned tree. Bttm Left: Crss -validatin errr, training, and test errr, fr different sizes f the pruned tree. Bttm Right: The pruned tree crrespnding t the minimal crss-validatin errr. assigning the remaining t the ther branch. In Figure 8.6, sme f the internal ndes crrespnd t splitting qualitative variables. Fr instance, the tp internal nde crrespnds t splitting Thal. The text Thal:a indicates that the left-hand branch cming ut f that nde cnsists f bservatins with the first value f the Thal variable (nrmal), and the right-hand nde cnsists f the remaining bservatins (fixed r reversible defects). The text ChestPain:bc tw splits dwn the tree n the left indicates that the left-hand branch cming ut f that nde cnsists f bservatins with the secnd and third values f the ChestPain variable, where the pssible values are typical angina, atypical angina, nn-anginal pain, and asymptmatic.

329 Tree-Based Methds Figure 8.6 has a surprising characteristic: sme f the splits yield tw terminal ndes that have the same predicted value. Fr instance, cnsider the split RestECG<1 near the bttm right f the unpruned tree. Regardless f the value f RestECG, a respnse value f Yes is predicted fr thse bservatins. Why, then, is the split perfrmed at all? The split is perfrmed because it leads t increased nde purity. That is, all 9 f the bservatins crrespnding t the right-hand leaf have a respnse value f Yes, whereas 7/11 f thse crrespnding t the left-hand leaf have a respnse value f Yes. Why is nde purity imprtant? Suppse that we have a test bservatin that belngs t the regin given by that right-hand leaf. Then we can be pretty certain that its respnse value is Yes. In cntrast, if a test bservatin belngs t the regin given by the left-hand leaf, then its respnse value is prbably Yes, but we are much less certain. Even thugh the split RestECG<1 des nt reduce the classificatin errr, it imprves the Gini index and the crss-entrpy, which are mre sensitive t nde purity Trees Versus Linear Mdels Regressin and classificatin trees have a very different flavr frm the mre classical appraches fr regressin and classificatin presented in Chapters 3 and 4. In particular, linear regressin assumes a mdel f the frm p f(x) =β 0 + X j β j, (8.8) j=1 whereas regressin trees assume a mdel f the frm f(x) = M c m 1 (X Rm) (8.9) m=1 where R 1,...,R M represent a partitin f feature space, as in Figure 8.3. Which mdel is better? It depends n the prblem at hand. If the relatinship between the features and the respnse is well apprximated by a linear mdel as in (8.8), then an apprach such as linear regressin will likely wrk well, and will utperfrm a methd such as a regressin tree that des nt explit this linear structure. If instead there is a highly nn-linear and cmplex relatinship between the features and the respnse as indicated by mdel (8.9), then decisin trees may utperfrm classical appraches. An illustrative example is displayed in Figure 8.7. The relative perfrmances f tree-based and classical appraches can be assessed by estimating the test errr, using either crss-validatin r the validatin set apprach (Chapter 5). Of curse, ther cnsideratins beynd simply test errr may cme int play in selecting a statistical learning methd; fr instance, in certain settings, predictin using a tree may be preferred fr the sake f interpretability and visualizatin.

330 8.1 The Basics f Decisin Trees 315 X 2 X X 1 X X X X 1 X 1 FIGURE 8.7. Tp Rw: A tw-dimensinal classificatin example in which the true decisin bundary is linear, and is indicated by the shaded regins. A classical apprach that assumes a linear bundary (left) will utperfrm a decisin tree that perfrms splits parallel t the axes (right). Bttm Rw: Here the true decisin bundary is nn-linear. Here a linear mdel is unable t capture the true decisin bundary (left), whereas a decisin tree is successful (right) Advantages and Disadvantages f Trees Decisin trees fr regressin and classificatin have a number f advantages ver the mre classical appraches seen in Chapters 3 and 4: Trees are very easy t explain t peple. In fact, they are even easier t explain than linear regressin! Sme peple believe that decisin trees mre clsely mirrr human decisin-making than d the regressin and classificatin appraches seen in previus chapters. Trees can be displayed graphically, and are easily interpreted even by a nn-expert (especially if they are small). Trees can easily handle qualitative predictrs withut the need t create dummy variables.

331 Tree-Based Methds Unfrtunately, trees generally d nt have the same level f predictive accuracy as sme f the ther regressin and classificatin appraches seen in this bk. Hwever, by aggregating many decisin trees, using methds like bagging, randm frests, and bsting, the predictive perfrmance f trees can be substantially imprved. We intrduce these cncepts in the next sectin. 8.2 Bagging, Randm Frests, Bsting Bagging, randm frests, and bsting use trees as building blcks t cnstruct mre pwerful predictin mdels Bagging The btstrap, intrduced in Chapter 5, is an extremely pwerful idea. It is used in many situatins in which it is hard r even impssible t directly cmpute the standard deviatin f a quantity f interest. We see here that the btstrap can be used in a cmpletely different cntext, in rder t imprve statistical learning methds such as decisin trees. The decisin trees discussed in Sectin 8.1 suffer frm high variance. This means that if we split the training data int tw parts at randm, and fit a decisin tree t bth halves, the results that we get culd be quite different. In cntrast, a prcedure with lw variance will yield similar results if applied repeatedly t distinct data sets; linear regressin tends t have lw variance, if the rati f n t p is mderately large. Btstrap aggregatin, rbagging, is a general-purpse prcedure fr reducing the bagging variance f a statistical learning methd; we intrduce it here because it is particularly useful and frequently used in the cntext f decisin trees. Recall that given a set f n independent bservatins Z 1,...,Z n,each with variance σ 2, the variance f the mean Z f the bservatins is given by σ 2 /n. Intherwrds,averaging a set f bservatins reduces variance. Hence a natural way t reduce the variance and hence increase the predictin accuracy f a statistical learning methd is t take many training sets frm the ppulatin, build a separate predictin mdel using each training set, and average the resulting predictins. In ther wrds, we culd calculate ˆf 1 (x), ˆf 2 (x),..., ˆf B (x) usingb separate training sets, and average them in rder t btain a single lw-variance statistical learning mdel, given by ˆf avg (x) = 1 B ˆf b (x). B Of curse, this is nt practical because we generally d nt have access t multiple training sets. Instead, we can btstrap, by taking repeated b=1

332 8.2 Bagging, Randm Frests, Bsting 317 samples frm the (single) training data set. In this apprach we generate B different btstrapped training data sets. We then train ur methd n the bth btstrapped training set in rder t get ˆf b (x), and finally average all the predictins, t btain ˆf bag (x) = 1 B B ˆf b (x). b=1 This is called bagging. While bagging can imprve predictins fr many regressin methds, it is particularly useful fr decisin trees. T apply bagging t regressin trees, we simply cnstructb regressintrees using B btstrapped training sets, and average the resulting predictins. These trees are grwn deep, and are nt pruned. Hence each individual tree has high variance, but lw bias. Averaging these B trees reduces the variance. Bagging has been demnstrated t give impressive imprvements in accuracy by cmbining tgether hundreds r even thusands f trees int a single prcedure. Thus far, we have described the bagging prcedure in the regressin cntext, t predict a quantitative utcme Y. Hw can bagging be extended t a classificatin prblem where Y is qualitative? In that situatin, there are a few pssible appraches, but the simplest is as fllws. Fr a given test bservatin, we can recrd the class predicted by each f the B trees, and take a majrity vte: the verall predictin is the mst cmmnly ccurring majrity class amng the B predictins. vte Figure 8.8 shws the results frm bagging trees n the Heart data. The test errr rate is shwn as a functin f B, the number f trees cnstructed using btstrapped training data sets. We see that the bagging test errr rate is slightly lwer in this case than the test errr rate btained frm a single tree. The number f trees B is nt a critical parameter with bagging; using a very large value f B will nt lead t verfitting. In practice we use a value f B sufficiently large that the errr has settled dwn. Using B = 100 is sufficient t achieve gd perfrmance in this example. Out-f-Bag Errr Estimatin It turns ut that there is a very straightfrward way t estimate the test errr f a bagged mdel, withut the need t perfrm crss-validatin r the validatin set apprach. Recall that the key t bagging is that trees are repeatedly fit t btstrapped subsets f the bservatins. One can shw that n average, each bagged tree makes use f arund tw-thirds f the bservatins. 3 The remaining ne-third f the bservatins nt used t fit a given bagged tree are referred t as the ut-f-bag (OOB) bservatins. We ut-f-bag can predict the respnse fr the ith bservatin using each f the trees in 3 This relates t Exercise 2 f Chapter 5.

333 Tree-Based Methds Errr Test: Bagging Test: RandmFrest OOB: Bagging OOB: RandmFrest Number f Trees FIGURE 8.8. Bagging and randm frest results fr the Heart data. The test errr (black and range) is shwn as a functin f B, the number f btstrapped training sets used. Randm frests were applied with m = p. The dashed line indicates the test errr resulting frm a single classificatin tree. The green and blue traces shw the OOB errr, which in this case is cnsiderably lwer. which that bservatin was OOB. This will yield arund B/3 predictins fr the ith bservatin. In rder t btain a single predictin fr the ith bservatin, we can average these predicted respnses (if regressin is the gal) r can take a majrity vte (if classificatin is the gal). This leads t a single OOB predictin fr the ith bservatin. An OOB predictin can be btained in this way fr each f the n bservatins, frm which the verall OOB MSE (fr a regressin prblem) r classificatin errr (fr a classificatin prblem) can be cmputed. The resulting OOB errr is a valid estimate f the test errr fr the bagged mdel, since the respnse fr each bservatin is predicted using nly the trees that were nt fit using that bservatin. Figure 8.8 displays the OOB errr n the Heart data. It can be shwn that with B sufficiently large, OOB errr is virtually equivalent t leave-ne-ut crss-validatin errr. The OOB apprach fr estimating the test errr is particularly cnvenient when perfrming bagging n large data sets fr which crss-validatin wuld be cmputatinally nerus. Variable Imprtance Measures As we have discussed, bagging typically results in imprved accuracy ver predictin using a single tree. Unfrtunately, hwever, it can be difficult t interpret the resulting mdel. Recall that ne f the advantages f decisin

334 8.2 Bagging, Randm Frests, Bsting 319 Fbs RestECG ExAng Sex Slpe Chl Age RestBP MaxHR Oldpeak ChestPain Ca Thal Variable Imprtance FIGURE 8.9. A variable imprtance plt fr the Heart data. Variable imprtance is cmputed using the mean decrease in Gini index, and expressed relative t the maximum. trees is the attractive and easily interpreted diagram that results, such as the ne displayed in Figure 8.1. Hwever, when we bag a large number f trees, it is n lnger pssible t represent the resulting statistical learning prcedure using a single tree, and it is n lnger clear which variables are mst imprtant t the prcedure. Thus, bagging imprves predictin accuracy at the expense f interpretability. Althugh the cllectin f bagged trees is much mre difficult t interpret than a single tree, ne can btain an verall summary f the imprtance f each predictr using the RSS (fr bagging regressin trees) r the Gini index (fr bagging classificatin trees). In the case f bagging regressin trees, we can recrd the ttal amunt that the RSS (8.1) is decreased due t splits ver a given predictr, averaged ver all B trees. A large value indicates an imprtant predictr. Similarly, in the cntext f bagging classificatin trees, we can add up the ttal amunt that the Gini index (8.6) is decreased by splits ver a given predictr, averaged ver all B trees. A graphical representatin f the variable imprtances in the Heart data variable is shwn in Figure 8.9. We see the mean decrease in Gini index fr each variable, relative t the largest. The variables with the largest mean decrease imprtance in Gini index are Thal, Ca, andchestpain.

335 Tree-Based Methds Randm Frests Randm frests prvide an imprvement ver bagged trees by way f a randm frest small tweak that decrrelates the trees. As in bagging, we build a number f decisin trees n btstrapped training samples. But when building these decisin trees, each time a split in a tree is cnsidered, a randm sample f m predictrs is chsen as split candidates frm the full set f p predictrs. The split is allwed t use nly ne f thse m predictrs. A fresh sample f m predictrs is taken at each split, and typically we chse m p that is, the number f predictrs cnsidered at each split is apprximately equal t the square rt f the ttal number f predictrs (4 ut f the 13 fr the Heart data). In ther wrds, in building a randm frest, at each split in the tree, the algrithm is nt even allwed t cnsider a majrity f the available predictrs. This may sund crazy, but it has a clever ratinale. Suppse that there is ne very strng predictr in the data set, alng with a number f ther mderately strng predictrs. Then in the cllectin f bagged trees, mst r all f the trees will use this strng predictr in the tp split. Cnsequently, all f the bagged trees will lk quite similar t each ther. Hence the predictins frm the bagged trees will be highly crrelated. Unfrtunately, averaging many highly crrelated quantities des nt lead t as large f a reductin in variance as averaging many uncrrelated quantities. In particular, this means that bagging will nt lead t a substantial reductin in variance ver a single tree in this setting. Randm frests vercme this prblem by frcing each split t cnsider nly a subset f the predictrs. Therefre, n average (p m)/p f the splits will nt even cnsider the strng predictr, and s ther predictrs will have mre f a chance. We can think f this prcess as decrrelating the trees, thereby making the average f the resulting trees less variable and hence mre reliable. The main difference between bagging and randm frests is the chice f predictr subset size m. Fr instance, if a randm frest is built using m = p, then this amunts simply t bagging. On the Heart data, randm frests using m = p leads t a reductin in bth test errr and OOB errr ver bagging (Figure 8.8). Using a small value f m in building a randm frest will typically be helpful when we have a large number f crrelated predictrs. We applied randm frests t a high-dimensinal bilgical data set cnsisting f expressin measurements f 4,718 genes measured n tissue samples frm 349 patients. There are arund 20,000 genes in humans, and individual genes have different levels f activity, r expressin, in particular cells, tissues, and bilgical cnditins. In this data set, each f the patient samples has a qualitative label with 15 different levels: either nrmal r 1 f 14 different types f cancer. Our gal was t use randm frests t predict cancer type based n the 500 genes that have the largest variance in the training set.

336 8.2 Bagging, Randm Frests, Bsting 321 Test Classificatin Errr m=p m=p/2 m= p Number f Trees FIGURE Results frm randm frests fr the 15-class gene expressin data set with p = 500 predictrs. The test errr is displayed as a functin f the number f trees. Each clred line crrespnds t a different value f m, the number f predictrs available fr splitting at each interir tree nde. Randm frests (m <p) lead t a slight imprvement ver bagging (m = p). A single classificatin tree has an errr rate f 45.7 %. We randmly divided the bservatins int a training and a test set, and applied randm frests t the training set fr three different values f the number f splitting variables m. The results are shwn in Figure The errr rate f a single tree is 45.7 %, and the null rate is 75.4%. 4 We see that using 400 trees is sufficient t give gd perfrmance, and that the chice m = p gave a small imprvement in test errr ver bagging (m = p) in this example. As with bagging, randm frests will nt verfit if we increase B, sinpracticeweuseavaluefb sufficiently large fr the errr rate t have settled dwn Bsting We nw discuss bsting, yet anther apprach fr imprving the predic- bsting tins resulting frm a decisin tree. Like bagging, bsting is a general apprach that can be applied t many statistical learning methds fr regressin r classificatin. Here we restrict ur discussin f bsting t the cntext f decisin trees. Recall that bagging invlves creating multiple cpies f the riginal training data set using the btstrap, fitting a separate decisin tree t each cpy, and then cmbining all f the trees in rder t create a single predictive mdel. Ntably, each tree is built n a btstrap data set, independent 4 The null rate results frm simply classifying each bservatin t the dminant class verall, which is in this case the nrmal class.

337 Tree-Based Methds f the ther trees. Bsting wrks in a similar way, except that the trees are grwn sequentially: each tree is grwn using infrmatin frm previusly grwn trees. Bsting des nt invlve btstrap sampling; instead each tree is fit n a mdified versin f the riginal data set. Algrithm 8.2 Bsting fr Regressin Trees 1. Set ˆf(x) =0andr i = y i fr all i in the training set. 2. Fr b =1, 2,...,B, repeat: (a) Fit a tree ˆf b with d splits (d + 1 terminal ndes) t the training data (X, r). (b) Update ˆf by adding in a shrunken versin f the new tree: (c) Update the residuals, 3. Output the bsted mdel, ˆf(x) ˆf(x)+λ ˆf b (x). (8.10) r i r i λ ˆf b (x i ). (8.11) ˆf(x) = B λ ˆf b (x). (8.12) b=1 Cnsider first the regressin setting. Like bagging, bsting invlves cmbining a large number f decisin trees, ˆf 1,..., ˆf B. Bsting is described in Algrithm 8.2. What is the idea behind this prcedure? Unlike fitting a single large decisin tree t the data, which amunts t fitting the data hard and ptentially verfitting, the bsting apprach instead learns slwly. Given the current mdel, we fit a decisin tree t the residuals frm the mdel. That is, we fit a tree using the current residuals, rather than the utcme Y,astherespnse. We then add this new decisin tree int the fitted functin in rder t update the residuals. Each f these trees can be rather small, with just a few terminal ndes, determined by the parameter d in the algrithm. By fitting small trees t the residuals, we slwly imprve ˆf in areas where it des nt perfrm well. The shrinkage parameter λ slws the prcess dwn even further, allwing mre and different shaped trees t attack the residuals. In general, statistical learning appraches that learn slwly tend t perfrm well. Nte that in bsting, unlike in bagging, the cnstructin f each tree depends strngly n the trees that have already been grwn.

338 8.2 Bagging, Randm Frests, Bsting 323 Test Classificatin Errr Bsting: depth=1 Bsting: depth=2 RandmFrest: m= p Number f Trees FIGURE Results frm perfrming bsting and randm frests n the 15-class gene expressin data set in rder t predict cancer versus nrmal. The test errr is displayed as a functin f the number f trees. Fr the tw bsted mdels, λ =0.01. Depth-1 trees slightly utperfrm depth-2 trees, and bth utperfrm the randm frest, althugh the standard errrs are arund 0.02, making nne f these differences significant. The test errr rate fr a single tree is 24 %. We have just described the prcess f bsting regressin trees. Bsting classificatin trees prceeds in a similar but slightly mre cmplex way, and the details are mitted here. Bsting has three tuning parameters: 1. The number f trees B. Unlike bagging and randm frests, bsting can verfit if B is t large, althugh this verfitting tends t ccur slwly if at all. We use crss-validatin t select B. 2. The shrinkageparameter λ, a small psitive number. This cntrls the rate at which bsting learns. Typical values are 0.01 r 0.001, and the right chice can depend n the prblem. Very small λ can require using a very large value f B in rder t achieve gd perfrmance. 3. The number d f splits in each tree, which cntrls the cmplexity f the bsted ensemble. Often d = 1 wrks well, in which case each tree is a stump, cnsisting f a single split. In this case, the bsted stump ensemble is fitting an additive mdel, since each term invlves nly a single variable. Mre generally d is the interactin depth, andcntrls interactin the interactin rder f the bsted mdel, since d splits can invlve depth at mst d variables. In Figure 8.11, we applied bsting t the 15-class cancer gene expressin data set, in rder t develp a classifier that can distinguish the nrmal class frm the 14 cancer classes. We display the test errr as a functin f the ttal number f trees and the interactin depth d. Weseethatsimple

339 Tree-Based Methds stumps with an interactin depth f ne perfrm well if enugh f them are included. This mdel utperfrms the depth-tw mdel, and bth utperfrm a randm frest. This highlights ne difference between bsting and randm frests: in bsting, because the grwth f a particular tree takes int accunt the ther trees that have already been grwn, smaller trees are typically sufficient. Using smaller trees can aid in interpretability as well; fr instance, using stumps leads t an additive mdel. 8.3 Lab: Decisin Trees Fitting Classificatin Trees The tree library is used t cnstruct classificatin and regressin trees. > library(tree) We first use classificatin trees t analyze the Carseats data set. In these data, Sales is a cntinuus variable, and s we begin by recding it as a binary variable. We use the ifelse() functin t create a variable, called ifelse() High, which takes n a value f Yes if the Sales variable exceeds 8, and takes n a value f N therwise. > library(islr) > attach(carseats) > High=ifelse(Sales <=8,"N","Yes") Finally, we use the data.frame() functin t merge High with the rest f the Carseats data. > Carseats=data.frame(Carseats,High) We nw use the tree() functin t fit a classificatin tree in rder t predict tree() High using all variables but Sales. The syntax f the tree() functin is quite similar t that f the lm() functin. > tree.carseats=tree(high.-sales,carseats) The summary() functin lists the variables that are used as internal ndes in the tree, the number f terminal ndes, and the (training) errr rate. > summary(tree.carseats) Classificatin tree: tree(frmula = High. - Sales, data = Carseats) Variables actually used in tree cnstructin: [1] "ShelveLc " "Price" "Incme" "CmpPrice " [5] "Ppulatin " "Advertising " "Age" "US" Number f terminal ndes: 27 Residual mean deviance: = / 373 Misclassificatin errr rate: 0.09 = 36 / 400

340 8.3 Lab: Decisin Trees 325 We see that the training errr rate is 9 %. Fr classificatin trees, the deviance reprted in the utput f summary() is given by 2 n mk lg ˆp mk, m k where n mk is the number f bservatins in the mth terminal nde that belng t the kth class. A small deviance indicates a tree that prvides a gd fit t the (training) data. The residual mean deviance reprted is simply the deviance divided by n T 0, which in this case is = 373. One f the mst attractive prperties f trees is that they can be graphically displayed. We use the plt() functin t display the tree structure, and the text() functin t display the nde labels. The argument pretty=0 instructs R t include the categry names fr any qualitative predictrs, rather than simply displaying a letter fr each categry. > plt(tree.carseats) > text(tree.carseats,pretty=0) The mst imprtant indicatr f Sales appears t be shelving lcatin, since the first branch differentiates Gd lcatins frm Bad and Medium lcatins. If we just type the name f the tree bject, R prints utput crrespnding t each branch f the tree. R displays the split criterin (e.g. Price<92.5), the number f bservatins in that branch, the deviance, the verall predictin fr the branch (Yes r N), and the fractin f bservatins in that branch that take n values f Yes and N. Branches that lead t terminal ndes are indicated using asterisks. > tree.carseats nde), split, n, deviance, yval, (yprb) * dentes terminal nde 1) rt N ( ) 2) ShelveLc : Bad,Medium N ( ) 4) Price < Yes ( ) 8) Incme < N ( ) In rder t prperly evaluate the perfrmance f a classificatin tree n these data, we must estimate the test errr rather than simply cmputing the training errr. We split the bservatins int a training set and a test set, build the tree using the training set, and evaluate its perfrmance n the test data. The predict() functin can be used fr this purpse. In the case f a classificatin tree, the argument type="class" instructs R t return the actual class predictin. This apprach leads t crrect predictins fr arund 71.5 % f the lcatins in the test data set. > set.seed(2) > train=sample(1:nrw(carseats), 200) > Carseats.test=Carseats[-train,] > High.test=High[-train]

341 Tree-Based Methds > tree.carseats=tree(high.-sales,carseats,subset=train) > tree.pred=predict(tree.carseats,carseats.test,type="class") > table(tree.pred,high.test) High.test tree.pred N Yes N Yes > (86+57) /200 [1] Next, we cnsider whether pruning the tree might lead t imprved results. The functin cv.tree() perfrms crss-validatin in rder t cv.tree() determine the ptimal level f tree cmplexity; cst cmplexity pruning is used in rder t select a sequence f trees fr cnsideratin. We use the argument FUN=prune.misclass in rder t indicate that we want the classificatin errr rate t guide the crss-validatin and pruning prcess, rather than the default fr the cv.tree() functin, which is deviance. The cv.tree() functin reprts the number f terminal ndes f each tree cnsidered (size) as well as the crrespnding errr rate and the value f the cst-cmplexity parameter used (k, which crrespnds t α in (8.4)). > set.seed(3) > cv.carseats=cv.tree(tree.carseats,fun=prune.misclass) > names(cv.carseats) [1] "size" "dev" "k" "methd" > cv.carseats $size [1] $dev [1] $k [1] -Inf [8] $methd [1] "misclass" attr(,"class") [1] "prune" "tree.sequence" Nte that, despite the name, dev crrespnds t the crss-validatin errr rate in this instance. The tree with 9 terminal ndes results in the lwest crss-validatin errr rate, with 50 crss-validatin errrs. We plt the errr rate as a functin f bth size and k. > par(mfrw=c(1,2)) > plt(cv.carseats$size,cv.carseats$dev,type="b") > plt(cv.carseats$k,cv.carseats$dev,type="b")

342 8.3 Lab: Decisin Trees 327 We nw apply the prune.misclass() functin in rder t prune the tree t prune. btain the nine-nde tree. misclass() > prune.carseats=prune.misclass(tree.carseats,best=9) > plt(prune.carseats) > text(prune.carseats,pretty=0) Hw well des this pruned tree perfrm n the test data set? Once again, we apply the predict() functin. > tree.pred=predict(prune.carseats,carseats.test,type="class") > table(tree.pred,high.test) High.test tree.pred N Yes N Yes > (94+60) /200 [1] 0.77 Nw 77 % f the test bservatins are crrectly classified, s nt nly has the pruning prcess prduced a mre interpretable tree, but it has als imprved the classificatin accuracy. If we increase the value f best, we btain a larger pruned tree with lwer classificatin accuracy: > prune.carseats=prune.misclass(tree.carseats,best=15) > plt(prune.carseats) > text(prune.carseats,pretty=0) > tree.pred=predict(prune.carseats,carseats.test,type="class") > table(tree.pred,high.test) High.test tree.pred N Yes N Yes > (86+62) /200 [1] Fitting Regressin Trees Here we fit a regressin tree t the Bstn data set. First, we create a training set, and fit the tree t the training data. > library(mass) > set.seed(1) > train = sample(1:nrw(bstn), nrw(bstn)/2) > tree.bstn=tree(medv.,bstn,subset=train) > summary(tree.bstn) Regressin tree: tree(frmula = medv., data = Bstn, subset = train) Variables actually used in tree cnstructin: [1] "lstat" "rm" "dis" Number f terminal ndes: 8

343 Tree-Based Methds Residual mean deviance: = 3099 / 245 Distributin f residuals : Min. 1st Qu. Median Mean 3rd Qu. Max Ntice that the utput f summary() indicates that nly three f the variables have been used in cnstructing the tree. In the cntext f a regressin tree, the deviance is simply the sum f squared errrs fr the tree. We nw plt the tree. > plt(tree.bstn) > text(tree.bstn,pretty=0) The variable lstat measures the percentage f individuals with lwer sciecnmic status. The tree indicates that lwer values f lstat crrespnd t mre expensive huses. The tree predicts a median huse price f $46, 400 fr larger hmes in suburbs in which residents have high sciecnmic status (rm>=7.437 and lstat<9.715). Nw we use the cv.tree() functin t see whether pruning the tree will imprve perfrmance. > cv.bstn=cv.tree(tree.bstn) > plt(cv.bstn$size,cv.bstn$dev,type= b ) In this case, the mst cmplex tree is selected by crss-validatin. Hwever, if we wish t prune the tree, we culd d s as fllws, using the prune.tree() functin: > prune.bstn=prune.tree(tree.bstn,best=5) > plt(prune.bstn) > text(prune.bstn,pretty=0) In keeping with the crss-validatin results, we use the unpruned tree t make predictins n the test set. > yhat=predict(tree.bstn,newdata=bstn[-train,]) > bstn.test=bstn[-train,"medv"] > plt(yhat,bstn.test) > abline(0,1) > mean((yhat-bstn.test)^2) [1] In ther wrds, the test set MSE assciated with the regressin tree is The square rt f the MSE is therefre arund 5.005, indicating that this mdel leads t test predictins that are within arund $5, 005 f the true median hme value fr the suburb. prune.tree() Bagging and Randm Frests Here we apply bagging and randm frests t the Bstn data, using the randmfrest package in R. The exact results btained in this sectin may depend n the versin f R and the versin f the randmfrest package

344 8.3 Lab: Decisin Trees 329 installed n yur cmputer. Recall that bagging is simply a special case f a randm frest with m = p. Therefre, the randmfrest() functin can randm be used t perfrm bth randm frests and bagging. We perfrm bagging Frest() as fllws: > library(randmfrest) > set.seed(1) > bag.bstn=randmfrest(medv.,data=bstn,subset=train, mtry=13,imprtance =TRUE) > bag.bstn Call: randmfrest(frmula = medv., data = Bstn, mtry = 13, imprtance = TRUE, subset = train) Type f randm frest: regressin Number f trees: 500 N. f variables tried at each split: 13 Mean f squared residuals : % Var explained : The argument mtry=13 indicates that all 13 predictrs shuld be cnsidered fr each split f the tree in ther wrds, that bagging shuld be dne. Hw well des this bagged mdel perfrm n the test set? > yhat.bag = predict(bag.bstn,newdata=bstn[-train,]) > plt(yhat.bag, bstn.test) > abline(0,1) > mean((yhat.bag-bstn.test)^2) [1] The test set MSE assciated with the bagged regressin tree is 13.16, almst half that btained using an ptimally-pruned single tree. We culd change the number f trees grwn by randmfrest() using the ntree argument: > bag.bstn=randmfrest(medv.,data=bstn,subset=train, mtry=13,ntree=25) > yhat.bag = predict(bag.bstn,newdata=bstn[-train,]) > mean((yhat.bag-bstn.test)^2) [1] Grwing a randm frest prceeds in exactly the same way, except that we use a smaller value f the mtry argument. By default, randmfrest() uses p/3 variables when building a randm frest f regressin trees, and p variables when building a randm frest f classificatin trees. Here we use mtry = 6. > set.seed(1) > rf.bstn=randmfrest(medv.,data=bstn,subset=train, mtry=6,imprtance =TRUE) > yhat.rf = predict(rf.bstn,newdata=bstn[-train,]) > mean((yhat.rf-bstn.test)^2) [1] 11.31

345 Tree-Based Methds The test set MSE is 11.31; this indicates that randm frests yielded an imprvement ver bagging in this case. Using the imprtance() functin, we can view the imprtance f each imprtance() variable. > imprtance (rf.bstn) %IncMSE IncNdePurity crim zn indus chas nx rm age dis rad tax ptrati black lstat Tw measures f variable imprtance are reprted. The frmer is based upn the mean decrease f accuracy in predictins n the ut f bag samples when a given variable is excluded frm the mdel. The latter is a measure f the ttal decrease in nde impurity that results frm splits ver that variable, averaged ver all trees (this was pltted in Figure 8.9). In the case f regressin trees, the nde impurity is measured by the training RSS, and fr classificatin trees by the deviance. Plts f these imprtance measures can be prduced using the varimpplt() functin. > varimpplt (rf.bstn) The results indicate that acrss all f the trees cnsidered in the randm frest, the wealth level f the cmmunity (lstat) and the huse size (rm) are by far the tw mst imprtant variables. varimpplt() Bsting Here we use the gbm package, and within it the gbm() functin, t fit bsted gbm() regressin trees t the Bstn data set. We run gbm() with the ptin distributin="gaussian" since this is a regressin prblem; if it were a binary classificatin prblem, we wuld use distributin="bernulli". The argument n.trees=5000 indicates that we want 5000 trees, and the ptin interactin.depth=4 limits the depth f each tree. > library(gbm) > set.seed(1) > bst.bstn=gbm(medv.,data=bstn[train,],distributin= "gaussian",n.trees=5000, interactin.depth=4) The summary() functin prduces a relative influence plt and als utputs the relative influence statistics.

346 8.3 Lab: Decisin Trees 331 > summary(bst.bstn) var rel.inf 1 lstat rm dis crim nx ptrati black age tax indus chas rad zn We see that lstat and rm are by far the mst imprtant variables. We can als prduce partial dependence plts fr these tw variables. These plts partial illustrate the marginal effect f the selected variables n the respnse after dependence integrating ut the ther variables. In this case, as we might expect, median plt huse prices are increasing with rm and decreasing with lstat. > par(mfrw=c(1,2)) > plt(bst.bstn,i="rm") > plt(bst.bstn,i="lstat") We nw use the bsted mdel t predict medv n the test set: > yhat.bst=predict(bst.bstn,newdata=bstn[-train,], n.trees=5000) > mean((yhat.bst -bstn.test)^2) [1] 11.8 The test MSE btained is 11.8; similar t the test MSE fr randm frests and superir t that fr bagging. If we want t, we can perfrm bsting with a different value f the shrinkage parameter λ in (8.10). The default value is 0.001, but this is easily mdified. Here we take λ =0.2. > bst.bstn=gbm(medv.,data=bstn[train,],distributin= "gaussian",n.trees=5000,interactin.depth=4,shrinkage =0.2, verbse=f) > yhat.bst=predict(bst.bstn,newdata=bstn[-train,], n.trees=5000) > mean((yhat.bst -bstn.test)^2) [1] 11.5 In this case, using λ =0.2 leads t a slightly lwer test MSE than λ =0.001.

347 Tree-Based Methds 8.4 Exercises Cnceptual 1. Draw an example (f yur wn inventin) f a partitin f twdimensinal feature space that culd result frm recursive binary splitting. Yur example shuld cntain at least six regins. Draw a decisin tree crrespnding t this partitin. Be sure t label all aspects f yur figures, including the regins R 1,R 2,...,thecutpints t 1,t 2,..., and s frth. Hint: Yur result shuld lk smething like Figures 8.1 and It is mentined in Sectin that bsting using depth-ne trees (r stumps) leadstanadditive mdel: that is, a mdel f the frm f(x) = p f j (X j ). j=1 Explain why this is the case. Yu can begin with (8.12) in Algrithm Cnsider the Gini index, classificatin errr, and crss-entrpy in a simple classificatin setting with tw classes. Create a single plt that displays each f these quantities as a functin f ˆp m1.thexaxis shuld display ˆp m1, ranging frm 0 t 1, and the y-axis shuld display the value f the Gini index, classificatin errr, and entrpy. Hint: In a setting with tw classes, ˆp m1 =1 ˆp m2. Yu culd make this plt by hand, but it will be much easier t make in R. 4. This questin relates t the plts in Figure (a) Sketch the tree crrespnding t the partitin f the predictr space illustrated in the left-hand panel f Figure The numbers inside the bxes indicate the mean f Y within each regin. (b) Create a diagram similar t the left-hand panel f Figure 8.12, using the tree illustrated in the right-hand panel f the same figure. Yu shuld divide up the predictr space int the crrect regins, and indicate the mean fr each regin. 5. Suppse we prduce ten btstrapped samples frm a data set cntaining red and green classes. We then apply a classificatin tree t each btstrapped sample and, fr a specific value f X, prduce 10 estimates f P (Class is RedX): 0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75.

348 8.4 Exercises 333 X2 < 1 15 X X1 < X X2 < 2 X1 < FIGURE Left: A partitin f the predictr space crrespnding t Exercise 4a. Right: A tree crrespnding t Exercise 4b. There are tw cmmn ways t cmbine these results tgether int a single class predictin. One is the majrity vte apprach discussed in this chapter. The secnd apprach is t classify based n the average prbability. In this example, what is the final classificatin under each f these tw appraches? 6. Prvide a detailed explanatin f the algrithm that is used t fit a regressin tree. Applied 7. In the lab, we applied randm frests t the Bstn data using mtry=6 and using ntree=25 and ntree=500. Create a plt displaying the test errr resulting frm randm frests n this data set fr a mre cmprehensive range f values fr mtry and ntree. Yu can mdel yur plt after Figure Describe the results btained. 8. In the lab, a classificatintreewas applied t the Carseats data set after cnverting Sales int a qualitative respnse variable. Nw we will seek t predict Sales using regressin trees and related appraches, treating the respnse as a quantitative variable. (a) Split the data set int a training set and a test set. (b) Fit a regressin tree t the training set. Plt the tree, and interpret the results. What test MSE d yu btain? (c) Use crss-validatin in rder t determine the ptimal level f tree cmplexity. Des pruning the tree imprve the test MSE? (d) Use the bagging apprach in rder t analyze this data. What test MSE d yu btain? Use the imprtance() functin t determine which variables are mst imprtant.

349 Tree-Based Methds (e) Use randm frests t analyze this data. What test MSE d yu btain? Use the imprtance() functin t determine which variables are mst imprtant. Describe the effect f m,thenumberf variables cnsidered at each split, n the errr rate btained. 9. This prblem invlves the OJ data set which is part f the ISLR package. (a) Create a training set cntaining a randm sample f 800 bservatins, and a test set cntaining the remaining bservatins. (b) Fit a tree t the training data, with Purchase as the respnse and the ther variables except fr Buy as predictrs. Use the summary() functin t prduce summary statistics abut the tree, and describe the results btained. What is the training errr rate? Hw many terminal ndes des the tree have? (c) Type in the name f the tree bject in rder t get a detailed text utput. Pick ne f the terminal ndes, and interpret the infrmatin displayed. (d) Create a plt f the tree, and interpret the results. (e) Predict the respnse n the test data, and prduce a cnfusin matrix cmparing the test labels t the predicted test labels. What is the test errr rate? (f) Apply the cv.tree() functin t the training set in rder t determine the ptimal tree size. (g) Prduce a plt with tree size n the x-axis and crss-validated classificatin errr rate n the y-axis. (h) Which tree size crrespnds t the lwest crss-validated classificatin errr rate? (i) Prduce a pruned tree crrespnding t the ptimal tree size btained using crss-validatin. If crss-validatin des nt lead t selectin f a pruned tree, then create a pruned tree with five terminal ndes. (j) Cmpare the training errr rates between the pruned and unpruned trees. Which is higher? (k) Cmpare the test errr rates between the pruned and unpruned trees. Which is higher? 10. We nw use bsting t predict Salary in the Hitters data set. (a) Remve the bservatins fr whm the salary infrmatin is unknwn, and then lg-transfrm the salaries.

350 8.4 Exercises 335 (b) Create a training set cnsisting f the first 200 bservatins, and a test set cnsisting f the remaining bservatins. (c) Perfrm bsting n the training set with 1,000 trees fr a range f values f the shrinkage parameter λ. Prduce a plt with different shrinkage values n the x-axis and the crrespnding training set MSE n the y-axis. (d) Prduce a plt with different shrinkage values n the x-axis and the crrespnding test set MSE n the y-axis. (e) Cmpare the test MSE f bsting t the test MSE that results frm applying tw f the regressin appraches seen in Chapters 3 and 6. (f) Which variables appear t be the mst imprtant predictrs in the bsted mdel? (g) Nw apply bagging t the training set. What is the test set MSE fr this apprach? 11. This questin uses the Caravan data set. (a) Create a training set cnsisting f the first 1,000 bservatins, and a test set cnsisting f the remaining bservatins. (b) Fit a bsting mdel t the training set with Purchase as the respnse and the ther variables as predictrs. Use 1,000 trees, and a shrinkage value f Which predictrs appear t be the mst imprtant? (c) Use the bsting mdel t predict the respnse n the test data. Predict that a persn will make a purchase if the estimated prbability f purchase is greater than 20 %. Frm a cnfusin matrix. What fractin f the peple predicted t make a purchase d in fact make ne? Hw des this cmpare with the results btained frm applying KNN r lgistic regressin t this data set? 12. Apply bsting, bagging, and randm frests t a data set f yur chice. Be sure t fit the mdels n a training set and t evaluate their perfrmance n a test set. Hw accurate are the results cmpared t simple methds like linear r lgistic regressin? Which f these appraches yields the best perfrmance?

351

352 9 Supprt Vectr Machines In this chapter, we discuss the supprt vectr machine (SVM), an apprach fr classificatin that was develped in the cmputer science cmmunity in the 1990s and that has grwn in ppularity since then. SVMs have been shwn t perfrm well in a variety f settings, and are ften cnsidered ne f the best ut f the bx classifiers. The supprt vectr machine is a generalizatin f a simple and intuitive classifier called the maximal margin classifier, which we intrduce in Sectin 9.1. Thugh it is elegant and simple, we will see that this classifier unfrtunately cannt be applied t mst data sets, since it requires that the classes be separable by a linear bundary. In Sectin 9.2, we intrduce the supprt vectr classifier, an extensin f the maximal margin classifier thatcanbeappliedinabraderrangef cases. Sectin 9.3 intrduces the supprt vectr machine, which is a further extensin f the supprt vectr classifier in rder t accmmdate nn-linear class bundaries. Supprt vectr machines are intended fr the binary classificatin setting in which there are tw classes; in Sectin 9.4 we discuss extensins f supprt vectr machines t the case f mre than tw classes. In Sectin 9.5 we discuss the clse cnnectins between supprt vectr machines and ther statistical methds such as lgistic regressin. Peple ften lsely refer t the maximal margin classifier, the supprt vectr classifier, and the supprt vectr machine as supprt vectr machines. T avid cnfusin, we will carefully distinguish between these three ntins in this chapter. G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics, DOI / , Springer Science+Business Media New Yrk

353 Supprt Vectr Machines 9.1 Maximal Margin Classifier In this sectin, we define a hyperplane and intrduce the cncept f an ptimal separating hyperplane What Is a Hyperplane? In a p-dimensinal space, a hyperplane is a flat affine subspace f hyperplane dimensin p 1. 1 Fr instance, in tw dimensins, a hyperplane is a flat ne-dimensinal subspace in ther wrds, a line. In three dimensins, a hyperplane is a flat tw-dimensinal subspace that is, a plane. In p>3 dimensins, it can be hard t visualize a hyperplane, but the ntin f a (p 1)-dimensinal flat subspace still applies. The mathematical definitin f a hyperplane is quite simple. In tw dimensins, a hyperplane is defined by the equatin β 0 + β 1 X 1 + β 2 X 2 = 0 (9.1) fr parameters β 0,β 1,andβ 2. When we say that (9.1) defines the hyperplane, we mean that any X =(X 1,X 2 ) T fr which (9.1) hlds is a pint n the hyperplane. Nte that (9.1) is simply the equatin f a line, since indeed in tw dimensins a hyperplane is a line. Equatin 9.1 can be easily extended t the p-dimensinal setting: β 0 + β 1 X 1 + β 2 X β p X p = 0 (9.2) defines a p-dimensinal hyperplane, again in the sense that if a pint X = (X 1,X 2,...,X p ) T in p-dimensinal space (i.e. a vectr f length p) satisfies (9.2), then X lies n the hyperplane. Nw, suppse that X des nt satisfy (9.2); rather, β 0 + β 1 X 1 + β 2 X β p X p > 0. (9.3) Then this tells us that X lies t ne side f the hyperplane. On the ther hand, if β 0 + β 1 X 1 + β 2 X β p X p < 0, (9.4) then X lies n the ther side f the hyperplane. S we can think f the hyperplane as dividing p-dimensinal space int tw halves. One can easily determine n which side f the hyperplane a pint lies by simply calculating the sign f the left hand side f (9.2). A hyperplane in tw-dimensinal space is shwn in Figure The wrd affine indicates that the subspace need nt pass thrugh the rigin.

354 9.1 Maximal Margin Classifier 339 X X 1 FIGURE 9.1. The hyperplane 1+2X 1 +3X 2 =0isshwn.Thebluereginis the set f pints fr which 1+2X 1 +3X 2 > 0, and the purple regin is the set f pints fr which 1+2X 1 +3X 2 < Classificatin Using a Separating Hyperplane Nw suppse that we have a n p data matrix X that cnsists f n training bservatins in p-dimensinal space, x 1 = x 11. x 1p,...,x n = x n1. x np, (9.5) and that these bservatins fall int tw classes that is, y 1,...,y n { 1, 1} where 1 represents ne class and 1 the ther class. We als have a test bservatin, a p-vectr f bserved features x = ( x 1... xp) T.Our gal is t develp a classifier based n the training data that will crrectly classify the test bservatin using its feature measurements. We have seen a number f appraches fr this task, such as linear discriminant analysis and lgistic regressin in Chapter 4, and classificatin trees, bagging, and bsting in Chapter 8. We will nw see a new apprach that is based upn the cncept f a separating hyperplane. Suppse that it is pssible t cnstruct a hyperplane that separates the training bservatins perfectly accrding t their class labels. Examples f three such separating hyperplanes are shwn in the left-hand panel f Figure 9.2. We can label the bservatins frm the blue class as y i =1and separating hyperplane

355 Supprt Vectr Machines X X X X 1 FIGURE 9.2. Left: There are tw classes f bservatins, shwn in blue and in purple, each f which has measurements n tw variables. Three separating hyperplanes, ut f many pssible, are shwn in black. Right: A separating hyperplane is shwn in black. The blue and purple grid indicates the decisin rule made by a classifier based n this separating hyperplane: a test bservatin that falls in the blue prtin f the grid will be assigned t the blue class, and a test bservatin that falls int the purple prtin f the grid will be assigned t the purple class. thse frm the purple class as y i = 1. Then a separating hyperplane has the prperty that β 0 + β 1 x i1 + β 2 x i β p x ip > 0ify i =1, (9.6) and β 0 + β 1 x i1 + β 2 x i β p x ip < 0ify i = 1. (9.7) Equivalently, a separating hyperplane has the prperty that y i (β 0 + β 1 x i1 + β 2 x i β p x ip ) > 0 (9.8) fr all i =1,...,n. If a separating hyperplane exists, we can use it t cnstruct a very natural classifier: a test bservatin is assigned a class depending n which side f the hyperplane it is lcated. The right-hand panel f Figure 9.2 shws an example f such a classifier. That is, we classify the test bservatin x based n the sign f f(x )=β 0 +β 1 x 1+β 2 x β p x p.iff(x ) is psitive, then we assign the test bservatin t class 1, and if f(x ) is negative, then we assign it t class 1. We can als make use f the magnitude f f(x ). If f(x ) is far frm zer, then this means that x lies far frm the hyperplane, and s we can be cnfident abut ur class assignment fr x. On the ther

356 9.1 Maximal Margin Classifier 341 hand, if f(x ) is clse t zer, then x is lcated near the hyperplane, and s we are less certain abut the class assignment fr x. Nt surprisingly, and as we see in Figure 9.2, a classifier that is based n a separating hyperplane leads t a linear decisin bundary The Maximal Margin Classifier In general, if ur data can be perfectly separated using a hyperplane, then there will in fact exist an infinite number f such hyperplanes. This is because a given separating hyperplane can usually be shifted a tiny bit up r dwn, r rtated, withut cming int cntact with any f the bservatins. Three pssible separating hyperplanes are shwn in the left-hand panel f Figure 9.2. In rder t cnstruct a classifier based upn a separating hyperplane, we must have a reasnable way t decide which f the infinite pssible separating hyperplanes t use. A natural chice is the maximal margin hyperplane (als knwn as the maximal ptimal separating hyperplane), which is the separating hyperplane that is farthest frm the training bservatins. That is, we can cmpute the (perpendicular) distance frm each training bservatin t a given separating hyperplane; the smallest such distance is the minimal distance frm the bservatins t the hyperplane, and is knwn as the margin. The maximal margin hyperplane ptimal separating hyperplane margin margin hyperplane is the separating hyperplane fr which the margin is largest that is, it is the hyperplane that has the farthest minimum distance t the training bservatins. We can then classify a test bservatin based n which side f the maximal margin hyperplane it lies. This is knwn as the maximal margin classifier. We hpe that a classifier that has a large maximal margin n the training data will als have a large margin n the test data, and hence will classify the test bservatins crrectly. Althugh the maximal margin classifier is ften successful, it can als lead t verfitting when p is large. If β 0,β 1,...,β p are the cefficients f the maximal margin hyperplane, then the maximal margin classifier classifies the test bservatin x based n the sign f f(x )=β 0 + β 1 x 1 + β 2x β px p. Figure 9.3 shws the maximal margin hyperplane n the data set f Figure 9.2. Cmparing the right-hand panel f Figure 9.2 t Figure 9.3, we see that the maximal margin hyperplane shwn in Figure 9.3 des indeed result in a greater minimal distance between the bservatins and the separating hyperplane that is, a larger margin. In a sense, the maximal margin hyperplane represents the mid-line f the widest slab that we can insert between the tw classes. Examining Figure 9.3, we see that three training bservatins are equidistant frm the maximal margin hyperplane and lie alng the dashed lines indicating the width f the margin. These three bservatins are knwn as margin classifier

357 Supprt Vectr Machines X X 1 FIGURE 9.3. There are tw classes f bservatins, shwn in blue and in purple. The maximal margin hyperplane is shwn as a slid line. The margin is the distance frm the slid line t either f the dashed lines. The tw blue pints and the purple pint that lie n the dashed lines are the supprt vectrs, and the distance frm thse pints t the margin is indicated by arrws. The purple and blue grid indicates the decisin rule made by a classifier based n this separating hyperplane. supprt vectrs, since they are vectrs in p-dimensinal space (in Figure 9.3, supprt p = 2) and they supprt the maximal margin hyperplane in the sense vectr that if these pints were mved slightly then the maximal margin hyperplane wuld mve as well. Interestingly, the maximal margin hyperplane depends directly n the supprt vectrs, but nt n the ther bservatins: a mvement t any f the ther bservatins wuld nt affect the separating hyperplane, prvided that the bservatin s mvement des nt cause it t crss the bundary set by the margin. The fact that the maximal margin hyperplane depends directly n nly a small subset f the bservatins is an imprtant prperty that will arise later in this chapter when we discuss the supprt vectr classifier and supprt vectr machines Cnstructin f the Maximal Margin Classifier We nw cnsider the task f cnstructing the maximal margin hyperplane based n a set f n training bservatins x 1,...,x n R p and assciated class labels y 1,...,y n { 1, 1}. Briefly, the maximal margin hyperplane is the slutin t the ptimizatin prblem

358 9.1 Maximal Margin Classifier 343 maximize M (9.9) β 0,β 1,...,β p p subject t βj 2 =1, (9.10) j=1 y i (β 0 + β 1 x i1 + β 2 x i β p x ip ) M i =1,...,n.(9.11) This ptimizatin prblem (9.9) (9.11) is actually simpler than it lks. First f all, the cnstraint in (9.11) that y i (β 0 + β 1 x i1 + β 2 x i β p x ip ) M i =1,...,n guarantees that each bservatin will be n the crrect side f the hyperplane, prvided that M is psitive. (Actually, fr each bservatin t be n the crrect side f the hyperplane we wuld simply need y i (β 0 + β 1 x i1 + β 2 x i β p x ip ) > 0, s the cnstraint in (9.11) in fact requires that each bservatin be n the crrect side f the hyperplane, with sme cushin, prvided that M is psitive.) Secnd, nte that (9.10) is nt really a cnstraint n the hyperplane, since if β 0 + β 1 x i1 + β 2 x i β p x ip = 0 defines a hyperplane, then s des k(β 0 + β 1 x i1 + β 2 x i β p x ip ) = 0 fr any k 0. Hwever, (9.10) adds meaning t (9.11); ne can shw that with this cnstraint the perpendicular distance frm the ith bservatin t the hyperplane is given by y i (β 0 + β 1 x i1 + β 2 x i β p x ip ). Therefre, the cnstraints (9.10) and (9.11) ensure that each bservatin is n the crrect side f the hyperplane and at least a distance M frm the hyperplane. Hence, M represents the margin f ur hyperplane, and the ptimizatin prblem chses β 0,β 1,...,β p t maximize M.Thisisexactly the definitin f the maximal margin hyperplane! The prblem (9.9) (9.11) can be slved efficiently, but details f this ptimizatin are utside f the scpe f this bk The Nn-separable Case The maximal margin classifier is a very natural way t perfrm classificatin, if a separating hyperplane exists. Hwever, as we have hinted, in many cases n separating hyperplane exists, and s there is n maximal margin classifier. In this case, the ptimizatin prblem (9.9) (9.11) has n slutin with M>0. An example is shwn in Figure 9.4. In this case, we cannt exactly separate the tw classes. Hwever, as we will see in the next sectin, we can extend the cncept f a separating hyperplane in rder t develp a hyperplane that almst separates the classes, using a s-called sft margin. The generalizatin f the maximal margin classifier t the nn-separable case is knwn as the supprt vectr classifier.

359 Supprt Vectr Machines X X 1 FIGURE 9.4. There are tw classes f bservatins, shwn in blue and in purple. In this case, the tw classes are nt separable by a hyperplane, and s the maximal margin classifier cannt be used. 9.2 Supprt Vectr Classifiers Overview f the Supprt Vectr Classifier In Figure 9.4, we see that bservatins that belng t tw classes are nt necessarily separable by a hyperplane. In fact, even if a separating hyperplane des exist, then there are instances in which a classifier based n a separating hyperplane might nt be desirable. A classifier based n a separating hyperplane will necessarily perfectly classify all f the training bservatins; this can lead t sensitivity t individual bservatins. An example is shwn in Figure 9.5. The additin f a single bservatin in the right-hand panel f Figure 9.5 leads t a dramatic change in the maximal margin hyperplane. The resulting maximal margin hyperplane is nt satisfactry fr ne thing, it has nly a tiny margin. This is prblematic because as discussed previusly, the distance f an bservatin frm the hyperplane can be seen as a measure f ur cnfidence that the bservatin was crrectly classified. Mrever, the fact that the maximal margin hyperplane is extremely sensitive t a change in a single bservatin suggests that it may have verfit the training data. In this case, we might be willing t cnsider a classifier based n a hyperplane that des nt perfectly separate the tw classes, in the interest f

360 9.2 Supprt Vectr Classifiers 345 X X X X 1 FIGURE 9.5. Left: Tw classes f bservatins are shwn in blue and in purple, alng with the maximal margin hyperplane. Right: An additinal blue bservatin has been added, leading t a dramatic shift in the maximal margin hyperplane shwn as a slid line. The dashed line indicates the maximal margin hyperplane that was btained in the absence f this additinal pint. Greater rbustness t individual bservatins, and Better classificatin f mst f the training bservatins. That is, it culd be wrthwhile t misclassify a few training bservatins in rder t d a better jb in classifying the remaining bservatins. The supprt vectr classifier, smetimes called a sft margin classifier, des exactly this. Rather than seeking the largest pssible margin s that every bservatin is nt nly n the crrect side f the hyperplane but als n the crrect side f the margin, we instead allw sme bservatins t be n the incrrect side f the margin, r even the incrrect side f the hyperplane. (The margin is sft because it can be vilated by sme f the training bservatins.) An example is shwn in the left-hand panel f Figure 9.6. Mst f the bservatins are n the crrect side f the margin. Hwever, a small subset f the bservatins are n the wrng side f the margin. An bservatin can be nt nly n the wrng side f the margin, but als n the wrng side f the hyperplane. In fact, when there is n separating hyperplane, such a situatin is inevitable. Observatins n the wrng side f the hyperplane crrespnd t training bservatins that are misclassified by the supprt vectr classifier. The right-hand panel f Figure 9.6 illustrates such a scenari. supprt vectr classifier sft margin classifier Details f the Supprt Vectr Classifier The supprt vectr classifier classifies a test bservatin depending n which side f a hyperplane it lies. The hyperplane is chsen t crrectly

361 Supprt Vectr Machines X X X X 1 FIGURE 9.6. Left: A supprt vectr classifier was fit t a small data set. The hyperplane is shwn as a slid line and the margins are shwn as dashed lines. Purple bservatins: Observatins 3, 4, 5, and 6 are n the crrect side f the margin, bservatin 2 is n the margin, and bservatin 1 is n the wrng side f the margin. Blue bservatins: Observatins 7 and 10 are n the crrect side f the margin, bservatin 9 is n the margin, and bservatin 8 is n the wrng side f the margin. N bservatins are n the wrng side f the hyperplane. Right: Same as left panel with tw additinal pints, 11 and 12. These tw bservatins are n the wrng side f the hyperplane and the wrng side f the margin. separate mst f the training bservatins int the tw classes, but may misclassify a few bservatins. It is the slutin t the ptimizatin prblem maximize β 0,β 1,...,β p,ɛ 1,...,ɛ n M (9.12) p subject t βj 2 =1, (9.13) j=1 y i (β 0 + β 1 x i1 + β 2 x i β p x ip ) M(1 ɛ i ), (9.14) n ɛ i 0, ɛ i C, (9.15) i=1 where C is a nnnegative tuning parameter. As in (9.11), M is the width f the margin; we seek t make this quantity as large as pssible. In (9.14), ɛ 1,...,ɛ n are slack variables that allw individual bservatins t be n slack the wrng side f the margin r the hyperplane; we will explain them in variable greater detail mmentarily. Once we have slved (9.12) (9.15), we classify a test bservatin x as befre, by simply determining n which side f the hyperplane it lies. That is, we classify the test bservatin based n the sign f f(x )=β 0 + β 1 x β p x p. The prblem (9.12) (9.15) seems cmplex, but insight int its behavir can be made thrugh a series f simple bservatins presented belw. First f all, the slack variable ɛ i tells us where the ith bservatin is lcated, relative t the hyperplane and relative t the margin. If ɛ i = 0 then the ith

362 9.2 Supprt Vectr Classifiers 347 bservatin is n the crrect side f the margin, as we saw in Sectin If ɛ i > 0 then the ith bservatin is n the wrng side f the margin, and we say that the ith bservatin has vilated the margin. If ɛ i > 1thenit is n the wrng side f the hyperplane. We nw cnsider the rle f the tuning parameter C. In (9.14), C bunds the sum f the ɛ i s, and s it determines the number and severity f the vilatins t the margin (and t the hyperplane) that we will tlerate. We can think f C as a budget fr the amunt that the margin can be vilated by the n bservatins. If C = 0 then there is n budget fr vilatins t the margin, and it must be the case that ɛ 1 =... = ɛ n =0,inwhichcase (9.12) (9.15) simply amunts t the maximal margin hyperplane ptimizatin prblem (9.9) (9.11). (Of curse, a maximal margin hyperplane exists nly if the tw classes are separable.) Fr C>0nmrethanC bservatins can be n the wrng side f the hyperplane, because if an bservatin is n the wrng side f the hyperplane then ɛ i > 1, and (9.14) requires that n i=1 ɛ i C. As the budget C increases, we becme mre tlerant f vilatins t the margin, and s the margin will widen. Cnversely, as C decreases, we becme less tlerant f vilatins t the margin and s the margin narrws. An example in shwn in Figure 9.7. In practice, C is treated as a tuning parameter that is generally chsen via crss-validatin. As with the tuning parameters that we have seen thrughut this bk, C cntrls the bias-variance trade-ff f the statistical learning technique. When C is small, we seek narrw margins that are rarely vilated; this amunts t a classifier that is highly fit t the data, which may have lw bias but high variance. On the ther hand, when C is larger, the margin is wider and we allw mre vilatins t it; this amunts t fitting the data less hard and btaining a classifier that is ptentially mre biased but may have lwer variance. The ptimizatin prblem (9.12) (9.15) has a very interesting prperty: it turns ut that nly bservatins that either lie n the margin r that vilate the margin will affect the hyperplane, and hence the classifier btained. In ther wrds, an bservatin that lies strictly n the crrect side f the margin des nt affect the supprt vectr classifier! Changing the psitin f that bservatin wuld nt change the classifier at all, prvided that its psitin remains n the crrect side f the margin. Observatins that lie directly n the margin, r n the wrng side f the margin fr their class, are knwn as supprt vectrs. These bservatins d affect the supprt vectr classifier. The fact that nly supprt vectrs affect the classifier is in line with ur previus assertin that C cntrls the bias-variance trade-ff f the supprt vectr classifier. When the tuning parameter C is large, then the margin is wide, many bservatins vilate the margin, and s there are many supprt vectrs. In this case, many bservatins are invlved in determining the hyperplane. The tp left panel in Figure 9.7 illustrates this setting: this classifier has lw variance (since many bservatins are supprt vectrs)

363 Supprt Vectr Machines X X 1 X X X2 X X X 1 FIGURE 9.7. A supprt vectr classifier was fit using fur different values f the tuning parameter C in (9.12) (9.15). The largest value f C was used in the tp left panel, and smaller values were used in the tp right, bttm left, and bttm right panels. When C is large, then there is a high tlerance fr bservatins being n the wrng side f the margin, and s the margin will be large. As C decreases, the tlerance fr bservatins being n the wrng side f the margin decreases, and the margin narrws. but ptentially high bias. In cntrast, if C is small, then there will be fewer supprt vectrs and hence the resulting classifier will have lw bias but high variance. The bttm right panel in Figure 9.7 illustrates this setting, with nly eight supprt vectrs. The fact that the supprt vectr classifier s decisin rule is based nly n a ptentially small subset f the training bservatins (the supprt vectrs) means that it is quite rbust t the behavir f bservatins that are far away frm the hyperplane. This prperty is distinct frm sme f the ther classificatin methds that we have seen in preceding chapters, such as linear discriminant analysis. Recall that the LDA classificatin rule

364 9.3 Supprt Vectr Machines 349 X X X X 1 FIGURE 9.8. Left: The bservatins fall int tw classes, with a nn-linear bundary between them. Right: The supprt vectr classifier seeks a linear bundary, and cnsequently perfrms very prly. depends n the mean f all f the bservatins within each class, as well as the within-class cvariance matrix cmputed using all f the bservatins. In cntrast, lgistic regressin, unlike LDA, has very lw sensitivity t bservatins far frm the decisin bundary. In fact we will see in Sectin 9.5 that the supprt vectr classifier and lgistic regressin are clsely related. 9.3 Supprt Vectr Machines We first discuss a general mechanism fr cnverting a linear classifier int ne that prduces nn-linear decisin bundaries. We then intrduce the supprt vectr machine, which des this in an autmatic way Classificatin with Nn-linear Decisin Bundaries The supprt vectr classifier is a natural apprach fr classificatin in the tw-class setting, if the bundary between the tw classes is linear. Hwever, in practice we are smetimes faced with nn-linear class bundaries. Fr instance, cnsider the data in the left-hand panel f Figure 9.8. It is clear that a supprt vectr classifier r any linear classifier will perfrm prly here. Indeed, the supprt vectr classifier shwn in the right-hand panel f Figure 9.8 is useless here. In Chapter 7, we are faced with an analgus situatin. We see there that the perfrmance f linear regressin can suffer when there is a nnlinear relatinship between the predictrs and the utcme. In that case, we cnsider enlarging the feature space using functins f the predictrs,

365 Supprt Vectr Machines such as quadratic and cubic terms, in rder t address this nn-linearity. In the case f the supprt vectr classifier, we culd address the prblem f pssibly nn-linear bundaries between classes in a similar way, by enlarging the feature space using quadratic, cubic, and even higher-rder plynmial functins f the predictrs. Fr instance, rather than fitting a supprt vectr classifier using p features X 1,X 2,...,X p, we culd instead fit a supprt vectr classifier using 2p features Then (9.12) (9.15) wuld becme X 1,X 2 1,X 2,X 2 2,...,X p,x 2 p. maximize M (9.16) β 0,β 11,β 12...,β p1,β p2,ɛ 1,...,ɛ n p p subject t y i β 0 + β j1 x ij + β j2 x 2 ij M(1 ɛ i ), n ɛ i C, ɛ i 0, i=1 j=1 p j=1 k=1 j=1 2 βjk 2 =1. Why des this lead t a nn-linear decisin bundary? In the enlarged feature space, the decisin bundary that results frm (9.16) is in fact linear. But in the riginal feature space, the decisin bundary is f the frm q(x) = 0, whereq is a quadratic plynmial, and its slutins are generally nn-linear. One might additinally want t enlarge the feature space with higher-rder plynmial terms, r with interactin terms f the frm X j X j fr j j. Alternatively, ther functins f the predictrs culd be cnsidered rather than plynmials. It is nt hard t see that there are many pssible ways t enlarge the feature space, and that unless we are careful, we culd end up with a huge number f features. Then cmputatins wuld becme unmanageable. The supprt vectr machine, which we present next, allws us t enlarge the feature space used by the supprt vectr classifier in a way that leads t efficient cmputatins The Supprt Vectr Machine The supprt vectr machine (SVM) is an extensin f the supprt vectr supprt classifier that results frm enlarging the feature space in a specific way, vectr machine using kernels. We will nw discuss this extensin, the details f which are kernel smewhat cmplex and beynd the scpe f this bk. Hwever, the main idea is described in Sectin 9.3.1: we may want t enlarge ur feature space

366 9.3 Supprt Vectr Machines 351 in rder t accmmdate a nn-linear bundary between the classes. The kernel apprach that we describe here is simply an efficient cmputatinal apprach fr enacting this idea. We have nt discussed exactly hw the supprt vectr classifier is cmputed because the details becme smewhat technical. Hwever, it turns ut that the slutin t the supprt vectr classifier prblem (9.12) (9.15) invlves nly the inner prducts f the bservatins (as ppsed t the bservatins themselves). The inner prduct f tw r-vectrs a and b is defined as a, b = r i=1 a ib i. Thus the inner prduct f tw bservatins x i, x i is given by p x i,x i = x ij x i j. (9.17) It can be shwn that j=1 The linear supprt vectr classifier can be represented as f(x) =β 0 + n α i x, x i, (9.18) where there are n parameters α i, i = 1,...,n, ne per training bservatin. T ( estimate the parameters α 1,...,α n and β 0, all we need are the n ) 2 inner prducts xi,x i between all pairs f training bservatins. (The ntatin ( n 2) means n(n 1)/2, and gives the number f pairs amng a set f n items.) Ntice that in (9.18), in rder t evaluate the functin f(x), we need t cmpute the inner prduct between the new pint x and each f the training pints x i. Hwever, it turns ut that α i is nnzer nly fr the supprt vectrs in the slutin that is, if a training bservatin is nt a supprt vectr, then its α i equals zer. S if S is the cllectin f indices f these supprt pints, we can rewrite any slutin functin f the frm (9.18) as i=1 f(x) =β 0 + i S α i x, x i, (9.19) which typically invlves far fewer terms than in (9.18). 2 T summarize, in representing the linear classifier f(x), and in cmputing its cefficients, all we need are inner prducts. Nw suppse that every time the inner prduct (9.17) appears in the representatin (9.18), r in a calculatin f the slutin fr the supprt 2 By expanding each f the inner prducts in (9.19), it is easy t see that f(x) is a linear functin f the crdinates f x. Ding s als establishes the crrespndence between the α i and the riginal parameters β j.

367 Supprt Vectr Machines vectr classifier, we replace it with a generalizatin f the inner prduct f the frm K(x i,x i ), (9.20) where K is sme functin that we will refer t as a kernel. A kernel is a kernel functin that quantifies the similarity f tw bservatins. Fr instance, we culd simply take p K(x i,x i )= x ij x i j, (9.21) which wuld just give us back the supprt vectr classifier. Equatin 9.21 is knwn as a linear kernel because the supprt vectr classifier is linear in the features; the linear kernel essentially quantifies the similarity f a pair f bservatins using Pearsn (standard) crrelatin. But ne culd instead chse anther frm fr (9.20). Fr instance, ne culd replace every instance f p j=1 x ijx i j with the quantity K(x i,x i )=(1+ j=1 p x ij x i j) d. (9.22) j=1 This is knwn as a plynmial kernel f degree d, whered is a psitive plynmial integer. Using such a kernel with d>1, instead f the standard linear kernel kernel (9.21), in the supprt vectr classifier algrithm leads t a much mre flexible decisin bundary. It essentially amunts t fitting a supprt vectr classifier in a higher-dimensinal space invlving plynmials f degree d, rather than in the riginal feature space. When the supprt vectr classifier is cmbined with a nn-linear kernel such as (9.22), the resulting classifier is knwn as a supprt vectr machine. Nte that in this case the (nn-linear) functin has the frm f(x) =β 0 + i S α i K(x, x i ). (9.23) The left-hand panel f Figure 9.9 shws an example f an SVM with a plynmial kernel applied t the nn-linear data frm Figure 9.8. The fit is a substantial imprvement ver the linear supprt vectr classifier. When d = 1, then the SVM reduces t the supprt vectr classifier seen earlier in this chapter. The plynmial kernel shwn in (9.22) is ne example f a pssible nn-linear kernel, but alternatives abund. Anther ppular chice is the radial kernel, which takes the frm K(x i,x i )=exp( γ p (x ij x i j) 2 ). (9.24) j=1 radial kernel

368 9.3 Supprt Vectr Machines 353 X X X X 1 FIGURE 9.9. Left: An SVM with a plynmial kernel f degree 3 is applied t the nn-linear data frm Figure 9.8, resulting in a far mre apprpriate decisin rule. Right: An SVM with a radial kernel is applied. In this example, either kernel is capable f capturing the decisin bundary. In (9.24), γ is a psitive cnstant. The right-hand panel f Figure 9.9 shws an example f an SVM with a radial kernel n this nn-linear data; it als des a gd jb in separating the tw classes. Hw des the radial kernel (9.24) actually wrk? If a given test bservatin x =(x 1...x p )T is far frm a training bservatin x i in terms f Euclidean distance, then p j=1 (x j x ij) 2 will be large, and s K(x,x i )= exp( γ p j=1 (x j x ij) 2 ) will be very tiny. This means that in (9.23), x i will play virtually n rle in f(x ). Recall that the predicted class label fr the test bservatin x is based n the sign f f(x ). In ther wrds, training bservatins that are far frm x will play essentially n rle in the predicted class label fr x. This means that the radial kernel has very lcal behavir, in the sense that nly nearby training bservatins have an effect n the class label f a test bservatin. What is the advantage f using a kernel rather than simply enlarging the feature space using functins f the riginal features, as in (9.16)? One advantage is cmputatinal, and it amunts t the fact that using kernels, ne need nly cmpute K(x i,x i ) fr all ( n 2) distinct pairs i, i.thiscanbe dne withut explicitly wrking in the enlarged feature space. This is imprtant because in many applicatins f SVMs, the enlarged feature space is s large that cmputatins are intractable. Fr sme kernels, such as the radial kernel (9.24), the feature space is implicit and infinite-dimensinal, s we culd never d the cmputatins there anyway!

369 Supprt Vectr Machines True psitive rate Supprt Vectr Classifier LDA True psitive rate Supprt Vectr Classifier SVM: γ=10 3 SVM: γ=10 2 SVM: γ= False psitive rate False psitive rate FIGURE ROC curves fr the Heart data training set. Left: The supprt vectr classifier and LDA are cmpared. Right: The supprt vectr classifier is cmpared t an SVM using a radial basis kernel with γ =10 3, 10 2,and An Applicatin t the Heart Disease Data In Chapter 8 we apply decisin trees and related methds t the Heart data. The aim is t use 13 predictrs such as Age, Sex, andchl in rder t predict whether an individual has heart disease. We nw investigate hw an SVM cmpares t LDA n this data. After remving 6 missing bservatins, the data cnsist f 297 subjects, which we randmly split int 207 training and 90 test bservatins. We first fit LDA and the supprt vectr classifier t the training data. Nte that the supprt vectr classifier is equivalent t a SVM using a plynmial kernel f degree d = 1. The left-hand panel f Figure 9.10 displays ROC curves (described in Sectin 4.4.3) fr the training set predictins fr bth LDA and the supprt vectr classifier. Bth classifiers cmpute scres f the frm ˆf(X) = ˆβ 0 + ˆβ 1 X 1 + ˆβ 2 X ˆβ p X p fr each bservatin. Fr any given cutff t, we classify bservatins int the heart disease r n heart disease categries depending n whether ˆf(X) <tr ˆf(X) t. The ROC curve is btained by frming these predictins and cmputing the false psitive and true psitive rates fr a range f values f t. Anptimal classifier will hug the tp left crner f the ROC plt. In this instance LDA and the supprt vectr classifier bth perfrm well, thugh there is a suggestin that the supprt vectr classifier may be slightly superir. The right-hand panel f Figure 9.10 displays ROC curves fr SVMs using a radial kernel, with varius values f γ. Asγ increases and the fit becmes mre nn-linear, the ROC curves imprve. Using γ =10 1 appears t give an almst perfect ROC curve. Hwever, these curves represent training errr rates, which can be misleading in terms f perfrmance n new test data. Figure 9.11 displays ROC curves cmputed n the 90 test bserva-

370 9.4 SVMs with Mre than Tw Classes 355 True psitive rate Supprt Vectr Classifier LDA True psitive rate Supprt Vectr Classifier SVM: γ=10 3 SVM: γ=10 2 SVM: γ= False psitive rate False psitive rate FIGURE ROC curves fr the test set f the Heart data. Left: The supprt vectr classifier and LDA are cmpared. Right: The supprt vectr classifier is cmpared t an SVM using a radial basis kernel with γ =10 3, 10 2,and10 1. tins. We bserve sme differences frm the training ROC curves. In the left-hand panel f Figure 9.11, the supprt vectr classifier appears t have a small advantage ver LDA (althugh these differences are nt statistically significant). In the right-hand panel, the SVM using γ =10 1,which shwed the best results n the training data, prduces the wrst estimates n the test data. This is nce again evidence that while a mre flexible methd will ften prduce lwer training errr rates, this des nt necessarily lead t imprved perfrmance n test data. The SVMs with γ =10 2 and γ =10 3 perfrm cmparably t the supprt vectr classifier, and all three utperfrm the SVM with γ = SVMs with Mre than Tw Classes S far, ur discussin has been limited t the case f binary classificatin: that is, classificatin in the tw-class setting. Hw can we extend SVMs t the mre general case where we have sme arbitrary number f classes? It turns ut that the cncept f separating hyperplanes upn which SVMs are based des nt lend itself naturally t mre than tw classes. Thugh a number f prpsals fr extending SVMs t the K-class case have been made, the tw mst ppular are the ne-versus-ne and ne-versus-all appraches. We briefly discuss thse tw appraches here One-Versus-One Classificatin Suppse that we wuld like t perfrm classificatin using SVMs, and there are K>2 classes. A ne-versus-ne r all-pairs apprach cnstructs ( ) K 2 ne-versusne

371 Supprt Vectr Machines SVMs, each f which cmpares a pair f classes. Fr example, ne such SVM might cmpare the kth class, cded as +1, t the k th class, cded as 1. We classify a test bservatin using each f the ( K 2 ) classifiers, and we tally the number f times that the test bservatin is assigned t each f the K classes. The final classificatin is perfrmed by assigning the test bservatin ( t the class t which it was mst frequently assigned in these K ) 2 pairwise classificatins One-Versus-All Classificatin The ne-versus-all apprach is an alternative prcedure fr applying SVMs ne-versusall in the case f K>2 classes. We fit K SVMs, each time cmparing ne f the K classes t the remaining K 1 classes. Let β 0k,β 1k,...,β pk dente the parameters that result frm fitting an SVM cmparing the kth class (cded as +1) t the thers (cded as 1). Let x dente a test bservatin. We assign the bservatin t the class fr which β 0k +β 1k x 1 +β 2kx β pk x p is largest, as this amunts t a high level f cnfidence that the test bservatin belngs t the kth class rather than t any f the ther classes. 9.5 Relatinship t Lgistic Regressin When SVMs were first intrduced in the mid-1990s, they made quite a splash in the statistical and machine learning cmmunities. This was due in part t their gd perfrmance, gd marketing, and als t the fact that the underlying apprach seemed bth nvel and mysterius. The idea f finding a hyperplane that separates the data as well as pssible, while allwing sme vilatins t this separatin, seemed distinctly different frm classical appraches fr classificatin, such as lgistic regressin and linear discriminant analysis. Mrever, the idea f using a kernel t expand the feature space in rder t accmmdate nn-linear class bundaries appeared t be a unique and valuable characteristic. Hwever, since that time, deep cnnectins between SVMs and ther mre classical statistical methds have emerged. It turns ut that ne can rewrite the criterin (9.12) (9.15) fr fitting the supprt vectr classifier f(x) =β 0 + β 1 X β p X p as n minimize max [0, 1 y i f(x i )] + λ β 0,β 1,...,β p i=1 p j=1 β 2 j, (9.25)

372 9.5 Relatinship t Lgistic Regressin 357 where λ is a nnnegative tuning parameter. When λ is large then β 1,...,β p are small, mre vilatins t the margin are tlerated, and a lw-variance but high-bias classifier will result. When λ is small then few vilatins t the margin will ccur; this amunts t a high-variance but lw-bias classifier. Thus, a small value f λ in (9.25) amunts t a small value f C in (9.15). Nte that the λ p j=1 β2 j term in (9.25) is the ridge penalty term frm Sectin 6.2.1, and plays a similar rle in cntrlling the bias-variance trade-ff fr the supprt vectr classifier. Nw (9.25) takes the Lss + Penalty frm that we have seen repeatedly thrughut this bk: minimize {L(X, y,β)+λp (β)}. (9.26) β 0,β 1,...,β p In (9.26), L(X, y,β) is sme lss functin quantifying the extent t which the mdel, parametrized by β, fits the data (X, y), and P (β) is a penalty functin n the parameter vectr β whse effect is cntrlled by a nnnegative tuning parameter λ. Fr instance, ridge regressin and the lass bth take this frm with 2 n p L(X, y,β)= y i β 0 x ij β j i=1 and with P (β) = p j=1 β2 j fr ridge regressin and P (β) = p j=1 β j fr the lass. In the case f (9.25) the lss functin instead takes the frm n L(X, y,β)= max [0, 1 y i (β 0 + β 1 x i β p x ip )]. i=1 This is knwn as hinge lss, and is depicted in Figure Hwever, it hinge lss turns ut that the hinge lss functin is clsely related t the lss functin used in lgistic regressin, als shwn in Figure An interesting characteristic f the supprt vectr classifier is that nly supprt vectrs play a rle in the classifier btained; bservatins n the crrect side f the margin d nt affect it. This is due t the fact that the lss functin shwn in Figure 9.12 is exactly zer fr bservatins fr which y i (β 0 + β 1 x i β p x ip ) 1; these crrespnd t bservatins that are n the crrect side f the margin. 3 In cntrast, the lss functin fr lgistic regressin shwn in Figure 9.12 is nt exactly zer anywhere. But it is very small fr bservatins that are far frm the decisin bundary. Due t the similarities between their lss functins, lgistic regressin and the supprt vectr classifier ften give very similar results. When the classes are well separated, SVMs tend t behave better than lgistic regressin; in mre verlapping regimes, lgistic regressin is ften preferred. j=1 3 With this hinge-lss + penalty representatin, the margin crrespnds t the value ne, and the width f the margin is determined by β 2 j.

373 Supprt Vectr Machines Lss SVM Lss Lgistic Regressin Lss y i (β 0 + β 1 x i β p x ip ) FIGURE The SVM and lgistic regressin lss functins are cmpared, as a functin f y i(β 0 + β 1x i β px ip). Wheny i(β 0 + β 1x i β px ip) is greater than 1, then the SVM lss is zer, since this crrespnds t an bservatin that is n the crrect side f the margin. Overall, the tw lss functins have quite similar behavir. When the supprt vectr classifier and SVM were first intrduced, it was thught that the tuning parameter C in (9.15) was an unimprtant nuisance parameter that culd be set t sme default value, like 1. Hwever, the Lss + Penalty frmulatin (9.25) fr the supprt vectr classifier indicates that this is nt the case. The chice f tuning parameter is very imprtant and determines the extent t which the mdel underfits r verfits the data, as illustrated, fr example, in Figure 9.7. We have established that the supprt vectr classifier is clsely related t lgistic regressin and ther preexisting statistical methds. Is the SVM unique in its use f kernels t enlarge the feature space t accmmdate nn-linear class bundaries? The answer t this questin is n. We culd just as well perfrm lgistic regressin r many f the ther classificatin methds seen in this bk using nn-linear kernels; this is clsely related t sme f the nn-linear appraches seen in Chapter 7. Hwever, fr histrical reasns, the use f nn-linear kernels is much mre widespread in the cntext f SVMs than in the cntext f lgistic regressin r ther methds. Thugh we have nt addressed it here, there is in fact an extensin f the SVM fr regressin (i.e. fr a quantitative rather than a qualitative respnse), called supprt vectr regressin. In Chapter 3, we saw that supprt least squares regressin seeks cefficients β 0,β 1,...,β p such that the sum f squared residuals is as small as pssible. (Recall frm Chapter 3 that residuals are defined as y i β 0 β 1 x i1 β p x ip.) Supprt vectr regressin instead seeks cefficients that minimize a different type f lss, where nly residuals larger in abslute value than sme psitive cnstant vectr regressin

374 9.6 Lab: Supprt Vectr Machines 359 cntribute t the lss functin. This is an extensin f the margin used in supprt vectr classifiers t the regressin setting. 9.6 Lab: Supprt Vectr Machines We use the e1071 library in R t demnstrate the supprt vectr classifier and the SVM. Anther ptin is the LiblineaR library, which is useful fr very large linear prblems Supprt Vectr Classifier The e1071 library cntains implementatins fr a number f statistical learning methds. In particular, the svm() functin can be used t fit a svm() supprt vectr classifier when the argument kernel="linear" is used. This functin uses a slightly different frmulatin frm (9.14) and (9.25) fr the supprt vectr classifier. A cst argument allws us t specify the cst f a vilatin t the margin. When the cst argument is small, then the margins will be wide and many supprt vectrs will be n the margin r will vilate the margin. When the cst argument is large, then the margins will be narrw and there will be few supprt vectrs n the margin r vilating the margin. We nw use the svm() functin t fit the supprt vectr classifier fr a given value f the cst parameter. Here we demnstrate the use f this functin n a tw-dimensinal example s that we can plt the resulting decisin bundary. We begin by generating the bservatins, which belng t tw classes. > set.seed(1) > x=matrix(rnrm(20*2), ncl=2) > y=c(rep(-1,10), rep(1,10)) > x[y==1,]=x[y==1,] + 1 We begin by checking whether the classes are linearly separable. > plt(x, cl=(3-y)) They are nt. Next, we fit the supprt vectr classifier. Nte that in rder fr the svm() functin t perfrm classificatin (as ppsed t SVM-based regressin), we must encde the respnse as a factr variable. We nw create a data frame with the respnse cded as a factr. > dat=data.frame(x=x, y=as.factr(y)) > library(e1071) > svmfit=svm(y., data=dat, kernel="linear", cst=10, scale=false)

375 Supprt Vectr Machines The argument scale=false tells the svm() functin nt t scale each feature t have mean zer r standard deviatin ne; depending n the applicatin, ne might prefer t use scale=true. We can nw plt the supprt vectr classifier btained: > plt(svmfit, dat) Nte that the tw arguments t the plt.svm() functin are the utput f the call t svm(), as well as the data used in the call t svm(). The regin f feature space that will be assigned t the 1 class is shwn in light blue, and the regin that will be assigned t the +1 class is shwn in purple. The decisin bundary between the tw classes is linear (because we used the argument kernel="linear"), thugh due t the way in which the pltting functin is implemented in this library the decisin bundary lks smewhat jagged in the plt. We see that in this case nly ne bservatin is misclassified. (Nte that here the secnd feature is pltted n the x-axis and the first feature is pltted n the y-axis, in cntrast t the behavir f the usual plt() functin in R.) The supprt vectrs are pltted as crsses and the remaining bservatins are pltted as circles; we see here that there are seven supprt vectrs. We can determine their identities as fllws: > svmfit$index [1] We can btain sme basic infrmatin abut the supprt vectr classifier fit using the summary() cmmand: > summary(svmfit) Call: svm(frmula = y., data = dat, kernel = "linear", cst = 10, scale = FALSE) Parameters : SVM-Type: C-classificatin SVM-Kernel: linear cst: 10 gamma: 0.5 Number f Supprt Vectrs: 7 ( 4 3 ) Number f Classes: 2 Levels: -1 1 This tells us, fr instance, that a linear kernel was used with cst=10, and that there were seven supprt vectrs, fur in ne class and three in the ther. What if we instead used a smaller value f the cst parameter? > svmfit=svm(y., data=dat, kernel="linear", cst=0.1, scale=false) > plt(svmfit, dat) > svmfit$index [1]

376 9.6 Lab: Supprt Vectr Machines 361 Nw that a smaller value f the cst parameter is being used, we btain a larger number f supprt vectrs, because the margin is nw wider. Unfrtunately, the svm() functin des nt explicitly utput the cefficients f the linear decisin bundary btained when the supprt vectr classifier is fit, nr des it utput the width f the margin. The e1071 library includes a built-in functin, tune(), t perfrm crss- tune() validatin. By default, tune() perfrms ten-fld crss-validatin n a set f mdels f interest. In rder t use this functin, we pass in relevant infrmatin abut the set f mdels that are under cnsideratin. The fllwing cmmand indicates that we want t cmpare SVMs with a linear kernel, using a range f values f the cst parameter. > set.seed(1) > tune.ut=tune(svm,y.,data=dat,kernel="linear", ranges=list(cst=c(0.001, 0.01, 0.1, 1,5,10,100) )) We can easily access the crss-validatin errrs fr each f these mdels using the summary() cmmand: > summary(tune.ut) Parameter tuning f svm : - sampling methd: 10-fld crss validatin - best parameters : cst best perfrmance : Detailed perfrmance results: cst errr dispersin 1 1e e e e e e e We see that cst=0.1 results in the lwest crss-validatin errr rate. The tune() functin stres the best mdel btained, which can be accessed as fllws: > bestmd=tune.ut$best.mdel > summary(bestmd) The predict() functin can be used t predict the class label n a set f test bservatins, at any given value f the cst parameter. We begin by generating a test data set. > xtest=matrix(rnrm(20*2), ncl=2) > ytest=sample(c(-1,1), 20, rep=true) > xtest[ytest==1,]=xtest[ytest==1,] + 1 > testdat=data.frame(x=xtest, y=as.factr(ytest)) Nw we predict the class labels f these test bservatins. Here we use the best mdel btained thrugh crss-validatin in rder t make predictins.

377 Supprt Vectr Machines > ypred=predict(bestmd,testdat) > table(predict=ypred, truth=testdat$y ) truth predict Thus, with this value f cst, 19 f the test bservatins are crrectly classified. What if we had instead used cst=0.01? > svmfit=svm(y., data=dat, kernel="linear", cst=.01, scale=false) > ypred=predict(svmfit,testdat) > table(predict=ypred, truth=testdat$y ) truth predict In this case ne additinal bservatin is misclassified. Nw cnsider a situatin in which the tw classes are linearly separable. Then we can find a separating hyperplane using the svm() functin. We first further separate the tw classes in ur simulated data s that they are linearly separable: > x[y==1,]=x[y==1,]+0.5 > plt(x, cl=(y+5)/2, pch=19) Nw the bservatins are just barely linearly separable. We fit the supprt vectr classifier and plt the resulting hyperplane, using a very large value f cst s that n bservatins are misclassified. > dat=data.frame(x=x,y=as.factr(y)) > svmfit=svm(y., data=dat, kernel="linear", cst=1e5) > summary(svmfit) Call: svm(frmula = y., data = dat, kernel = "linear", cst = 1e +05) Parameters : SVM-Type: C-classificatin SVM-Kernel: linear cst: 1e+05 gamma: 0.5 Number f Supprt Vectrs: 3 ( 1 2 ) Number f Classes: 2 Levels: -1 1 > plt(svmfit, dat) N training errrs were made and nly three supprt vectrs were used. Hwever, we can see frm the figure that the margin is very narrw (because the bservatins that are nt supprt vectrs, indicated as circles, are very

378 9.6 Lab: Supprt Vectr Machines 363 clse t the decisin bundary). It seems likely that this mdel will perfrm prly n test data. We nw try a smaller value f cst: > svmfit=svm(y., data=dat, kernel="linear", cst=1) > summary(svmfit) > plt(svmfit,dat) Using cst=1, we misclassify a training bservatin, but we als btain a much wider margin and make use f seven supprt vectrs. It seems likely that this mdel will perfrm better n test data than the mdel with cst=1e Supprt Vectr Machine In rder t fit an SVM using a nn-linear kernel, we nce again use the svm() functin. Hwever, nw we use a different value f the parameter kernel. T fit an SVM with a plynmial kernel we use kernel="plynmial", and t fit an SVM with a radial kernel we use kernel="radial". Inthefrmer case we als use the degree argument t specify a degree fr the plynmial kernel (this is d in (9.22)), and in the latter case we use gamma t specify a value f γ fr the radial basis kernel (9.24). We first generate sme data with a nn-linear class bundary, as fllws: > set.seed(1) > x=matrix(rnrm(200*2), ncl=2) > x[1:100,]=x[1:100,]+2 > x[101:150,]=x[101:150,]-2 > y=c(rep(1,150),rep(2,50)) > dat=data.frame(x=x,y=as.factr(y)) Pltting the data makes it clear that the class bundary is indeed nnlinear: > plt(x, cl=y) The data is randmly split int training and testing grups. We then fit the training data using the svm() functin with a radial kernel and γ =1: > train=sample(200,100) > svmfit=svm(y., data=dat[train,], kernel="radial", gamma=1, cst=1) > plt(svmfit, dat[train,]) The plt shws that the resulting SVM has a decidedly nn-linear bundary. The summary() functin can be used t btain sme infrmatin abut the SVM fit: > summary(svmfit) Call: svm(frmula = y., data = dat, kernel = "radial", gamma = 1, cst = 1) Parameters : SVM-Type: C-classificatin

379 Supprt Vectr Machines SVM-Kernel: radial cst: 1 gamma: 1 Number f Supprt Vectrs: 37 ( ) Number f Classes: 2 Levels: 1 2 We can see frm the figure that there are a fair number f training errrs in this SVM fit. If we increase the value f cst, we can reduce the number f training errrs. Hwever, this cmes at the price f a mre irregular decisin bundary that seems t be at risk f verfitting the data. > svmfit=svm(y., data=dat[train,], kernel="radial",gamma=1, cst=1e5) > plt(svmfit,dat[train,]) We can perfrm crss-validatin using tune() t select the best chice f γ and cst fr an SVM with a radial kernel: > set.seed(1) > tune.ut=tune(svm, y., data=dat[train,], kernel="radial", ranges=list(cst=c(0.1,1,10,100,1000), gamma=c(0.5,1,2,3,4))) > summary(tune.ut) Parameter tuning f svm : - sampling methd: 10-fld crss validatin - best parameters : cst gamma best perfrmance : Detailed perfrmance results: cst gamma errr dispersin 1 1e e e e e e e Therefre, the best chice f parameters invlves cst=1 and gamma=2. We can view the test set predictins fr this mdel by applying the predict() functin t the data. Ntice that t d this we subset the dataframe dat using -train as an index set. > table(true=dat[-train,"y"], pred=predict(tune.ut$best.mdel, newx=dat[-train,])) 39 % f test bservatins are misclassified by this SVM.

380 9.6.3 ROC Curves 9.6 Lab: Supprt Vectr Machines 365 The ROCR package can be used t prduce ROC curves such as thse in Figures 9.10 and We first write a shrt functin t plt an ROC curve given a vectr cntaining a numerical scre fr each bservatin, pred, and a vectr cntaining the class label fr each bservatin, truth. > library(rocr) > rcplt=functin(pred, truth,...){ + predb = predictin (pred, truth) + perf = perfrmance (predb, "tpr", "fpr") + plt(perf,...)} SVMs and supprt vectr classifiers utput class labels fr each bservatin. Hwever, it is als pssible t btain fitted values fr each bservatin, which are the numerical scres used t btain the class labels. Fr instance, in the case f a supprt vectr classifier, the fitted value fr an bservatin X =(X 1,X 2,...,X p ) T takes the frm ˆβ 0 + ˆβ 1 X 1 + ˆβ 2 X ˆβ p X p. Fr an SVM with a nn-linear kernel, the equatin that yields the fitted value is given in (9.23). In essence, the sign f the fitted value determines n which side f the decisin bundary the bservatin lies. Therefre, the relatinship between the fitted value and the class predictin fr a given bservatin is simple: if the fitted value exceeds zer then the bservatin is assigned t ne class, and if it is less than zer than it is assigned t the ther. In rder t btain the fitted values fr a given SVM mdel fit, we use decisin.values=true when fitting svm(). Then the predict() functin will utput the fitted values. > svmfit.pt=svm(y., data=dat[train,], kernel="radial", gamma=2, cst=1,decisin.values=t) > fitted=attributes (predict(svmfit.pt,dat[train,],decisin. values=true))$decisin.values Nw we can prduce the ROC plt. > par(mfrw=c(1,2)) > rcplt(fitted,dat[train,"y"],main="training Data") SVM appears t be prducing accurate predictins. By increasing γ we can prduce a mre flexible fit and generate further imprvements in accuracy. > svmfit.flex=svm(y., data=dat[train,], kernel="radial", gamma=50, cst=1, decisin.values=t) > fitted=attributes (predict(svmfit.flex,dat[train,],decisin. values=t))$decisin.values > rcplt(fitted,dat[train,"y"],add=t,cl="red") Hwever, these ROC curves are all n the training data. We are really mre interested in the level f predictin accuracy n the test data. When we cmpute the ROC curves n the test data, the mdel with γ = 2 appears t prvide the mst accurate results.

381 Supprt Vectr Machines > fitted=attributes (predict(svmfit.pt,dat[-train,],decisin. values=t))$decisin.values > rcplt(fitted,dat[-train,"y"],main="test Data") > fitted=attributes (predict(svmfit.flex,dat[-train,],decisin. values=t))$decisin.values > rcplt(fitted,dat[-train,"y"],add=t,cl="red") SVM with Multiple Classes If the respnse is a factr cntaining mre than tw levels, then the svm() functin will perfrm multi-class classificatin using the ne-versus-ne apprach. We explre that setting here by generating a third class f bservatins. > set.seed(1) > x=rbind(x, matrix(rnrm(50*2), ncl=2)) > y=c(y, rep(0,50)) > x[y==0,2]= x[y==0,2]+2 > dat=data.frame(x=x, y=as.factr(y)) > par(mfrw=c(1,1)) > plt(x,cl=(y+1)) We nw fit an SVM t the data: > svmfit=svm(y., data=dat, kernel="radial", cst=10, gamma=1) > plt(svmfit, dat) The e1071 library can als be used t perfrm supprt vectr regressin, if the respnse vectr that is passed in t svm() is numerical rather than a factr Applicatin t Gene Expressin Data We nw examine the Khan data set, which cnsists f a number f tissue samples crrespnding t fur distinct types f small rund blue cell tumrs. Fr each tissue sample, gene expressin measurements are available. The data set cnsists f training data, xtrain and ytrain, and testing data, xtest and ytest. We examine the dimensin f the data: > library(islr) > names(khan) [1] "xtrain" "xtest" "ytrain" "ytest" > dim(khan$xtrain ) [1] > dim(khan$xtest ) [1] > length(khan$ytrain ) [1] 63 > length(khan$ytest ) [1] 20

382 9.6 Lab: Supprt Vectr Machines 367 This data set cnsists f expressin measurements fr 2,308 genes. The training and test sets cnsist f 63 and 20 bservatins respectively. > table(khan$ytrain ) > table(khan$ytest ) We will use a supprt vectr apprach t predict cancer subtype using gene expressin measurements. In this data set, there are a very large number f features relative t the number f bservatins. This suggests that we shuld use a linear kernel, because the additinal flexibility that will result frm using a plynmial r radial kernel is unnecessary. > dat=data.frame(x=khan$xtrain, y=as.factr(khan$ytrain )) > ut=svm(y., data=dat, kernel="linear",cst=10) > summary(ut) Call: svm(frmula = y., data = dat, kernel = "linear", cst = 10) Parameters : SVM-Type: C-classificatin SVM-Kernel: linear cst: 10 gamma: Number f Supprt Vectrs: 58 ( ) Number f Classes: 4 Levels: > table(ut$fitted, dat$y) We see that there are n training errrs. In fact, this is nt surprising, because the large number f variables relative t the number f bservatins implies that it is easy t find hyperplanes that fully separate the classes. We are mst interested nt in the supprt vectr classifier s perfrmance n the training bservatins, but rather its perfrmance n the test bservatins. > dat.te=data.frame(x=khan$xtest, y=as.factr(khan$ytest )) > pred.te=predict(ut, newdata=dat.te) > table(pred.te, dat.te$y) pred.te

383 Supprt Vectr Machines We see that using cst=10 yields tw test set errrs n this data. 9.7 Exercises Cnceptual 1. This prblem invlves hyperplanes in tw dimensins. (a) Sketch the hyperplane 1 + 3X 1 X 2 = 0. Indicate the set f pintsfrwhich1+3x 1 X 2 > 0, as well as the set f pints frwhich1+3x 1 X 2 < 0. (b) On the same plt, sketch the hyperplane 2+X 1 +2X 2 =0. Indicate the set f pints fr which 2+X 1 +2X 2 > 0, as well as the set f pints fr which 2+X 1 +2X 2 < We have seen that in p = 2 dimensins, a linear decisin bundary takes the frm β 0 +β 1 X 1 +β 2 X 2 = 0. We nw investigate a nn-linear decisin bundary. (a) Sketch the curve (1 + X 1 ) 2 +(2 X 2 ) 2 =4. (b) On yur sketch, indicate the set f pints fr which (1 + X 1 ) 2 +(2 X 2 ) 2 > 4, as well as the set f pints fr which (1 + X 1 ) 2 +(2 X 2 ) 2 4. (c) Suppse that a classifier assigns an bservatin t the blue class if (1 + X 1 ) 2 +(2 X 2 ) 2 > 4, and t the red class therwise. T what class is the bservatin (0, 0) classified? ( 1, 1)? (2, 2)? (3, 8)? (d) Argue that while the decisin bundary in (c) is nt linear in terms f X 1 and X 2, it is linear in terms f X 1, X1 2, X 2,and X Here we explre the maximal margin classifier n a ty data set. (a) We are given n = 7 bservatins in p = 2 dimensins. Fr each bservatin, there is an assciated class label.

384 Obs. X 1 X 2 Y Red Red Red Red Blue Blue Blue 9.7 Exercises 369 Sketch the bservatins. (b) Sketch the ptimal separating hyperplane, and prvide the equatin fr this hyperplane (f the frm (9.1)). (c) Describe the classificatin rule fr the maximal margin classifier. It shuld be smething alng the lines f Classify t Red if β 0 + β 1 X 1 + β 2 X 2 > 0, and classify t Blue therwise. Prvide the values fr β 0, β 1,andβ 2. (d) On yur sketch, indicate the margin fr the maximal margin hyperplane. (e) Indicate the supprt vectrs fr the maximal margin classifier. (f) Argue that a slight mvement f the seventh bservatin wuld nt affect the maximal margin hyperplane. (g) Sketch a hyperplane that is nt the ptimal separating hyperplane, and prvide the equatin fr this hyperplane. (h) Draw an additinal bservatin n the plt s that the tw classes are n lnger separable by a hyperplane. Applied 4. Generate a simulated tw-class data set with 100 bservatins and tw features in which there is a visible but nn-linear separatin between the tw classes. Shw that in this setting, a supprt vectr machine with a plynmial kernel (with degree greater than 1) r a radial kernel will utperfrm a supprt vectr classifier n the training data. Which technique perfrms best n the test data? Make plts and reprt training and test errr rates in rder t back up yur assertins. 5. We have seen that we can fit an SVM with a nn-linear kernel in rder t perfrm classificatin using a nn-linear decisin bundary. We will nw see that we can als btain a nn-linear decisin bundary by perfrming lgistic regressin using nn-linear transfrmatins f the features.

385 Supprt Vectr Machines (a) Generate a data set with n = 500 and p = 2, such that the bservatins belng t tw classes with a quadratic decisin bundary between them. Fr instance, yu can d this as fllws: > x1=runif(500) -0.5 > x2=runif(500) -0.5 > y=1*(x1^2-x2^2 > 0) (b) Plt the bservatins, clred accrding t their class labels. Yur plt shuld display X 1 n the x-axis, and X 2 n the y- axis. (c) Fit a lgistic regressin mdel t the data, using X 1 and X 2 as predictrs. (d) Apply this mdel t the training data in rder t btain a predicted class label fr each training bservatin. Plt the bservatins, clred accrding t the predicted class labels. The decisin bundary shuld be linear. (e) Nw fit a lgistic regressin mdel t the data using nn-linear functins f X 1 and X 2 as predictrs (e.g. X1 2, X 1 X 2,lg(X 2 ), and s frth). (f) Apply this mdel t the training data in rder t btain a predicted class label fr each training bservatin. Plt the bservatins, clred accrding t the predicted class labels. The decisin bundary shuld be bviusly nn-linear. If it is nt, then repeat (a)-(e) until yu cme up with an example in which the predicted class labels are bviusly nn-linear. (g) Fit a supprt vectr classifier t the data with X 1 and X 2 as predictrs. Obtain a class predictin fr each training bservatin. Plt the bservatins, clred accrding t the predicted class labels. (h) Fit a SVM using a nn-linear kernel t the data. Obtain a class predictin fr each training bservatin. Plt the bservatins, clred accrding t the predicted class labels. (i) Cmment n yur results. 6. At the end f Sectin 9.6.1, it is claimed that in the case f data that is just barely linearly separable, a supprt vectr classifier with a small value f cst that misclassifies a cuple f training bservatins may perfrm better n test data than ne with a huge value f cst that des nt misclassify any training bservatins. Yu will nw investigate this claim. (a) Generate tw-class data with p = 2 in such a waythat the classes are just barely linearly separable.

386 9.7 Exercises 371 (b) Cmpute the crss-validatin errr rates fr supprt vectr classifiers with a range f cst values. Hw many training errrs are misclassified fr each value f cst cnsidered, and hw des this relate t the crss-validatin errrs btained? (c) Generate an apprpriate test data set, and cmpute the test errrs crrespnding t each f the values f cst cnsidered. Which value f cst leads t the fewest test errrs, and hw des this cmpare t the values f cst that yield the fewest training errrs and the fewest crss-validatin errrs? (d) Discuss yur results. 7. In this prblem, yu will use supprt vectr appraches in rder t predict whether a given car gets high r lw gas mileage based n the Aut data set. (a) Create a binary variable that takes n a 1 fr cars with gas mileage abve the median, and a 0 fr cars with gas mileage belw the median. (b) Fit a supprt vectr classifier t the data with varius values f cst, in rder t predict whether a car gets high r lw gas mileage. Reprt the crss-validatin errrs assciated with different values f this parameter. Cmment n yur results. (c) Nw repeat (b), this time using SVMs with radial and plynmial basis kernels, with different values f gamma and degree and cst. Cmment n yur results. (d) Make sme plts t back up yur assertins in (b) and (c). Hint: In the lab, we used the plt() functin fr svm bjects nly in cases with p =2.Whenp>2, yucanusetheplt() functin t create plts displaying pairs f variables at a time. Essentially, instead f typing > plt(svmfit, dat) where svmfit cntains yur fitted mdel and dat is a data frame cntaining yur data, yu can type > plt(svmfit, dat, x1 x4) in rder t plt just the first and furth variables. Hwever, yu must replace x1 and x4 with the crrect variable names. T find ut mre, type?plt.svm. 8. This prblem invlves the OJ data set which is part f the ISLR package.

387 Supprt Vectr Machines (a) Create a training set cntaining a randm sample f 800 bservatins, and a test set cntaining the remaining bservatins. (b) Fit a supprt vectr classifier t the training data using cst=0.01, with Purchase as the respnse and the ther variables as predictrs. Use the summary() functin t prduce summary statistics, and describe the results btained. (c) What are the training and test errr rates? (d) Use the tune() functin t select an ptimal cst. Cnsider values in the range 0.01 t 10. (e) Cmpute the training and test errr rates using this new value fr cst. (f) Repeat parts (b) thrugh (e) using a supprt vectr machine with a radial kernel. Use the default value fr gamma. (g) Repeat parts (b) thrugh (e) using a supprt vectr machine with a plynmial kernel. Set degree=2. (h) Overall, which apprach seems t give the best results n this data?

388 10 Unsupervised Learning Mst f this bk cncerns supervised learning methds such as regressin and classificatin. In the supervised learning setting, we typically have access t a set f p features X 1,X 2,...,X p,measurednn bservatins, and a respnse Y als measured n thse same n bservatins. The gal is then t predict Y using X 1,X 2,...,X p. This chapter will instead fcus n unsupervised learning, asetfstatistical tls intended fr the setting in which we have nly a set f features X 1,X 2,...,X p measured n n bservatins. We are nt interested in predictin, because we d nt have an assciated respnse variable Y. Rather, the gal is t discver interesting things abut the measurements n X 1,X 2,...,X p. Is there an infrmative way t visualize the data? Can we discver subgrups amng the variables r amng the bservatins? Unsupervised learning refers t a diverse set f techniques fr answering questins such as these. In this chapter, we will fcus n tw particular types f unsupervised learning: principal cmpnents analysis, atl used fr data visualizatin r data pre-prcessing befre supervised techniques are applied, and clustering, a brad class f methds fr discvering unknwn subgrups in data The Challenge f Unsupervised Learning Supervised learning is a well-understd area. In fact, if yu have read the preceding chapters in this bk, then yu shuld by nw have a gd G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics, DOI / , Springer Science+Business Media New Yrk

389 Unsupervised Learning grasp f supervised learning. Fr instance, if yu are asked t predict a binary utcme frm a data set, yu have a very well develped set f tls at yur dispsal (such as lgistic regressin, linear discriminant analysis, classificatin trees, supprt vectr machines, and mre) as well as a clear understanding f hw t assess the quality f the results btained (using crss-validatin, validatin n an independent test set, and s frth). In cntrast, unsupervised learning is ften much mre challenging. The exercise tends t be mre subjective, and there is n simple gal fr the analysis, such as predictin f a respnse. Unsupervised learning is ften perfrmed as part f an explratry data analysis. Furthermre,itcanbe explratry hard t assess the results btained frm unsupervised learning methds, data analysis since there is n universally accepted mechanism fr perfrming crssvalidatin r validating results n an independent data set. The reasn fr this difference is simple. If we fit a predictive mdel using a supervised learning technique, then it is pssible t check ur wrk by seeing hw well ur mdel predicts the respnse Y n bservatins nt used in fitting the mdel. Hwever, in unsupervised learning, there is n way t check ur wrk because we dn t knw the true answer the prblem is unsupervised. Techniques fr unsupervised learning are f grwing imprtance in a number f fields. A cancer researcher might assay gene expressin levels in 100 patients with breast cancer. He r she might then lk fr subgrups amng the breast cancer samples, r amng the genes, in rder t btain a better understanding f the disease. An nline shpping site might try t identify grups f shppers with similar brwsing and purchase histries, as well as items that are f particular interest t the shppers within each grup. Then an individual shpper can be preferentially shwn the items in which he r she is particularly likely t be interested, based n the purchase histries f similar shppers. A search engine might chse what search results t display t a particular individual based n the click histries f ther individuals with similar search patterns. These statistical learning tasks, and many mre, can be perfrmed via unsupervised learning techniques Principal Cmpnents Analysis Principal cmpnents are discussed in Sectin in the cntext f principal cmpnents regressin. When faced with a large set f crrelated variables, principal cmpnents allw us t summarize this set with a smaller number f representative variables that cllectively explain mst f the variability in the riginal set. The principal cmpnent directins are presented in Sectin as directins in feature space alng which the riginal data are highly variable. These directins als define lines and subspaces that are as clse as pssible t the data clud. T perfrm

390 10.2 Principal Cmpnents Analysis 375 principal cmpnents regressin, we simply use principal cmpnents as predictrs in a regressin mdel in place f the riginal larger set f variables. Principal cmpnent analysis (PCA) refers t the prcess by which prin- principal cipal cmpnents are cmputed, and the subsequent use f these cmpnents in understanding the data. PCA is an unsupervised apprach, since it invlves nly a set f features X 1,X 2,...,X p, and n assciated respnse Y. Apart frm prducing derived variables fr use in supervised learning prblems, PCA als serves as a tl fr data visualizatin (visualizatin f the bservatins r visualizatin f the variables). We nw discuss PCA in greater detail, fcusing n the use f PCA as a tl fr unsupervised data explratin, in keeping with the tpic f this chapter. cmpnent analysis What Are Principal Cmpnents? Suppse that we wish t visualize n bservatins with measurements n a set f p features, X 1,X 2,...,X p, as part f an explratry data analysis. We culd d this by examining tw-dimensinal scatterplts f the data, each f which cntains the n bservatins measurements n tw f the features. Hwever, there are ( p 2) = p(p 1)/2 such scatterplts; fr example, with p =10thereare45plts!Ifp is large, then it will certainly nt be pssible t lk at all f them; mrever, mst likely nne f them will be infrmative since they each cntain justasmallfractinfthettal infrmatin present in the data set. Clearly, a better methd is required t visualize the n bservatins when p is large. In particular, we wuld like t find a lw-dimensinal representatin f the data that captures as much f the infrmatin as pssible. Fr instance, if we can btain a tw-dimensinal representatin f the data that captures mst f the infrmatin, then we can plt the bservatins in this lw-dimensinal space. PCA prvides a tl t d just this. It finds a lw-dimensinal representatin f a data set that cntains as much as pssible f the variatin. The idea is that each f the n bservatins lives in p-dimensinal space, but nt all f these dimensins are equally interesting. PCA seeks a small number f dimensins that are as interesting as pssible, where the cncept f interesting is measured by the amunt that the bservatins vary alng each dimensin. Each f the dimensins fund by PCA is a linear cmbinatin f the p features. We nw explain the manner in which these dimensins, r principal cmpnents, are fund. The first principal cmpnent f a set f features X 1,X 2,...,X p is the nrmalized linear cmbinatin f the features Z 1 = φ 11 X 1 + φ 21 X φ p1 X p (10.1) that has the largest variance. By nrmalized, wemeanthat p j=1 φ2 j1 =1. We refer t the elements φ 11,...,φ p1 as the ladings f the first principal lading

391 Unsupervised Learning cmpnent; tgether, the ladings make up the principal cmpnent lading vectr, φ 1 =(φ 11 φ φ p1 ) T. We cnstrain the ladings s that their sum f squares is equal t ne, since therwise setting these elements t be arbitrarily large in abslute value culd result in an arbitrarily large variance. Given a n p data set X, hw d we cmpute the first principal cmpnent? Since we are nly interested in variance, we assume that each f the variables in X has been centered t have mean zer (that is, the clumn means f X are zer). We then lk fr the linear cmbinatin f the sample feature values f the frm z i1 = φ 11 x i1 + φ 21 x i φ p1 x ip (10.2) that has largest sample variance, subject t the cnstraint that p j=1 φ2 j1 =1. In ther wrds, the first principal cmpnent lading vectr slves the ptimizatin prblem 2 1 n p p maximize φ j1 x ij subject t φ 2 j1 =1. (10.3) φ 11,...,φ p1 n i=1 j=1 Frm (10.2) we can write the bjective in (10.3) as 1 n n i=1 z2 i1. Since 1 n n i=1 x ij = 0, the average f the z 11,...,z n1 will be zer as well. Hence the bjective that we are maximizing in (10.3) is just the sample variance f the n values f z i1. We refer t z 11,...,z n1 as the scres f the first princi- scre pal cmpnent. Prblem (10.3) can be slved via an eigen decmpsitin, a standard technique in linear algebra, but details are utside f the scpe f this bk. There is a nice gemetric interpretatin fr the first principal cmpnent. The lading vectr φ 1 with elements φ 11,φ 21,...,φ p1 defines a directin in feature space alng which the data vary the mst. If we prject the n data pints x 1,...,x n nt this directin, the prjected values are the principal cmpnent scres z 11,...,z n1 themselves. Fr instance, Figure 6.14 n page 230 displays the first principal cmpnent lading vectr (green slid line) n an advertising data set. In these data, there are nly tw features, and s the bservatins as well as the first principal cmpnent lading vectr can be easily displayed. As can be seen frm (6.19), in that data set φ 11 =0.839 and φ 21 = After the first principal cmpnent Z 1 f the features has been determined, we can find the secnd principal cmpnent Z 2. The secnd principal cmpnent is the linear cmbinatin f X 1,...,X p that has maximal variance ut f all linear cmbinatins that are uncrrelated with Z 1.The secnd principal cmpnent scres z 12,z 22,...,z n2 take the frm j=1 z i2 = φ 12 x i1 + φ 22 x i φ p2 x ip, (10.4)

392 10.2 Principal Cmpnents Analysis 377 PC1 PC2 Murder Assault UrbanPp Rape TABLE The principal cmpnent lading vectrs, φ 1 USArrests data. These are als displayed in Figure and φ 2, fr the where φ 2 is the secnd principal cmpnent lading vectr, with elements φ 12,φ 22,...,φ p2. It turns ut that cnstraining Z 2 t be uncrrelated with Z 1 is equivalent t cnstraining the directin φ 2 t be rthgnal (perpendicular) t the directin φ 1. In the example in Figure 6.14, the bservatins lie in tw-dimensinal space (since p = 2), and s nce we have fund φ 1, there is nly ne pssibility fr φ 2, which is shwn as a blue dashed line. (Frm Sectin 6.3.1, we knw that φ 12 =0.544 and φ 22 = ) But in a larger data set with p>2 variables, there are multiple distinct principal cmpnents, and they are defined in a similar manner. T find φ 2,weslve a prblem similar t (10.3) with φ 2 replacing φ 1, and with the additinal cnstraint that φ 2 is rthgnal t φ 1. 1 Once we have cmputed the principal cmpnents, we can plt them against each ther in rder t prduce lw-dimensinal views f the data. Fr instance, we can plt the scre vectr Z 1 against Z 2, Z 1 against Z 3, Z 2 against Z 3, and s frth. Gemetrically, this amunts t prjecting the riginal data dwn nt the subspace spanned by φ 1, φ 2,andφ 3,and pltting the prjected pints. We illustrate the use f PCA n the USArrests data set. Fr each f the 50 states in the United States, the data set cntains the number f arrests per 100, 000 residents fr each f three crimes: Assault, Murder, andrape. We als recrd UrbanPp (the percent f the ppulatin in each state living in urban areas). The principal cmpnent scre vectrs have length n = 50, and the principal cmpnent lading vectrs have length p =4.PCAwas perfrmed after standardizing each variable t have mean zer and standard deviatin ne. Figure 10.1 plts the first tw principal cmpnents f these data. The figure represents bth the principal cmpnent scres and the lading vectrs in a single biplt display. The ladings are als given in biplt Table In Figure 10.1, we see that the first lading vectr places apprximately equal weight n Assault, Murder, and Rape, with much less weight n 1 On a technical nte, the principal cmpnent directins φ 1, φ 2, φ 3,... are the rdered sequence f eigenvectrs f the matrix X T X, and the variances f the cmpnents are the eigenvalues. There are at mst min(n 1,p) principal cmpnents.

393 Unsupervised Learning Secnd Principal Cmpnent UrbanPp Hawaii Rhde Massachusetts Island Utah New Jersey Califrnia Cnnecticut Washingtn Clrad New Yrk Ohi Illinis Arizna Nevada Wiscnsin Minnesta Pennsylvania Oregn Rape Texas Kansas Oklahma Delaware Nebraska Missuri Iwa Indiana Michigan New Hampshire Idah Virginia New Mexic Maine Wyming Maryland Flrida rth Dakta Mntana Assault Suth Dakta Kentucky Tennessee Luisiana Arkansas Alabama Alaska Gergia VermntWest Virginia Murder Suth Carlina Nrth Carlina Mississippi First Principal Cmpnent FIGURE The first tw principal cmpnents fr the USArrests data. The blue state names represent the scres fr the first tw principal cmpnents. The range arrws indicate the first tw principal cmpnent lading vectrs (with axes n the tp and right). Fr example, the lading fr Rape n the first cmpnent is 0.54, and its lading n the secnd principal cmpnent 0.17 (the wrd Rape is centered at the pint (0.54, 0.17)). This figure is knwn as a biplt, because it displays bth the principal cmpnent scres and the principal cmpnent ladings. UrbanPp. Hence this cmpnent rughly crrespnds t a measure f verall rates f serius crimes. The secnd lading vectr places mst f its weight n UrbanPp and much less weight n the ther three features. Hence, this cmpnent rughly crrespnds t the level f urbanizatin f the state. Overall, we see that the crime-related variables (Murder, Assault, andrape) are lcated clse t each ther, and that the UrbanPp variable is far frm the ther three. This indicates that the crime-related variables are crrelated with each ther states with high murder rates tend t have high assault and rape rates and that the UrbanPp variable is less crrelated with the ther three.

394 10.2 Principal Cmpnents Analysis 379 We can examine differences between the states via the tw principal cmpnent scre vectrs shwn in Figure Our discussin f the lading vectrs suggests that states with large psitive scres n the first cmpnent, such as Califrnia, Nevada and Flrida, have high crime rates, while states like Nrth Dakta, with negative scres n the first cmpnent, have lw crime rates. Califrnia als has a high scre n the secnd cmpnent, indicating a high level f urbanizatin, while the ppsite is true fr states like Mississippi. States clse t zer n bth cmpnents, such as Indiana, have apprximately average levels f bth crime and urbanizatin Anther Interpretatin f Principal Cmpnents The first tw principal cmpnent lading vectrs in a simulated threedimensinal data set are shwn in the left-hand panel f Figure 10.2; these tw lading vectrs span a plane alng which the bservatins have the highest variance. In the previus sectin, we describe the principal cmpnent lading vectrs as the directins in feature space alng which the data vary the mst, and the principal cmpnent scres as prjectins alng these directins. Hwever, an alternative interpretatin fr principal cmpnents can als be useful: principal cmpnents prvide lw-dimensinal linear surfaces that are clsest t the bservatins. We expand upn that interpretatin here. The first principal cmpnent lading vectr has a very special prperty: it is the line in p-dimensinal space that is clsest t the n bservatins (using average squared Euclidean distance as a measure f clseness). This interpretatin can be seen in the left-hand panel f Figure 6.15; the dashed lines indicate the distance between each bservatin and the first principal cmpnent lading vectr. The appeal f this interpretatin is clear: we seek a single dimensin f the data that lies as clse as pssible t all f the data pints, since such a line will likely prvide a gd summary f the data. The ntin f principal cmpnents as the dimensins that are clsest t the n bservatins extends beynd just the first principal cmpnent. Fr instance, the first tw principal cmpnents f a data set span the plane that is clsest t the n bservatins, in terms f average squared Euclidean distance. An example is shwn in the left-hand panel f Figure The first three principal cmpnents f a data set span the three-dimensinal hyperplane that is clsest t the n bservatins, and s frth. Using this interpretatin, tgether the first M principal cmpnent scre vectrs and the first M principal cmpnent lading vectrs prvide the best M-dimensinal apprximatin (in terms f Euclidean distance) t the ith bservatin x ij. This representatin can be written

395 Unsupervised Learning Secnd principal cmpnent First principal cmpnent FIGURE Ninety bservatins simulated in three dimensins. Left: the first tw principal cmpnent directins span the plane that best fits the data. It minimizes the sum f squared distances frm each pint t the plane. Right: the first tw principal cmpnent scre vectrs give the crdinates f the prjectin f the 90 bservatins nt the plane. The variance in the plane is maximized. x ij M z im φ jm (10.5) m=1 (assuming the riginal data matrix X is clumn-centered). In ther wrds, tgether the M principal cmpnent scre vectrs and M principal cmpnent lading vectrs can give a gd apprximatin t the data when M is sufficiently large. When M = min(n 1,p), then the representatin is exact: x ij = M m=1 z imφ jm Mre n PCA Scaling the Variables We have already mentined that befre PCA is perfrmed, the variables shuld be centered t have mean zer. Furthermre, the results btained when we perfrm PCA will als depend n whether the variables have been individually scaled (each multiplied by a different cnstant). This is in cntrast t sme ther supervised and unsupervised learning techniques, such as linear regressin, in which scaling the variables has n effect. (In linear regressin, multiplying a variable by a factr f c will simply lead t multiplicatin f the crrespnding cefficient estimate by a factr f 1/c, and thus will have n substantive effect n the mdel btained.) Fr instance, Figure 10.1 was btained after scaling each f the variables t have standard deviatin ne. This is reprduced in the left-hand plt in Figure Why des it matter that we scaled the variables? In these data,

396 10.2 Principal Cmpnents Analysis 381 Scaled Unscaled Secnd Principal Cmpnent * * UrbanPp * * * * * * * * * * * * * * * * * * Rape * * * * * * * * * * * Assault * * * * * * * * Murder * * * * * * Secnd Principal Cmpnent * * * * * * * * * * * * * * UrbanPp Rape * * * * * * * * Murder * * * * * * * * * * * * Assau * * * * First Principal Cmpnent First Principal Cmpnent FIGURE Tw principal cmpnent biplts fr the USArrests data. Left: the same as Figure 10.1, with the variables scaled t have unit standard deviatins. Right: principal cmpnents using unscaled data. Assault has by far the largest lading n the first principal cmpnent because it has the highest variance amng the fur variables. In general, scaling the variables t have standard deviatin ne is recmmended. the variables are measured in different units; Murder, Rape, andassault are reprted as the number f ccurrences per 100, 000 peple, and UrbanPp is the percentage f the state s ppulatin that lives in an urban area. These fur variables have variance 18.97, 87.73, , and 209.5, respectively. Cnsequently, if we perfrm PCA n the unscaled variables, then the first principal cmpnent lading vectr will have a very large lading fr Assault, since that variable has by far the highest variance. The righthand plt in Figure 10.3 displays the first tw principal cmpnents fr the USArrests data set, withut scaling the variables t have standard deviatin ne. As predicted, the first principal cmpnent lading vectr places almst all f its weight n Assault, while the secnd principal cmpnent lading vectr places almst all f its weight n UrpanPp. Cmparing this t the left-hand plt, we see that scaling des indeed have a substantial effect n the results btained. Hwever, this result is simply a cnsequence f the scales n which the variables were measured. Fr instance, if Assault were measured in units f the number f ccurrences per 100 peple (rather than number f ccurrences per 100, 000 peple), then this wuld amunt t dividing all f the elements f that variable by 1, 000. Then the variance f the variable wuld be tiny, and s the first principal cmpnent lading vectr wuld have a very small value fr that variable. Because it is undesirable fr the principal cmpnents btained t depend n an arbitrary chice f scaling, we typically scale each variable t have standard deviatin ne befre we perfrm PCA.

397 Unsupervised Learning In certain settings, hwever, the variables may be measured in the same units. In this case, we might nt wish t scale the variables t have standard deviatin ne befre perfrming PCA. Fr instance, suppse that the variables in a given data set crrespnd t expressin levels fr p genes. Then since expressin is measured in the same units fr each gene, we might chse nt t scale the genes t each have standard deviatin ne. Uniqueness f the Principal Cmpnents Each principal cmpnent lading vectr is unique, up t a sign flip. This means that tw different sftware packages will yield the same principal cmpnent lading vectrs, althugh the signs f thse lading vectrs may differ. The signs may differ because each principal cmpnent lading vectr specifies a directin in p-dimensinal space: flipping the sign has n effect as the directin des nt change. (Cnsider Figure 6.14 the principal cmpnent lading vectr is a line that extends in either directin, and flipping its sign wuld have n effect.) Similarly, the scre vectrs are unique up t a sign flip, since the variance f Z is the same as the variance f Z. It is wrth nting that when we use (10.5) t apprximate x ij we multiply z im by φ jm. Hence, if the sign is flipped n bth the lading and scre vectrs, the final prduct f the tw quantities is unchanged. The Prprtin f Variance Explained In Figure 10.2, we perfrmed PCA n a three-dimensinal data set (lefthand panel) and prjected the data nt the first tw principal cmpnent lading vectrs in rder t btain a tw-dimensinal view f the data (i.e. the principal cmpnent scre vectrs; right-hand panel). We see that this tw-dimensinal representatin f the three-dimensinal data des successfully capture the majr pattern in the data: the range, green, and cyan bservatins that are near each ther in three-dimensinal space remain nearby in the tw-dimensinal representatin. Similarly, we have seen n the USArrests data set that we can summarize the 50 bservatins and 4 variables using just the first tw principal cmpnent scre vectrs and the first tw principal cmpnent lading vectrs. We can nw ask a natural questin: hw much f the infrmatin in a given data set is lst by prjecting the bservatins nt the first few principal cmpnents? That is, hw much f the variance in the data is nt cntained in the first few principal cmpnents? Mre generally, we are interested in knwing the prprtin f variance explained (PVE) by each prprtin principal cmpnent. The ttal variance present in a data set (assuming f variance that the variables have been centered t have mean zer) is defined as explained p Var(X j )= j=1 p j=1 1 n n x 2 ij, (10.6) i=1

398 10.2 Principal Cmpnents Analysis 383 Prp. Variance Explained Cumulative Prp. Variance Explained Principal Cmpnent Principal Cmpnent FIGURE Left: a scree plt depicting the prprtin f variance explained by each f the fur principal cmpnents in the USArrests data. Right: the cumulative prprtin f variance explained by the fur principal cmpnents in the USArrests data. and the variance explained by the mth principal cmpnent is 2 1 n zim 2 = 1 n p φ jm x ij. (10.7) n n i=1 Therefre, the PVE f the mth principal cmpnent is given by ( n p ) 2 i=1 j=1 φ jmx ij i=1 j=1 p n j=1 i=1 x2 ij. (10.8) The PVE f each principal cmpnent is a psitive quantity. In rder t cmpute the cumulative PVE f the first M principal cmpnents, we can simply sum (10.8) ver each f the first M PVEs. In ttal, there are min(n 1,p) principal cmpnents, and their PVEs sum t ne. In the USArrests data, the first principal cmpnent explains 62.0 % f the variance in the data, and the next principal cmpnent explains 24.7 % f the variance. Tgether, the first tw principal cmpnents explain almst 87 % f the variance in the data, and the last tw principal cmpnents explain nly 13 % f the variance. This means that Figure 10.1 prvides a pretty accurate summary f the data using just tw dimensins. The PVE f each principal cmpnent, as well as the cumulative PVE, is shwn in Figure The left-hand panel is knwn as a scree plt, and will be scree plt discussed next. Deciding Hw Many Principal Cmpnents t Use In general, a n p data matrix X has min(n 1,p) distinct principal cmpnents. Hwever, we usually are nt interested in all f them; rather,

399 Unsupervised Learning we wuld like t use just the first few principal cmpnents in rder t visualize r interpret the data. In fact, we wuld like t use the smallest number f principal cmpnents requiredt get agd understanding f the data. Hw many principal cmpnents are needed? Unfrtunately, there is n single (r simple!) answer t this questin. We typically decide n the number f principal cmpnents required t visualize the data by examining a scree plt, such as the ne shwn in the left-hand panel f Figure We chse the smallest number f principal cmpnents that are required in rder t explain a sizable amunt f the variatin in the data. This is dne by eyeballing the scree plt, and lking fr a pint at which the prprtin f variance explained by each subsequent principal cmpnent drps ff. This is ften referred t as an elbw in the scree plt. Fr instance, by inspectin f Figure 10.4, ne might cnclude that a fair amunt f variance is explained by the first tw principal cmpnents, and that there is an elbw after the secnd cmpnent. After all, the third principal cmpnent explains less than ten percent f the variance in the data, and the furth principal cmpnent explains less than half that and s is essentially wrthless. Hwever, this type f visual analysis is inherently ad hc. Unfrtunately, there is n well-accepted bjective way t decide hw many principal cmpnents are enugh. In fact, the questin f hw many principal cmpnents are enugh is inherently ill-defined, and will depend n the specific area f applicatin and the specific data set. In practice, we tend t lk at the first few principal cmpnents in rder t find interesting patterns in the data. If n interesting patterns are fund in the first few principal cmpnents, then further principal cmpnents are unlikely t be f interest. Cnversely, if the first few principal cmpnents are interesting, then we typically cntinue t lk at subsequent principal cmpnents until n further interesting patterns are fund. This is admittedly a subjective apprach, and is reflective f the fact that PCA is generally used as a tl fr explratry data analysis. On the ther hand, if we cmpute principal cmpnents fr use in a supervised analysis, such as the principal cmpnents regressin presented in Sectin 6.3.1, then there is a simple and bjective way t determine hw many principal cmpnents t use: we can treat the number f principal cmpnent scre vectrs t be used in the regressin as a tuning parameter t be selected via crss-validatin r a related apprach. The cmparative simplicity f selecting the number f principal cmpnents fr a supervised analysis is ne manifestatin f the fact that supervised analyses tend t be mre clearly defined and mre bjectively evaluated than unsupervised analyses.

400 Other Uses fr Principal Cmpnents 10.3 Clustering Methds 385 We saw in Sectin that we can perfrm regressin using the principal cmpnent scre vectrs as features. In fact, many statistical techniques, such as regressin, classificatin, and clustering, can be easily adapted t use the n M matrix whse clumns are the first M p principal cmpnent scre vectrs, rather than using the full n p data matrix. This can lead t less nisy results, since it is ften the case that the signal (as ppsed t the nise) in a data set is cncentrated in its first few principal cmpnents Clustering Methds Clustering refers t a very brad set f techniques fr finding subgrups, r clustering clusters, in a data set. When we cluster the bservatins f a data set, we seek t partitin them int distinct grups s that the bservatins within each grup are quite similar t each ther, while bservatins in different grups are quite different frm each ther. Of curse, t make this cncrete, we must define what it means fr tw r mre bservatins t be similar r different. Indeed, this is ften a dmain-specific cnsideratin that must be made based n knwledge f the data being studied. Fr instance, suppse that we have a set f n bservatins, each with p features. The n bservatins culd crrespnd t tissue samples fr patients with breast cancer, and the p features culd crrespnd t measurements cllected fr each tissue sample; these culd be clinical measurements, such as tumr stage r grade, r they culd be gene expressin measurements. We may have a reasn t believe that there is sme hetergeneity amng the n tissue samples; fr instance, perhaps there are a few different unknwn subtypes f breast cancer. Clustering culd be used t find these subgrups. This is an unsupervised prblem because we are trying t discver structure in this case, distinct clusters n the basis f a data set. The gal in supervised prblems, n the ther hand, is t try t predict sme utcme vectr such as survival time r respnse t drug treatment. Bth clustering and PCA seek t simplify the data via a small number f summaries, but their mechanisms are different: PCA lks t find a lw-dimensinal representatin f the bservatins that explain a gd fractin f the variance; Clustering lks t find hmgeneus subgrups amng the bservatins. Anther applicatin f clustering arises in marketing. We may have access t a large number f measurements (e.g. median husehld incme, ccupatin, distance frm nearest urban area, and s frth) fr a large

401 Unsupervised Learning number f peple. Our gal is t perfrm market segmentatin by identifying subgrups f peple wh might be mre receptive t a particular frm f advertising, r mre likely t purchase a particular prduct. The task f perfrming market segmentatin amunts t clustering the peple in the data set. Since clustering is ppular in many fields, there exist a great number f clustering methds. In this sectin we fcus n perhaps the tw best-knwn clustering appraches: K-means clustering and hierarchical clustering. In K-means K-means clustering, we seek t partitin the bservatins int a pre-specified number f clusters. On the ther hand, in hierarchical clustering, we d nt knw in advance hw many clusters we want; in fact, we end up with a tree-like visual representatin f the bservatins, called a dendrgram, that allws us t view at nce the clusterings btained fr each pssible number f clusters, frm 1 t n. There are advantages and disadvantages t each f these clustering appraches, which we highlight in this chapter. In general, we can cluster bservatins n the basis f the features in rder t identify subgrups amng the bservatins, r we can cluster features n the basis f the bservatins in rder t discver subgrups amng the features. In what fllws, fr simplicity we will discuss clustering bservatins n the basis f the features, thugh the cnverse can be perfrmed by simply transpsing the data matrix. clustering hierarchical clustering dendrgram K-Means Clustering K-means clustering is a simple and elegant apprach fr partitining a data set int K distinct, nn-verlapping clusters. T perfrm K-means clustering, we must first specify the desired number f clusters K; then the K-means algrithm will assign each bservatin t exactly ne f the K clusters. Figure 10.5 shws the results btained frm perfrming K-means clustering n a simulated example cnsisting f 150 bservatins in tw dimensins, using three different values f K. The K-means clustering prcedure results frm a simple and intuitive mathematical prblem. We begin by defining sme ntatin. Let C 1,...,C K dente sets cntaining the indices f the bservatins in each cluster. These sets satisfy tw prperties: 1. C 1 C 2... C K = {1,...,n}. In ther wrds, each bservatin belngs t at least ne f the K clusters. 2. C k C k = fr all k k. In ther wrds, the clusters are nnverlapping: n bservatin belngs t mre than ne cluster. Fr instance, if the ith bservatin is in the kth cluster, then i C k.the idea behind K-means clustering is that a gd clustering is ne fr which the within-cluster variatin is as small as pssible. The within-cluster variatin

402 10.3 Clustering Methds 387 K=2 K=3 K=4 FIGURE A simulated data set with 150 bservatins in tw-dimensinal space. Panels shw the results f applying K-means clustering with different values f K, the number f clusters. The clr f each bservatin indicates the cluster t which it was assigned using the K-means clustering algrithm. Nte that there is n rdering f the clusters, s the cluster clring is arbitrary. These cluster labels were nt used in clustering; instead, they are the utputs f the clustering prcedure. fr cluster C k is a measure W (C k ) f the amunt by which the bservatins within a cluster differ frm each ther. Hence we want t slve the prblem { K } minimize W (C k ). (10.9) C 1,...,C K k=1 In wrds, this frmula says that we want t partitin the bservatins int K clusters such that the ttal within-cluster variatin, summed ver all K clusters, is as small as pssible. Slving (10.9) seems like a reasnable idea, but in rder t make it actinable we need t define the within-cluster variatin. There are many pssible ways t define this cncept, but by far the mst cmmn chice invlves squared Euclidean distance. That is, we define W (C k )= 1 p (x ij x i j) 2, (10.10) C k i,i C k j=1 where C k dentes the number f bservatins in the kth cluster. In ther wrds, the within-cluster variatin fr the kth cluster is the sum f all f the pairwise squared Euclidean distances between the bservatins in the kth cluster, divided by the ttal number f bservatins in the kth cluster. Cmbining (10.9) and (10.10) gives the ptimizatin prblem that defines K-means clustering, K 1 p minimize (x ij x i j) 2 C 1,...,C K C k. (10.11) k=1 i,i C k j=1

403 Unsupervised Learning Nw, we wuld like t find an algrithm t slve (10.11) that is, a methd t partitin the bservatins int K clusters such that the bjective f (10.11) is minimized. This is in fact a very difficult prblem t slve precisely, since there are almst K n ways t partitin n bservatins int K clusters. This is a huge number unless K and n are tiny! Frtunately, a very simple algrithm can be shwn t prvide a lcal ptimum a pretty gd slutin t the K-means ptimizatin prblem (10.11). This apprach is laid ut in Algrithm Algrithm 10.1 K-Means Clustering 1. Randmly assign a number, frm 1 t K, t each f the bservatins. These serve as initial cluster assignments fr the bservatins. 2. Iterate until the cluster assignments stp changing: (a) Fr each f the K clusters, cmpute the cluster centrid. The kth cluster centrid is the vectr f the p feature means fr the bservatins in the kth cluster. (b) Assign each bservatin t the cluster whse centrid is clsest (where clsest is defined using Euclidean distance). Algrithm 10.1 is guaranteed t decrease the value f the bjective (10.11) at each step. T understand why, the fllwing identity is illuminating: 1 C k i,i C k j=1 p (x ij x i j) 2 =2 i C k j=1 p (x ij x kj ) 2, (10.12) where x kj = 1 C k i C k x ij is the mean fr feature j in cluster C k. In Step 2(a) the cluster means fr each feature are the cnstants that minimize the sum-f-squared deviatins, and in Step 2(b), reallcating the bservatins can nly imprve (10.12). This means that as the algrithm is run, the clustering btained will cntinually imprve until the result n lnger changes; the bjective f (10.11) will never increase. When the result n lnger changes, a lcal ptimum has been reached. Figure 10.6 shws the prgressin f the algrithm n the ty example frm Figure K-means clustering derives its name frm the fact that in Step 2(a), the cluster centrids are cmputed as the mean f the bservatins assigned t each cluster. Because the K-means algrithm finds a lcal rather than a glbal ptimum, the results btained will depend n the initial (randm) cluster assignment f each bservatin in Step 1 f Algrithm Fr this reasn, it is imprtant t run the algrithm multiple times frm different randm

404 10.3 Clustering Methds 389 Data Step 1 Iteratin 1, Step 2a Iteratin 1, Step 2b Iteratin 2, Step 2a Final Results FIGURE The prgress f the K-means algrithm n the example f Figure 10.5 with K=3. Tp left: the bservatins are shwn. Tp center: in Step 1 f the algrithm, each bservatin is randmly assigned t a cluster. Tp right: in Step 2(a), the cluster centrids are cmputed. These are shwn as large clred disks. Initially the centrids are almst cmpletely verlapping because the initial cluster assignments were chsen at randm. Bttm left: in Step 2(b), each bservatin is assigned t the nearest centrid. Bttm center: Step 2(a) is nce again perfrmed, leading t new cluster centrids. Bttm right: the results btained after ten iteratins. initial cnfiguratins. Then ne selects the best slutin, i.e. that fr which the bjective (10.11) is smallest. Figure 10.7 shws the lcal ptima btained by running K-means clustering six times using six different initial cluster assignments, using the ty data frm Figure In this case, the best clustering is the ne with an bjective value f As we have seen, t perfrm K-means clustering, we must decide hw many clusters we expect in the data. The prblem f selecting K is far frm simple. This issue, alng with ther practical cnsideratins that arise in perfrming K-means clustering, is addressed in Sectin

405 Unsupervised Learning FIGURE K-means clustering perfrmed six times n the data frm Figure 10.5 with K =3, each time with a different randm assignment f the bservatins in Step 1 f the K-means algrithm. Abve each plt is the value f the bjective (10.11). Three different lcal ptima were btained, ne f which resulted in a smaller value f the bjective and prvides better separatin between the clusters. Thse labeled in red all achieved the same best slutin, with an bjective value f Hierarchical Clustering One ptential disadvantage f K-means clustering is that it requires us t pre-specify the number f clusters K. Hierarchical clustering is an alternative apprach which des nt require that we cmmit t a particular chice f K. Hierarchical clustering has an added advantage ver K-means clustering in that it results in an attractive tree-based representatin f the bservatins, called a dendrgram. In this sectin, we describe bttm-up r agglmerative clustering. This is the mst cmmn type f hierarchical clustering, and refers t the fact that a dendrgram (generally depicted as an upside-dwn tree; see bttm-up agglmerative

406 10.3 Clustering Methds 391 X FIGURE Frty-five bservatins generated in tw-dimensinal space. In reality there are three distinct classes, shwn in separate clrs. Hwever, we will treat these class labels as unknwn and will seek t cluster the bservatins in rder t discver the classes frm the data. X 1 Figure 10.9) is built starting frm the leaves and cmbining clusters up t the trunk. We will begin with a discussin f hw t interpret a dendrgram and then discuss hw hierarchical clustering is actually perfrmed that is, hw the dendrgram is built. Interpreting a Dendrgram We begin with the simulated data set shwn in Figure 10.8, cnsisting f 45 bservatins in tw-dimensinal space. The data were generated frm a three-class mdel; the true class labels fr each bservatin are shwn in distinct clrs. Hwever, suppse that the data were bserved withut the class labels, and that we wanted t perfrm hierarchical clustering f the data. Hierarchical clustering (with cmplete linkage, t be discussed later) yields the result shwn in the left-hand panel f Figure Hw can we interpret this dendrgram? In the left-hand panel f Figure 10.9, each leaf f the dendrgram represents ne f the 45 bservatins in Figure Hwever, as we mve up the tree, sme leaves begin t fuse int branches. These crrespnd t bservatins that are similar t each ther. As we mve higher up the tree, branches themselves fuse, either with leaves r ther branches. The earlier (lwer in the tree) fusins ccur, the mre similar the grups f bservatins are t each ther. On the ther hand, bservatins that fuse later (near the tp f the tree) can be quite different. In fact, this statement can be made precise: fr any tw bservatins, we can lk fr the pint in the tree where branches cntaining thse tw bservatins are first fused. The height f this fusin, as measured n the vertical axis, indicates hw

407 Unsupervised Learning FIGURE Left: dendrgram btained frm hierarchically clustering the data frm Figure 10.8 with cmplete linkage and Euclidean distance. Center: the dendrgram frm the left-hand panel, cut at a height f nine (indicated by the dashed line). This cut results in tw distinct clusters, shwn in different clrs. Right: the dendrgram frm the left-hand panel, nw cut at a height f five. This cut results in three distinct clusters, shwn in different clrs. Nte that the clrs were nt used in clustering, but are simply used fr display purpses in this figure. different the tw bservatins are. Thus, bservatins that fuse at the very bttm f the tree are quite similar t each ther, whereas bservatins that fuse clse t the tp f the tree will tend t be quite different. This highlights a very imprtant pint in interpreting dendrgrams that is ften misunderstd. Cnsider the left-hand panel f Figure 10.10, which shws a simple dendrgram btained frm hierarchically clustering nine bservatins. One can see that bservatins 5 and 7 are quite similar t each ther, since they fuse at the lwest pint n the dendrgram. Observatins 1 and 6 are als quite similar t each ther. Hwever, it is tempting but incrrect t cnclude frm the figure that bservatins 9 and 2 are quite similar t each ther n the basis that they are lcated near each ther n the dendrgram. In fact, based n the infrmatin cntained in the dendrgram, bservatin 9 is n mre similar t bservatin 2 than it is t bservatins 8, 5, and 7. (This can be seen frm the right-hand panel f Figure 10.10, in which the raw data are displayed.) T put it mathematically, there are 2 n 1 pssible rerderings f the dendrgram, where n is the number f leaves. This is because at each f the n 1pintswhere fusins ccur, the psitins f the tw fused branches culd be swapped withut affecting the meaning f the dendrgram. Therefre, we cannt draw cnclusins abut the similarity f tw bservatins based n their prximity alng the hrizntal axis. Rather, we draw cnclusins abut the similarity f tw bservatins based n the lcatin n the vertical axis where branches cntaining thse tw bservatins first are fused.

408 10.3 Clustering Methds X X 1 FIGURE An illustratin f hw t prperly interpret a dendrgram with nine bservatins in tw-dimensinal space. Left: a dendrgram generated using Euclidean distance and cmplete linkage. Observatins 5 and 7 are quite similar t each ther, as are bservatins 1 and 6. Hwever, bservatin 9 is n mre similar t bservatin 2 than it is t bservatins 8, 5, and 7, even thugh bservatins 9 and 2 are clse tgether in terms f hrizntal distance. This is because bservatins 2, 8, 5, and 7 all fuse with bservatin 9 at the same height, apprximately 1.8. Right: the raw data used t generate the dendrgram can be used t cnfirm that indeed, bservatin 9 is n mre similar t bservatin 2 than it is t bservatins 8, 5, and 7. Nw that we understand hw t interpret the left-hand panel f Figure 10.9, we can mve n t the issue f identifying clusters n the basis f a dendrgram. In rder t d this, we make a hrizntal cut acrss the dendrgram, as shwn in the center and right-hand panels f Figure The distinct sets f bservatins beneath the cut can be interpreted as clusters. In the center panel f Figure 10.9, cutting the dendrgram at a height f nine results in tw clusters, shwn in distinct clrs. In the right-hand panel, cutting the dendrgram at a height f five results in three clusters. Further cuts can be made as ne descends the dendrgram in rder t btain any number f clusters, between 1 (crrespnding t n cut) and n (crrespnding t a cut at height 0, s that each bservatin is in its wn cluster). In ther wrds, the height f the cut t the dendrgram serves the same rle as the K in K-means clustering: it cntrls the number f clusters btained. Figure 10.9 therefre highlights a very attractive aspect f hierarchical clustering: ne single dendrgram can be used t btain any number f clusters. In practice, peple ften lk at the dendrgram and select by eye a sensible number f clusters, based n the heights f the fusin and the number f clusters desired. In the case f Figure 10.9, ne might chse t select either tw r three clusters. Hwever, ften the chice f where t cut the dendrgram is nt s clear.

409 Unsupervised Learning The term hierarchical refers t the fact that clusters btained by cutting the dendrgram at a given height are necessarily nested within the clusters btained by cutting the dendrgram at any greater height. Hwever, n an arbitrary data set, this assumptin f hierarchical structure might be unrealistic. Fr instance, suppse that ur bservatins crrespnd t a grup f peple with a split f males and females, evenly split amng Americans, Japanese, and French. We can imagine a scenari in which the best divisin int tw grups might split these peple by gender, and the best divisin int three grups might split them by natinality. In this case, the true clusters are nt nested, in the sense that the best divisin int three grups des nt result frm taking the best divisin int tw grups and splitting up ne f thse grups. Cnsequently, this situatin culd nt be well-represented by hierarchical clustering. Due t situatins such as this ne, hierarchical clustering can smetimes yield wrse (i.e. less accurate) results than K-means clustering fr a given number f clusters. The Hierarchical Clustering Algrithm The hierarchical clustering dendrgram is btained via an extremely simple algrithm. We begin by defining sme srt f dissimilarity measure between each pair f bservatins. Mst ften, Euclidean distance is used; we will discuss the chice f dissimilarity measure later in this chapter. The algrithm prceeds iteratively. Starting ut at the bttm f the dendrgram, each f the n bservatins is treated as its wn cluster. The tw clusters that are mst similar t each ther are then fused s that there nw are n 1 clusters. Next the tw clusters that are mst similar t each ther are fused again, s that there nw are n 2 clusters. The algrithm prceeds in this fashin until all f the bservatins belng t ne single cluster, and the dendrgram is cmplete. Figure depicts the first few steps f the algrithm, fr the data frm Figure T summarize, the hierarchical clustering algrithm is given in Algrithm This algrithm seems simple enugh, but ne issue has nt been addressed. Cnsider the bttm right panel in Figure Hw did we determine that the cluster {5, 7} shuld be fused with the cluster {8}? We have a cncept f the dissimilarity between pairs f bservatins, but hw d we define the dissimilarity between tw clusters if ne r bth f the clusters cntains multiple bservatins? The cncept f dissimilarity between a pair f bservatins needs t be extended t a pair f grups f bservatins. This extensin is achieved by develping the ntin f linkage, which defines the dissimilarity between tw grups f bserva- linkage tins. The fur mst cmmn types f linkage cmplete, average, single, and centrid are briefly described in Table Average, cmplete, and single linkage are mst ppular amng statisticians. Average and cmplete

410 Algrithm 10.2 Hierarchical Clustering 10.3 Clustering Methds Begin with n bservatins and a measure (such as Euclidean distance) f all the ( n 2) = n(n 1)/2 pairwise dissimilarities. Treat each bservatin as its wn cluster. 2. Fr i = n, n 1,...,2: (a) Examine all pairwise inter-cluster dissimilarities amng the i clusters and identify the pair f clusters that are least dissimilar (that is, mst similar). Fuse these tw clusters. The dissimilarity between these tw clusters indicates the height in the dendrgram at which the fusin shuld be placed. (b) Cmpute the new pairwise inter-cluster dissimilarities amng the i 1 remaining clusters. Linkage Cmplete Single Average Centrid Descriptin Maximal intercluster dissimilarity. Cmpute all pairwise dissimilarities between the bservatins in cluster A and the bservatins in cluster B, and recrd the largest f these dissimilarities. Minimal intercluster dissimilarity. Cmpute all pairwise dissimilarities between the bservatins in cluster A and the bservatins in cluster B, and recrd the smallest f these dissimilarities. Single linkage can result in extended, trailing clusters in which single bservatins are fused ne-at-a-time. Mean intercluster dissimilarity. Cmpute all pairwise dissimilarities between the bservatins in cluster A and the bservatins in cluster B, and recrd the average f these dissimilarities. Dissimilarity between the centrid fr cluster A (a mean vectr f length p) and the centrid fr cluster B. Centrid linkage can result in undesirable inversins. TABLE A summary f the fur mst cmmnly-used types f linkage in hierarchical clustering. linkage are generally preferred ver single linkage, as they tend t yield mre balanced dendrgrams. Centrid linkage is ften used in genmics, but suffers frm a majr drawback in that an inversin can ccur, whereby inversin tw clusters are fused at a height belw either f the individual clusters in the dendrgram. This can lead t difficulties in visualizatin as well as in interpretatin f the dendrgram. The dissimilarities cmputed in Step 2(b) f the hierarchical clustering algrithm will depend n the type f linkage used, as well as n the chice f dissimilarity measure. Hence, the resulting

411 Unsupervised Learning 9 9 X X X X 1 9 X X X X 1 FIGURE An illustratin f the first few steps f the hierarchical clustering algrithm, using the data frm Figure 10.10, with cmplete linkage and Euclidean distance. Tp Left: initially, there are nine distinct clusters, {1}, {2},...,{9}. Tp Right: the tw clusters that are clsest tgether, {5} and {7}, are fused int a single cluster. Bttm Left: the tw clusters that are clsest tgether, {6} and {1}, are fused int a single cluster. Bttm Right: the tw clusters that are clsest tgether using cmplete linkage, {8} and the cluster {5, 7}, are fused int a single cluster. dendrgram typically depends quite strngly n the type f linkage used, as is shwn in Figure Chice f Dissimilarity Measure Thus far, the examples in this chapter have used Euclidean distance as the dissimilarity measure. But smetimes ther dissimilarity measures might be preferred. Fr example, crrelatin-based distance cnsiders tw bservatins t be similar if their features are highly crrelated, even thugh the bserved values may be far apart in terms f Euclidean distance. This is

412 10.3 Clustering Methds 397 Average Linkage Cmplete Linkage Single Linkage FIGURE Average, cmplete, and single linkage applied t an example data set. Average and cmplete linkage tend t yield mre balanced clusters. an unusual use f crrelatin, which is nrmally cmputed between variables; here it is cmputed between the bservatin prfiles fr each pair f bservatins. Figure illustrates the difference between Euclidean and crrelatin-based distance. Crrelatin-based distance fcuses n the shapes f bservatin prfiles rather than their magnitudes. The chice f dissimilarity measure is very imprtant, as it has a strng effect n the resulting dendrgram. In general, careful attentin shuld be paid t the type f data being clustered and the scientific questin at hand. These cnsideratins shuld determine what type f dissimilarity measure is used fr hierarchical clustering. Fr instance, cnsider an nline retailer interested in clustering shppers based n their past shpping histries. The gal is t identify subgrups f similar shppers, s that shppers within each subgrup can be shwn items and advertisements that are particularly likely t interest them. Suppse the data takes the frm f a matrix where the rws are the shppers and the clumns are the items available fr purchase; the elements f the data matrix indicate the number f times a given shpper has purchased a given item (i.e. a 0 if the shpper has never purchased this item, a 1 if the shpper has purchased it nce, etc.) What type f dissimilarity measure shuld be used t cluster the shppers? If Euclidean distance is used, then shppers wh have bught very few items verall (i.e. infrequent users f the nline shpping site) will be clustered tgether. This may nt be desirable. On the ther hand, if crrelatin-based distance is used, then shppers with similar preferences (e.g. shppers wh have bught items A and B but

413 Unsupervised Learning Observatin 1 Observatin 2 Observatin Variable Index FIGURE Three bservatins with measurements n 20 variables are shwn. Observatins 1 and 3 have similar values fr each variable and s there is a small Euclidean distance between them. But they are very weakly crrelated, s they have a large crrelatin-based distance. On the ther hand, bservatins 1 and 2 have quite different values fr each variable, and s there is a large Euclidean distance between them. But they are highly crrelated, s there is a small crrelatin-based distance between them. never items C r D) will be clustered tgether, even if sme shppers with these preferences are higher-vlume shppers than thers. Therefre, fr this applicatin, crrelatin-based distance may be a better chice. In additin t carefully selecting the dissimilarity measure used, ne must als cnsider whether r nt the variables shuld be scaled t have standard deviatin ne befre the dissimilarity between the bservatins is cmputed. T illustrate this pint, we cntinue with the nline shpping example just described. Sme items may be purchased mre frequently than thers; fr instance, a shpper might buy ten pairs f scks a year, but a cmputer very rarely. High-frequency purchases like scks therefre tend t have a much larger effect n the inter-shpper dissimilarities, and hence n the clustering ultimately btained, than rare purchases like cmputers. This may nt be desirable. If the variables are scaled t have standard deviatin ne befre the inter-bservatin dissimilarities are cmputed, then each variable will in effect be given equal imprtance in the hierarchical clustering perfrmed. We might als want t scale the variables t have standard deviatin ne if they are measured n different scales; therwise, the chice f units (e.g. centimeters versus kilmeters) fr a particular variable will greatly affect the dissimilarity measure btained. It shuld cme as n surprise that whether r nt it is a gd decisin t scale the variables befre cmputing the dissimilarity measure depends n the applicatin at hand. An example is shwn in Figure We nte that the issue f whether r nt t scale the variables befre perfrming clustering applies t K-means clustering as well.

414 10.3 Clustering Methds Scks Cmputers Scks Cmputers Scks Cmputers FIGURE An eclectic nline retailer sells tw items: scks and cmputers. Left: the number f pairs f scks, and cmputers, purchased by eight nline shppers is displayed. Each shpper is shwn in a different clr. If inter-bservatin dissimilarities are cmputed using Euclidean distance n the raw variables, then the number f scks purchased by an individual will drive the dissimilarities btained, and the number f cmputers purchased will have little effect. This might be undesirable, since (1) cmputers are mre expensive than scks and s the nline retailer may be mre interested in encuraging shppers t buy cmputers than scks, and (2) a large difference in the number f scks purchased by tw shppers may be less infrmative abut the shppers verall shpping preferences than a small difference in the number f cmputers purchased. Center: the same data is shwn, after scaling each variable by its standard deviatin. Nw the number f cmputers purchased will have a much greater effect n the inter-bservatin dissimilarities btained. Right: the same data are displayed, but nw the y-axis represents the number f dllars spent by each nline shpper n scks and n cmputers. Since cmputers are much mre expensive than scks, nw cmputer purchase histry will drive the inter-bservatin dissimilarities btained Practical Issues in Clustering Clustering can be a very useful tl fr data analysis in the unsupervised setting. Hwever, there are a number f issues that arise in perfrming clustering. We describe sme f these issues here. Small Decisins with Big Cnsequences In rder t perfrm clustering, sme decisins must be made. Shuld the bservatins r features first be standardized in sme way? Fr instance, maybe the variables shuld be centered t have mean zer and scaled t have standard deviatin ne.

415 Unsupervised Learning In the case f hierarchical clustering, What dissimilarity measure shuld be used? What type f linkage shuld be used? Where shuld we cut the dendrgram in rder t btain clusters? In the case f K-means clustering, hw many clusters shuld we lk fr in the data? Each f these decisins can have a strng impact n the results btained. In practice, we try several different chices, and lk fr the ne with the mst useful r interpretable slutin. With these methds, there is n single right answer any slutin that expses sme interesting aspects f the data shuld be cnsidered. Validating the Clusters Obtained Any time clustering is perfrmed n a data set we will find clusters. But we really want t knw whether the clusters that have been fund represent true subgrups in the data, r whether they are simply a result f clustering the nise. Fr instance, if we were t btain an independent set f bservatins, then wuld thse bservatins als display the same set f clusters? This is a hard questin t answer. There exist a number f techniques fr assigning a p-value t a cluster in rder t assess whether there is mre evidence fr the cluster than ne wuld expect due t chance. Hwever, there has been n cnsensus n a single best apprach. Mre details can be fund in Hastie et al. (2009). Other Cnsideratins in Clustering Bth K-means and hierarchical clustering will assign each bservatin t a cluster. Hwever, smetimes this might nt be apprpriate. Fr instance, suppse that mst f the bservatins truly belng t a small number f (unknwn) subgrups, and a small subset f the bservatins are quite different frm each ther and frm all ther bservatins. Then since K- means and hierarchical clustering frce every bservatin int a cluster, the clusters fund may be heavily distrted due t the presence f utliers that d nt belng t any cluster. Mixture mdels are an attractive apprach fr accmmdating the presence f such utliers. These amunt t a sft versin f K-means clustering, and are described in Hastie et al. (2009). In additin, clustering methds generally are nt very rbust t perturbatins t the data. Fr instance, suppse that we cluster n bservatins, and then cluster the bservatins again after remving a subset f the n bservatins at randm. One wuld hpe that the tw sets f clusters btained wuld be quite similar, but ften this is nt the case!

416 10.4 Lab 1: Principal Cmpnents Analysis 401 A Tempered Apprach t Interpreting the Results f Clustering We have described sme f the issues assciated with clustering. Hwever, clustering can be a very useful and valid statistical tl if used prperly. We mentined that small decisins in hw clustering is perfrmed, such as hw the data are standardized and what type f linkage is used, can have a large effect n the results. Therefre, we recmmend perfrming clustering with different chices f these parameters, and lking at the full set f results in rder t see what patterns cnsistently emerge. Since clustering can be nn-rbust, we recmmend clustering subsets f the data in rder t get a sense f the rbustness f the clusters btained. Mst imprtantly, we must be careful abut hw the results f a clustering analysis are reprted. These results shuld nt be taken as the abslute truth abut a data set. Rather, they shuld cnstitute a starting pint fr the develpment f a scientific hypthesis and further study, preferably n an independent data set Lab 1: Principal Cmpnents Analysis In this lab, we perfrm PCA n the USArrests data set, which is part f the base R package. The rws f the data set cntain the 50 states, in alphabetical rder. > states=rw.names(usarrests ) > states The clumns f the data set cntain the fur variables. > names(usarrests ) [1] "Murder" "Assault" "UrbanPp" "Rape" We first briefly examine the data. We ntice that the variables have vastly different means. > apply(usarrests, 2, mean) Murder Assault UrbanPp Rape Nte that the apply() functin allws us t apply a functin in this case, the mean() functin t each rw r clumn f the data set. The secnd input here dentes whether we wish t cmpute the mean f the rws, 1, r the clumns, 2. We see that there are n average three times as many rapes as murders, and mre than eight times as many assaults as rapes. We can als examine the variances f the fur variables using the apply() functin. > apply(usarrests, 2, var) Murder Assault UrbanPp Rape

417 Unsupervised Learning Nt surprisingly, the variables als have vastly different variances: the UrbanPp variable measures the percentage f the ppulatin in each state living in an urban area, which is nt a cmparable number t the number f rapes in each state per 100,000 individuals. If we failed t scale the variables befre perfrming PCA, then mst f the principal cmpnents that we bserved wuld be driven by the Assault variable, since it has by far the largest mean and variance. Thus, it is imprtant t standardize the variables t have mean zer and standard deviatin ne befre perfrming PCA. We nw perfrm principal cmpnents analysis using the prcmp() func- prcmp() tin, which is ne f several functins in R that perfrm PCA. > pr.ut=prcmp(usarrests, scale=true) By default, the prcmp() functin centers the variables t have mean zer. By using the ptin scale=true, we scale the variables t have standard deviatin ne. The utput frm prcmp() cntains a number f useful quantities. > names(pr.ut) [1] "sdev" "rtatin" "center" "scale" "x" The center and scale cmpnents crrespnd t the means and standard deviatins f the variables that were used fr scaling prir t implementing PCA. > pr.ut$center Murder Assault UrbanPp Rape > pr.ut$scale Murder Assault UrbanPp Rape The rtatin matrix prvides the principal cmpnent ladings; each clumn f pr.ut$rtatin cntains the crrespnding principal cmpnent lading vectr. 2 > pr.ut$rtatin PC1 PC2 PC3 PC4 Murder Assault UrbanPp Rape We see that there are fur distinct principal cmpnents. This is t be expected because there are in general min(n 1,p) infrmative principal cmpnents in a data set with n bservatins and p variables. 2 This functin names it the rtatin matrix, because when we matrix-multiply the X matrix by pr.ut$rtatin, it gives us the crdinates f the data in the rtated crdinate system. These crdinates are the principal cmpnent scres.

418 10.4 Lab 1: Principal Cmpnents Analysis 403 Using the prcmp() functin, we d nt need t explicitly multiply the data by the principal cmpnent lading vectrs in rder t btain the principal cmpnent scre vectrs. Rather the 50 4matrixx has as its clumns the principal cmpnent scre vectrs. That is, the kth clumn is the kth principal cmpnent scre vectr. > dim(pr.ut$x) [1] 50 4 We can plt the first tw principal cmpnents as fllws: > biplt(pr.ut, scale=0) The scale=0 argument t biplt() ensures that the arrws are scaled t biplt() represent the ladings; ther values fr scale give slightly different biplts with different interpretatins. Ntice that this figure is a mirrr image f Figure Recall that the principal cmpnents are nly unique up t a sign change, s we can reprduce Figure 10.1 by making a few small changes: > pr.ut$rtatin=-pr.ut$rtatin > pr.ut$x=-pr.ut$x > biplt(pr.ut, scale=0) The prcmp() functin als utputs the standard deviatin f each principal cmpnent. Fr instance, n the USArrests data set, we can access these standard deviatins as fllws: > pr.ut$sdev [1] The variance explained by each principal cmpnent is btained by squaring these: > pr.var=pr.ut$sdev ^2 > pr.var [1] T cmpute the prprtin f variance explained by each principal cmpnent, we simply divide the variance explained by each principal cmpnent by the ttal variance explained by all fur principal cmpnents: > pve=pr.var/sum(pr.var) > pve [1] We see that the first principal cmpnent explains 62.0 % f the variance in the data, the next principal cmpnent explains 24.7 % f the variance, and s frth. We can plt the PVE explained by each cmpnent, as well as the cumulative PVE, as fllws: > plt(pve, xlab="principal Cmpnent ", ylab="prprtin f Variance Explained ", ylim=c(0,1),type= b ) > plt(cumsum(pve), xlab="principal Cmpnent ", ylab=" Cumulative Prprtin f Variance Explained ", ylim=c(0,1), type= b )

419 Unsupervised Learning The result is shwn in Figure Nte that the functin cumsum() cm- cumsum() putes the cumulative sum f the elements f a numeric vectr. Fr instance: > a=c(1,2,8,-3) > cumsum(a) [1] Lab 2: Clustering K-Means Clustering The functin kmeans() perfrms K-means clustering in R. Webeginwith kmeans() a simple simulated example in which there truly are tw clusters in the data: the first 25 bservatins have a mean shift relative t the next 25 bservatins. > set.seed(2) > x=matrix(rnrm(50*2), ncl=2) > x[1:25,1]=x[1:25,1]+3 > x[1:25,2]=x[1:25,2]-4 We nw perfrm K-means clustering with K =2. > km.ut=kmeans(x,2,nstart=20) The cluster assignments f the 50 bservatins are cntained in km.ut$cluster. > km.ut$cluster [1] [30] The K-means clustering perfectly separated the bservatins int tw clusters even thugh we did nt supply any grup infrmatin t kmeans(). We can plt the data, with each bservatin clred accrding t its cluster assignment. > plt(x, cl=(km.ut$cluster +1), main="k-means Clustering Results with K=2", xlab="", ylab="", pch=20, cex=2) Here the bservatins can be easily pltted because they are tw-dimensinal. If there were mre than tw variables then we culd instead perfrm PCA and plt the first tw principal cmpnents scre vectrs. In this example, we knew that there really were tw clusters because we generated the data. Hwever, fr real data, in general we d nt knw the true number f clusters. We culd instead have perfrmed K-means clustering n this example with K =3. > set.seed(4) > km.ut=kmeans(x,3,nstart=20) > km.ut K-means clustering with 3 clusters f sizes 10, 23, 17

420 10.5 Lab 2: Clustering 405 Cluster means: [,1] [,2] Clustering vectr: [1] Within cluster sum f squares by cluster: [1] (between_ss / ttal_ss = 79.3 %) Available cmpnents : [1] "cluster" "centers" "ttss" "withinss" "tt.withinss" "betweenss " "size" > plt(x, cl=(km.ut$cluster +1), main="k-means Clustering Results with K=3", xlab="", ylab="", pch=20, cex=2) When K =3,K-means clustering splits up the tw clusters. T run the kmeans() functin in R with multiple initial cluster assignments, we use the nstart argument. If a value f nstart greater than ne is used, then K-means clustering will be perfrmed using multiple randm assignments in Step 1 f Algrithm 10.1, and the kmeans() functin will reprt nly the best results. Here we cmpare using nstart=1 t nstart=20. > set.seed(3) > km.ut=kmeans(x,3,nstart=1) > km.ut$tt.withinss [1] > km.ut=kmeans(x,3,nstart=20) > km.ut$tt.withinss [1] Nte that km.ut$tt.withinss is the ttal within-cluster sum f squares, which we seek t minimize by perfrming K-means clustering (Equatin 10.11). The individual within-cluster sum-f-squares are cntained in the vectr km.ut$withinss. We strngly recmmend always running K-means clustering with a large value f nstart, such as 20 r 50, since therwise an undesirable lcal ptimum may be btained. When perfrming K-means clustering, in additin t using multiple initial cluster assignments, it is als imprtant t set a randm seed using the set.seed() functin. This way, the initial cluster assignments in Step 1 can be replicated, and the K-means utput will be fully reprducible.

421 Unsupervised Learning Hierarchical Clustering The hclust() functin implements hierarchical clustering in R. Inthefl- hclust() lwing example we use the data frm Sectin t plt the hierarchical clustering dendrgram using cmplete, single, and average linkage clustering, with Euclidean distance as the dissimilarity measure. We begin by clustering bservatins using cmplete linkage. The dist() functin is used dist() t cmpute the inter-bservatin Euclidean distance matrix. > hc.cmplete=hclust(dist(x), methd="cmplete ") We culd just as easily perfrm hierarchical clustering with average r single linkage instead: > hc.average=hclust(dist(x), methd="average") > hc.single=hclust(dist(x), methd="single") We can nw plt the dendrgrams btained using the usual plt() functin. The numbers at the bttm f the plt identify each bservatin. > par(mfrw=c(1,3)) > plt(hc.cmplete,main="cmplete Linkage", xlab="", sub="", cex=.9) > plt(hc.average, main="average Linkage", xlab="", sub="", cex=.9) > plt(hc.single, main="single Linkage", xlab="", sub="", cex=.9) T determine the cluster labels fr each bservatin assciated with a given cut f the dendrgram, we can use the cutree() functin: > cutree(hc.cmplete, 2) [1] [30] > cutree(hc.average, 2) [1] [30] > cutree(hc.single, 2) [1] [30] Fr this data, cmplete and average linkage generally separate the bservatins int their crrect grups. Hwever, single linkage identifies ne pint as belnging t its wn cluster. A mre sensible answer is btained when fur clusters are selected, althugh there are still tw singletns. > cutree(hc.single, 4) [1] [30] T scale the variables befre perfrming hierarchical clustering f the bservatins, we use the scale() functin: > xsc=scale(x) > plt(hclust(dist(xsc), methd="cmplete "), main="hierarchical Clustering with Scaled Features ") cutree() scale()

422 10.6 Lab 3: NCI60 Data Example 407 Crrelatin-based distance can be cmputed using the as.dist() func- as.dist() tin, which cnverts an arbitrary square symmetric matrix int a frm that the hclust() functin recgnizes as a distance matrix. Hwever, this nly makes sense fr data with at least three features since the abslute crrelatin between any tw bservatins with measurements n tw features is always 1. Hence, we will cluster a three-dimensinal data set. > x=matrix(rnrm(30*3), ncl=3) > dd=as.dist(1-cr(t(x))) > plt(hclust(dd, methd="cmplete "), main="cmplete Linkage with Crrelatin -Based Distance", xlab="", sub="") 10.6 Lab 3: NCI60 Data Example Unsupervised techniques are ften used in the analysis f genmic data. In particular, PCA and hierarchical clustering are ppular tls. We illustrate these techniques n the NCI60 cancer cell line micrarray data, which cnsists f 6,830 gene expressin measurements n 64 cancer cell lines. > library(islr) > nci.labs=nci60$labs > nci.data=nci60$data Each cell line is labeled with a cancer type. We d nt make use f the cancer types in perfrming PCA and clustering, as these are unsupervised techniques. But after perfrming PCA and clustering, we will check t see the extent t which these cancer types agree with the results f these unsupervised techniques. The data has 64 rws and 6,830 clumns. > dim(nci.data) [1] We begin by examining the cancer types fr the cell lines. > nci.labs[1:4] [1] "CNS" "CNS" "CNS" "RENAL" > table(nci.labs) nci.labs BREAST CNS COLON K562A -repr K562B -repr LEUKEMIA MCF7A -repr MCF7D -repr MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN

423 Unsupervised Learning PCA n the NCI60 Data We first perfrm PCA n the data after scaling the variables (genes) t have standard deviatin ne, althugh ne culd reasnably argue that it is better nt t scale the genes. > pr.ut=prcmp(nci.data, scale=true) We nw plt the first few principal cmpnent scre vectrs, in rder t visualize the data. The bservatins (cell lines) crrespnding t a given cancer type will be pltted in the same clr, s that we can see t what extent the bservatins within a cancer type are similar t each ther. We first create a simple functin that assigns a distinct clr t each element f a numeric vectr. The functin will be used t assign a clr t each f the 64 cell lines, based n the cancer type t which it crrespnds. Cls=functin(vec){ + cls=rainbw(length(unique(vec))) + return(cls[as.numeric(as.factr(vec))]) + } Nte that the rainbw() functin takes as its argument a psitive integer, rainbw() and returns a vectr cntaining that number f distinct clrs. We nw can plt the principal cmpnent scre vectrs. > par(mfrw=c(1,2)) > plt(pr.ut$x[,1:2], cl=cls(nci.labs), pch=19, xlab="z1",ylab="z2") > plt(pr.ut$x[,c(1,3)], cl=cls(nci.labs), pch=19, xlab="z1",ylab="z3") The resulting plts are shwn in Figure On the whle, cell lines crrespnding t a single cancer type d tend t have similar values n the first few principal cmpnent scre vectrs. This indicates that cell lines frm the same cancer type tend t have pretty similar gene expressin levels. We can btain a summary f the prprtin f variance explained (PVE) f the first few principal cmpnents using the summary() methd fr a prcmp bject (we have truncated the printut): > summary(pr.ut) Imprtance f cmpnents : PC1 PC2 PC3 PC4 PC5 Standard deviatin Prprtin f Variance Cumulative Prprtin Using the plt() functin, we can als plt the variance explained by the first few principal cmpnents. > plt(pr.ut) Nte that the height f each bar in the bar plt is given by squaring the crrespnding element f pr.ut$sdev. Hwever, it is mre infrmative t

424 10.6 Lab 3: NCI60 Data Example 409 Z Z Z Z 1 FIGURE Prjectins f the NCI60 cancer cell lines nt the first three principal cmpnents (in ther wrds, the scres fr the first three principal cmpnents). On the whle, bservatins belnging t a single cancer type tend t lie near each ther in this lw-dimensinal space. It wuld nt have been pssible t visualize the data withut using a dimensin reductin methd such as PCA, since based n the full data set there are ( ) 6,830 2 pssible scatterplts, nne f which wuld have been particularly infrmative. plt the PVE f each principal cmpnent (i.e. a scree plt) and the cumulative PVE f each principal cmpnent. This can be dne with just a little wrk. > pve=100*pr.ut$sdev ^2/sum(pr.ut$sdev ^2) > par(mfrw=c(1,2)) > plt(pve, type="", ylab="pve", xlab="principal Cmpnent ", cl="blue") > plt(cumsum(pve), type="", ylab="cumulative PVE", xlab=" Principal Cmpnent ", cl="brwn3") (Nte that the elements f pve can als be cmputed directly frm the summary, summary(pr.ut)$imprtance[2,], and the elements f cumsum(pve) are given by summary(pr.ut)$imprtance[3,].) The resulting plts are shwn in Figure We see that tgether, the first seven principal cmpnents explain arund 40 % f the variance in the data. This is nt a huge amunt f the variance. Hwever, lking at the scree plt, we see that while each f the first seven principal cmpnents explain a substantial amunt f variance, there is a marked decrease in the variance explained by further principal cmpnents. That is, there is an elbw in the plt after apprximately the seventh principal cmpnent. This suggests that there may be little benefit t examining mre than seven r s principal cmpnents (thugh even examining seven principal cmpnents may be difficult).

425 Unsupervised Learning PVE Cumulative PVE Principal Cmpnent Principal Cmpnent FIGURE The PVE f the principal cmpnents f the NCI60 cancer cell line micrarray data set. Left: the PVE f each principal cmpnent is shwn. Right: the cumulative PVE f the principal cmpnents is shwn. Tgether, all principal cmpnents explain 100 % f the variance Clustering the Observatins f the NCI60 Data We nw prceed t hierarchically cluster the cell lines in the NCI60 data, with the gal f finding ut whether r nt the bservatins cluster int distinct types f cancer. T begin, we standardize the variables t have mean zer and standard deviatin ne. As mentined earlier, this step is ptinal and shuld be perfrmed nly if we want each gene t be n the same scale. > sd.data=scale(nci.data) We nw perfrm hierarchical clustering f the bservatins using cmplete, single, and average linkage. Euclidean distance is used as the dissimilarity measure. > par(mfrw=c(1,3)) > data.dist=dist(sd.data) > plt(hclust(data.dist), labels=nci.labs, main="cmplete Linkage", xlab="", sub="",ylab="") > plt(hclust(data.dist, methd="average"), labels=nci.labs, main="average Linkage", xlab="", sub="",ylab="") > plt(hclust(data.dist, methd="single"), labels=nci.labs, main="single Linkage", xlab="", sub="",ylab="") The results are shwn in Figure We see that the chice f linkage certainly des affect the results btained. Typically, single linkage will tend t yield trailing clusters: very large clusters nt which individual bservatins attach ne-by-ne. On the ther hand, cmplete and average linkage tend t yield mre balanced, attractive clusters. Fr this reasn, cmplete and average linkage are generally preferred t single linkage. Clearly cell lines within a single cancer type d tend t cluster tgether, althugh the

426 10.6 Lab 3: NCI60 Data Example 411 Cmplete Linkage LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA K562B repr K562A repr RENAL NSCLC BREAST NSCLC BREAST MCF7A repr BREAST MCF7D repr COLON COLON COLON RENAL MELANOMA MELANOMA BREAST BREAST MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA OVARIAN OVARIAN NSCLC OVARIAN UNKNOWN OVARIAN NSCLC MELANOMA CNS CNS CNS RENAL RENAL RENAL RENAL RENAL RENAL RENAL PROSTATE NSCLC NSCLC NSCLC NSCLC OVARIAN PROSTATE NSCLC COLON COLON OVARIAN COLON COLON CNS CNS BREAST BREAST BREAST BREAST CNS CNS RENAL BREAST NSCLC RENAL MELANOMA OVARIAN OVARIAN NSCLC OVARIAN COLON COLON OVARIAN PROSTATE NSCLC NSCLC NSCLC PROSTATE NSCLC MELANOMA RENAL RENAL RENAL OVARIAN UNKNOWN OVARIAN NSCLC CNS CNS CNS NSCLC RENAL RENAL RENAL RENAL NSCLC MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA BREAST BREAST COLON COLON COLON COLON COLON BREAST MCF7A repr BREAST MCF7D repr LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA K562B repr K562A repr LEUKEMIA LEUKEMIA Average Linkage Single Linkage LEUKEMIA K562B repr K562A repr LEUKEMIA LEUKEMIA LEUKEMIA RENAL BREAST CNS LEUKEMIA NSCLC LEUKEMIA OVARIAN NSCLC CNS BREAST NSCLC OVARIAN COLON BREAST MELANOMA RENAL MELANOMA BREAST BREAST MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA BREAST OVARIAN COLON MCF7A repr BREAST MCF7D repr UNKNOWN OVARIAN NSCLC NSCLC PROSTATE MELANOMA COLON OVARIAN NSCLC RENAL COLON PROSTATE COLON OVARIAN COLON COLON NSCLC NSCLC RENAL NSCLC RENAL RENAL RENAL RENAL RENAL CNS CNS CNS FIGURE The NCI60 cancer cell line micrarray data, clustered with average, cmplete, and single linkage, and using Euclidean distance as the dissimilarity measure. Cmplete and average linkage tend t yield evenly sized clusters whereas single linkage tends t yield extended clusters t which single leaves are fused ne by ne.

427 Unsupervised Learning clustering is nt perfect. We will use cmplete linkage hierarchical clustering fr the analysis that fllws. We can cut the dendrgram at the height that will yield a particular number f clusters, say fur: > hc.ut=hclust(dist(sd.data)) > hc.clusters=cutree(hc.ut,4) > table(hc.clusters,nci.labs) There are sme clear patterns. All the leukemia cell lines fall in cluster 3, while the breast cancer cell lines are spread ut ver three different clusters. We can plt the cut n the dendrgram that prduces these fur clusters: > par(mfrw=c(1,1)) > plt(hc.ut, labels=nci.labs) > abline(h=139, cl="red") The abline() functin draws a straight line n tp f any existing plt in R. The argument h=139 plts a hrizntal line at height 139 n the dendrgram; this is the height that results in fur distinct clusters. It is easy t verify that the resulting clusters are the same as the nes we btained using cutree(hc.ut,4). Printing the utput f hclust gives a useful brief summary f the bject: > hc.ut Call: hclust(d = dist(dat)) Cluster methd : cmplete Distance : euclidean Number f bjects: 64 We claimed earlier in Sectin that K-means clustering and hierarchical clustering with the dendrgram cut t btain the same number f clusters can yield very different results. Hw d these NCI60 hierarchical clustering results cmpare t what we get if we perfrm K-means clustering with K =4? > set.seed(2) > km.ut=kmeans(sd.data, 4, nstart=20) > km.clusters=km.ut$cluster > table(km.clusters,hc.clusters) hc.clusters km.clusters We see that the fur clusters btained using hierarchical clustering and K- means clustering are smewhat different. Cluster 2 in K-means clustering is identical t cluster 3 in hierarchical clustering. Hwever, the ther clusters

428 10.7 Exercises 413 differ: fr instance, cluster 4 in K-means clustering cntains a prtin f the bservatins assigned t cluster 1 by hierarchical clustering, as well as all f the bservatins assigned t cluster 2 by hierarchical clustering. Rather than perfrming hierarchical clustering n the entire data matrix, we can simply perfrm hierarchical clustering n the first few principal cmpnent scre vectrs, as fllws: > hc.ut=hclust(dist(pr.ut$x[,1:5])) > plt(hc.ut, labels=nci.labs, main="hier. Clust. n First Five Scre Vectrs") > table(cutree(hc.ut,4), nci.labs) Nt surprisingly, these results are different frm the nes that we btained when we perfrmed hierarchical clustering n the full data set. Smetimes perfrming clustering n the first few principal cmpnent scre vectrs can give better results than perfrming clustering n the full data. In this situatin, we might view the principal cmpnent step as ne f denising the data. We culd als perfrm K-means clustering n the first few principal cmpnent scre vectrs rather than the full data set Exercises Cnceptual 1. This prblem invlves the K-means clustering algrithm. (a) Prve (10.12). (b) On the basis f this identity, argue that the K-means clustering algrithm (Algrithm 10.1) decreases the bjective (10.11) at each iteratin. 2. Suppse that we have fur bservatins, fr which we cmpute a dissimilarity matrix, given by Fr instance, the dissimilarity between the first and secnd bservatins is 0.3, and the dissimilarity between the secnd and furth bservatins is 0.8. (a) On the basis f this dissimilarity matrix, sketch the dendrgram that results frm hierarchically clustering these fur bservatins using cmplete linkage. Be sure t indicate n the plt the height at which each fusin ccurs, as well as the bservatins crrespnding t each leaf in the dendrgram.

429 Unsupervised Learning (b) Repeat (a), this time using single linkage clustering. (c) Suppse that we cut the dendgram btained in (a) such that tw clusters result. Which bservatins are in each cluster? (d) Suppse that we cut the dendgram btained in (b) such that tw clusters result. Which bservatins are in each cluster? (e) It is mentined in the chapter that at each fusin in the dendrgram, the psitin f the tw clusters being fused can be swapped withut changing the meaning f the dendrgram. Draw a dendrgram that is equivalent t the dendrgram in (a), fr which tw r mre f the leaves are repsitined, but fr which the meaning f the dendrgram is the same. 3. In this prblem, yu will perfrm K-means clustering manually, with K =2,nasmallexamplewithn = 6 bservatins and p = 2 features. The bservatins are as fllws. Obs. X 1 X (a) Plt the bservatins. (b) Randmly assign a cluster label t each bservatin. Yu can use the sample() cmmand in R t d this. Reprt the cluster labels fr each bservatin. (c) Cmpute the centrid fr each cluster. (d) Assign each bservatin t the centrid t which it is clsest, in terms f Euclidean distance. Reprt the cluster labels fr each bservatin. (e) Repeat (c) and (d) until the answers btained stp changing. (f) In yur plt frm (a), clr the bservatins accrding t the cluster labels btained. 4. Suppse that fr a particular data set, we perfrm hierarchical clustering using single linkage and using cmplete linkage. We btain tw dendrgrams. (a) At a certain pint n the single linkage dendrgram, the clusters {1, 2, 3} and {4, 5} fuse. On the cmplete linkage dendrgram, the clusters {1, 2, 3} and {4, 5} als fuse at a certain pint. Which fusin will ccur higher n the tree, r will they fuse at the same height, r is there nt enugh infrmatin t tell?

430 10.7 Exercises 415 (b) At a certain pint n the single linkage dendrgram, the clusters {5} and {6} fuse. On the cmplete linkage dendrgram, the clusters {5} and {6} als fuse at a certain pint. Which fusin will ccur higher n the tree, r will they fuse at the same height, r is there nt enugh infrmatin t tell? 5. In wrds, describe the results that yu wuld expect if yu perfrmed K-means clustering f the eight shppers in Figure 10.14, n the basis f their sck and cmputer purchases, with K =2.Givethree answers, ne fr each f the variable scalings displayed. Explain. 6. A researcher cllects expressin measurements fr 1,000 genes in 100 tissue samples. The data can be written as a 1, matrix, which we call X, in which each rw represents a gene and each clumn a tissue sample. Each tissue sample was prcessed n a different day, and the clumns f X are rdered s that the samples that were prcessed earliest are n the left, and the samples that were prcessed later are n the right. The tissue samples belng t tw grups: cntrl (C) and treatment (T). The C and T samples were prcessed in a randm rder acrss the days. The researcher wishes t determine whether each gene s expressin measurements differ between the treatment and cntrl grups. As a pre-analysis (befre cmparing T versus C), the researcher perfrms a principal cmpnent analysis f the data, and finds that the first principal cmpnent (a vectr f length 100) has a strng linear trend frm left t right, and explains 10 % f the variatin. The researcher nw remembers that each patient sample was run n ne f tw machines, A and B, and machine A was used mre ften in the earlier times while B was used mre ften later. The researcher has a recrd f which sample was run n which machine. (a) Explain what it means that the first principal cmpnent explains 10 % f the variatin. (b) The researcher decides t replace the (i, j)th element f X with x ij z i1 φ j1 where z i1 is the ith scre, and φ j1 is the jth lading, fr the first principal cmpnent. He will then perfrm a tw-sample t-test n each gene in this new data set in rder t determine whether its expressin differs between the tw cnditins. Critique this idea, and suggest a better apprach. (c) Design and run a small simulatin experiment t demnstrate the superirity f yur idea.

431 Unsupervised Learning Applied 7. In the chapter, we mentined the use f crrelatin-based distance and Euclidean distance as dissimilarity measures fr hierarchical clustering. It turns ut that these tw measures are almst equivalent: if each bservatin has been centered t have mean zer and standard deviatin ne, and if we let r ij dente the crrelatin between the ith and jth bservatins, then the quantity 1 r ij is prprtinal t the squared Euclidean distance between the ith and jth bservatins. On the USArrests data, shw that this prprtinality hlds. Hint: The Euclidean distance can be calculated using the dist() functin, and crrelatins can be calculated using the cr() functin. 8. In Sectin , a frmula fr calculating PVE was given in Equatin We als saw that the PVE can be btained using the sdev utput f the prcmp() functin. On the USArrests data, calculate PVE in tw ways: (a) Using the sdev utput f the prcmp() functin, as was dne in Sectin (b) By applying Equatin 10.8 directly. That is, use the prcmp() functin t cmpute the principal cmpnent ladings. Then, use thse ladings in Equatin 10.8 t btain the PVE. These tw appraches shuld give the same results. Hint: Yu will nly btain the same results in (a) and (b) if the same data is used in bth cases. Fr instance, if in (a) yu perfrmed prcmp() using centered and scaled variables, then yu must center and scale the variables befre applying Equatin 10.3 in (b). 9. Cnsider the USArrests data. We will nw perfrm hierarchical clustering n the states. (a) Using hierarchical clustering with cmplete linkage and Euclidean distance, cluster the states. (b) Cut the dendrgram at a height that results in three distinct clusters. Which states belng t which clusters? (c) Hierarchically cluster the states using cmplete linkage and Euclidean distance, after scaling the variables t have standard deviatin ne. (d) What effect des scaling the variables have n the hierarchical clustering btained? In yur pinin, shuld the variables be scaled befre the inter-bservatin dissimilarities are cmputed? Prvide a justificatin fr yur answer.

432 10.7 Exercises In this prblem, yu will generate simulated data, and then perfrm PCA and K-means clustering n the data. (a) Generate a simulated data set with 20 bservatins in each f three classes (i.e. 60 bservatins ttal), and 50 variables. Hint: There are a number f functins in R that yu can use t generate data. One example is the rnrm() functin; runif() is anther ptin. Be sure t add a mean shift t the bservatins in each class s that there are three distinct classes. (b) Perfrm PCA n the 60 bservatins and plt the first tw principal cmpnent scre vectrs. Use a different clr t indicate the bservatins in each f the three classes. If the three classes appear separated in this plt, then cntinue n t part (c). If nt, then return t part (a) and mdify the simulatin s that there is greater separatin between the three classes. D nt cntinue t part (c) until the three classes shw at least sme separatin in the first tw principal cmpnent scre vectrs. (c) Perfrm K-means clustering f the bservatins with K = 3. Hw well d the clusters that yu btained in K-means clustering cmpare t the true class labels? Hint: Yu can use the table() functin in R t cmpare the true class labels t the class labels btained by clustering. Be careful hw yu interpret the results: K-means clustering will arbitrarily number the clusters, s yu cannt simply check whether the true class labels and clustering labels are the same. (d) Perfrm K-means clustering with K = 2. Describe yur results. (e) Nw perfrm K-means clustering with K = 4, and describe yur results. (f) Nw perfrm K-means clustering with K = 3nthefirsttw principal cmpnent scre vectrs, rather than n the raw data. That is, perfrm K-means clustering n the 60 2matrixf which the first clumn is the first principal cmpnent scre vectr, and the secnd clumn is the secnd principal cmpnent scre vectr. Cmment n the results. (g) Using the scale() functin, perfrm K-means clustering with K = 3 n the data after scaling each variable t have standard deviatin ne. Hw d these results cmpare t thse btained in (b)? Explain. 11. On the bk website, there is a gene expressin data set (Ch10Ex11.csv) that cnsists f 40 tissue samples with measurements n 1,000 genes. The first 20 samples are frm healthy patients, while the secnd 20 are frm a diseased grup.

433 Unsupervised Learning (a) Lad in the data using read.csv(). Yu will need t select header=f. (b) Apply hierarchical clustering t the samples using crrelatinbased distance, and plt the dendrgram. D the genes separate the samples int the tw grups? D yur results depend n the type f linkage used? (c) Yur cllabratr wants t knw which genes differ the mst acrss the tw grups. Suggest a way t answer this questin, and apply it here.

434 Index C p, 78, 205, 206, R 2, 68 71, 79 80, 103, 212 l 2 nrm, 216 l 1 nrm, 219 additive, 12, 86 90, 104 additivity, 282, 283 adjusted R 2, 78, 205, 206, Advertising data set, 15, 16, 20, 59, 61 63, 68, 69, 71 76, 79, 81, 82, 87, 88, agglmerative clustering, 390 Akaike infrmatin criterin, 78, 205, 206, alternative hypthesis, 67 analysis f variance, 290 area under the curve, 147 argument, 42 AUC, 147 Aut data set, 14, 48, 49, 56, 90 93, 121, 122, 171, , 180, 182, 191, , 299, 371 backfitting, 284, 300 backward stepwise selectin, 79, , 247 bagging, 12, 26, 303, , baseline, 86 basis functin, 270, 273 Bayes classifier, 37 40, 139 decisin bundary, 140 errr, Bayes therem, 138, 139, 226 Bayesian, Bayesian infrmatin criterin, 78, 205, 206, best subset selectin, 205, 221, bias, 33 36, 65, 82 bias-variance decmpsitin, 34 trade-ff, 33 37, 42, 105, 149, 217, 230, 239, 243, 278, 307, 347, 357 binary, 28, 130 biplt, 377, 378 G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics, DOI / , Springer Science+Business Media New Yrk

435 420 Index Blean, 159 bsting, 12, 25, 26, 303, 316, , btstrap, 12, 175, , 316 Bstn data set, 14, 56, 110, 113, 126, 173, 201, 264, 299, 327, 328, 330, 333 bttm-up clustering, 390 bxplt, 50 branch, 305 Caravan data set, 14, 165, 335 Carseats data set, 14, 117, 123, 324, 333 categrical, 3, 28 classificatin, 3, 12, 28 29, 37 42, , errr rate, 311 tree, , classifier, 127 cluster analysis, clustering, 4, 26 28, K-means, 12, agglmerative, 390 bttm-up, 390 hierarchical, 386, cefficient, 61 Cllege data set, 14, 54, 263, 300 cllinearity, cnditinal prbability, 37 cnfidence interval, 66 67, 81, 82, 103, 268 cnfunding, 136 cnfusin matrix, 145, 158 cntinuus, 3 cntur plt, 46 cntrast, 86 crrelatin, 70, 74, 396 Credit data set, 83, 84, 86, 89, 90, crss-entrpy, , 332 crss-validatin, 12, 33, 36, , 205, 227, k-fld, leave-ne-ut, curse f dimensinality, 108, 168, data frame, 48 Data sets Advertising, 15, 16, 20, 59, 61 63, 68, 69, 71 76, 79, 81, 82, 87, 88, Aut, 14, 48, 49, 56, 90 93, 121, 122, 171, , 180, 182, 191, , 299, 371 Bstn, 14, 56, 110, 113, 126, 173, 201, 264, 299, 327, 328, 330, 333 Caravan, 14, 165, 335 Carseats, 14, 117, 123, 324, 333 Cllege, 14, 54, 263, 300 Credit, 83, 84, 86, 89, 90, Default, 14, , , 198, 199 Heart, 312, 313, , 354, 355 Hitters, 14, 244, 251, 255, 256, 304, 305, 310, 311, 334 Incme, 16 18, Khan, 14, 366 NCI60, 4, 5, 14, 407, OJ, 14, 334, 371 Prtfli, 14, 194 Smarket, 3, 14, 154, 161, 163, 171 USArrests, 14, 377, 378,

436 Index 421 Wage, 1, 2, 9, 10, 14, 267, 269, 271, 272, , 280, 281, 283, 284, 286, 287, 299 Weekly, 14, 171, 200 decisin tree, 12, Default data set, 14, , , 198, 199 degrees f freedm, 32, 241, 271, 272, 278 dendrgram, 386, density functin, 138 dependent variable, 15 derivative, 272, 278 deviance, 206 dimensin reductin, 204, discriminant functin, 141 dissimilarity, distance crrelatin-based, , 416 Euclidean, 379, 387, 388, 394, duble-expnential distributin, 227 dummy variable, 82 86, 130, 134, 269 effective degrees f freedm, 278 elbw, 409 errr irreducible, 18, 32 rate, 37 reducible, 18 term, 16 Euclidean distance, 379, 387, 388, 394, , 416 expected value, 19 explratry data analysis, 374 F-statistic, 75 factr, 84 false discvery prprtin, 147 false negative, 147 false psitive, 147 false psitive rate, 147, 149, 354 feature, 15 feature selectin, 204 Fisher s linear discriminant, 141 fit, 21 fitted value, 93 flexible, 22 fr lp, 193 frward stepwise selectin, 78, , 247 functin, 42 Gaussian (nrmal) distributin, 138, 139, generalized additive mdel, 6, 26, 265, 266, , 294 generalized linear mdel, 6, 156, 192 Gini index, , 319, 332 Heart data set, 312, 313, , 354, 355 heatmap, 47 Msticity,95 heterscedasticity, hierarchical clustering, dendrgram, inversin, 395 linkage, hierarchical principle, 89 high-dimensinal, 78, 208, 239 hinge lss, 357 histgram, 50 Hitters data set, 14, 244, 251, 255, 256, 304, 305, 310, 311, 334 hld-ut set, 176 hyperplane, hypthesis test, 67 68, 75, 95 Incme data set, 16 18, independent variable, 15 indicatr functin, 268

437 422 Index inference, 17, 19 inner prduct, 351 input variable, 15 integral, 278 interactin, 60, 81, 87 90, 104, 286 intercept, 61, 63 interpretability, 203 inversin, 395 irreducible errr, 18, 39, 82, 103 K-means clustering, 12, K-nearest neighbrs classifier, 12, 38 40, 127 regressin, kernel, , 356, 367 linear, 352 nn-linear, plynmial, 352, 354 radial, , 363 kernel trick, 351 Khan data set, 14, 366 knt, 266, 271, Laplace distributin, 227 lass, 12, 25, , , 309, 357 leaf, 305, 391 least squares, 6, 21, 61 63, 133, 203 line, 63 weighted, 96 level, 84 leverage, likelihd functin, 133 linear, 2, 86 linear cmbinatin, 121, 204, 229, 375 linear discriminant analysis, 6, 12, 127, 130, , 348, 354 linear kernel, 352 linear mdel, 20, 21, 59 linear regressin, 6, 12 multiple, simple, linkage, , 410 average, centrid, cmplete, 391, single, lcal regressin, 266, 294 lgistic functin, 132 lgistic regressin, 6, 12, 26, 127, , , 349, multiple, lgit, 132, 286, 291 lss functin, 277, 357 lw-dimensinal, 238 main effects, 88, 89 majrity vte, 317 Mallw s C p, 78, 205, 206, margin, 341, 357 matrix multiplicatin, 12 maximal margin classifier, hyperplane, 341 maximum likelihd, , 135 mean squared errr, 29 misclassificatin errr, 37 missing data, 49 mixed selectin, 79 mdel assessment, 175 mdel selectin, 175 M cllinearity, 101 multicllinearity, 243 multivariate Gaussian, multivariate nrmal, natural spline, 274, 278, 293 NCI60 data set, 4, 5, 14, 407, negative predictive value, 147, 149 nde

438 Index 423 internal, 305 purity, terminal, 305 nise, 22, 228 nn-linear, 2, 12, decisin bundary, kernel, nn-parametric, 21, 23 24, , 168 nrmal (Gaussian) distributin, 138, 139, null, 145 hypthesis, 67 mdel, 78, 205, 220 dds, 132, 170 OJ data set, 14, 334, 371 ne-standard-errr rule, 214 ne-versus-all, 356 ne-versus-ne, 355 ptimal separating hyperplane, 341 ptimism f training errr, 32 rdered categrical variable, 292 rthgnal, 233, 377 basis, 288 ut-f-bag, utlier, utput variable, 15 verfitting, 22, 24, 26, 32, 80, 144, 207, 341 p-value, 67 68, 73 parameter, 61 parametric, 21 23, partial least squares, 12, 230, , 258, 259 path algrithm, 224 perpendicular, 233 plynmial kernel, 352, 354 regressin, 90 92, , 271 ppulatin regressin line, 63 Prtfli data set, 14, 194 psitive predictive value, 147, 149 psterir distributin, 226 mde, 226 prbability, 139 pwer, 101, 147 precisin, 147 predictin, 17 interval, 82, 103 predictr, 15 principal cmpnents, 375 analysis, 12, , lading vectr, 375, 376 prprtin f variance explained, , 408 regressin, 12, , , , 385 scre vectr, 376 scree plt, prir distributin, 226 prbability, 138 prjectin, 204 pruning, cst cmplexity, weakest link, quadratic, 91 quadratic discriminant analysis, 4, qualitative, 3, 28, 127, 176 variable, quantitative, 3, 28, 127, 176 R functins x 2, 125 abline(), 112, 122, 301, 412 anva(), 116, 290, 291 apply(), 250, 401 as.dist(), 407 as.factr(), 50 attach(), 50

439 424 Index biplt(), 403 bt(), , 199 bs(), 293, 300 c(), 43 cbind(), 164, 289 cef(), 111, 157, 247, 251 cnfint(), 111 cntur(), 46 cntrasts(), 118, 157 cr(), 44, 122, 155, 416 cumsum(), 404 cut(), 292 cutree(), 406 cv.glm(), 192, 193, 199 cv.glmnet(), 254 cv.tree(), 326, 328, 334 data.frame(), 171, 201, 262, 324 dev.ff(), 46 dim(), 48, 49 dist(), 406, 416 fix(), 48, 54 fr(), 193 gam(), 284, 294, 296 gbm(), 330 glm(), 156, 161, 192, 199, 291 glmnet(), 251, hatvalues(), 113 hclust(), 406, 407 hist(), 50, 55 I(), 115, 289, 291, 296 identify(), 50 ifelse(), 324 image(), 46 imprtance(), 330, 333, 334 is.na(), 244 jitter(), 292 jpeg(), 46 kmeans(), 404, 405 knn(), 163, 164 lda(), 161, 163 legend(), 125 length(), 43 library(), 109, 110 lines(), 112 lm(), 110, 112, 113, 115, 116, 121, 122, 156, 161, 191, 192, 254, 256, 288, 294, 324 l(), 296 ladhistry(), 51 less(), 294 ls(), 43 matrix(), 44 mean(), 45, 158, 191, 401 median(), 171 mdel.matrix(), 251 na.mit(), 49, 244 names(), 49, 111 ns(), 293 pairs(), 50, 55 par(), 112, 289 pcr(), 256, 258 pdf(), 46 persp(), 47 plt(), 45, 46, 49, 55, 112, 122, 246, 295, 325, 360, 371, 406, 408 plt.gam(), 295 plt.svm(), 360 plsr(), 258 pints(), 246 ply(), 116, 191, , 299 prcmp(), 402, 403, 416 predict(), 111, 157, , 191, 249, 250, 252, 253, 289, 291, 292, 296, 325, 327, 361, 364, 365 print(), 172 prune.misclass(), 327 prune.tree(), 328 q(), 51 qda(), 163 quantile(), 201 rainbw(), 408 randmfrest(), 329

440 Index 425 range(), 56 read.csv(), 49, 54, 418 read.table(), 48, 49 regsubsets(), , 262 residuals(), 112 return(), 172 rm(), 43 rnrm(), 44, 45, 124, 262, 417 rstudent(), 112 runif(), 417 s(), 294 sample(), 191, 194, 414 savehistry(), 51 scale(), 165, 406, 417 sd(), 45 seq(), 46 set.seed(), 45, 191, 405 smth.spline(), 293, 294 sqrt(), 44, 45 sum(), 244 summary(), 51, 55, 113, 121, 122, 157, 196, 199, 244, 245, 256, 257, 295, 324, 325, 328, 330, 334, 360, 361, 363, 372, 408 svm(), , 365, 366 table(), 158, 417 text(), 325 title(), 289 tree(), 304, 324 tune(), 361, 364, 372 update(), 114 var(), 45 varimpplt(), 330 vif(), 114 which.max(), 113, 246 which.min(), 246 write.table(), 48 radial kernel, , 363 randm frest, 12, 303, 316, , recall, 147 receiver perating characteristic (ROC), 147, recursive binary splitting, 306, 309, 311 reducible errr, 18, 81 regressin, 3, 12, lcal, 265, 266, piecewise plynmial, 271 plynmial, , spline, 266, 270, 293 tree, , regularizatin, 204, 215 replacement, 189 resampling, residual, 62, 72 plt, 92 standard errr, 66, 68 69, 79 80, 102 studentized, 97 sum f squares, 62, 70, 72 residuals, 239, 322 respnse, 15 ridge regressin, 12, , 357 rbust, 345, 348, 400 ROC curve, 147, rug plt, 292 scale equivariant, 217 scatterplt, 49 scatterplt matrix, 50 scree plt, , 409 elbw, 384 seed, 191 semi-supervised learning, 28 sensitivity, 145, 147 separating hyperplane, shrinkage, 204, 215 penalty, 215 signal, 228 slack variable, 346 slpe, 61, 63 Smarket data set, 3, 14, 154, 161, 163, 171 smther, 286

441 426 Index smthing spline, 266, , 293 sft margin classifier, sft-threshlding, 225 sparse, 219, 228 sparsity, 219 specificity, 145, 147, 148 spline, 265, cubic, 273 linear, 273 natural, 274, 278 regressin, 266, smthing, 31, 266, thin-plate, 23 standard errr, 65, 93 standardize, 165 statistical mdel, 1 step functin, 105, 265, stepwise mdel selectin, 12, 205, 207 stump, 323 subset selectin, subtree, 308 supervised learning, 26 28, 237 supprt vectr, 342, 347, 357 classifier, 337, machine, 12, 26, regressin, 358 synergy, 60, 81, 87 90, 104 systematic, 16 t-distributin, 67, 153 t-statistic, 67 test errr, 37, 40, 158 MSE, bservatins, 30 set, 32 time series, 94 ttal sum f squares, 70 tracking, 94 train, 21 training data, 21 errr, 37, 40, 158 MSE, tree, tree-based methd, 303 true negative, 147 true psitive, 147 true psitive rate, 147, 149, 354 truncated pwer basis, 273 tuning parameter, 215 Type I errr, 147 Type II errr, 147 unsupervised learning, 26 28, 230, 237, USArrests data set, 14, 377, 378, validatin set, 176 apprach, variable, 15 dependent, 15 dummy, 82 86, imprtance, 319, 330 independent, 15 indicatr, 37 input, 15 utput, 15 qualitative, 82 86, selectin, 78, 204, 219 variance, 19, inflatin factr, , 114 varying cefficient mdel, 282 vectr, 43 Wage data set, 1, 2, 9, 10, 14, 267, 269, 271, 272, , 280, 281, 283, 284, 286, 287, 299 weakest link pruning, 308 Weekly data set, 14, 171, 200 weighted least squares, 96, 282 within class cvariance, 143 wrkspace, 51 wrapper, 289