3 Supervised Learning

3 Supervsed Learnng Supervsed learnng has been a great success n real-world applcatons. It s used n almost every doman, ncludng text and Web domans. Supervsed learnng s also called classfcaton or nductve learnng n machne learnng. Ths type of learnng s analogous to human learnng from past experences to gan new knowledge n order to mprove our ablty to perform real-world tasks. However, snce computers do not have experences, machne learnng learns from data, whch are collected n the past and represent past experences n some real-world applcatons. There are several types of supervsed learnng tasks. In ths chapter, we focus on one partcular type, namely, learnng a target functon that can be used to predct the values of a dscrete class attrbute. Ths type of learnng has been the focus of the machne learnng research and s perhaps also the most wdely used learnng paradgm n practce. Ths chapter ntroduces a number of such supervsed learnng technques. They are used n almost every Web mnng applcaton. We wll see ther uses from Chaps. 6 1. 3.1 Basc Concepts A data set used n the learnng task conssts of a set of data records, whch are descrbed by a set of attrbutes A = {A 1, A,, A A }, where A denotes the number of attrbutes or the sze of the set A. The data set also has a specal target attrbute C, whch s called the class attrbute. In our subsequent dscussons, we consder C separately from attrbutes n A due to ts specal status,.e., we assume that C s not n A. The class attrbute C has a set of dscrete values,.e., C = {c 1, c,, c C }, where C s the number of classes and C. A class value s also called a class label. A data set for learnng s smply a relatonal table. Each data record descrbes a pece of past experence. In the machne learnng and data mnng lterature, a data record s also called an example, an nstance, a case or a vector. A data set bascally conssts of a set of examples or nstances. Gven a data set D, the obectve of learnng s to produce a classfcaton/predcton functon to relate values of attrbutes n A and classes n C. The functon can be used to predct the class values/labels of the future B. Lu, Web Data Mnng: Explorng Hyperlnks, Contents, and Usage Data, Data-Centrc Systems and Applcatons, DOI 10.1007/978-3-64-19460-3_3, Sprnger-Verlag Berln Hedelberg 011 63

64 3 Supervsed Learnng data. The functon s also called a classfcaton model, a predctve model or smply a classfer. We wll use these terms nterchangeably n ths book. It should be noted that the functon/model can be n any form, e.g., a decson tree, a set of rules, a Bayesan model or a hyperplane. Example 1: Table 3.1 shows a small loan applcaton data set. It has four attrbutes. The frst attrbute s Age, whch has three possble values, young, mddle and old. The second attrbute s Has_Job, whch ndcates whether an applcant has a ob. Its possble values are true (has a ob) and false (does not have a ob). The thrd attrbute s Own_house, whch shows whether an applcant owns a house. The fourth attrbute s Credt_ratng, whch has three possble values, far, good and excellent. The last column s the Class attrbute, whch shows whether each loan applcaton was approved (denoted by Yes) or not (denoted by No) n the past. Table 3.1. A loan applcaton data set ID Age Has_ob Own_house Credt_ratng Class 1 young false false far No young false false good No 3 young true false good Yes 4 young true true far Yes 5 young false false far No 6 mddle false false far No 7 mddle false false good No 8 mddle true true good Yes 9 mddle false true excellent Yes 10 mddle false true excellent Yes 11 old false true excellent Yes 1 old false true good Yes 13 old true false good Yes 14 old true false excellent Yes 15 old false false far No We want to learn a classfcaton model from ths data set that can be used to classfy future loan applcatons. That s, when a new customer comes nto the bank to apply for a loan, after nputtng hs/her age, whether he/she has a ob, whether he/she owns a house, and hs/her credt ratng, the classfcaton model should predct whether hs/her loan applcaton should be approved. Our learnng task s called supervsed learnng because the class labels (e.g., Yes and No values of the class attrbute n Table 3.1) are provded n

3.1 Basc Concepts 65 the data. It s as f some teacher tells us the classes. Ths s n contrast to the unsupervsed learnng, where the classes are not known and the learnng algorthm needs to automatcally generate classes. Unsupervsed learnng s the topc of the next chapter. The data set used for learnng s called the tranng data (or the tranng set). After a model s learned or bult from the tranng data by a learnng algorthm, t s evaluated usng a set of test data (or unseen data) to assess the model accuracy. It s mportant to note that the test data s not used n learnng the classfcaton model. The examples n the test data usually also have class labels. That s why the test data can be used to assess the accuracy of the learned model because we can check whether the class predcted for each test case by the model s the same as the actual class of the test case. In order to learn and also to test, the avalable data (whch has classes) for learnng s usually splt nto two dsont subsets, the tranng set (for learnng) and the test set (for testng). We wll dscuss ths further n Sect. 3.3. The accuracy of a classfcaton model on a test set s defned as: Number of correct classfcatons Accuracy, (1) Total number of test cases where a correct classfcaton means that the learned model predcts the same class as the orgnal class of the test case. There are also other measures that can be used. We wll dscuss them n Sect. 3.3. We pause here to rases two mportant questons: 1. What do we mean by learnng by a computer system?. What s the relatonshp between the tranng and the test data? We answer the frst queston frst. Gven a data set D representng past experences, a task T and a performance measure M, a computer system s sad to learn from the data to perform the task T f after learnng the system s performance on the task T mproves as measured by M. In other words, the learned model or knowledge helps the system to perform the task better as compared to no learnng. Learnng s the process of buldng the model or extractng the knowledge. We use the data set n Example 1 to explan the dea. The task s to predct whether a loan applcaton should be approved. The performance measure M s the accuracy n Equaton (1). Wth the data set n Table 3.1, f there s no learnng, all we can do s to guess randomly or to smply take the maorty class (whch s the Yes class). Suppose we use the maorty class and announce that every future nstance or case belongs to the class Yes. If the future data are drawn from the same dstrbuton as the exstng tranng data n Table 3.1, the estmated classfcaton/predcton accuracy

66 3 Supervsed Learnng on the future data s 9/15 = 0.6 as there are 9 Yes class examples out of the total of 15 examples n Table 3.1. The queston s: can we do better wth learnng? If the learned model can ndeed mprove the accuracy, then the learnng s sad to be effectve. The second queston n fact touches the fundamental assumpton of machne learnng, especally the theoretcal study of machne learnng. The assumpton s that the dstrbuton of tranng examples s dentcal to the dstrbuton of test examples (ncludng future unseen examples). In practcal applcatons, ths assumpton s often volated to a certan degree. Strong volatons wll clearly result n poor classfcaton accuracy, whch s qute ntutve because f the test data behave very dfferently from the tranng data then the learned model wll not perform well on the test data. To acheve good accuracy on the test data, tranng examples must be suffcently representatve of the test data. We now llustrate the steps of learnng n Fg. 3.1 based on the precedng dscussons. In step 1, a learnng algorthm uses the tranng data to generate a classfcaton model. Ths step s also called the tranng step or tranng phase. In step, the learned model s tested usng the test set to obtan the classfcaton accuracy. Ths step s called the testng step or testng phase. If the accuracy of the learned model on the test data s satsfactory, the model can be used n real-world tasks to predct classes of new cases (whch do not have classes). If the accuracy s not satsfactory, we need to go back and choose a dfferent learnng algorthm and/or do some further processng of the data (ths step s called data pre-processng, not shown n the fgure). A practcal learnng task typcally nvolves many teratons of these steps before a satsfactory model s bult. It s also possble that we are unable to buld a satsfactory model due to a hgh degree of randomness n the data or lmtatons of current learnng algorthms. Tranng data Learnng algorthm model Test data Accuracy Step 1: Tranng Step : Testng Fg. 3.1. The basc learnng process: tranng and testng From the next secton onward, we study several supervsed learnng algorthms, except Sect. 3.3, whch focuses on model/classfer evaluaton. We note that throughout the chapter we assume that the tranng and test data are avalable for learnng. However, n many text and Web page related learnng tasks, ths s not true. Usually, we need to collect raw data,

3. Decson Tree Inducton 67 desgn attrbutes and compute attrbute values from the raw data. The reason s that the raw data n text and Web applcatons are often not sutable for learnng ether because ther formats are not rght or because there are no obvous attrbutes n the raw text documents or Web pages. 3. Decson Tree Inducton Decson tree learnng s one of the most wdely used technques for classfcaton. Its classfcaton accuracy s compettve wth other learnng methods, and t s very effcent. The learned classfcaton model s represented as a tree, called a decson tree. The technques presented n ths secton are based on the C4.5 system from Qunlan [49]. Example : Fg. 3. shows a possble decson tree learnt from the data n Table 3.1. The tree has two types of nodes, decson nodes (whch are nternal nodes) and leaf nodes. A decson node specfes some test (.e., asks a queston) on a sngle attrbute. A leaf node ndcates a class. Age? Young mddle old Has_ob? Own_house? Credt_ratng? true false Yes No (/) (3/3) true false Yes No (3/3) (/) far good excellent No Yes Yes (1/1) (/) (/) Fg. 3.. A decson tree for the data n Table 3.1 The root node of the decson tree n Fg. 3. s Age, whch bascally asks the queston: what s the age of the applcant? It has three possble answers or outcomes, whch are the three possble values of Age. These three values form three tree branches/edges. The other nternal nodes have the same meanng. Each leaf node gves a class value (Yes or No). (x/y) below each class means that x out of y tranng examples that reach ths leaf node have the class of the leaf. For nstance, the class of the left most leaf node s Yes. Two tranng examples (examples 3 and 4 n Table 3.1) reach here and both of them are of class Yes. To use the decson tree n testng, we traverse the tree top-down accordng to the attrbute values of the gven test nstance untl we reach a leaf node. The class of the leaf s the predcted class of the test nstance.

68 3 Supervsed Learnng Example 3: We use the tree to predct the class of the followng new nstance, whch descrbes a new loan applcant. Age Has_ob Own_house Credt-ratng Class young false false good? Gong through the decson tree, we fnd that the predcted class s No as we reach the second leaf node from the left. A decson tree s constructed by parttonng the tranng data so that the resultng subsets are as pure as possble. A pure subset s one that contans only tranng examples of a sngle class. If we apply all the tranng data n Table 3.1 on the tree n Fg. 3., we wll see that the tranng examples reachng each leaf node form a subset of examples that have the same class as the class of the leaf. In fact, we can see that from the x and y values n (x/y). We wll dscuss the decson tree buldng algorthm n Sect. 3..1. An nterestng queston s: Is the tree n Fg. 3. unque for the data n Table 3.1? The answer s no. In fact, there are many possble trees that can be learned from the data. For example, Fg. 3.3 gves another decson tree, whch s much smaller and s also able to partton the tranng data perfectly accordng to ther classes. Own_house? true false Yes (6/6) Has_ob? true false Yes No (3/3) (6/6) Fg. 3.3. A smaller tree for the data set n Table 3.1 In practce, one wants to have a small and accurate tree for many reasons. A smaller tree s more general and also tends to be more accurate (we wll dscuss ths later). It s also easer to understand by human users. In many applcatons, the user understandng of the classfer s mportant. For example, n some medcal applcatons, doctors want to understand the model that classfes whether a person has a partcular dsease. It s not satsfactory to smply produce a classfcaton because wthout understandng why the decson s made the doctor may not trust the system and/or does not gan useful knowledge. It s useful to note that n both Fg. 3. and Fg. 3.3, the tranng exam-

3. Decson Tree Inducton 69 ples that reach each leaf node all have the same class (see the values of (x/y) at each leaf node). However, for most real-lfe data sets, ths s usually not the case. That s, the examples that reach a partcular leaf node are not of the same class,.e., x y. The value of x/y s, n fact, the confdence (conf) value used n assocaton rule mnng, and x s the support count. Ths suggests that a decson tree can be converted to a set of f-then rules. Yes, ndeed. The converson s done as follows: Each path from the root to a leaf forms a rule. All the decson nodes along the path form the condtons of the rule and the leaf node or the class forms the consequent. For each rule, a support and confdence can be attached. Note that n most classfcaton systems, these two values are not provded. We add them here to see the connecton of assocaton rules and decson trees. Example 4: The tree n Fg. 3.3 generates three rules., means and. Own_house = true Class =Yes [sup=6/15, conf=6/6] Own_house = false, Has_ob = true Class = Yes [sup=3/15, conf=3/3] Own_house = false, Has_ob = false Class = No [sup=6/15, conf=6/6]. We can see that these rules are of the same format as assocaton rules. However, the rules above are only a small subset of the rules that can be found n the data of Table 3.1. For nstance, the decson tree n Fg. 3.3 does not fnd the followng rule: Age = young, Has_ob = false Class = No [sup=3/15, conf=3/3]. Thus, we say that a decson tree only fnds a subset of rules that exst n data, whch s suffcent for classfcaton. The obectve of assocaton rule mnng s to fnd all rules subect to some mnmum support and mnmum confdence constrants. Thus, the two methods have dfferent obectves. We wll dscuss these ssues agan n Sect. 3.5 when we show that assocaton rules can be used for classfcaton as well, whch s obvous. An nterestng and mportant property of a decson tree and ts resultng set of rules s that the tree paths or the rules are mutually exclusve and exhaustve. Ths means that every data nstance s covered by a sngle rule (a tree path) and a sngle rule only. By coverng a data nstance, we mean that the nstance satsfes the condtons of the rule. We also say that a decson tree generalzes the data as a tree s a smaller (more compact) descrpton of the data,.e., t captures the key regulartes n the data. Then, the problem becomes buldng the best tree that s small and accurate. It turns out that fndng the best tree that models the data s a NP-complete problem [6]. All exstng algorthms use heurstc methods for tree buldng. Below, we study one of the most successful technques.

70 3 Supervsed Learnng Algorthm decsontree(d, A, T) 1 f D contans only tranng examples of the same class c C then make T a leaf node labeled wth class c ; 3 elsef A = then 4 make T a leaf node labeled wth c, whch s the most frequent class n D 5 else // D contans examples belongng to a mxture of classes. We select a sngle 6 // attrbute to partton D nto subsets so that each subset s purer 7 p 0 = mpurtyeval-1(d); 8 for each attrbute A A (={A 1, A,, A k }) do 9 p = mpurtyeval-(a, D) 10 endfor 11 Select A g {A 1, A,, A k } that gves the bggest mpurty reducton, computed usng p 0 p ; 1 f p 0 p g < threshold then // A g does not sgnfcantly reduce mpurty p 0 13 make T a leaf node labeled wth c, the most frequent class n D. 14 else // A g s able to reduce mpurty p 0 15 Make T a decson node on A g ; 16 Let the possble values of A g be v 1, v,, v m. Partton D nto m dsont subsets D 1, D,, D m based on the m values of A g. 17 for each D n {D 1, D,, D m } do 18 f D then 19 create a branch (edge) node T for v as a chld node of T; 0 decsontree(d, A{A g }, T ) // A g s removed 1 endf endfor 3 endf 4 endf Fg. 3.4. A decson tree learnng algorthm 3..1 Learnng Algorthm As ndcated earler, a decson tree T smply parttons the tranng data set D nto dsont subsets so that each subset s as pure as possble (of the same class). The learnng of a tree s typcally done usng the dvde-andconquer strategy that recursvely parttons the data to produce the tree. At the begnnng, all the examples are at the root. As the tree grows, the examples are sub-dvded recursvely. A decson tree learnng algorthm s gven n Fg. 3.4. For now, we assume that every attrbute n D takes dscrete values. Ths assumpton s not necessary as we wll see later. The stoppng crtera of the recurson are n lnes 1 4 n Fg. 3.4. The algorthm stops when all the tranng examples n the current data are of the same class, or when every attrbute has been used along the current tree

3. Decson Tree Inducton 71 path. In tree learnng, each successve recurson chooses the best attrbute to partton the data at the current node accordng to the values of the attrbute. The best attrbute s selected based on a functon that ams to mnmze the mpurty after the parttonng (lnes 7 11). In other words, t maxmzes the purty. The key n decson tree learnng s thus the choce of the mpurty functon, whch s used n lnes 7, 9 and 11 n Fg. 3.4. The recursve recall of the algorthm s n lne 0, whch takes the subset of tranng examples at the node for further parttonng to extend the tree. Ths s a greedy algorthm wth no backtrackng. Once a node s created, t wll not be revsed or revsted no matter what happens subsequently. 3.. Impurty Functon Before presentng the mpurty functon, we use an example to show what the mpurty functon ams to do ntutvely. Example 5: Fg. 3.5 shows two possble root nodes for the data n Table 3.1. Age? Own_house? Young mddle old true false No: 3 No: No: 1 Yes: Yes: 3 Yes: 4 (A) No: 0 No: 6 Yes: 6 Yes: 3 (B) Fg. 3.5. Two possble root nodes or two possble attrbutes for the root node Fg. 3.5(A) uses Age as the root node, and Fg. 3.5(B) uses Own_house as the root node. Ther possble values (or outcomes) are the branches. At each branch, we lsted the number of tranng examples of each class (No or Yes) that land or reach there. Fg. 3.5(B) s obvously a better choce for the root. From a predcton or classfcaton pont of vew, Fg. 3.5(B) makes fewer mstakes than Fg. 3.5(A). In Fg. 3.5(B), when Own_house = true every example has the class Yes. When Own_house = false, f we take maorty class (the most frequent class), whch s No, we make three mstakes/errors. If we look at Fg. 3.5(A), the stuaton s worse. If we take the maorty class for each branch, we make fve mstakes (marked n bold). Thus, we say that the mpurty of the tree n Fg. 3.5(A) s hgher than the tree n Fg. 3.5(B). To learn a decson tree, we prefer Own_house to Age to be the root node. Instead of countng the number of mstakes or errors, C4.5 uses a more prncpled approach to perform ths evaluaton on every attrbute n order to choose the best attrbute to buld the tree.

7 3 Supervsed Learnng The most popular mpurty functons used for decson tree learnng are nformaton gan and nformaton gan rato, whch are used n C4.5 as two optons. Let us frst dscuss nformaton gan, whch can be extended slghtly to produce nformaton gan rato. The nformaton gan measure s based on the entropy functon from nformaton theory [55]: entropy ( D) Pr( c )log Pr( c ) () C 1 Pr( c ) 1, C 1 where Pr(c ) s the probablty of class c n data set D, whch s the number of examples of class c n D dvded by the total number of examples n D. In the entropy computaton, we defne 0log0 = 0. The unt of entropy s bt. Let us use an example to get a feelng of what ths functon does. Example 6: Assume we have a data set D wth only two classes, postve and negatve. Let us see the entropy values for three dfferent compostons of postve and negatve examples: 1. The data set D has 50% postve examples (Pr(postve) = 0.5) and 50% negatve examples (Pr(negatve) = 0.5). entropy ( D) 0.5 log 0.5 0.5log 0.5 1.. The data set D has 0% postve examples (Pr(postve) = 0.) and 80% negatve examples (Pr(negatve) = 0.8). entropy ( D) 0. log 0. 0.8log 0.8 0.7. 3. The data set D has 100% postve examples (Pr(postve) = 1) and no negatve examples, (Pr(negatve) = 0). entropy ( D) 1log 1 0log 0 0. We can see a trend: When the data becomes purer and purer, the entropy value becomes smaller and smaller. In fact, t can be shown that for ths bnary case (two classes), when Pr(postve) = 0.5 and Pr(negatve) = 0.5 the entropy has the maxmum value,.e., 1 bt. When all the data n D belong to one class the entropy has the mnmum value, 0 bt. It s clear that the entropy measures the amount of mpurty or dsorder n the data. That s exactly what we need n decson tree learnng. We now descrbe the nformaton gan measure, whch uses the entropy functon.

3. Decson Tree Inducton 73 Informaton Gan The dea s the followng: 1. Gven a data set D, we frst use the entropy functon (Equaton ) to compute the mpurty value of D, whch s entropy(d). The mpurtyeval-1 functon n lne 7 of Fg. 3.4 performs ths task.. Then, we want to know whch attrbute can reduce the mpurty most f t s used to partton D. To fnd out, every attrbute s evaluated (lnes 8 10 n Fg. 3.4). Let the number of possble values of the attrbute A be v. If we are gong to use A to partton the data D, we wll dvde D nto v dsont subsets D 1, D,, D v. The entropy after the partton s v D entropya ( D) entropy( D ). (3) D 1 The mpurtyeval- functon n lne 9 of Fg. 3.4 performs ths task. 3. The nformaton gan of attrbute A s computed wth: gan( D, A ) entropy( D) entropy ( D). (4) Clearly, the gan crteron measures the reducton n mpurty or dsorder. The gan measure s used n lne 11 of Fg. 3.4, whch chooses attrbute A g resultng n the largest reducton n mpurty. If the gan of A g s too small, the algorthm stops for the branch (lne 1). Normally a threshold s used here. If choosng A g s able to reduce mpurty sgnfcantly, A g s employed to partton the data to extend the tree further, and so on (lnes 15 1 n Fg. 3.4). The process goes on recursvely by buldng sub-trees usng D 1, D,, D m (lne 0). For subsequent tree extensons, we do not need A g any more, as all tranng examples n each branch has the same A g value. Example 7: Let us compute the gan values for attrbutes Age, Own_house and Credt_Ratng usng the whole data set D n Table 3.1,.e., we evaluate for the root node of a decson tree. Frst, we compute the entropy of D. Snce D has 6 No class tranng examples, and 9 Yes class tranng examples, we have 6 6 9 9 entropy D) log log 15 15 15 15 ( A 0.971. We then try Age, whch parttons the data nto 3 subsets (as Age has three possble values) D 1 (wth Age=young), D (wth Age=mddle), and D 3 (wth Age=old). Each subset has fve tranng examples. In Fg. 3.5, we also see the number of No class examples and the number of Yes examples n each subset (or n each branch).

74 3 Supervsed Learnng 5 5 5 entropy Age ( D) entropy( D1 ) entropy( D ) entropy( D3 ) 15 15 15 5 5 5 0.971 0.971 0.7 0.888. 15 15 15 Lkewse, we compute for Own_house, whch parttons D nto two subsets, D 1 (wth Own_house=true) and D (wth Own_house=false). entropy 6 9 D) entropy ( D1) entropy ( D ) 15 15 6 9 0 0.918 0.551. 15 15 Own _ house ( Smlarly, we obtan entropy Has_ob (D) = 0.647, and entropy Credt_ratng (D) = 0.608. The gans for the attrbutes are: gan(d, Age) = 0.971 0.888 = 0.083 gan(d, Own_house) = 0.971 0.551 = 0.40 gan(d, Has_ob) = 0.971 0.647 = 0.34 gan(d, Credt_ratng) = 0.971 0.608 = 0.363. Own_house s the best attrbute for the root node. Fg. 3.5(B) shows the root node usng Own_house. Snce the left branch has only one class (Yes) of data, t results n a leaf node (lne 1 n Fg. 3.4). For Own_house = false, further extenson s needed. The process s the same as above, but we only use the subset of the data wth Own_house = false,.e., D. Informaton Gan Rato The gan crteron tends to favor attrbutes wth many possble values. An extreme stuaton s that the data contan an ID attrbute that s an dentfcaton of each example. If we consder usng ths ID attrbute to partton the data, each tranng example wll form a subset and has only one class, whch results n entropy ID (D) = 0. So the gan by usng ths attrbute s maxmal. From a predcton pont of revew, such a partton s useless. Gan rato (Equaton 5) remedes ths bas by normalzng the gan usng the entropy of the data wth respect to the values of the attrbute. Our prevous entropy computatons are done wth respect to the class attrbute: ganrato( D, A ) s D D log 1 D D gan( D, A ) where s s the number of possble values of A, and D s the subset of data (5)

3. Decson Tree Inducton 75 that has the th value of A. D / D corresponds to the probablty of Equaton (). Usng Equaton (5), we smply choose the attrbute wth the hghest ganrato value to extend the tree. Ths method works because f A has too many values the denomnator wll be large. For nstance, n our above example of the ID attrbute, the denomnator wll be log D. The denomnator s called the splt nfo n C4.5. One note s that the splt nfo can be 0 or very small. Some heurstc solutons can be devsed to deal wth t (see [49]). 3..3 Handlng of Contnuous Attrbutes It seems that the decson tree algorthm can only handle dscrete attrbutes. In fact, contnuous attrbutes can be dealt wth easly as well. In a real lfe data set, there are often both dscrete attrbutes and contnuous attrbutes. Handlng both types n an algorthm s an mportant advantage. To apply the decson tree buldng method, we can dvde the value range of attrbute A nto ntervals at a partcular tree node. Each nterval can then be consdered a dscrete value. Based on the ntervals, gan or ganrato s evaluated n the same way as n the dscrete case. Clearly, we can dvde A nto any number of ntervals at a tree node. However, two ntervals are usually suffcent. Ths bnary splt s used n C4.5. We need to fnd a threshold value for the dvson. Clearly, we should choose the threshold that maxmzes the gan (or ganrato). We need to examne all possble thresholds. Ths s not a problem because although for a contnuous attrbute A the number of possble values that t can take s nfnte, the number of actual values that appear n the data s always fnte. Let the set of dstnctve values of attrbute A that occur n the data be {v 1, v,, v r }, whch are sorted n ascendng order. Clearly, any threshold value lyng between v and v +1 wll have the same effect of dvdng the tranng examples nto those whose value of attrbute A les n {v 1, v,, v } and those whose value les n {v +1, v +,, v r }. There are thus only r1 possble splts on A, whch can all be evaluated. The threshold value can be the mddle pont between v and v +1, or ust on the rght sde of value v, whch results n two ntervals A v and A > v. Ths latter approach s used n C4.5. The advantage of ths approach s that the values appearng n the tree actually occur n the data. The threshold value that maxmzes the gan (ganrato) value s selected. We can modfy the algorthm n Fg. 3.4 (lnes 8 11) easly to accommodate ths computaton so that both dscrete and contnuous attrbutes are consdered. A change to lne 0 of the algorthm n Fg. 3.4 s also needed. For a contnuous attrbute, we do not remove attrbute A g because an nterval can

76 3 Supervsed Learnng be further splt recursvely n subsequent tree extensons. Thus, the same contnuous attrbute may appear multple tmes n a tree path (see Example 9), whch does not happen for a dscrete attrbute. From a geometrc pont of vew, a decson tree bult wth only contnuous attrbutes represents a parttonng of the data space. A seres of splts from the root node to a leaf node represents a hyper-rectangle. Each sde of the hyper-rectangle s an axs-parallel hyperplane. Example 8: The hyper-rectangular regons n Fg. 3.6(A), whch parttons the space, are produced by the decson tree n Fg. 3.6(B). There are two classes n the data, represented by empty crcles and flled rectangles..6.5 Y 0 3 4 (A) A partton of the data space X X > Y Y.5 >.5 > Y X.6 >.6 3 > 3 X 4 > 4 (B). The decson tree Fg. 3.6. A parttonng of the data space and ts correspondng decson tree Handlng of contnuous (numerc) attrbutes has an mpact on the effcency of the decson tree algorthm. Wth only dscrete attrbutes the algorthm grows lnearly wth the sze of the data set D. However, sortng of a contnuous attrbute takes D log D tme, whch can domnate the tree learnng process. Sortng s mportant as t ensures that gan or ganrato can be computed n one pass of the data. 3..4 Some Other Issues We now dscuss several other ssues n decson tree learnng. Tree Prunng and Overfttng: A decson tree algorthm recursvely parttons the data untl there s no mpurty or there s no attrbute left. Ths process may result n trees that are very deep and many tree leaves may cover very few tranng examples. If we use such a tree to predct the tranng set, the accuracy wll be very hgh. However, when t s used to classfy unseen test set, the accuracy may be very low. The learnng s thus not effectve,.e., the decson tree does not generalze the data well. Ths

3. Decson Tree Inducton 77 phenomenon s called overfttng. More specfcally, we say that a classfer f 1 overfts the data f there s another classfer f such that f 1 acheves a hgher accuracy on the tranng data than f, but a lower accuracy on the unseen test data than f [45]. Overfttng s usually caused by nose n the data,.e., wrong class values/labels and/or wrong values of attrbutes, but t may also be due to the complexty and randomness of the applcaton doman. These problems cause the decson tree algorthm to refne the tree by extendng t to very deep usng many attrbutes. To reduce overfttng n the context of decson tree learnng, we perform prunng of the tree,.e., to delete some branches or sub-trees and replace them wth leaves of maorty classes. There are two man methods to do ths, stoppng early n tree buldng (whch s also called pre-prunng) and prunng the tree after t s bult (whch s called post-prunng). Postprunng has been shown more effectve. Early-stoppng can be dangerous because t s not clear what wll happen f the tree s extended further (wthout stoppng). Post-prunng s more effectve because after we have extended the tree to the fullest, t becomes clearer whch branches/subtrees may not be useful (overft the data). The general dea of post-prunng s to estmate the error of each tree node. If the estmated error for a node s less than the estmated error of ts extended sub-tree, then the sub-tree s pruned. Most exstng tree learnng algorthms take ths approach. See [49] for a technque called the pessmstc error based prunng. Example 9: In Fg. 3.6(B), the sub-tree representng the rectangular regon X, Y >.5, Y.6 n Fg. 3.6(A) s very lkely to be overfttng. The regon s very small and contans only a sngle data pont, whch may be an error (or nose) n the data collecton. If t s pruned, we obtan Fg. 3.7(A) and (B)..6.5 Y X 0 3 4 (A) A partton of the data space X > Y > X 3 > 3 X 4 > 4 (B). The decson tree Fg. 3.7. The data space partton and the decson tree after prunng

78 3 Supervsed Learnng Another common approach to prunng s to use a separate set of data called the valdaton set, whch s not used n tranng and nether n testng. After a tree s bult, t s used to classfy the valdaton set. Then, we can fnd the errors at each node on the valdaton set. Ths enables us to know what to prune based on the errors at each node. Rule Prunng: We noted earler that a decson tree can be converted to a set of rules. In fact, C4.5 also prunes the rules to smplfy them and to reduce overfttng. Frst, the tree (C4.5 uses the unpruned tree) s converted to a set of rules n the way dscussed n Example 4. Rule prunng s then performed by removng some condtons to make the rules shorter and fewer (after prunng some rules may become redundant). In most cases, prunng results n a more accurate rule set as shorter rules are less lkely to overft the tranng data. Prunng s also called generalzaton as t makes rules more general (wth fewer condtons). A rule wth more condtons s more specfc than a rule wth fewer condtons. Example 10: The sub-tree below X n Fg. 3.6(B) produces these rules: Rule 1: Rule : Rule 3: X, Y >.5, Y >.6 X, Y >.5, Y.6 O X, Y.5 Note that Y >.5 n Rule 1 s not useful because of Y >.6, and thus Rule 1 should be Rule 1: X, Y >.6 In prunng, we may be able to delete the condtons Y >.6 from Rule 1 to produce: X Then Rule and Rule 3 become redundant and can be removed. A useful pont to note s that after prunng the resultng set of rules may no longer be mutually exclusve and exhaustve. There may be data ponts that satsfy the condtons of more than one rule, and f naccurate rules are dscarded, of no rules. An orderng of the rules s thus needed to ensure that when classfyng a test case only one rule wll be appled to determne the class of the test case. To deal wth the stuaton that a test case does not satsfy the condtons of any rule, a default class s used, whch s usually the maorty class. Handlng Mssng Attrbute Values: In many practcal data sets, some attrbute values are mssng or not avalable due to varous reasons. There are many ways to deal wth the problem. For example, we can fll each

3.3 Classfer Evaluaton 79 mssng value wth the specal value unknown or the most frequent value of the attrbute f the attrbute s dscrete. If the attrbute s contnuous, use the mean of the attrbute for each mssng value. The decson tree algorthm n C4.5 takes another approach. At a tree node, t dstrbutes the tranng example wth mssng value for the attrbute to each branch of the tree proportonally accordng to the dstrbuton of the tranng examples that have values for the attrbute. Handlng Skewed Class Dstrbuton: In many applcatons, the proportons of data for dfferent classes can be very dfferent. For nstance, n a data set of ntruson detecton n computer networks, the proporton of ntruson cases s extremely small (< 1%) compared wth normal cases. Drectly applyng the decson tree algorthm for classfcaton or predcton of ntrusons s usually not effectve. The resultng decson tree often conssts of a sngle leaf node normal, whch s useless for ntruson detecton. One way to deal wth the problem s to over sample the ntruson examples to ncrease ts proporton. Another soluton s to rank the new cases accordng to how lkely they may be ntrusons. The human users can then nvestgate the top ranked cases. 3.3 Classfer Evaluaton After a classfer s constructed, t needs to be evaluated for accuracy. Effectve evaluaton s crucal because wthout knowng the approxmate accuracy of a classfer, t cannot be used n real-world tasks. There are many ways to evaluate a classfer, and there are also many measures. The man measure s the classfcaton accuracy (Equaton 1), whch s the number of correctly classfed nstances n the test set dvded by the total number of nstances n the test set. Some researchers also use the error rate, whch s 1 accuracy. Clearly, f we have several classfers, the one wth the hghest accuracy s preferred. Statstcal sgnfcance tests may be used to check whether one classfer s accuracy s sgnfcantly better than that of another gven the same tranng and test data sets. Below, we frst present several common methods for classfer evaluaton, and then ntroduce some other evaluaton measures. 3.3.1 Evaluaton Methods Holdout Set: The avalable data D s dvded nto two dsont subsets, the tranng set D tran and the test set D test, D = D tran D test and D tran D test =. The test set s also called the holdout set. Ths method s manly used

80 3 Supervsed Learnng when the data set D s large. Note that the examples n the orgnal data set D are all labeled wth classes. As we dscussed earler, the tranng set s used for learnng a classfer and the test set s used for evaluatng the classfer. The tranng set should not be used n the evaluaton as the classfer s based toward the tranng set. That s, the classfer may overft the tranng data, whch results n very hgh accuracy on the tranng set but low accuracy on the test set. Usng the unseen test set gves an unbased estmate of the classfcaton accuracy. As for what percentage of the data should be used for tranng and what percentage for testng, t depends on the data set sze. 50 50 and two thrds for tranng and one thrd for testng are commonly used. To partton D nto tranng and test sets, we can use a few approaches: 1. We randomly sample a set of tranng examples from D for learnng and use the rest for testng.. If the data s collected over tme, then we can use the earler part of the data for tranng/learnng and the later part of the data for testng. In many applcatons, ths s a more sutable approach because when the classfer s used n the real-world the data are from the future. Ths approach thus better reflects the dynamc aspects of applcatons. Multple Random Samplng: When the avalable data set s small, usng the above methods can be unrelable because the test set would be too small to be representatve. One approach to deal wth the problem s to perform the above random samplng n tmes. Each tme a dfferent tranng set and a dfferent test set are produced. Ths produces n accuraces. The fnal estmated accuracy on the data s the average of the n accuraces. Cross-Valdaton: When the data set s small, the n-fold cross-valdaton method s very commonly used. In ths method, the avalable data s parttoned nto n equal-sze dsont subsets. Each subset s then used as the test set and the remanng n1 subsets are combned as the tranng set to learn a classfer. Ths procedure s then run n tmes, whch gves n accuraces. The fnal estmated accuracy of learnng from ths data set s the average of the n accuraces. 10-fold and 5-fold cross-valdatons are often used. A specal case of cross-valdaton s the leave-one-out cross-valdaton. In ths method, each fold of the cross valdaton has only a sngle test example and all the rest of the data s used n tranng. That s, f the orgnal data has m examples, then ths s m-fold cross-valdaton. Ths method s normally used when the avalable data s very small. It s not effcent for a large data set as m classfers need to be bult. In Sect. 3..4, we mentoned that a valdaton set can be used to prune a decson tree or a set of rules. If a valdaton set s employed for that pur-

3.3 Classfer Evaluaton 81 pose, t should not be used n testng. In that case, the avalable data s dvded nto three subsets, a tranng set, a valdaton set and a test set. Apart from usng a valdaton set to help tree or rule prunng, a valdaton set s also used frequently to estmate parameters n learnng algorthms. In such cases, the values that gve the best accuracy on the valdaton set are used as the fnal values of the parameters. Cross-valdaton can be used for parameter estmatng as well. Then a separate valdaton set s not needed. Instead, the whole tranng set s used n cross-valdaton. 3.3. Precson, Recall, F-score and Breakeven Pont In some applcatons, we are only nterested n one class. Ths s partcularly true for text and Web applcatons. For example, we may be nterested n only the documents or web pages of a partcular topc. Also, n classfcaton nvolvng skewed or hghly mbalanced data, e.g., network ntruson and fnancal fraud detecton, we are typcally nterested n only the mnorty class. The class that the user s nterested n s commonly called the postve class, and the rest negatve classes (the negatve classes may be combned nto one negatve class). Accuracy s not a sutable measure n such cases because we may acheve a very hgh accuracy, but may not dentfy a sngle ntruson. For nstance, 99% of the cases are normal n an ntruson detecton data set. Then a classfer can acheve 99% accuracy (wthout dong anythng) by smply classfyng every test case as not ntruson. Ths s, however, useless. Precson and recall are more sutable n such applcatons because they measure how precse and how complete the classfcaton s on the postve class. It s convenent to ntroduce these measures usng a confuson matrx (Table 3.). A confuson matrx contans nformaton about actual and predcted results gven by a classfer. Table 3.. Confuson matrx of a classfer Classfed postve Classfed negatve Actual postve TP FN Actual negatve FP TN where TP: the number of correct classfcatons of the postve examples (true postve) FN: the number of ncorrect classfcatons of postve examples (false negatve) FP: the number of ncorrect classfcatons of negatve examples (false postve) TN: the number of correct classfcatons of negatve examples (true negatve) Based on the confuson matrx, the precson (p) and recall (r) of the postve class are defned as follows:

8 3 Supervsed Learnng TP TP p. r. TP FP TP FN (6) In words, precson p s the number of correctly classfed postve examples dvded by the total number of examples that are classfed as postve. Recall r s the number of correctly classfed postve examples dvded by the total number of actual postve examples n the test set. The ntutve meanngs of these two measures are qute obvous. However, t s hard to compare classfers based on two measures, whch are not functonally related. For a test set, the precson may be very hgh but the recall can be very low, and vce versa. Example 11: A test data set has 100 postve examples and 1000 negatve examples. After classfcaton usng a classfer, we have the followng confuson matrx (Table 3.3), Table 3.3. Confuson matrx of a classfer Classfed postve Classfed negatve Actual postve 1 99 Actual negatve 0 1000 Ths confuson matrx gves the precson p = 100% and the recall r = 1% because we only classfed one postve example correctly and classfed no negatve examples wrongly. Although n theory precson and recall are not related, n practce hgh precson s acheved almost always at the expense of recall and hgh recall s acheved at the expense of precson. In an applcaton, whch measure s more mportant depends on the nature of the applcaton. If we need a sngle measure to compare dfferent classfers, the F-score s often used: pr F. (7) p r The F-score (also called the F 1 -score) s the harmonc mean of precson and recall. F. 1 1 (8) p r The harmonc mean of two numbers tends to be closer to the smaller of the two. Thus, for the F-score to be hgh, both p and r must be hgh. There s also another measure, called precson and recall breakeven pont, whch s used n the nformaton retreval communty. The break-

3.3 Classfer Evaluaton 83 even pont s when the precson and the recall are equal. Ths measure assumes that the test cases can be ranked by the classfer based on ther lkelhoods of beng postve. For nstance, n decson tree classfcaton, we can use the confdence of each leaf node as the value to rank test cases. Example 1: We have the followng rankng of 0 test documents. 1 represents the hghest rank and 0 represents the lowest rank. + ( ) represents an actual postve (negatve) document. 1 3 4 5 6 7 8 9 10 11 1 13 14 15 16 17 18 19 0 + + + + + + + + + + Assume that the test set has 10 postve examples. At rank 1: p = 1/1 = 100% r = 1/10 = 10% At rank : p = / = 100% r = /10 = 0% At rank 9: p = 6/9 = 66.7% r = 6/10 = 60% At rank 10: p = 7/10 = 70% r = 7/10 = 70% The breakeven pont s p = r = 70%. Note that nterpolaton s needed f such a pont cannot be found. 3.3.3 Recever Operatng Characterstc Curve A recever operatng characterstc (ROC) curve s a plot of the true postve rate aganst the false postve rate. It s also commonly used to evaluate classfcaton results on the postve class n two-class classfcaton problems. The classfer needs to rank the test cases accordng to ther lkelhoods of belongng to the postve class wth the most lkely postve case ranked at the top. The true postve rate (TPR) s defned as the fracton of actual postve cases that are correctly classfed, TP TPR. (9) TP FN The false postve rate (FPR) s defned as the fracton of actual negatve cases that are classfed to the postve class, FP FPR. (10) TN FP TPR s bascally the recall of the postve class and s also called senstvty n statstcs. There s also another measure n statstcs called specfcty, whch s the true negatve rate (TNR), or the recall of the negatve class. TNR s defned as follows:

84 3 Supervsed Learnng TN TNR. (11) TN FP From Equatons (10) and (11), we can see the followng relatonshp, FPR 1 specfcty. (1) Fg. 3.8 shows the ROC curves of two example classfers (C 1 and C ) on the same test data. Each curve starts from (0, 0) and ends at (1, 1). (0, 0) represents the stuaton where every test case s classfed as negatve, and (1, 1) represents the stuaton where every test case s classfed as postve. Ths s the case because we can treat the classfcaton result as a rankng of the test cases n the postve class, and we can partton the ranked lst at any pont nto two parts wth the upper part assgned to the postve class and the lower part assgned to the negatve class. We wll see shortly that an ROC curve s drawn based on such parttons. In Fg. 3.8, we also see the man dagonal lne, whch represents random guessng,.e., predctng each case to be postve wth a fxed probablty. In ths case, t s clear that for every FPR value, TPR has the same value,.e., TPR = FPT. C 1 C Fg. 3.8. ROC curves for two classfers (C 1 and C ) on the same data For classfer evaluaton usng the ROC curves n Fg. 3.8, we want to know whch classfer s better. The answer s that when FPR s less than 0.43, C 1 s better, and when FPR s greater than 0.43, C s better. However, sometmes ths s not a satsfactory answer because we cannot say any one of the classfers s strctly better than the other. For an overall comparson, researchers often use the area under the ROC curve (AUC). If the AUC value for a classfer C s greater than that of another classfer C, t s sad that C s better than C. If a classfer s perfect, ts AUC value s 1. If a classfer makes all random guesses, ts AUC value s 0.5.

3.3 Classfer Evaluaton 85 Let us now descrbe how to draw an ROC curve gven the classfcaton result as a rankng of test cases. The rankng s obtaned by sortng the test cases n decreasng order of the classfer s output values (e.g., posteror probabltes). We then partton the rank lst nto two subsets (or parts) at every test case and regard every test case n the upper part (wth hgher classfer output value) as a postve case and every test case n the lower part as a negatve case. For each such partton, we compute a par of TPR and FPR values. When the upper part s empty, we obtan the pont (0, 0) on the ROC and when the lower part s empty, we obtan the pont (1, 1). Fnally, we smply connect the adacent ponts. Example 13: We have 10 test cases. A classfer has been bult, and t has ranked the 10 test cases as shown n the second row of Table 3.4 (the numbers n row 1 are the rank postons, wth 1 beng the hghest rank and 10 the lowest). The second row shows the actual class of each test case. + means that the test case s from the postve class, and means that t s from the negatve class. All the results needed for drawng the ROC curve are shown n rows 3 8 n Table 3.4. The ROC curve s gven n Fg. 3.9. Table 3.4. Computatons for drawng an ROC curve Rank 1 3 4 5 6 7 8 9 10 Actual class + + + + TP 0 1 3 3 3 4 4 4 FP 0 0 0 1 3 4 4 5 6 TN 6 6 6 5 4 4 3 1 0 FN 4 3 1 1 1 0 0 0 TPR 0 0.5 0.5 0.5 0.5 0.75 0.75 0.75 1 1 1 FPR 0 0 0 0.17 0.33 0.33 0.50 0.67 0.67 0.83 1 Fg. 3.9. ROC curve for the data shown n Table 3.4

86 3 Supervsed Learnng 3.3.4 Lft Curve The lft curve (also called the lft chart) s smlar to the ROC curve. It s also for evaluaton of two-class classfcaton tasks, where the postve class s the target of nterest and usually the rare class. It s often used n drect marketng applcatons to lnk classfcaton results to costs and profts. For example, a mal order company wants to send promotonal materals to potental customers to sell an expensve watch. Snce prntng and postage cost money, the company needs to buld a classfer to dentfy lkely buyers, and only sends the promotonal materals to them. The queston s how many should be sent. To make the decson, the company needs to balance the cost and proft (f a watch s sold, the company makes a certan proft, but to send each letter there s a fxed cost). The lft curve provdes a nce tool to enable the marketer to make the decson. Lke an ROC curve, to draw a lft curve, the classfer needs to produce a rankng of the test cases accordng to ther lkelhoods of belongng to the postve class wth the most lkely postve case ranked at the top. After the rankng, the test cases are dvded nto N equal-szed bns (N s usually 10 0). The actual postve cases n each bn are then counted. A lft curve s drawn wth the x-axs beng the percentages of test data (or bns) and the y- axs beng the percentages of cumulatve postve cases from the frst bn to the current bn. A lft curve usually also ncludes a lne (called the baselne) along the man dagonal [from (0, 0) to (100, 100)] whch represents the stuaton where the postve cases n the test set are unformly (or randomly) dstrbuted n the N bns (no learnng),.e., each bn contans 100/N percent of the postve cases. If the lft curve s above ths baselne, learnng s sad to be effectve. The greater the area between the lft curve and the baselne, the better the classfer. Example 14: A company wants to send promotonal materals to potental buyers to sell an expensve brand of watches. It bulds a classfcaton model and tests t on a test data of 10,000 people (test cases) that they collected n the past. After classfcaton and rankng, t decdes to dvde the test data nto 10 bns wth each bn contanng 10% of the test cases or 1,000 cases. Out of the 1,000 cases n each bn, there are a certan number of postve cases (e.g., past buyers). The detaled results are lsted n Table 3.5, whch ncludes the number (#) of postve cases and the percentage (%) of postve cases n each bn, and the cumulatve percentage for that bn. The cumulatve percentages are used n drawng the lft curve whch s gven n Fg. 3.10. We can see that the lft curve s way above the baselne, whch means that the learnng s hghly effectve. Suppose prntng and postage cost $1.00 for each letter, and the sale of each watch makes $100 (assumng that each buyer only buys one watch).