Chapter 6. Classification and Prediction

Transcription

1 Chapter 6. Classfcaton and Predcton What s classfcaton? What s Lazy learners (or learnng from predcton? your neghbors) Issues regardng classfcaton and Frequent-pattern-based predcton classfcaton Classfcaton by decson tree Other classfcaton methods nducton Predcton Classfcaton by back propagaton Accuracy and error measures October 25, 2013 Data Mnng: Concepts and Technques 1

2 Supervsed vs. Unsupervsed Learnng Supervsed learnng (classfcaton) Supervson: The tranng data (observatons, measurements, etc.) are accompaned by labels ndcatng the class of the observatons New data s classfed based on the tranng set Unsupervsed learnng (clusterng) The class labels of tranng data s unknown Gven a set of measurements, observatons, etc. wth the am of establshng the exstence of classes or clusters n the data October 25, 2013 Data Mnng: Concepts and Technques 2

3 Classfcaton vs. Predcton Classfcaton predcts categorcal class labels (dscrete or nomnal) classfes data (constructs a model) based on the tranng set and the values (class labels) n a classfyng attrbute and uses t n classfyng new data Predcton models contnuous-valued functons,.e., predcts unknown or mssng values Typcal applcatons Credt/loan approval: Medcal dagnoss: f a tumor s cancerous or bengn Fraud detecton: f a transacton s fraudulent Web page categorzaton: whch category t s October 25, 2013 Data Mnng: Concepts and Technques 3

4 Classfcaton A Two-Step Process Model constructon: descrbng a set of predetermned classes Each tuple/sample s assumed to belong to a predefned class, as determned by the class label attrbute The set of tuples used for model constructon s tranng set The model s represented as classfcaton rules, decson trees, or mathematcal formulae Model usage: for classfyng future or unknown objects Estmate accuracy of the model The known label of test sample s compared wth the classfed result from the model Accuracy rate s the percentage of test set samples that are correctly classfed by the model Test set s ndependent of tranng set, otherwse over-fttng wll occur If the accuracy s acceptable, use the model to classfy data tuples whose class labels are not known October 25, 2013 Data Mnng: Concepts and Technques 4

5 Process (1): Model Constructon Tranng Data Classfcaton Algorthms NAM E RANK YEARS TENURED M ke Assstant Prof 3 no M ary Assstant Prof 7 yes Bll Professor 2 yes Jm Assocate Prof 7 yes Dave Assstant Prof 6 no Anne Assocate Prof 3 no Classfer (Model) IF rank = professor OR years > 6 THEN tenured = yes October 25, 2013 Data Mnng: Concepts and Technques 5

6 Process (2): Usng the Model n Predcton Classfer Testng Data Unseen Data (Jeff, Professor, 4) NAM E RANK YEARS TENURED Tom Assstant Prof 2 no M erlsa Assocate Prof 7 no G eorge Professor 5 yes Joseph Assstant Prof 7 yes Tenured? October 25, 2013 Data Mnng: Concepts and Technques 6

7 Issues: Data Preparaton Data cleanng Preprocess data n order to reduce nose and handle mssng values Relevance analyss (feature selecton) Remove the rrelevant or redundant attrbutes Data transformaton Generalze and/or normalze data October 25, 2013 Data Mnng: Concepts and Technques 7

8 Issues: Evaluatng Classfcaton Methods Accuracy classfer accuracy: predctng class label predctor accuracy: guessng value of predcted attrbutes Speed tme to construct the model (tranng tme) tme to use the model (classfcaton/predcton tme) Robustness: handlng nose and mssng values Scalablty: effcency n dsk-resdent databases Interpretablty understandng and nsght provded by the model Other measures, e.g., goodness of rules, such as decson tree sze or compactness of classfcaton rules October 25, 2013 Data Mnng: Concepts and Technques 8

9 Decson Tree Inducton: Tranng Dataset Ths follows an example of Qunlan s ID3 (Playng Tenns) age ncome student credt_ratng buys_computer <=30 hgh no far no <=30 hgh no excellent no hgh no far yes >40 medum no far yes >40 low yes far yes >40 low yes excellent no low yes excellent yes <=30 medum no far no <=30 low yes far yes >40 medum yes far yes <=30 medum yes excellent yes medum no excellent yes hgh yes far yes >40 medum no excellent no October 25, 2013 Data Mnng: Concepts and Technques 9

10 Output: A Decson Tree for buys_computer age? <=30 overcast >40 student? yes credt ratng? no yes excellent far no yes yes no October 25, 2013 Data Mnng: Concepts and Technques 10

11 Algorthm for Decson Tree Inducton Basc algorthm (a greedy algorthm) Tree s constructed n a top-down recursve dvde-and-conquer manner At start, all the tranng examples are at the root Attrbutes are categorcal (f contnuous-valued, they are dscretzed n advance) Examples are parttoned recursvely based on selected attrbutes Test attrbutes are selected on the bass of a heurstc or statstcal measure (e.g., nformaton gan) Condtons for stoppng parttonng All samples for a gven node belong to the same class There are no remanng attrbutes for further parttonng majorty votng s employed for classfyng the leaf There are no samples left October 25, 2013 Data Mnng: Concepts and Technques 11

12 Attrbute Selecton Measure: Informaton Gan (ID3/C4.5) Select the attrbute wth the hghest nformaton gan Let p be the probablty that an arbtrary tuple n D belongs to class C, estmated by C, D / D Expected nformaton (entropy) needed to classfy a tuple n D: m Info D) = p log ( p ) ( 2 = 1 Informaton needed (after usng A to splt D nto v v parttons) to classfy D: Dj Info A( D) = I( D D Informaton ganed by branchng on attrbute A j= 1 Gan(A) = Info(D) j ) Info (D) A October 25, 2013 Data Mnng: Concepts and Technques 12

13 Attrbute Selecton: Informaton Gan g Class P: buys_computer = yes g Class N: buys_computer = no Info ( D) = I (9,5) = log ( ) log age p n I(p, n ) <= > ( ) 14 2 = I 14 Info age ( D) = (2,3) I (2,3) + I (3,2) = means age <=30 has 5 out of 14 samples, wth 2 yes es and 3 no s. Hence I (4,0) Gan( age) = Info( D) Info ( D) = age ncome student credt_ratng buys_computer <=30 hgh no far no age <=30 hgh no excellent no hgh no far yes >40 medum no far yes Smlarly, >40 low yes far yes >40 low yes excellent no low yes excellent yes Gan( ncome) = <=30 medum no far no <=30 low yes far yes Gan( student) = >40 medum yes far yes <=30 medum yes excellent yes medum no excellent yes Gan( credt _ ratng) = hgh yes far yes >40October medum 25, 2013 no excellent Data Mnng: no Concepts and Technques 13

14 Computng Informaton-Gan for Contnuous-Value Attrbutes Let attrbute A be a contnuous-valued attrbute Must determne the best splt pont for A Sort the value A n ncreasng order Typcally, the mdpont between each par of adjacent values s consdered as a possble splt pont (a +a +1 )/2 s the mdpont between the values of a and a +1 The pont wth the mnmum expected nformaton requrement for A s selected as the splt-pont for A Splt: D1 s the set of tuples n D satsfyng A splt-pont, and D2 s the set of tuples n D satsfyng A > splt-pont October 25, 2013 Data Mnng: Concepts and Technques 14

15 Gan Rato for Attrbute Selecton (C4.5) Informaton gan measure s based towards attrbutes wth a large number of values C4.5 (a successor of ID3) uses gan rato to overcome the problem (normalzaton to nformaton gan) Ex. SpltInfo v D j D j D) = log ( ) j 1 D D A ( 2 = GanRato(A) = Gan(A)/SpltInfo(A) SpltInfo A ( D) = log ( ) log 2 ( ) log gan_rato(ncome) = 0.029/0.926 = ( ) 14 2 = The attrbute wth the maxmum gan rato s selected as the splttng attrbute October 25, 2013 Data Mnng: Concepts and Technques 15

16 Gn ndex (CART, IBM IntellgentMner) If a data set D contans examples from n classes, gn ndex, gn(d) s defned as where p j s the relatve frequency of class j n D If a data set D s splt on A nto two subsets D 1 and D 2, the gn ndex gn(d) s defned as Reducton n Impurty: gn ( D) = 1 n p 2 j j = 1 D ( ) 1 D ( ) 2 gn A D = gn D1 + gn ( D 2) D D gn( A) = gn( D) gn ( D) The attrbute provdes the smallest gn splt (D) (or the largest reducton n mpurty) s chosen to splt the node (need to enumerate all the possble splttng ponts for each attrbute) A October 25, 2013 Data Mnng: Concepts and Technques 16

17 Gn ndex (CART, IBM IntellgentMner) Ex. D has 9 tuples n buys_computer = yes and 5 n no gn( D) = 1 = Suppose the attrbute ncome parttons D nto 10 n D 1 : {low, medum} and 4 n D gnncome low, medum} D) = Gn( D1 ) + Gn( { ( D1 2 ) but gn {medum,hgh} s 0.30 and thus the best snce t s the lowest All attrbutes are assumed contnuous-valued May need other tools, e.g., clusterng, to get the possble splt values Can be modfed for categorcal attrbutes October 25, 2013 Data Mnng: Concepts and Technques 17

18 Comparng Attrbute Selecton Measures The three measures, n general, return good results but Informaton gan: based towards multvalued attrbutes Gan rato: tends to prefer unbalanced splts n whch one partton s much smaller than the others Gn ndex: based to multvalued attrbutes has dffculty when # of classes s large tends to favor tests that result n equal-szed parttons and purty n both parttons October 25, 2013 Data Mnng: Concepts and Technques 18

19 Overfttng and Tree Prunng Overfttng: An nduced tree may overft the tranng data Too many branches, some may reflect anomales due to nose or outlers Poor accuracy for unseen samples Two approaches to avod overfttng Preprunng: Halt tree constructon early do not splt a node f ths would result n the goodness measure fallng below a threshold Dffcult to choose an approprate threshold Postprunng: Remove branches from a fully grown tree get a sequence of progressvely pruned trees Use a set of data dfferent from the tranng data to decde whch s the best pruned tree October 25, 2013 Data Mnng: Concepts and Technques 19

20 Classfcaton n Large Databases Classfcaton a classcal problem extensvely studed by statstcans and machne learnng researchers Scalablty: Classfyng data sets wth mllons of examples and hundreds of attrbutes wth reasonable speed Why decson tree nducton n data mnng? relatvely faster learnng speed (than other classfcaton methods) convertble to smple and easy to understand classfcaton rules can use SQL queres for accessng databases comparable classfcaton accuracy wth other methods October 25, 2013 Data Mnng: Concepts and Technques 20

21 Classfcaton by Backpropagaton Backpropagaton: A neural network learnng algorthm Started by psychologsts and neurobologsts to develop and test computatonal analogues of neurons A neural network: A set of connected nput/output unts where each connecton has a weght assocated wth t Durng the learnng phase, the network learns by adjustng the weghts so as to be able to predct the correct class label of the nput tuples Also referred to as connectonst learnng due to the connectons between unts October 25, 2013 Data Mnng: Concepts and Technques 21

22 Neural Network as a Classfer Weakness Long tranng tme Requre a number of parameters typcally best determned emprcally, e.g., the network topology or structure. Poor nterpretablty: Dffcult to nterpret the symbolc meanng behnd the learned weghts and of hdden unts n the network Strength Hgh tolerance to nosy data Ablty to classfy untraned patterns Well-suted for contnuous-valued nputs and outputs Successful on a wde array of real-world data Algorthms are nherently parallel Technques have recently been developed for the extracton of rules from traned neural networks October 25, 2013 Data Mnng: Concepts and Technques 22

23 A Neuron (= a perceptron) - µ k x 0 w 0 x 1 x n w 1 w n f output y For Example Input vector x weght vector w weghted sum Actvaton functon y n = sgn( w x µ k ) = 0 The n-dmensonal nput vector x s mapped nto varable y by means of the scalar product and a nonlnear functon mappng October 25, 2013 Data Mnng: Concepts and Technques 23

24 A Mult-Layer Feed-Forward Neural Network Output vector w = w + λ( y yˆ ) x ( k + 1) j ( k ) j ( k ) j Output layer Hdden layer w j Input layer Input vector: X October 25, 2013 Data Mnng: Concepts and Technques 24

25 How A Mult-Layer Neural Network Works? The nputs to the network correspond to the attrbutes measured for each tranng tuple Inputs are fed smultaneously nto the unts makng up the nput layer They are then weghted and fed smultaneously to a hdden layer The number of hdden layers s arbtrary, although usually only one The weghted outputs of the last hdden layer are nput to unts makng up the output layer, whch emts the network's predcton The network s feed-forward n that none of the weghts cycles back to an nput unt or to an output unt of a prevous layer From a statstcal pont of vew, networks perform nonlnear regresson: Gven enough hdden unts and enough tranng samples, they can closely approxmate any functon October 25, 2013 Data Mnng: Concepts and Technques 25

26 Defnng a Network Topology Frst decde the network topology: # of unts n the nput layer, # of hdden layers (f > 1), # of unts n each hdden layer, and # of unts n the output layer Normalzng the nput values for each attrbute measured n the tranng tuples to [ ] One nput unt per doman value, each ntalzed to 0 Output, f for classfcaton and more than two classes, one output unt per class s used Once a network has been traned and ts accuracy s unacceptable, repeat the tranng process wth a dfferent network topology or a dfferent set of ntal weghts October 25, 2013 Data Mnng: Concepts and Technques 26

27 Backpropagaton Iteratvely process a set of tranng tuples & compare the network's predcton wth the actual known target value For each tranng tuple, the weghts are modfed to mnmze the mean squared error between the network's predcton and the actual target value Modfcatons are made n the backwards drecton: from the output layer, through each hdden layer down to the frst hdden layer, hence backpropagaton Steps Intalze weghts (to small random #s) and bases n the network Propagate the nputs forward (by applyng actvaton functon) Backpropagate the error (by updatng weghts and bases) Termnatng condton (when error s very small, etc.) October 25, 2013 Data Mnng: Concepts and Technques 27

28 Backpropagaton and Interpretablty Effcency of backpropagaton: Each epoch (one nteraton through the tranng set) takes O( D * w), wth D tuples and w weghts, but # of epochs can be exponental to n, the number of nputs, n the worst case Rule extracton from networks: network prunng Smplfy the network structure by removng weghted lnks that have the least effect on the traned network Then perform lnk, unt, or actvaton value clusterng The set of nput and actvaton values are studed to derve rules descrbng the relatonshp between the nput and hdden unt layers Senstvty analyss: assess the mpact that a gven nput varable has on a network output. The knowledge ganed from ths analyss can be represented n rules October 25, 2013 Data Mnng: Concepts and Technques 28

29 Lazy vs. Eager Learnng Lazy vs. eager learnng Lazy learnng (e.g., nstance-based learnng): Smply stores tranng data (or only mnor processng) and wats untl t s gven a test tuple Eager learnng (the above dscussed methods): Gven a set of tranng set, constructs a classfcaton model before recevng new (e.g., test) data to classfy Lazy: less tme n tranng but more tme n predctng Accuracy Lazy method effectvely uses a rcher hypothess space snce t uses many local lnear functons to form ts mplct global approxmaton to the target functon Eager: must commt to a sngle hypothess that covers the entre nstance space October 25, 2013 Data Mnng: Concepts and Technques 29

30 Lazy Learner: Instance-Based Methods Instance-based learnng: Store tranng examples and delay the processng ( lazy evaluaton ) untl a new nstance must be classfed Typcal approaches k-nearest neghbor approach Instances represented as ponts n a Eucldean space. Locally weghted regresson Constructs local approxmaton Case-based reasonng Uses symbolc representatons and knowledgebased nference October 25, 2013 Data Mnng: Concepts and Technques 30

31 The k-nearest Neghbor Algorthm All nstances correspond to ponts n the n-d space The nearest neghbor are defned n terms of Eucldean dstance, dst(x 1, X 2 ) Target functon could be dscrete- or real- valued For dscrete-valued, k-nn returns the most common value among the k tranng examples nearest to x q Vonoro dagram: the decson surface nduced by 1- NN for a typcal set of tranng examples _ + _. + _ + x q _ + October 25, 2013 Data Mnng: Concepts and Technques

32 Dscusson on the k-nn Algorthm k-nn for real-valued predcton for a gven unknown tuple Returns the mean values of the k nearest neghbors Dstance-weghted nearest neghbor algorthm Weght the contrbuton of each of the k neghbors accordng to ther dstance to the query x q w Gve greater weght to closer neghbors d( x q Robust to nosy data by averagng k-nearest neghbors 1, x ) 2 Curse of dmensonalty: dstance between neghbors could be domnated by rrelevant attrbutes To overcome t, axes stretch or elmnaton of the least relevant attrbutes October 25, 2013 Data Mnng: Concepts and Technques 32

33 Genetc Algorthms (GA) Genetc Algorthm: based on an analogy to bologcal evoluton An ntal populaton s created consstng of randomly generated rules Each rule s represented by a strng of bts E.g., f A 1 and A 2 then C 2 can be encoded as 100 If an attrbute has k > 2 values, k bts can be used Based on the noton of survval of the fttest, a new populaton s formed to consst of the fttest rules and ther offsprngs The ftness of a rule s represented by ts classfcaton accuracy on a set of tranng examples Offsprngs are generated by crossover and mutaton The process contnues untl a populaton P evolves when each rule n P satsfes a prespecfed threshold Slow but easly parallelzable October 25, 2013 Data Mnng: Concepts and Technques 33

34 What Is Predcton? (Numercal) predcton s smlar to classfcaton construct a model use model to predct contnuous or ordered value for a gven nput Predcton s dfferent from classfcaton Classfcaton refers to predct categorcal class label Predcton models contnuous-valued functons Major method for predcton: regresson model the relatonshp between one or more ndependent or predctor varables and a dependent or response varable Regresson analyss Lnear and multple regresson Non-lnear regresson Other regresson methods: generalzed lnear model, Posson regresson, log-lnear models, regresson trees October 25, 2013 Data Mnng: Concepts and Technques 34

35 Lnear Regresson Lnear regresson: nvolves a response varable y and a sngle predctor varable x y = w 0 + w 1 x where w 0 (y-ntercept) and w 1 (slope) are regresson coeffcents Method of least squares: estmates the best-fttng straght lne w D ( x = 1 = D 1 = 1 x)( y ( x x) 2 y) w = y w x 0 1 Multple lnear regresson: nvolves more than one predctor varable Tranng data s of the form (X 1, y 1 ), (X 2, y 2 ),, (X D, y D ) Ex. For 2-D data, we may have: y = w 0 + w 1 x 1 + w 2 x 2 Solvable by extenson of least square method or usng SAS, S-Plus Many nonlnear functons can be transformed nto the above October 25, 2013 Data Mnng: Concepts and Technques 35

36 Nonlnear Regresson Some nonlnear models can be modeled by a polynomal functon A polynomal regresson model can be transformed nto lnear regresson model. For example, y = w 0 + w 1 x + w 2 x 2 + w 3 x 3 convertble to lnear wth new varables: x 2 = x 2, x 3 = x 3 y = w 0 + w 1 x + w 2 x 2 + w 3 x 3 Other functons, such as power functon, can also be transformed to lnear model Some models are ntractable nonlnear (e.g., sum of exponental terms) possble to obtan least square estmates through extensve calculaton on more complex formulae October 25, 2013 Data Mnng: Concepts and Technques 36

37 Other Regresson-Based Models Generalzed lnear model: Foundaton on whch lnear regresson can be appled to modelng categorcal response varables Varance of y s a functon of the mean value of y, not a constant Logstc regresson: models the prob. of some event occurrng as a lnear functon of a set of predctor varables Posson regresson: models the data that exhbt a Posson dstrbuton Log-lnear models: (for categorcal data) Approxmate dscrete multdmensonal prob. dstrbutons Also useful for data compresson and smoothng Regresson trees and model trees Trees to predct contnuous values rather than class labels October 25, 2013 Data Mnng: Concepts and Technques 37

38 Predcton: Numercal Data October 25, 2013 Data Mnng: Concepts and Technques 38

39 Predcton: Categorcal Data October 25, 2013 Data Mnng: Concepts and Technques 39

40 Classfer Accuracy Measures Real class\predcted class C 1 ~C 1 C 1 True postve False negatve ~C 1 False postve True negatve Real class\predcted class buy_computer = yes buy_computer = no total recognton(%) buy_computer = yes buy_computer = no total Accuracy of a classfer M, acc(m): percentage of test set tuples that are correctly classfed by the model M Error rate (msclassfcaton rate) of M = 1 acc(m) Gven m classes, CM,j, an entry n a confuson matrx, ndcates # of tuples n class that are labeled by the classfer as class j Alternatve accuracy measures (e.g., for cancer dagnoss) senstvty = t-pos/pos /* true postve recognton rate */ specfcty = t-neg/neg /* true negatve recognton rate */ precson = t-pos/(t-pos + f-pos) accuracy = senstvty * pos/(pos + neg) + specfcty * neg/(pos + neg) Ths model can also be used for cost-beneft analyss October 25, 2013 Data Mnng: Concepts and Technques 40

41 October 25, 2013 Data Mnng: Concepts and Technques 41 Predctor Error Measures Measure predctor accuracy: measure how far off the predcted value s from the actual known value Loss functon: measures the error betw. y and the predcted value y Absolute error: y y Squared error: (y y ) 2 Test error (generalzaton error): the average loss over the test set Mean absolute error: Mean squared error: Relatve absolute error: Relatve squared error: The mean squared-error exaggerates the presence of outlers Popularly use (square) root mean-square error, smlarly, root relatve squared error d y y d = 1 ' d y y d = 1 2 ') ( = = d d y y y y 1 1 ' = = d d y y y y ) ( ') (

42 Evaluatng the Accuracy of a Classfer or Predctor (I) Holdout method Gven data s randomly parttoned nto two ndependent sets Tranng set (e.g., 2/3) for model constructon Test set (e.g., 1/3) for accuracy estmaton Random samplng: a varaton of holdout Repeat holdout k tmes, accuracy = avg. of the accuraces obtaned Cross-valdaton (k-fold, where k = 10 s most popular) Randomly partton the data nto k mutually exclusve subsets, each approxmately equal sze At -th teraton, use D as test set and others as tranng set Leave-one-out: k folds where k = # of tuples, for small szed data Stratfed cross-valdaton: folds are stratfed so that class dst. n each fold s approx. the same as that n the ntal data October 25, 2013 Data Mnng: Concepts and Technques 42

43 Model Selecton: ROC Curves ROC (Recever Operatng Characterstcs) curves: for vsual comparson of classfcaton models Orgnated from sgnal detecton theory Shows the trade-off between the true postve rate and the false postve rate The area under the ROC curve s a measure of the accuracy of the model Rank the test tuples n decreasng order: the one that s most lkely to belong to the postve class appears at the top of the lst The closer to the dagonal lne (.e., the closer the area s to 0.5), the less accurate s the model Vertcal axs represents the true postve rate Horzontal axs rep. the false postve rate The plot also shows a dagonal lne A model wth perfect accuracy wll have an area of 1.0 October 25, 2013 Data Mnng: Concepts and Technques 43