Linear classifier MAXIMUM ENTROPY. Linear regression. Logistic regression 11/3/11. f 1

Transcription

1 Liear classifier A liear classifier predicts the label based o a weighted, liear combiatio of the features predictio = w 0 + w 1 f 1 + w 2 f w m f m For two classes, a liear classifier ca be viewed as a plae (hyperplae i the feature space MAXIMUM ENTROPY David Kauchak CS457, Sprig 2011 Some material derived from Jaso Eiser f 2 f 3 f 1 Liear regressio Logistic regressio Predict the respose based o a weighted, liear combiatio of the features log P(1 1 P(1 = w 0 + w 1 + w w m x m h( f = w 0 + w 1 f 1 + w 2 f w m f m real value weights Lear weights by miimizig the square error o the traiig data # error(h = (y i (w 0 + w 1 f 1 + w 2 f w m f m 2 P(1 1 P(1 = ew 0 +w 1 +w w mx m P(1 = (1 P(1 e w 0+w 1+w w mx m 1 P(1 = 1+ e (w 0 +w 1 +w w m x m 1

2 Logistic fuctio 1 logistic = 1+ e x Logistic regressio Fid the best fit of the data based o a logistic Logistic regressio 3 views of logistic regressio How would we classify examples oce we had a traied model? log P(1 1 P(1 = w 0 + w 1 + w w m x m liear classifier log P(1 1 P(1 = w 0 + w 1 + w w m x m If the sum > 0 the p(1/p(0 > 1, so positive if the sum < 0 the p(1/p(0 < 1, so egative Still a liear classifier (decisio boudary is a lie ew 0 +w 1 +w w m x m P(1 = 1+ e w 0 +w 1 +w w mx m 1 P(1 = 1+ e (w 0 +w 1 +w w m x m expoetial model (log-liear model logistic 2

3 Logistic regressio Fid the best fit of the data based o a logistic fuctio Traiig logistic regressio models How should we lear the parameters for logistic regressio (i.e. the w s? log P(1 1 P(1 = w 0 + w 1 + w w m x m parameters 1 P(1 = 1+ e (w 0 +w 1 +w w mx m Traiig logistic regressio models Idea 1: miimize the squared error (like liear regressio Ay problems? A digressio Why is this called Maximum Likelihood Estimatio (MLE? Parsed seteces Grammar log P(1 1 P(1 = w 0 + w 1 + w w m x m # error(h = (y i h( f i 2 We do t kow what the actual probability values are Learig/ Traiig cout( #$ P( #$ = cout( S NP VP 0.9 S VP 0.1 NP Det A N 0.5 NP NP PP 0.3 NP PropN 0.2 A ε 0.6 A Adj A 0.4 PP Prep NP 1.0 VP V NP 0.7 VP VP PP 0.3 Eglish 3

4 MLE Maximum likelihood estimatio picks the values for the model parameters that maximize the likelihood of the traiig data MLE Maximum likelihood estimatio picks the values for the model parameters that maximize the likelihood of the traiig data parameters S NP VP 0.9 S VP 0.1 NP Det A N 0.5 NP NP PP 0.3 NP PropN 0.2 A ε 0.6 A Adj A 0.4 PP Prep NP 1.0 VP V NP 0.7 VP VP PP 0.3 model ( parameter values parameters S NP VP S VP NP Det A N NP NP PP NP PropN A ε A Adj A PP Prep NP VP V NP VP VP PP model ( parameter values MLE(data = argmax p (data = argmax # p (data i i = argmax log( # p (data i i If this is what you wat to optimize, you ca do NO BETTER tha MLE MLE example You flip a coi 100 times. 60 times you get heads. What is the MLE for heads? p(head = 0.60 What is the likelihood of the data uder this model (each coi flip is a data poit? likelihood(data = # i p (data i log( * = MLE Example Ca we do ay better? likelihood(data = # i p (data i p(heads = 0.5 log( * =-69.3 p(heads = 0.7 log( * =

5 Traiig logistic regressio models Idea 1: miimize the squared error (like liear regressio log P(1 1 P(1 = w 0 + w 1 + w w m x m # error(h = (y i h( f i 2 We do t kow what the actual probability values are Ideas? Traiig logistic regressio models Idea 2: maximum likelihood traiig MLE(data = argmax p (data = argmax w = argmax w p w (label i f i log p w (label i f i How do we solve this? Traiig logistic regressio models Idea 2: maximum likelihood traiig Covex fuctios Covex fuctios look somethig like: MLE(data = argmax p (data = argmax w = argmax w p w (label i f i log p w (label i f i Ufortuately, o closed form solutio. 1. plug i our logistic equatio 2. take partial derivatives ad solve What are some ice properties about covex fuctios? How ca we fid the miimum/maximum of a covex fuctio? 5

6 Fidig the miimum Oe approach: gradiet descet Partial derivatives give us the slope i that dimesio You re blidfolded, but you ca see out of the bottom of the blidfold to the groud right by your feet. I drop you off somewhere ad tell you that you re i a covex shaped valley ad escape is at the bottom/miimum. How do you get out? Approach: pick a startig poit (w repeat util likelihood ca t icrease i ay dimesio: pick a dimesio move a small amout i that dimesio towards icreasig likelihood (usig the derivative Gradiet descet pick a startig poit (w repeat util loss does t decrease i all dimesios: pick a dimesio move a small amout i that dimesio towards decreasig loss (usig the derivative w i = w i # d error(w dw i learig rate (how much we wat to move i the error directio Solvig covex fuctios Gradiet descet is just oe approach A whole field called covex optimizatio Lots of well kow methods Cojugate gradiet Geeralized Iterative Scalig (GIS Improved Iterative Scalig (IIS Limited-memory quasi-newto (L-BFGS The key: if we get a error fuctio that is covex, we ca miimize/maximize it (evetually 6

7 Aother thought experimet Aother thought experimet What is a 100,000-dimesioal space like? You re a 1-D creature, ad you decide to buy a 2-uit apartmet What is a 100,000-dimesioal space like? Your job s goig well ad you re makig good moey. You upgrade to a 2-D apartmet with 2-uits per dimesio 2 rooms (very, skiy rooms 4 rooms (very, flat rooms Aother thought experimet Aother thought experimet What is a 100,000-dimesioal space like? You get promoted agai ad start havig kids ad decide to upgrade to aother dimesio. What is a 100,000-dimesioal space like? Larry Page steps dow as CEO of google ad they ask you if you d like the job. You decide to upgrade to a 100,000 dimesioal apartmet. 8 rooms (very, ormal rooms Each time you add a dimesio, the amout of space you have to work with goes up expoetially How much room do you have? Ca you have a big party? 2 100,000 rooms (it s very quiet ad loely = ~10 30 rooms per perso if you ivited everyoe o the plaet 7

8 The challege Overfittig Because logistic regressio has fewer costraits (tha, say NB it has a lot more optios We re tryig to fid 100,000 w values (or a poit i a 100,000 dimesioal space It s easy for logistic regressio to fit to uaces i the data: overfittig Give these poits as traiig data, which is a better lie to lear to separate the poits? Prevetig overfittig Prevetig overfittig log P(1 1 P(1 = w 0 + w 1 + w w m x m log P(1 1 P(1 = w 0 + w 1 + w w m x m We wat to avoid ay sigle feature from havig too much weight We wat to avoid ay sigle feature from havig too much weight MLE(data = argmax w log p w (y f ormal MLE MLE(data = argmax w log p w (y f ormal MLE ideas? MLE(data = argmax w m log p w (y f 2 # $ w j j =1 regularized MLE 8

9 Prevetig overfittig: regularizatio NB vs. Logistic regressio MLE(data = argmax w m log p w (y f 2 # $ w regularized MLE j What affect will this have o the leared weights assumig a positive? pealize large weights ecourage smaller weights - still a covex problem - equivalet to assumig your w j are distributed from a Gaussia with mea 0 (called a prior j =1 NB ad logistic regressio look very similar both are probabilistic models both are liear both lear parameters that maximize the log-likelihood of the traiig data How are they differet? NB vs. Logistic regressio Some historical perspective NB Logistic regressio f 1 log(p( f 1 l + f 1 log(1 P( f 1 l log(p(l e w 0 +w 1 +w w mx m 1+ e w 0 +w 1 +w w mx m Estimates the weights uder the strict assumptio that the features are idepedet If NB assumptio does t hold, we ca adjust the weights to compesate for this Naïve bayes is called a geerative model; it models the joit distributio p(features, labels Logistic regressio is called a discrimiative model; it models the coditioal distributio directly p(labels features 9

10 Estimatig the best chess state Old school optimizatio Possible parses (or whatever have scores Pick the oe with the best score How do you defie the score? Completely ad hoc Throw aythig you wat ito the mix Add a bous for this, a pealty for that, etc. State evaluatio fuctio for chess Write a fuctio that takes as iput a state represetatio of tic tac toe ad scores how good it is for you if you re X. How would you do it? (Called a state evaluatio fuctio Old school optimizatio Learig adjust bouses ad pealties by had to improve performace. J Total kludge, but totally flexible too Ca throw i ay ituitios you might have But we re purists we oly use probabilities New revolutio? Probabilities 10

11 New revolutio? Exposé at 9 Probabilities Probabilistic Revolutio Not Really a Revolutio, Critics Say Probabilities o more tha scores i disguise We re just addig stuff up like the old corrupt regime did, admits spokesperso 83% of Probabilists Rally Behid Paradigm ^.2,.4,.6,.8 We re ot goa take your bait 1. Ca estimate our parameters automatically e.g., p(t7 t5, t6 (trigram probability from supervised or usupervised data 2. Our results are more meaigful Ca use probabilities to place bets, quatify risk e.g., how sure are we that this is the correct parse? 3. Our results ca be meaigfully combied modularity Multiply idep. coditioal probs ormalized, ulike scores p(eglish text * p(eglish phoemes Eglish text * p(jap. phoemes Eglish phoemes * p(jap. text Jap. phoemes p(sematics * p(sytax sematics * p(morphology sytax * p (phoology morphology * p(souds phoology Probabilists Regret Beig Boud by Priciple Probabilists Regret Beig Boud by Priciple Ad-hoc approach does have oe advatage Cosider e.g. Naïve Bayes for spam categorizatio: Buy this supercalifragilistic Gisu kife set for oly $39 today Some useful features: Cotais Buy Cotais supercalifragilistic Cotais a dollar amout uder $100 Cotais a imperative setece Readig level = 8 th grade Metios moey (use word classes ad/or regexp to detect this Ay problem with these features for NB? Buy this supercalifragilistic Gisu kife set for oly $39 today Naïve Bayes Cotais a dollar amout uder $100 Metios moey (use word classes ad/or regexp to detect this Spam < $ Moey amout ot-spam How likely is it to see both features i either class usig NB? Is this right? 11

12 Probabilists Regret Beig Boud by Priciple Buy this supercalifragilistic Gisu kife set for oly $39 today Naïve Bayes Cotais a dollar amout uder $100 Metios moey (use word classes ad/or regexp to detect this Spam < $ Moey amout ot-spam 0.5*0.9= *0.1=0.002 Overestimates The problem is that the features are ot idepedet NB vs. Logistic regressio Logistic regressio allows us to put i features that overlap ad adjust the probabilities accordigly Which to use? NB is better for small data sets: strog model assumptios keep the model from overfittig Logistic regressio is better for larger data sets: ca exploit the fact that NB assumptio is rarely true NB vs. Logistic regressio NB vs. Logistic regressio 12

13 Logistic regressio with more classes Challege: probabilistic modelig NB works o multiple classes Logistic regressio oly works o two classes Idea: somethig like logistic regressio, but with more classes Like NB, oe model per each class The model is a weight vector P(class 1 = e w 1,0 +w 1,1 +w 1, w 1,mx m P(class 2 = e w 2,0 +w 2,1 +w 2, w 2,mx m P(class 1 = e w 1,0 +w 1,1 +w 1, w 1,mx m P(class 2 = e w 2,0 +w 2,1 +w 2, w 2,mx m P(class 3 = e w 3,0 +w 3,1 +w 3, w 3,mx m These are supposed to be probabilities P(class 3 = e w 3,0 +w 3,1 +w 3, w 3,m x m P(class 1 + P(class 2 + P(class aythig wrog with this? Ideas? Maximum Etropy Modelig aka Multiomial Logistic Regressio Log-liear model Normalize each class probability by the sum over all the classes e w 1,0 +w 1,1 +w 1, w 1,mx m P(class 1 = P(class 1 + P(class 2 + P(class P(class 1 = C ew 1,0 +w 1,1 +w 1, w 1,m x m P(class i P(class 1 = C ew 1,0 +w 1,1 +w 1, w 1,mx m P(class i = ew 1,0 +w 1,1 +w 1, w 1,mx m C e w i,0 +w i,1 +w i, w i,m x m ormalizig costat $ C ' logp(class 1 = w 1,0 + w 1,1 + w 1, w 1,m x m log& # P(class i % ( - still just a liear combiatio of feature weightigs - class specific features 13

14 Traiig the model Ca still use maximum likelihood traiig MLE(data = argmax Use regularizatio # log p(label i f i MLE(data = argmax # log p(label i f i $ %R( Plug ito a covex optimizatio package there are a few complicatios, but this is the basic idea Maximum Etropy Suppose there are 10 classes, A through J. I do t give you ay other iformatio. Questio: Give a ew example m: what is your guess for p(c m? Suppose I tell you that 55% of all examples are i class A. Questio: Now what is your guess for p(c m? Suppose I also tell you that 10% of all examples cotai Buy ad 80% of these are i class A or C. Questio: Now what is your guess for p(c m, if m cotais Buy? Maximum Etropy Maximum Etropy A B C D E F G H I J prob Qualitatively Maximum etropy priciple: give the costraits, pick the probabilities as equally as possible Quatitatively Maximum etropy: give the costraits, pick the probabilities so as to maximize the etropy Etropy(model = c p(clog p(c A B C D E F G H I J prob Qualitatively Maximum etropy priciple: give the costraits, pick the probabilities as equally as possible Quatitatively Maximum etropy: give the costraits, pick the probabilities so as to maximize the etropy Etropy(model = c p(clog p(c 14

15 Maximum Etropy Maximum Etropy A B C D E F G H I J Buy Other Colum A sums to 0.55 ( 55% of all messages are i class A A B C D E F G H I J Buy Other Colum A sums to 0.55 Row Buy sums to 0.1 ( 10% of all messages cotai Buy Maximum Etropy Maximum Etropy A B C D E F G H I J Buy Other Colum A sums to 0.55 Row Buy sums to 0.1 (Buy, A ad (Buy, C cells sum to 0.08 ( 80% of the 10% Give these costraits, fill i cells as equally as possible : maximize the etropy (related to cross-etropy, perplexity Etropy = log log log Largest if probabilities are evely distributed A B C D E F G H I J Buy Other Colum A sums to 0.55 Row Buy sums to 0.1 (Buy, A ad (Buy, C cells sum to 0.08 ( 80% of the 10% Give these costraits, fill i cells as equally as possible : maximize the etropy Now p(buy, C =.029 ad p(c Buy =.29 We got a compromise: p(c Buy < p(a Buy <.55 15

16 Geeralizig to More Features What we just did Other <$100 A B C D E F G H Buy Other For each feature ( cotais Buy, see what fractio of traiig data has it May distributios p(c,m would predict these fractios Of these, pick distributio that has max etropy Amazig Theorem: The maximum etropy model is the same as the maximum likelihood model If we calculate the maximum likelihood parameters, we re also calculatig the maximum etropy model What to take home May learig approaches Bayesia approaches (of which NB is just oe Liear regressio Logistic regressio Maximum Etropy (multiomial logistic regressio SVMs Decisio trees Differet models have differet stregths/weakesses/uses Uderstad what the model is doig Uderstad what assumptios the model is makig Pick the model that makes the most sese for your problem/data Feature selectio is importat Articles discussio 16