Logstc Regresson Steve Kroon Course notes sectons: 24.3-24.4 Dsclamer: these notes do not explctly ndcate whether values are vectors or scalars, but expects the reader to dscern ths from the context. Scenaro supervsed classfcaton We are gven tranng data {(x, y ) : =,..., n} from some (mxture) dstrbuton, where y ndcates class membershp. Am: gven a new x, predct the correspondng y. Ths stuaton s often not determnstc (e.g. gven heght and weght nfo, predctng gender). Usng class membershp probabltes Gven pror probabltes for each class, and a generatve model for each class, we can use the maxmum lkelhood estmate. Ths s the class the pont has the hghest probablty of beng n. To do ths, we only need to know whch class has hghest P (y x) at each x. More generally, we mght not want to pck the class wth hghest probablty (e.g. spam classfcaton, cancer dagnoss, extreme sports). Decdng when ths s the case, and what to do then, s the subject of decson theory. The theory makes use of a so-called loss functon. Key nsght s that the actual probabltes of each class are useful beyond just the maxmum. However, we stll only need to know P (y x) at each x. Generatve vs dscrmnatve models We can get class probabltes f we have generatve models. (A generatve model s a full specfcaton of P (x, y).) The key ssue: The more parameters you have to estmate from data, the less sure you are of each estmate.
Snce we usually don t actually know the model for each class, we must estmate t from the class data. Two phases: certan assumptons/pror knowledge, such as normalty; followed by estmatng parameters from the data. If we need the model, there s no problem wth ths approach. However, f we only want to classfy, we don t need to know the margnal dstrbuton P (x), even though generatve models provde ths nformaton. Dscrmnatve models are specfcatons of the condtonal dstrbuton P (y x). Snce generatve models usually have more parameters than dscrmnatve ones, dscrmnatve models often outperform generatve models for classfcaton. Note that generatve models can be used for tasks dscrmnatve models can t perform. What should a dscrmnatve model look lke? We don t know a model for P (y x), and have no ntuton yet. To develop an ntuton, let us look at what P (y x) looks lke when we do know the models generatng the data. Assume we have two classes C and C2, wth pror probabltes P (C) and P (C2). Then P (C x) = = = P (C, x) P (x) P (x C)P (C) P (x C)P (C) + P (x C2)P (C2) + P (x C2)P (C2) P (x C)P (C) = + exp ( a(x)) = σ(a(x)) where we convenently defne a(x) = ln P (x C)P (C) P (x C2)P (C2) and the logstc functon σ(y) = +exp ( y). Note that σ(y) les n (0, ), and that a(x) s the so-called log-odds for class membershp of x. (You should see a smlar expresson turnng up n assgnment 2.) Also t s worth verfyng that the dervatve of σ(y) s σ(y)( σ(y)). Next, we wll assume the classes each have Gaussan dstrbutons, wth means µ and µ 2 and covarance matrces Σ and Σ 2. What s a(x) then? a(x) = ln P (x C) ln P (x C2) + ln P (C) P (C2) = 2 ln 2πΣ 2 (x µ ) T Σ (x µ ) + 2 ln 2πΣ 2 + 2 (x µ 2) T Σ 2 (x µ 2) + ln P (C) P (C2) 2
If we further assume that Σ = Σ 2 = Σ, we get some cancellaton, yeldng ln P (C) P (C2) /2[(x µ ) T Σ (x µ ) (x µ 2 ) T Σ (x µ 2 )] Multplyng out, we get: /2 [ (µ 2 µ ) T Σ x + x T Σ (µ 2 µ ) + (µ T Σ µ µ T 2 Σ µ 2 ) ] + ln P (C) P (C2) [ = [Σ (µ 2 µ )] T x + /2µ T Σ µ + /2µ T 2 Σ µ 2 + ln P (C) ] P (C2) = w T x + w 0 where these equatons defne w and w 0. Thus, we fnd that for 2 classes wth equal covarances, but dfferent means, the log-odds s a lnear functon of the observatons. It follows that n ths case, f we used the data to drectly estmate the means and covarance matrx, we would estmate 2d+d(d+)/2 parameters, whle f we could drectly estmate (w, w 0 ), we would only be estmatng d + parameters. The multvarate normal case Let us now consder the same problem, but wth k classes. Then P (C x) = P (x C)P (C)/ [P (x C )P (C)] We could go the same route as before (dvdng the numerator and denomnator by P (x C)P (C), but that leads to complcatons wth more than 2 classes. Instead, we shall wrte a (x) = ln P (x C )P (C ), so that P (C x) = exp (a (x))/ exp (a (x)). 2 Agan assumng Gaussans wth shared covarance, we eventually conclude that a (x) = w T x + w 0, where and w = Σ µ w 0 = µt Σ µ + ln P (C ) 2 Comparng the number of parameters, we have kd+d(d+)/2 for a generatve approach versus k(d + ) for the dscrmnatve approach. If we restrct ourself to usng a dagonal covarance matrx n the generatve approach, a la Nave Bayes, the number of parameters s reduced to k(d + ). But now there s a hgher chance the model s wrong. Fndng w These examples motvate modellng P (y x) by a logstc functon of the logodds of the observaton, whch we model usng lnear functons (for 2 classes); or a softmax functon, usng lnear functons of the observatons as exponents If we assume dfferent covarance matrces, we get a quadratc functon of the observatons. 2 Thus a(x) n the two-class case s a (x) a 2 (x). 3
(for mult-class problems). More generally, we could use quadratc functons, or even more generally, a lnear functon of some transformaton of the observaton. The extenson to transformatons of the data s n the textbook; we wll stck to the lnear case here. However, we add a to the feature vector for each observaton to get rd of the nconvenent w 0. Let us try to select w usng maxmum lkelhood on a tranng set. Thus, we try to dentfy whch selecton of w was most lkely to generate the labels n the tranng set! We begn by wrtng down the lkelhood of the tranng data as a functon of w (bnary case n notes). 3 We have P (X, Y w) = P (Y X, w)p (X w), but snce P (X w) = P (X), ths equals P (Y X, w)p (X) = P (X) P (y x, w) Ths factorzaton assumes that the label of x s c.. of other observatons and labels, gven the observaton x. 4 For mathematcal convenence defne t j = f y = Cj, and 0 for the other k classes. 5 Then the lkelhood becomes P (X) P (y = Cj x, w) tj j In order to maxmze ths, we mnmze the negatve log-lkelhood w.r.t. w. Ths equals ln P (X) t j ln P (y = Cj x, w) j Wrtng P (y = Cj x, w) = ln P (X) exp (aj(x)) r exp (ar(x)), we get [ t j a j (x ) ln r j exp (a r (x )) ] where the a r are lnear functons of x : a r (x ) = w T r x. To mnmze, we take the gradent w.r.t. w: 6 wv = [ t v x exp (a ] v(x ))x r exp (a r(x )) = [ ] exp (av (x )) r exp (a r(x )) t v x For an optmum all k of these gradents must smultaneously be zero. Ths s a non-lnear system of k(d + ) equatons n k(d + ) unknowns, so we wll make use of a numercal optmzaton technque, Newton-Raphson optmzaton. 3 Here X s the observaton matrx and Y the vector of labels. 4 A common settng for supervsed learnng s assumng IID data, whch satsfes ths. Ths assumpton keeps thngs smple, even f often not qute true. 5 Ths s known as a -of-k encodng. 6 Note that ln P (X) s constant w.r.t. w, allowng ths term to be removed from the optmzaton problem. Wth generatve models, ths s not the case. 4
Newton-Raphson for mult-class logstc regresson Recall that we wanted to fnd the elements of w mnmzng the negatve loglkelhood l(w) = ln P (X) t j [a j (x ) ln exp (a r (x ))] j r wth a r (x ) = wr T x. Settng the gradent l(w) to zero drectly yelded a large non-lnear system of equatons, whch we could not solve analytcally. To apply logstc regresson, we need not only l and l, but also H l, so we must do further dfferentaton. To smplfy ths, let us defne y v (b) = exp (b v )/ r exp (b r), wth y v = y v (a(x )). In ths notaton, the gradent of the negatve log-lkelhood (w.r.t. w v ) turns out to smply be [y v t v ]x. Another advantage of ths defnton s to smplfy the calculus. Let us frst calculate b y v. We have y v = exp (b v)( r exp (b r) exp (b v )) b v ( r exp (b r)) 2 = y v (b)( y v (b)) smlar to the dervatve of the logstc functon, whle for j v we have y v = exp (b v) exp (b j ) b j ( r exp (b r)) 2 = y v(b)y j (b) ( r exp (b r)) 2 Note that these results can be pooled as yv b j = y v (b)(i vj y j (b)), so that we do not need to handle the case j = v separately. Usng ths, we can fnd the entres of the Hessan as follows (where x (k) denotes the k-th component of x ): 7 2 l w v,d w v2,d 2 = = = w v,d x (d2) x (d2) [y v2 t v2 ]x (d2) y v2 (a(x )) w v,d y v2 (a(x ))(I v2j y j (a(x ))) a j (x ) w v,d j where the last step follows from the chan rule. Now, w v a,d j (x ) s zero for j v, and x (d) for j = v, so that the above expresson equals y v2 (a(x ))(I v2v y v (a(x )))x (d2)x (d) so that we can wrte the block of the Hessan correspondng to w v and w v2 as y v2 (I v2v y v )x x T Now that we can calculate the Hessan and the gradent, we can start wth an ntal guess (for example, settng all the w s to zero ntally), and then applyng Newton-Raphson updates. We leave showng that the Hessan s postve semdefnte to the nterested reader. 7 Here w v,dj refers to the d j -th component of w v. 5
Two-class logstc regresson It s worth notcng that the soluton to the mult-class problem above s not unque: addng a constant to any component of all the w vectors yelds the same soluton. Thus, we can assume that the soluton vector for one of the classes s the zero vector. Ths means that we only need to fnd the w vectors for k classes, rather than k. In the bnary case, ths smplfes thngs consderably. It s left to the reader to verfy that after the adjustment mentoned n the prevous paragraph (settng w = 0), the softmax functon for class probablty for the frst class reduces to the logstc functon dscussed earler, where w now represents the adjusted weght w 2. 8 The negatve log-lkelhood of the observatons, as obtaned earler, s l = ln P (X) + t j ln P (y = Cj x, w) j In ths case, we have P (y = C x, w) = σ(a(x )) and P (y = C2 x, w) = σ(a(x )), wth a(x) = w T x (agan, an extra feature has been added to the observatons to cater for the bas term). For the bnary case, t s more convenent to replace the -of-k encodng t j wth a bnary encodng: t = f x s n class, and 0 otherwse. Then, the negatve log-lkelhood becomes ln P (X) (t ln σ(a(x )) + ( t ) ln( σ(a(x )))) Next we derve the gradent and Hessan of l: l = w d t [σ(a(x ))] σ(a(x ))( σ(a(x )))x (d) = ( t )[ σ(a(x ))] σ(a(x ))( σ(a(x )))x (d) = [t ( σ(a(x ))) ( t )σ(a(x ))]x (d) = (σ(a(x )) t )x (d) so that w l = (σ(a(x )) t )x. Next 2 l w d w d2 = x (d2)σ(a(x ))[ σ(a(x ))]x (d) so that H l (w) = σ(a(x ))[ σ(a(x ))]x x T. In order to ensure that our optmzaton fnds a mnmum, we show that the Hessan matrx s postve sem-defnte. Frst note that for any, x x T s postve sem-defnte, snce for any u, u T x x T u = (x T u) T (x T u) = x T u 2 0 8 It should also be easy to verfy that the new weght vector w equals the dfference of the weght vectors obtaned usng the mult-class approach,.e. w 2 w. 6
Next we note that snce the range of σ s (0, ), the coeffcent of x x T s always postve, so that the Hessan s a sum of postve sem-defnte matrces, and s thus postve sem-defnte. 9 Fnally, we can apply Newton-Raphson optmzaton 0 to the log-lkelhood to obtan the weght vector w. A complcaton wth logstc regresson overfttng Suppose that a weght vector w leads to a perfect classfcaton of the data set n the bnary case, usng classfcaton by the class wth hghest probablty, and where we consder all classes equally lkely. In such a case, we say the data set s lnearly seperable. For ths classfcaton, the classfcaton boundary (or decson surface) les where P (C x) = 0.5. Snce P (C x) = (+exp (w T x)), we must have that w T x = 0,.e. the decson surface s a lne passng through the orgn. Now, consder what happens f we rather classfy wth w = 2w. In such a case, the decson boundary and all the pont classfcatons reman the same. However, the lkelhoods assocated wth each pont now becomes greater, yeldng a hgher lkelhood soluton than the orgnal w. We can contnue doublng w repeatedly n ths way, leadng n the lmt to a stuaton where the predcted probabltes become step functons at the decson boundary. (To clarfy ths, draw the logstc functon as ts argument ncreases.) Although the maths s dfferent, ths behavour manfest to varyng degrees even when there are multple classes, pror probabltes for the classes are unequal, and the classes are not lnearly separable. Ths s a common problem wth many machne learnng approaches that estmate parameters by optmzaton, and s known as overfttng. To see why ths s a problem, note that your model s now very confdent about ts classfcaton of future ponts close to the decson boundary, even though t has never observed data there! Essentally, the only gude avalable to the algorthm s the data t has been gven, and the lnear constrant on the decson surface. However, we usually do not expect the probabltes of the classes to change abruptly between 0 and. Thus, we must fnd some way of takng ths nto account n our calculatons. There are two major approaches to dong ths, whch are closely related: frst, one can use a pror dstrbuton on the weght vector w, whch s then updated by the lkelhood calculaton to obtan a maxmum a posteror (MAP) estmate; second, one can penalze choces of w leadng to undesrable behavour n our classfer by addng an extra term to the lkelhood functon ths s known as regularzaton. The relatonshp between these approaches s that the sze of the penalty, or regularzaton, term, should depend on how lkely you thnk certan values of w are n advance. Thus, the choce of regularzaton functon effectvely encodes a pror dstrbuton on the parameters under nvestgaton nto the lkelhood, 9 Under farly general condtons, the Hessan can n fact be shown to be postve defnte, by notng that t s a weghted sum of the rank matrces x x T. However, we do not go nto that here, snce the regularzaton we apply later wll easly lead to a postve defnte Hessan, n any case. 0 Because each quadratc approxmaton step n the Newton-Raphson optmzaton s effectvely a weghted least squares ft to the data (a common approach for estmatng parameters n statstcs), ths procedure s sometmes called teratve reweghted least squares (IRLS). 7
so that the optmum of ths regularzed lkelhood functon s actually the MAP estmate for the correspondng pror. Prors and regularzaton Let us assume a normal pror dstrbuton on w. We would prefer smaller w, so let us set the mean of the pror to 0. Also, we have no reason to expect that certan components of w must be larger than others, or that they should be correlated, so let us assume a dagonal covarance matrx, wth equal entres on the dagonals (.e. Σ = λi for some λ > 0). Gven the data set (X, Y ), what s the posteror dstrbuton for w? We have p(w X, Y ) = p(w)p(x, Y w) p(x, Y ) where the denomnator s not dependent on w. We can maxmze ths by mnmzng the negatve logarthm of the numerator, log p(w) log p(x, Y w) = wt w 2λ + l(w) + C for a constant C, and n ths formulaton we see that the pror dstrbuton on w has led to the regularzaton penalty λ J(w) = wt w 2λ. Ths partcular form s very convenent from a calculus pont of vew, snce J(w) = w, and thus H J (w) = I. The choce of λ determnes how strong one wshes the penalty term to be, and must usually be determned emprcally. To apply regularzaton n ths context s a straght forward modfcaton of the earler approach: one stll uses Newton-Raphson optmzaton, but rather than optmzng l(w), one optmzes l(w) + λ J(w), whch has a slghtly modfed gradent and Hessan 2. Choce of λ Let us next dscuss the selecton of λ: λ s an example of an algorthm parameter whch we can adjust, or tune, n the hope of obtanng good performance for our classfer, although we have no gudance of our selecton. One way to get an ndcaton of a good choce s to keep some of our tranng data asde (let us call ths part the valdaton set), and do parameter estmaton on the remanng tranng set for varous choces of λ. Our fnal choce of λ can then be obtaned by comparng the performance of the varous classfers bult wth dfferng choces of λ on the valdaton set. Fnally, we mght re-estmate the parameters for ths choce of λ usng the whole orgnal tranng set for our fnal classfer. 3 Revewng the math, we see a general rule of thumb that the regularzaton penalty corresponds roughly to the negatve logarthm of the pror dstrbuton, snce for MAP estmates, we typcally mnmze the negatve log posteror, whch equals the negatve log-lkelhood plus the negatve log-pror. 2 Note that the modfed Hessan s postve defnte now, rather than postve sem-defnte. 3 Many other approaches, such as cross-valdaton, are possble, and fndng good approaches to handlng parameter tunng s somewhat of an art. Much research has been done n ths area, but t s fraught wth dffcultes. 8
An alternatve vew If we consder l(w) for the bnary case, and gnore the constant ln P (X), we have (t ln σ(a(x )) + ( t ) ln( σ(a(x )))) For each data pont, ths functon calculates the predcted class probablty p for the actual class of that pont, and adds ln p to a total. If the probablty s close to one, the amount added s small, but for ponts whch are badly msclassfed, the amount added can be much larger. Ths nterpretaton helps us understand why maxmum llelhood approaches overft there s pressure to reduce these penaltes. However, when regularzaton s performed, an extra λ J(w) s added to ths functon, n such a way that overfttng to reduce the loss functon s prevented by a compensatng ncrease n ths regularzaton term. Many other classfcaton technques can also be formulated n terms of a regularzaton functon combned wth penalty functon on classfcaton of ponts. 9