Logistic regression CS434

Logstc regresson CS434

Baes (Naïve or not) Classfers: Generatve Approach What do e mean b Generatve approach: Assumng that each data pont s generated follong a generatve process governed b p() and p(x ) Learn p(), p(x ) and then appl Baes rule to compute p( x) for makng predctons

Generatve approach s just one tpe of learnng approaches used n machne learnng Learnng a correct generatve model p(x ) s dffcult denst estmaton s a challengng problem n ts on And sometmes unnecessar In contrast, LTU, KNN and DT are hat e call dscrmnatve methods The are not concerned about an generatve models The onl care about fndng a good dscrmnatve functon LTU, KNN and DT learn determnstc functons, not probablstc One can also take a probablstc approach to learnng dscrmnatve functons.e., Learn p( x) drectl thout learnng p(x ) Logstc regresson s one such approach

Logstc regresson Recall the problem of regresson Learns a mappng from nput vector x to a contnuous output Logstc regresson extends tradtonal regresson to handle bnar output g(t) In partcular, e assume that P ( x) ( 0 x... m x m ) e T T g ( x) x e Sgmod functon t

Logstc Regresson Equvalentl, e have the follong: P( x) log 0 x P( 0 x) Odds of =... m x m Sde Note: the odds n favor of an event are the quantt p /( p), here p s the probablt of the event If I toss a far dce, hat are the odds that I ll have a sx? In other ords, LR assumes that the log odds s a lnear functon of the nput features

Learnng for logstc regresson Gven a set of tranng data ponts, e ould lke to fnd a eght vector such that P ( x) T x e s large (e.g. ) for postve tranng examples, and small (e.g. 0) otherse In other ords, a good eght vector should satsf the follong: x should be large negatve values for ponts x should be large postve valuse for + ponts

Learnng for logstc regresson Ths can be captured b the log lkelhood functon: L() log P( [ x log P(, ) x, ) ( )log( P( x, ))] Note that the superscrpt s an ndex to the examples n the tranng set Ths s call the lkelhood functon of, and b maxmzng ths objectve functon, e perform hat e call maxmum lkelhood estmaton of the parameter.

MLE for logstc regresson n n P P l ), ( log ), ( log ) ( x x n n MLE P P P l )), ( )log( ( ), ( log arg max ), ( log arg max ) ( arg max x x x Equvalentl, gven a set of tranng data ponts, e ould lke to fnd a eght vector such that s large (e.g. ) for postve tranng examples, and small (e.g. 0) otherse the same as our ntuton ) ( x, P

Optmzng l() Unfortunatel ths does not have a close form soluton Instead, e teratvel search for the optmal Start th a random, teratvel mprove (smlar to Perceptron) b movng toard the gradent drecton (the fastest ncreasng drecton)

. Gradent Descend/Ascend Example Start from a random ntal pont Iteratvel move toard the drecton that mproves the objectve at maxmal rate Stop hen reachng local optmal pont ( 0

Batch Learnng for Logstc Regresson Note: takes 0/ here, not / Gven : tranng examples ( x Let (0,0,0,...,0) Repeat untl convergence d (0,0,0,...,0) For to N do x e error d d error x d, ),,..., N Gradent contrbuton from the th example Learnng rate

Logstc Regresson Vs. Perceptron Note the strkng smlart beteen the to algorthms In fact LR learns a lnear decson boundar ho so? Home ork assgnment What are the dfference? Dfferent as to tran the eghts LR produces a probablt estmaton

Logstc Regresson vs. Naïve Baes If e use Naïve Baes and assume Gaussan dstrbuton for p(x ), e can sho that p(= X) takes the exact same functonal form of Logstc Regresson What are the dfferences here? Dfferent as of tranng Naïve baes estmates θ b maxmzng P(X =v, θ ), and hle dong so assumes condtonal ndependence among attrbutes Logstc regresson estmates b maxmzng P( x, ) and make no condtonal ndependence assumpton.

Comparatvel Naïve Baes generatve model: P(x ) makes strong condtonal ndependence assumpton about the data attrbutes When the assumptons are ok, Naïve Baes can use a small amount of tranng data and estmate a reasonable model Logstc regresson dscrmnatve model: drectl learn p( X) has feer parameters to estmate, but the are ted together and make learnng harder Makes eaker assumptons Ma need large number of tranng examples Bottom lne: f the naïve baes assumpton holds and the probablstc models are accurate (.e., x s gaussan gven etc.), NB ould be a good choce; otherse, logstc regresson orks better

Summar We ntroduced the concept of generatve vs. dscrmnatve method Gven a method that e dscussed n class, ou need to kno hch categor t belongs to Logstc regresson Assumes that the log odds of = s a lnear functon of x (.e., x) Learnng goal s to learn a eght vector such that examples th = are predcted to have hgh P(= x) and vce versa Maxmum lkelhood estmaton s a approach that acheves ths Iteratve algorthm to learn usng MLE Smlart and dfference beteen LR and Perceptrons Logstc regresson learns a lnear decson boundares