Nuno Vasconcelos UCSD

Bayesan parameter estmaton Nuno Vasconcelos UCSD 1

Maxmum lkelhood parameter estmaton n three steps: 1 choose a parametrc model for probabltes to make ths clear we denote the vector of parameters by Θ P X ( x; Θ note that ths means that Θ s NOT a random varable 2 assemble D = {x 1,..., x n } of examples drawn ndependently 3 select the parameters that maxmze the probablty of the data Θ * = arg max Θ P X = arg max log P Θ ( D; Θ P X ( D; Θ P X (D;Θ s the lkelhood of parameter Θ wth respect to the data 2

Least squares there are nterestng connectons between ML estmaton and least squares methods e.g. n a regresson problem we have two random varables X and Y a dataset of examples D = {(x 1,y 1, (x n,y n } a parametrc model of the form y = f (x; Θ + ε where Θ s a parameter vector, and ε a random varable that accounts for nose e.g. ε ~ N(0,σ 2 3

Least squares assumng that the famly of models s known, e.g. K f ( x ; Θ = f = 0 x ths s really just a problem of parameter estmaton where the data s dstrbuted as P Z X ( 2 z, f ( x ; Θ X ( D x ; Θ = G f, σ note that X s always known, and the mean s a functon of x and Θ n the homework, you wll show that Θ * = [ T 1 T Γ Γ] Γ y 4

Least squares where Γ = 1 K 1 K x1 M K K x n concluson: least squares estmaton s really just ML estmaton under the assumpton of Gaussan nose ndependent d sample ε ~ N(0,σ 2 once agan, probablty blt makes the assumptons explct t 5

Least squares soluton due to the connecton to parameter estmaton we can also talk about the qualty of the least squares soluton n partcular, we know that t s unbased varance goes to zero as the number of ponts ncreases t s the BLUE estmator for f(x;θ under the statstcal formulaton we can also see how the optmal estmator changes wth assumptons ML estmaton can also lead to (homework weghted least squares mnmzaton of L p norms robust estmators 6

Bayesan parameter estmaton Bayesan parameter estmaton s an alternatve framework for parameter estmaton t turns out that the dvson between Bayesan and ML methods s qute fundamental t stems from a dfferent way of nterpretng probabltes frequentst vs Bayesan there s a long debate about whch s best ths debate goes to the core of what probabltes blt mean to understand t, we have to dstngush two components the defnton of probablty (ths does not change the assessment of probablty (ths changes let s start wth a bref revew of the part that does not change 7

Probablty probablty s a language to deal wth processes that are non-determnstc examples: f I flp a con 100 tmes, how many can I expect to see heads? what s the weather gong to be lke tomorrow? are my stocks gong to be up or down? am I n front of a classroom or s ths just a pcture of t? 8

Sample space the most mportant concept s that of a sample space our process defnes a set of events these are the outcomes or states of the process example: we roll a par of dce call the value on the up face at the n th toss x n note that possble events such as odd number on second throw two sxes x 1 = 2 and x 2 = 6 can all be expressed as combnatons x 2 6 of the sample space events 1 1 6 x 1 9

Sample space s the lst of possble events that satsfes the followng propertes: fnest gran: all possble dstngushable events are lsted separately mutually exclusve: f one event happens the other does not (f x 1 = 5 t cannot be anythng else collectvely exhaustve: any possble outcome can be expressed as unons of sample space events x 2 6 1 1 6 x 1 mutually exclusve property smplfes the calculaton of the probablty of complex events collectvely exhaustve means that there s no possble outcome to whch h we cannot assgn a probablty blt 10

Probablty measure probablty of an event: number expressng the chance that the event wll be the outcome of the process probablty measure: satsfes three axoms P(A 0 for any event A P(unversal event = 1 f A B =, then P(A+B = P(A + P(B all of ths has to do wth the defnton of probablty 1 s the same under Bayes and frequentst vews what changes s how probabltes are assessed x 2 6 1 6 x 1 11

Frequentst vew under the frequentst vew probabltes are relatve frequences I throw my dce n tmes n m of those the sum s 5 I say that P ( sum = 5 = m n ths s ntmately connected wth the ML method t s the ML estmate for the probablty of a Bernoull process wth states ( 5, everythng else makes sense when we have a lot of observatons no bas; decreasng varance; converges to true probablty blt 12

Problems many nstances where we do not have a large number of observatons consder the problem of crossng a street ths s a decson problem wth two states Y = 0: I am gong to get hurt Y = 1: I wll make t safely optmal decson computable by Bayes decson rule collect some measurements that are nformatve e.g. (X = {sze, dstance, speed} of ncomng cars collect examples under both states and estmate all probabltes somehow ths does not sound lke a great dea! 13

Problems under the frequentst vew you need to repeat an experment a large number of tmes to estmate any probabltes yet, people are very good at estmatng probabltes for problems n whch t s mpossble to set up such experments for example: wll I de f I jon the army? wll Democrats or Republcans wn the next electon? s there a God? wll I graduate n two years? to the pont where they make lfe-changng decsons based on these probablty estmates (enlstng n the army, etc. 14

Subjectve probablty ths motvates an alternatve defnton of probabltes note that ths has to do more wth how probabltes are assessed than wth the probablty defnton tself we stll have a sample space, a probablty measure, etc however the probabltes are not equated to relatve counts ths s usually referred to as subjectve probablty probabltes are degrees of belef on the outcomes of the experment they are ndvdual (vary from person to person they are not ratos of expermental outcomes e.g. for very relgous person P(god exsts ~ 1 for casual churchgoer P(god exsts ~ 0.8 (e.g. accepts evoluton, etc. for non-relgous P(god exsts ~ 0 15

Problems n practce, why do we care about ths? under the noton of subjectve probablty, the entre ML framework makes lttle sense there s a magc number that s estmated from the world and determnes our belefs to evaluate my estmates I have to run experments over and over agan and measure quanttes lke bas and varance ths s not how people behave, when we make estmates we attach a degree of confdence to them, wthout further experments there s only one model (the ML model for the probablty of the data, no multple explanatons there s no way to specfy that some models are, a pror, better than others 16

Bayesan parameter estmaton the man dfference wth respect to ML s that n the Bayesan case Θ s a random varable basc concepts tranng set D = {x 1,..., x n } of examples drawn ndependently probablty densty for observatons gven parameter P X Θ ( x pror dstrbuton b t for parameter confguratons P Θ ( that encodes pror belefs about them goal: to compute the posteror dstrbuton PΘ X D ( D 17

Bayes vs ML there are a number of sgnfcant dfferences between Bayesan and ML estmates D 1 : ML produces a number, the best estmate to measure ts goodness we need to measure bas and varance ths can only be done wth repeated experments Bayes produces a complete characterzaton of the parameter from the sngle dataset n addton to the most probable estmate, we obtan a characterzaton of the uncertanty lower uncertanty hgher uncertanty 18

Bayes vs ML D 2 : optmal estmate under ML there s one best estmate under Bayes there s no best estmate only a random varable that takes dfferent values wth dfferent probabltes techncally speakng, t makes no sense to talk about the best estmate D 3 : predctons remember that we do not really care about the parameters themselves they are needed only n the sense that they allow us to buld models that can be used to make predctons (e.g. the BDR unlke ML, Bayes uses ALL nformaton n the tranng set to make predctons 19

Bayes vs ML let s consder the BDR under the 0-1 loss and an ndependent sample D = {x 1,..., x n } ML-BDR: pck f two steps: fnd * * * ( x = arg max P ( x ; * where plug nto the BDR X Y = arg max P X Y P ( Y ( D, all nformaton not captured by * s lost, not used at decson tme 20

Bayes vs ML note that we know that nformaton s lost e.g. we can t even know how good of an estmate * s unless we run multple experments and measure bas/varance Bayesan BDR under the Bayesan framework, everythng s condtoned on the tranng data denote T = {X 1,..., X n } the set of random varables from whch the tranng sample D = {x 1,..., x n n} s drawn B-BDR: pck f * ( x = arg max PX Y, ( x, D P ( the decson s condtoned d on the entre tranng set T Y 21

Bayesan BDR to compute the condtonal probabltes, we use the margnalzaton equaton P X Y, T ( x, D ( ( PX Θ, Y, T x,, D PΘ Y, T, D = d note 1: when the parameter value s known, x no longer depends on T, e.g. XΘ ~ N(,σ 2 we can, smplfy equaton above nto P ( x, D ( ( PX Θ, Y x, PΘ Y, T D = d X Y, T, note 2: once agan can be done n two steps (per class fnd P ΘT (D compute P XY,T (x, D and plug nto the BDR no tranng nformaton s lost 22

Bayesan BDR n summary pck f * note: ( x = arg max PX Y, where P T ( x, D P Y ( ( x, D P ( x, P ( D d X Y, T X Y, Θ Θ Y, T, = as before the bottom equaton s repeated for each class hence, we can drop the dependence on the class and consder the more general problem of estmatng P ( x D P ( x P ( D d X T X Θ Θ T = 23

The predctve dstrbuton the dstrbuton ( x D P ( x P ( D d P = X T X Θ Θ T s known as the predctve dstrbuton ths follows from the fact that t allows us to predct the value of x gven ALL the nformaton avalable n the tranng set note that t t can also be wrtten as P ( x D E P ( x [ T D] X T = Θ T X Θ = snce each parameter value defnes a model ths s an expectaton over all possble models each model s weghted by ts posteror probablty, gven tranng data 24

The predctve dstrbuton suppose that 2 P ( x ~ N(,1 and P ( D ~ N( µ σ X Θ Θ T, P T ( D π P X T 1 ( x D weght π 2 Θ weght π 1 weght π 2 π σ 2 1 2 µ 2 µ µ 1 µ 2 µ µ 1 the predctve dstrbuton s an average of all these Gaussans P ( x D P ( x P ( D d X T X Θ Θ T = 1 1 x 25

The predctve dstrbuton Bayes vs ML ML: pck one model Bayes: average all models are Bayesan predctons very dfferent than those of ML? they can be, unless the pror s narrow P T ( D Θ P T ( D Θ max max Bayes ~ ML very dfferent 26

The predctve dstrbuton hence, ML can be seen as a specal case of Bayes when you are very confdent about the model pckng one s good enough n comng lectures we wll see that f the sample s qute large, the pror tends to be narrow ntutve: gven a lot of tranng data, there s lttle uncertanty about what the model s Bayes can make a dfference when there s lttle data we have already seen that ths s the mportant case snce the varance of ML tends to go down as the sample ncreases overall Bayes regularzes the ML estmate when ths s uncertan converges to ML when there s a lot of certanty 27

MAP approxmaton ths sounds good, why use ML at all? the man problem wth Bayes s that the ntegral P can be qute nasty ( x D P ( x P ( D d = X T X Θ Θ T n practce one s frequently forced to use approxmatons one possblty s to do somethng smlar to ML,.e. pck only one model ths can be made to account for the pror by pckng the model that has the largest posteror probablty gven the tranng data ( D MAP = arg max P Θ T 28

MAP approxmaton ths can usually be computed snce arg max P ( D MAP P Θ T = D T Θ ( D ( = arg max P P and corresponds to approxmatng the pror by a delta functon centered at ts maxmum Θ ( D PΘ T ( D P T Θ MAP MAP 29

MAP approxmaton n ths case P X T the BDR becomes pck f * ( x D = PX Θ ( x δ ( MAP d d = P ( x X Θ ( x = arg max PX Y MAP ( MAP x ; ( ( D, P ( MAP where = arg max PT Y, Θ Θ Y P Y when compared to the ML ths has the advantage of stll accountng for the pror (although only approxmately 30

MAP vs ML ML-BDR pck f * * ( x = arg max P ( x ; where Bayes MAP-BDR pck f * ( x where * = MAP X Y arg max = arg max P X Y P X Y = arg max P P ( Y ( D, ( MAP x ; P ( T Y, Θ Y ( D, P ( the dfference s non-neglgble only when the dataset s small there are better alternatve approxmatons Θ Y 31

The Laplace approxmaton ths s a method for approxmatng any dstrbuton P X (x conssts of approxmatng P X (x by a Gaussan centered at ts peak let s assume that 1 Z ( x g( x P X = where g(x s an unormalzed dstrbuton (g(x > 0, for all x and Z the normalzaton constant Z = g ( x dx we make a Taylor seres approxmaton of g(x at ts maxmum x 0 32

Laplace approxmaton the Taylor expanson s log g( x = log g( x c ( o x x 2 2 0 +K (the frst-order term s zero because x 0 s a maxmum wth 2 c = x 2 log g( x x= x 0 x 0 P X (x and we approxmate g(x by an unormalzed Gaussan { ( 2 } c x x g' ( x = g( xo exp 2 and then compute the normalzaton constant 0 Z = g( x o 2π c 33

Laplace approxmaton ths can obvously be extended to the multvarate case the approxmaton s T log g( x = log g( xo ( x x ( 2 0 A x x0 wth A the Hessan of g(x at x 0 A j = 2 x x j log g( x and the normalzaton constant Z = g( x o ( 2 d 2π A 1 x= x 0 n physcs ths s also called a saddle-pont approxmaton 34

Laplace approxmaton note that the approxmaton can be made for the predctve dstrbuton ( x D = G( x, x Α P X T *, X T or for the parameter posteror n whch case ( D G(, A P Θ T = MAP, Θ T P ( x D P ( x G(, A d X T X Θ MAP, Θ T = ths s clearly superor to the MAP approxmaton ( x D = P Θ ( x δ ( d P X T X Θ MAP 35

Other methods there are two other man alternatves, when ths s not enough varatonal approxmatons samplng methods (Markov Chan Monte Carlo varatonal approxmatons consst of boundng the ntractable functon searchng for the best bound samplng methods consst desgnng a Markov chan that has the desred dstrbuton as ts equlbrum dstrbuton sample from ths chan samplng methods converge to the true dstrbuton but convergence s slow and hard to detect 36