Machine Learning and Data Mining Lecture Notes

Transcription

1 Machne Learnng and Data Mnng Lecture Notes CSC 411/D11 Computer Scence Department Unversty of Toronto Verson: February 6, 2012 Copyrght c 2010 Aaron Hertzmann and Davd Fleet

2 CONTENTS Contents Conventons and Notaton v 1 Introducton to Machne Learnng Types of Machne Learnng A smple problem Lnear Regresson The 1D case Multdmensonal nputs Multdmensonal outputs Nonlnear Regresson Bass functon regresson Overfttng and Regularzaton Artfcal Neural Networks K-Nearest Neghbors Quadratcs Optmzng a quadratc Basc Probablty Theory Classcal logc Basc defntons and rules Dscrete random varables Bnomal and Multnomal dstrbutons Mathematcal expectaton Probablty Densty Functons (PDFs) Mathematcal expectaton, mean, and varance Unform dstrbutons Gaussan dstrbutons Dagonalzaton Condtonal Gaussan dstrbuton Estmaton Learnng a bnomal dstrbuton Bayes Rule Parameter estmaton MAP, ML, and Bayes Estmates Learnng Gaussans Copyrght c 2011 Aaron Hertzmann and Davd Fleet

3 CONTENTS 7.5 MAP nonlnear regresson Classfcaton Class Condtonals Logstc Regresson Artfcal Neural Networks K-Nearest Neghbors Classfcaton Generatve vs. Dscrmnatve models Classfcaton by LS Regresson Naïve Bayes Dscrete Input Features Learnng Gradent Descent Fnte dfferences Cross Valdaton Cross-Valdaton Bayesan Methods Bayesan Regresson Hyperparameters Bayesan Model Selecton Monte Carlo Methods Samplng Gaussans Importance Samplng Markov Chan Monte Carlo (MCMC) Prncpal Components Analyss The model and learnng Reconstructon Propertes of PCA Whtenng Modelng Probablstc PCA Lagrange Multplers Examples Least-Squares PCA n one-dmenson Multple constrants Inequalty constrants Copyrght c 2011 Aaron Hertzmann and Davd Fleet

4 CONTENTS 15 Clusterng K-means Clusterng K-medods Clusterng Mxtures of Gaussans Learnng Numercal ssues The Free Energy Proofs Relaton to K-means Degeneracy Determnng the number of clusters Hdden Markov Models Markov Models Hdden Markov Models Vterb Algorthm The Forward-Backward Algorthm EM: The Baum-Welch Algorthm Numercal ssues: renormalzaton Free Energy Most lkely state sequences Support Vector Machnes Maxmzng the margn Slack Varables for Non-Separable Datasets Loss Functons The Lagrangan and the Kernel Trck Choosng parameters Software AdaBoost Decson stumps Why does t work? Early stoppng Copyrght c 2011 Aaron Hertzmann and Davd Fleet

5 Acknowledgements Conventons and Notaton Scalars are wrtten wth lower-case talcs, e.g., x. Column-vectors are wrtten n bold, lower-case: x, and matrces are wrtten n bold uppercase: B. The set of real numbers s represented byr; N-dmensonal Eucldean space s wrtten R N. Asde: Text n asde boxes provde extra background or nformaton that you are not requred to know for ths course. Acknowledgements Graham Taylor and James Martens asssted wth preparaton of these notes. Copyrght c 2011 Aaron Hertzmann and Davd Fleet v

6 Introducton to Machne Learnng 1 Introducton to Machne Learnng Machne learnng s a set of tools that, broadly speakng, allow us to teach computers how to perform tasks by provdng examples of how they should be done. For example, suppose we wsh to wrte a program to dstngush between vald emal messages and unwanted spam. We could try to wrte a set of smple rules, for example, flaggng messages that contan certan features (such as the word vagra or obvously-fake headers). However, wrtng rules to accurately dstngush whch text s vald can actually be qute dffcult to do well, resultng ether n many mssed spam messages, or, worse, many lost emals. Worse, the spammers wll actvely adjust the way they send spam n order to trck these strateges (e.g., wrtng v@gr@ ). Wrtng effectve rules and keepng them up-to-date quckly becomes an nsurmountable task. Fortunately, machne learnng has provded a soluton. Modern spam flters are learned from examples: we provde the learnng algorthm wth example emals whch we have manually labeled as ham (vald emal) or spam (unwanted emal), and the algorthms learn to dstngush between them automatcally. Machne learnng s a dverse and exctng feld, and there are multple ways of defnng t: 1. The Artfcal Intellgence Vew. Learnng s central to human knowledge and ntellgence, and, lkewse, t s also essental for buldng ntellgent machnes. Years of effort n AI has shown that tryng to buld ntellgent computers by programmng all the rules cannot be done; automatc learnng s crucal. For example, we humans are not born wth the ablty to understand language we learn t and t makes sense to try to have computers learn language nstead of tryng to program t all t. 2. The Software Engneerng Vew. Machne learnng allows us to program computers by example, whch can be easer than wrtng code the tradtonal way. 3. The Stats Vew. Machne learnng s the marrage of computer scence and statstcs: computatonal technques are appled to statstcal problems. Machne learnng has been appled to a vast number of problems n many contexts, beyond the typcal statstcs problems. Machne learnng s often desgned wth dfferent consderatons than statstcs (e.g., speed s often more mportant than accuracy). Often, machne learnng methods are broken nto two phases: 1. Tranng: A model s learned from a collecton of tranng data. 2. Applcaton: The model s used to make decsons about some new test data. For example, n the spam flterng case, the tranng data consttutes emal messages labeled as ham or spam, and each new emal message that we receve (and whch to classfy) s test data. However, there are other ways n whch machne learnng s used as well. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 1

7 Introducton to Machne Learnng 1.1 Types of Machne Learnng Some of the man types of machne learnng are: 1. Supervsed Learnng, n whch the tranng data s labeled wth the correct answers, e.g., spam or ham. The two most common types of supervsed learnng are classfcaton (where the outputs are dscrete labels, as n spam flterng) and regresson (where the outputs are real-valued). 2. Unsupervsed learnng, n whch we are gven a collecton of unlabeled data, whch we wsh to analyze and dscover patterns wthn. The two most mportant examples are dmenson reducton and clusterng. 3. Renforcement learnng, n whch an agent (e.g., a robot or controller) seeks to learn the optmal actons to take based the outcomes of past actons. There are many other types of machne learnng as well, for example: 1. Sem-supervsed learnng, n whch only a subset of the tranng data s labeled 2. Tme-seres forecastng, such as n fnancal markets 3. Anomaly detecton such as used for fault-detecton n factores and n survellance 4. Actve learnng, n whch obtanng data s expensve, and so an algorthm must determne whch tranng data to acqure and many others. 1.2 A smple problem Fgure 1 shows a 1D regresson problem. The goal s to ft a 1D curve to a few ponts. Whch curve s best to ft these ponts? There are nfntely many curves that ft the data, and, because the data mght be nosy, we mght not even want to ft the data precsely. Hence, machne learnng requres that we make certan choces: 1. How do we parameterze the model we ft? For the example n Fgure 1, how do we parameterze the curve; should we try to explan the data wth a lnear functon, a quadratc, or a snusodal curve? 2. What crtera (e.g., objectve functon) do we use to judge the qualty of the ft? For example, when fttng a curve to nosy data, t s common to measure the qualty of the ft n terms of the squared error between the data we are gven and the ftted curve. When mnmzng the squared error, the resultng ft s usually called a least-squares estmate. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 2

8 Introducton to Machne Learnng 3. Some types of models and some model parameters can be very expensve to optmze well. How long are we wllng to wat for a soluton, or can we use approxmatons (or handtunng) nstead? 4. Ideally we want to fnd a model that wll provde useful predctons n future stuatons. That s, although we mght learn a model from tranng data, we ultmately care about how well t works on future test data. When a model fts tranng data well, but performs poorly on test data, we say that the model has overft the tranng data;.e., the model has ft propertes of the nput that are not partcularly relevant to the task at hand (e.g., Fgures 1 (top row and bottom left)). Such propertes are refered to as nose. When ths happens we say that the model does not generalze well to the test data. Rather t produces predctons on the test data that are much less accurate than you mght have hoped for gven the ft to the tranng data. Machne learnng provdes a wde selecton of optons by whch to answer these questons, along wth the vast experence of the communty as to whch methods tend to be successful on a partcular class of data-set. Some more advanced methods provde ways of automatng some of these choces, such as automatcally selectng between alternatve models, and there s some beautful theory that asssts n ganng a deeper understandng of learnng. In practce, there s no sngle slver bullet for all learnng. Usng machne learnng n practce requres that you make use of your own pror knowledge and expermentaton to solve problems. But wth the tools of machne learnng, you can do amazng thngs! Copyrght c 2011 Aaron Hertzmann and Davd Fleet 3

9 Introducton to Machne Learnng Fgure 1: A smple regresson problem. The blue crcles are measurements (the tranng data), and the red curves are possble fts to the data. There s no one rght answer; the soluton we prefer depends on the problem. Ideally we want to fnd a model that provdes good predctons for new nputs (.e., locatons on thex-axs for whch we had no tranng data). We wll often prefer smple, smooth models lke that n the lower rght. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 4

10 Lnear Regresson 2 Lnear Regresson In regresson, our goal s to learn a mappng from one real-valued space to another. Lnear regresson s the smplest form of regresson: t s easy to understand, often qute effectve, and very effcent to learn and use. 2.1 The 1D case We wll start by consderng lnear regresson n just 1 dmenson. Here, our goal s to learn a mappng y = f(x), where x and y are both real-valued scalars (.e., x R,y R). We wll take f to be an lnear functon of the form: y = wx+b (1) where w s a weght and b s a bas. These two scalars are the parameters of the model, whch we would lke to learn from tranng data. n partcular, we wsh to estmate w and b from the N tranng pars {(x,y )} N =1. Then, once we have values for w and b, we can compute the y for a newx. Gven 2 data ponts (.e., N=2), we can exactly solve for the unknown slope w and offset b. (How would you formulate ths soluton?) Unfortunately, ths approach s extremely senstve to nose n the tranng data measurements, so you cannot usually trust the resultng model. Instead, we can fnd much better models when the two parameters are estmated from larger data sets. When N > 2 we wll not be able to fnd unque parameter values for whch y = wx +b for all, snce we have many more constrants than parameters. The best we can hope for s to fnd the parameters that mnmze the resdual errors,.e., y (wx +b). The most commonly-used way to estmate the parameters s by least-squares regresson. We defne an energy functon (a.k.a. objectve functon): E(w,b) = N (y (wx +b)) 2 (2) =1 To estmate w and b, we solve for the w and b that mnmze ths objectve functon. Ths can be done by settng the dervatves to zero and solvng. de db = 2 (y (wx +b)) = 0 (3) Solvng for b gves us the estmate: b = N w x N (4) = ȳ w x (5) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 5

11 Lnear Regresson y x Fgure 2: An example of lnear regresson: the red lne s ft to the blue data ponts. where we defne x and ȳ as the averages of the x s and y s, respectvely. Ths equaton for b stll depends on w, but we can nevertheless substtute t back nto the energy functon: E(w,b) = ((y ȳ) w(x x)) 2 (6) Then: de dw = 2 ((y ȳ) w(x x))(x x) (7) Solvng de dw = 0 then gves: w = (y ȳ)(x x) (x x) 2 (8) The valuesw andb are the least-squares estmates for the parameters of the lnear regresson. 2.2 Multdmensonal nputs Now, suppose we wsh to learn a mappng from D-dmensonal nputs to scalar outputs: x R D, y R. Now, we wll learn a vector of weghts w, so that the mappng wll be: 1 f(x) = w T x+b = D w j x j +b. (9) j=1 1 Above we used subscrpts to ndex the tranng set, whle here we are usng the subscrpt to ndex the elements of the nput and weght vectors. In what follows the context should make t clear what the ndex denotes. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 6

12 Lnear Regresson For convenence, we can fold the bas b nto the weghts, f we augment the nputs wth an addtonal 1. In other words, f we defne then the mappng can be wrtten: w 1 x 1 w =. w D, x =. x D b 1 (10) f(x) = w T x. (11) GvenN tranng nput-output pars, the least-squares objectve functon s then: E( w) = N (y w T x ) 2 (12) =1 If we stack the outputs n a vector and the nputs n a matrx, then we can also wrte ths as: where y = E( w) = y X w 2 (13) y 1. y N, X = x T 1 1. x T N 1 (14) and s the usual Eucldean norm,.e., v 2 = v2. (You should verfy for yourself that Equatons 12 and 13 are equvalent). Equaton 13 s known as a lnear least-squares problem, and can be solved by methods from lnear algebra. We can rewrte the objectve functon as: E(w) = (y X w) T (y X w) (15) = w T XT X w 2y T X w+y T y (16) We can optmze ths by settng all values of de/dw = 0 and solvng the resultng system of equatons (we wll cover ths n more detal later n Chapter 4). In the meantme, f ths s unclear, start by revewng your lnear algebra and vector calculus). The soluton s gven by: w = ( X T X) 1 XT y (17) (You may wsh to verfy for yourself that ths reduces to the soluton for the 1D case n Secton 2.1; however, ths takes qute a lot of lnear algebra and a lttle cleverness). The matrx X + ( X T X) 1 XT s called the pseudonverse of X, and so the soluton can also be wrtten: w = X + y (18) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 7

13 Lnear Regresson In MATLAB, one can drectly solve the system of equatons usng the slash operator: w = X\y (19) There are some subtle dfferences between these two ways of solvng the system of equatons. We wll not concern ourselves wth these here except to say that I recommend usng the slash operator rather than the pseudonverse. 2.3 Multdmensonal outputs In the most general case, both the nputs and outputs may be multdmensonal. For example, wth D-dmensonal nputs, andk-dmensonal outputsy R K, a lnear mappng from nput to output can be wrtten as y = W T x (20) where W R (D+1) K. It s convenent to express W n terms of ts column vectors,.e., [ ] w1... w W = [ w 1... w K ] K. (21) b 1... b K In ths way we can then express the mappng from the nput x to thej th element ofyasy j = w j T x. Now, gvenn tranng samples, denoted{ x,y } N =1 a natural energy functon to mnmze n order to estmate W s just the squared resdual error over all tranng samples and all output dmensons,.e., E( W) N K = (y,j w j T x ) 2. (22) =1 j=1 There are several ways to convenently vectorze ths energy functon. One way s to express E solely as a sum over output dmensons. That s, lety j be then-dmensonal vector comprsng thej th component of each output tranng vector,.e.,y j = [y 1,j,y 2,j,...,y N,j ] T. Then we can wrte E( W) = K y j X w j 2 (23) j=1 where X T = [ x 1 x 2... x N ]. Wth a lttle thought you can see that ths really amounts to K dstnct estmaton problems, the solutons for whch are gven by w j = X + y j. Another common conventon s to stack up everythng nto a matrx equaton,.e., E( W) = Y X W 2 F (24) where Y = [y 1...y K ], and F denotes the Frobenus norm: Y 2 F =,j Y2,j. You should verfy that Equatons (23) and (24) are equvalent representatons of the energy functon n Equaton (22). Fnally, the soluton s agan provded by the pseudonverse: or, n MATLAB, W = X\Y. W = X + Y (25) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 8

14 Nonlnear Regresson 3 Nonlnear Regresson Sometmes lnear models are not suffcent to capture the real-world phenomena, and thus nonlnear models are necessary. In regresson, all such models wll have the same basc form,.e., y = f(x) (26) In lnear regresson, we have f(x) = Wx+b; the parameters W and b must be ft to data. What nonlnear functon do we choose? In prncple, f(x) could be anythng: t could nvolve lnear functons, snes and cosnes, summatons, and so on. However, the form we choose wll make a bg dfference on the effectveness of the regresson: a more general model wll requre more data to ft, and dfferent models are more approprate for dfferent problems. Ideally, the form of the model would be matched exactly to the underlyng phenomenon. If we re modelng a lnear process, we d use a lnear regresson; f we were modelng a physcal process, we could, n prncple, model f(x) by the equatons of physcs. In many stuatons, we do not know much about the underlyng nature of the process beng modeled, or else modelng t precsely s too dffcult. In these cases, we typcally turn to a few models n machne learnng that are wdely-used and qute effectve for many problems. These methods nclude bass functon regresson (ncludng Radal Bass Functons), Artfcal Neural Networks, and k-nearest Neghbors. There s one other mportant choce to be made, namely, the choce of objectve functon for learnng, or, equvalently, the underlyng nose model. In ths secton we extend the LS estmators ntroduced n the prevous chapter to nclude one or more terms to encourage smoothness n the estmated models. It s hoped that smoother models wll tend to overft the tranng data less and therefore generalze somewhat better. 3.1 Bass functon regresson A common choce for the functon f(x) s a bass functon representaton 2 : y = f(x) = k w k b k (x) (27) for the 1D case. The functons b k (x) are called bass functons. Often t wll be convenent to express ths model n vector form, for whch we defne b(x) = [b 1 (x),...,b M (x)] T and w = [w 1,...,w M ] T where M s the number of bass functons. We can then rewrte the model as y = f(x) = b(x) T w (28) Two common choces of bass functons are polynomals and Radal Bass Functons (RBF). A smple, common bass for polynomals are the monomals,.e., b 0 (x) = 1, b 1 (x) = x, b 2 (x) = x 2, b 3 (x) = x 3,... (29) 2 In the machne learnng and statstcs lterature, these representatons are often referred to as lnear regresson, snce they are lnear functons of the features b k (x) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 9

15 Nonlnear Regresson Polynomal bass functons x 0 x 1 x 2 x Radal Bass Functons x x Fgure 3: The frst three bass functons of a polynomal bass, and Radal Bass Functons Wth a monomal bass, the regresson model has the form f(x) = w k x k, (30) Radal Bass Functons, and the resultng regresson model are gven by b k (x) = e (x c k ) 2 2σ 2, (31) f(x) = w k e (x c k ) 2 2σ 2, (32) where c k s the center (.e., the locaton) of the bass functon and σ 2 determnes the wdth of the bass functon. Both of these are parameters of the model that must be determned somehow. In practce there are many other possble choces for bass functons, ncludng snusodal functons, and other types of polynomals. Also, bass functons from dfferent famles, such as monomals and RBFs, can be combned. We mght, for example, form a bass usng the frst few polynomals and a collecton of RBFs. In general we deally want to choose a famly of bass functons such that we get a good ft to the data wth a small bass set so that the number of weghts to be estmated s not too large. To ft these models, we can agan use least-squares regresson, by mnmzng the sum of squared resdual error between model predctons and the tranng data outputs: ( E(w) = (y f(x )) 2 = y k w k b k (x)) 2 (33) To mnmze ths functon wth respect tow, we note that ths objectve functon has the same form as that for lnear regresson n the prevous chapter, except that the nputs are now theb k (x) values. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 10

16 Nonlnear Regresson In partcular, E s stll quadratc n the weghts w, and hence the weghts w can be estmated the same way. That s, we can rewrte the objectve functon n matrx-vector form to produce E(w) = y Bw 2 (34) where denotes the Eucldean norm, and the elements of the matrxbare gven byb,j = b j (x ) (for row and column j). In Matlab the least-squares estmate can be computed as w = B\y. Pckng the other parameters. The postons of the centers and the wdths of the RBF bass functons cannot be solved drectly for n closed form. So we need some other crtera to select them. If we optmze these parameters for the squared-error, then we wll end up wth one bass center at each data pont, and wth tny wdth that exactly ft the data. Ths s a problem as such a model wll not usually provde good predctons for nputs other than those n the tranng set. The followng heurstcs nstead are commonly used to determne these parameters wthout overfttng the tranng data. To pck the bass centers: 1. Place the centers unformly spaced n the regon contanng the data. Ths s qute smple, but can lead to empty regons wth bass functons, and wll have an mpractcal number of data ponts n hgher-dmensnal nput spaces. 2. Place one center at each data pont. Ths s used more often, snce t lmts the number of centers needed, although t can also be expensve f the number of data ponts s large. 3. Cluster the data, and use one center for each cluster. We wll cover clusterng methods later n the course. To pck the wdth parameter: 1. Manually try dfferent values of the wdth and pck the best by tral-and-error. 2. Use the average squared dstances (or medan dstances) to neghborng centers, scaled by a constant, to be the wdth. Ths approach also allows you to use dfferent wdths for dfferent bass functons, and t allows the bass functons to be spaced non-unformly. In later chapters we wll dscuss other methods for determnng these and other parameters of models. 3.2 Overfttng and Regularzaton Drectly mnmzng squared-error can lead to an effect called overfttng, wheren we ft the tranng data extremely well (.e., wth low error), yet we obtan a model that produces very poor predctons on future test data whenever the test nputs dffer from the tranng nputs (Fgure 4(b)). Overfttng can be understood n many ways, all of whch are varatons on the same underlyng pathology: Copyrght c 2011 Aaron Hertzmann and Davd Fleet 11

17 Nonlnear Regresson 1. The problem s nsuffcently constraned: for example, f we have ten measurements and ten model parameters, then we can often obtan a perfect ft to the data. 2. Fttng nose: overfttng can occur when the model s so powerful that t can ft the data and also the random nose n the data. 3. Dscardng uncertanty: the posteror probablty dstrbuton of the unknowns s nsuffcently peaked to pck a sngle estmate. (We wll explan what ths means n more detal later.) There are two mportant solutons to the overfttng problem: addng pror knowledge and handlng uncertanty. The latter one we wll dscuss later n the course. In many cases, there s some sort of pror knowledge we can leverage. A very common assumpton s that the underlyng functon s lkely to be smooth, for example, havng small dervatves. Smoothness dstngushes the examples n Fgure 4. There s also a practcal reason to prefer smoothness, n that assumng smoothness reduces model complexty: t s easer to estmate smooth models from small datasets. In the extreme, f we make no pror assumptons about the nature of the ft then t s mpossble to learn and generalze at all; smoothness assumptons are one way of constranng the space of models so that we have any hope of learnng from small datasets. One way to add smoothness s to parameterze the model n a smooth way (e.g., makng the wdth parameter for RBFs larger; usng only low-order polynomal bass functons), but ths lmts the expressveness of the model. In partcular, when we have lots and lots of data, we would lke the data to be able to overrule the smoothness assumptons. Wth large wdths, t s mpossble to get hghly-curved models no matter what the data says. Instead, we can add regularzaton: an extra term to the learnng objectve functon that prefers smooth models. For example, for RBF regresson wth scalar outputs, and wth many other types of bass functons or mult-dmensonal outputs, ths can be done wth an objectve functon of the form: E(w) = y Bw 2 } {{ } data term + λ w 2 } {{ } smoothness term Ths objectve functon has two terms. The frst term, called the data term, measures the model ft to the tranng data. The second term, often called the smoothness term, penalzes non-smoothness (rapd changes n f(x)). Ths partcular smoothness term ( w ) s called weght decay, because t tends to make the weghts smaller. 3 Weght decay mplctly leads to smoothness wth RBF bass functons because the bass functons themselves are smooth, so rapd changes n the slope of f (.e., hgh curvature) can only be created n RBFs by addng and subtractng bass functons wth large weghts. (Ideally, we mght drectly penalze smoothness, e.g., usng an objectve term that drectly penalzes the ntegral of the squared curvature of f(x), but ths s usually mpractcal.) 3 Estmaton wth ths objectve functon s sometmes called Rdge Regresson n Statstcs. (35) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 12

18 Nonlnear Regresson Ths regularzed least-squares objectve functon s stll quadratc wth respect to w and can be optmzed n closed-form. To see ths, we can rewrte t as follows: E(w) = (y Bw) T (y Bw)+λw T w (36) = w T B T Bw 2w T B T y+λw T w+y T y (37) = w T (B T B+λI)w 2w T B T y+y T y (38) To mnmze E(w), as above, we solve the normal equatons E(w) = 0 (.e., E/ w = 0 for all). Ths yelds the followng regularzed LS estmate for w: w = (B T B+λI) 1 B T y (39) 3.3 Artfcal Neural Networks Another choce of bass functon s the sgmod functon. Sgmod lterally means s-shaped. The most common choce of sgmod s: g(a) = 1 1+e a (40) Sgmods can be combned to create a model called an Artfcal Neural Network (ANN). For regresson wth mult-dmensonal nputs x R K 2, and multdmensonal outputs y R K 1 : y = f(x) = ( ) w (1) j g w (2) k,j x k +b (2) j +b (1) (41) j Ths equaton descrbes a process whereby a lnear regressor wth weghts w ( 2) s appled to x. The output of ths regressor s then put through the nonlnear Sgmod functon, the outputs of whch act as features to another lnear regressor. Thus, note that the nner weghtsw (2) are dstnct parameters from the outer weghts w (1) j. As usual, t s easest to nterpret ths model n the 1D case,.e., y = f(x) = j w (1) j g k ( w (2) j x+b (2) j ) +b (1) (42) Fgure 5(left) shows plots of g(wx) for dfferent values of w, and Fgure 5(rght) shows g(x+b) for dfferent values of b. As can be seen from the fgures, the sgmod functon acts more or less lke a step functon for large values of w, and more lke a lnear ramp for small values of w. The bas b shfts the functon left or rght. Hence, the neural network s a lnear combnaton of shfted (smoothed) step functons, lnear ramps, and the bas term. To learn an artfcal neural network, we can agan wrte a regularzed squared-error objectve functon: E(w,b) = y f(x) 2 +λ w 2 (43) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 13

19 Nonlnear Regresson tranng data ponts orgnal curve estmated curve tranng data ponts orgnal curve estmated curve (a) (b) tranng data ponts orgnal curve estmated curve 1 tranng data ponts orgnal curve estmated curve (c) (d) Fgure 4: Least-squares curve fttng of an RBF. (a) Pont data (blue crcles) was taken from a sne curve, and a curve was ft to the ponts by a least-squares ft. The horzontal axs s x, the vertcal axs s y, and the red curve s the estmated f(x). In ths case, the ft s essentally perfect. The curve representaton s a sum of Gaussan bass functons. (b) Overfttng. Random nose was added to the data ponts, and the curve was ft agan. The curve exactly fts the data ponts, whch does not reproduce the orgnal curve (a green, dashed lne) very well. (c) Underfttng. Addng a smoothness term makes the resultng curve too smooth. (In ths case, weght decay was used, along wth reducng the number of bass functons). (d) Reducng the strength of the smoothness term yelds a better ft. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 14

20 Nonlnear Regresson g(x 4) g(x) g(x+4) Fgure 5: Left: Sgmodsg(wx) = 1/(1+e wx ) for varous values ofw, rangng from lnear ramps to smooth steps to nearly hard steps. Rght: Sgmods g(x + b) = 1/(1 + e x b ) wth dfferent shfts b. wherewcomprses the weghts at both levels for allj. Note that we regularze by applyng weght decay to the weghts (both nner and outer), but not the bases, snce only the weghts affect the smoothness of the resultng functon (why?). Unfortuntely, ths objectve functon cannot be optmzed n closed-form, and numercal optmzaton procedures must be used. We wll study one such method, gradent descent, n the next chapter. 3.4 K-Nearest Neghbors At heart, many learnng procedures especally when our pror knowledge s weak amount to smoothng the tranng data. RBF fttng s an example of ths. However, many of these fttng procedures requre makng a number of decsons, such as the locatons of the bass functons, and can be senstve to these choces. Ths rases the queston: why not cut out the mddleman, and smooth the data drectly? Ths s the dea behnd K-Nearest Neghbors regresson. The dea s smple. We frst select a parameterk, whch s the only parameter to the algorthm. Then, for a new nput x, we fnd the K nearest neghbors to x n the tranng set, based on ther Eucldean dstance x x 2. Then, our new outputys smply an average of the tranng outputs Copyrght c 2011 Aaron Hertzmann and Davd Fleet 15

21 Nonlnear Regresson for those nearest negbors. Ths can be expressed as: y = 1 K N K (x) y (44) where the set N K (x) contans the ndces of the K tranng ponts closest to x. Alternatvely, we mght take a weghted average of the K-nearest neghbors to gve more nfluence to tranng ponts close toxthan to those further away: N y = K (x) w(x )y N K (x) w(x, w(x ) = e x x 2 /2σ 2 (45) ) where σ 2 s an addtonal parameter to the algorthm. The parameters K and σ control the degree of smoothng performed by the algorthm. In the extreme case of K = 1, the algorthm produces a pecewse-constant functon. K-nearest neghbors s smple and easy to mplement; t doesn t requre us to muck about at all wth dfferent choces of bass functons or regularzatons. However, t doesn t compress the data at all: we have to keep around the entre tranng set n order to use t, whch could be very expensve, and we must search the whole data set to make predctons. (The cost of searchng can be mtgated wth spatal data-structures desgned for searchng, such as k-d-trees and localtysenstve hashng. We wll not cover these methods here). Copyrght c 2011 Aaron Hertzmann and Davd Fleet 16

22 Quadratcs 4 Quadratcs The objectve functons used n lnear least-squares and regularzed least-squares are multdmensonal quadratcs. We now analyze multdmensonal quadratcs further. We wll see many more uses of quadratcs further n the course, partcularly when dealng wth Gaussan dstrbutons. The general form of a one-dmensonal quadratc s gven by: f(x) = w 2 x 2 +w 1 x+w 0 (46) Ths can also be wrtten n a slghtly dfferent way (called standard form): f(x) = a(x b) 2 +c (47) where a = w 2,b = w 1 /(2w 2 ),c = w 0 w 2 1/4w 2. These two forms are equvalent, and t s easy to go back and forth between them (e.g., gven a,b,c, what are w 0,w 1,w 2?). In the latter form, t s easy to vsualze the shape of the curve: t s a bowl, wth mnmum (or maxmum) at b, and the wdth of the bowl s determned by the magntude of a, the sgn of a tells us whch drecton the bowl ponts (a postve means a convex bowl,anegatve means a concave bowl), and c tells us how hgh or low the bowl goes (at x = b). We wll now generalze these ntutons for hgher-dmensonal quadratcs. The general form for a 2D quadratc functon s: and, for ann-d quadratc, t s: f(x 1,x 2 ) = w 1,1 x 2 1 +w 1,2 x 1 x 2 +w 2,2 x 2 2 +w 1 x 1 +w 2 x 2 +w 0 (48) f(x 1,...x N ) = 1 N,1 j N w,j x x j + 1 N w x +w 0 (49) Note that there are three sets of terms: the quadratc terms ( w,j x x j ), the lnear terms ( w x ) and the constant term (w 0 ). Dealng wth these summatons s rather cumbersome. We can smplfy thngs by usng matrxvector notaton. Letxbe ann-dmensonal column vector, wrtten x = [x 1,...x N ] T. Then we can wrte a quadratc as: f(x) = x T Ax+b T x+c (50) where A = w 1,1... w 1,N. w,j.... w N,N w N,1 (51) b = [w 1,...,w N ] T (52) c = w 0 (53) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 17

23 Quadratcs You should verfy for yourself that these dfferent forms are equvalent: by multplyng out all the elements of f(x), ether n the 2D case or, usng summatons, the general N D case. For many manpulatons we wll want to do later, t s helpful for A to be symmetrc,.e., to have w,j = w j,. In fact, t should be clear that these off-dagonal entres are redundant. So, f we are a gven a quadratc for whch A s asymmetrc, we can symmetrze t as: f(x) = x T ( 1 2 (A+AT ))x+b T x+c = x T Ãx+b T x+c (54) and use Ã = 1 2 (A + AT ) nstead. You should confrm for yourself that ths s equvalent to the orgnal quadratc. As before, we can convert the quadratc to a form that leads to clearer nterpretaton: f(x) = (x µ) T A(x µ)+d (55) where µ = 1 2 A 1 b,d = c µ T Aµ, assumng that A 1 exsts. Note the smlarty here to the 1-D case. As before, ths functon s a bowl-shape n N dmensons, wth curvature specfed by the matrx A, and wth a sngle statonary pont µ. 4 However, fully understandng the shape of f(x) s a bt more subtle and nterestng. 4.1 Optmzng a quadratc Suppose we wsh to fnd the statonary ponts (mnma or maxma) of a quadratc f(x) = x T Ax+b T x+c. (56) The statonary ponts occur where all partal dervatves are zero,.e., f/ x = 0 for all. The gradent of a functon s the vector comprsng the partal dervatves of the functon,.e., f [ f/ x 1, f/ x 2,..., f/ N] T. (57) At statonary ponts t must therefore be true that f = [0,...,0] T. Let us assume that A s symmetrc (f t s not, then we can symmetrze t as above). Equaton 56 s a very common form of cost functon (e.g. the log probablty of a Gaussan as we wll later see), and so the form of ts gradent s mportant to examne. Due to the lnearty of the dfferentaton operator, we can look at each of the three terms of Eq.56 separately. The last (constant) term does not depend on x and so we can gnore t because ts dervatve s zero. Let us examne the frst term. If we wrte out the ndvdual terms wthn the 4 A statonary pont means a settng of x where the gradent s zero. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 18

24 Quadratcs vectors/matrces, we get: (x 1...x N ) a a 1N..... a N1... a NN x 1. x N (58) =(x 1 a 11 +x 2 a x N a N1 x 1 a 12 +x 2 a (59)...+x 1 a 1N +x 2 a 2N +...+x N a NN ) x 1. x N (60) =x 2 1a 11 +x 1 x 2 a x 1 x N a N1 +x 1 x 2 a 12 +x 2 2a x N x 2 a N (61)...x 1 x N a 1N +x 2 x N a 2N +...+x 2 Na NN (62) = j a j x x j (63) The th element of the gradent corresponds to f/ x. So n the expresson above, for the terms n the gradent correspondng to each x, we only need to consder the terms nvolvng x (others wll have dervatve zero), namely x 2 a + j x x j (a j +a j ) (64) The gradent then has a very smple form: ( x T Ax ) x = 2x a + j We can wrte a sngle expresson for all of the x usng matrx/vector form: x T Ax x x j (a j +a j ). (65) = (A+A T )x. (66) You should multply ths out for yourself to see that ths corresponds to the ndvdual terms above. If we assume that A s symmetrc, then we have x T Ax x = 2Ax. (67) Ths s also a very helpful rule that you should remember. The next term n the cost functon,b T x, has an even smpler gradent. Note that ths s smply a dot product, and the result s a scalar: b T x = b 1 x 1 +b 2 x b N x N. (68) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 19

25 Quadratcs Only one term corresponds to each x and so f/ x = b. We can agan express ths n matrx/vector form: ( b T x ) = b. (69) x Ths s another helpful rule that you wll encounter agan. If we use both of the expressons we have just derved, and set the gradent of the cost functon to zero, we get: f(x) x = 2Ax+b = [0,...,0]T (70) The optmum s gven by the soluton to ths system of equatons (called normal equatons): x = 1 2 A 1 b (71) In the case of scalar x, ths reduces to x = b/2a. For lnear regresson wth mult-dmensonal nputs above (see Equaton 18): A = XX T and b = 2Xy T. As an exercse, convnce yourself that ths s true. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 20

26 Basc Probablty Theory 5 Basc Probablty Theory Probablty theory addresses the followng fundamental queston: how do we reason? Reasonng s central to many areas of human endeavor, ncludng phlosophy (what s the best way to make decsons?), cogntve scence (how does the mnd work?), artfcal ntellgence (how do we buld reasonng machnes?), and scence (how do we test and develop theores based on expermental data?). In nearly all real-world stuatons, our data and knowledge about the world s ncomplete, ndrect, and nosy; hence, uncertanty must be a fundamental part of our decson-makng process. Bayesan reasonng provdes a formal and consstent way to reasonng n the presence of uncertanty; probablstc nference s an embodment of common sense reasonng. The approach we focus on here s Bayesan. Bayesan probablty theory s dstngushed by defnng probabltes as degrees-of-belef. Ths s n contrast to Frequentst statstcs, where the probablty of an event s defned as ts frequency n the lmt of an nfnte number of repeated trals. 5.1 Classcal logc Perhaps the most famous attempt to descrbe a formal system of reasonng s classcal logc, orgnally developed by Arstotle. In classcal logc, we have some statements that may be true or false, and we have a set of rules whch allow us to determne the truth or falsty of new statements. For example, suppose we ntroduce two statements, namedaandb: A My car was stolen B My car s not n the parkng spot where I remember leavng t Moreover, let us assert the rule A mples B, whch we wll wrte as A B. Then, f A s known to be true, we may deduce logcally that B must also be true (f my car s stolen then t won t be n the parkng spot where I left t). Alternatvely, f I fnd my car where I left t ( B s false, wrtten B), then I may nfer that t was not stolen (Ā) by the contrapostve B Ā. Classcal logc provdes a model of how humans mght reason, and a model of how we mght buld an ntellgent computer. Unfortunately, classcal logc has a sgnfcant shortcomng: t assumes that all knowledge s absolute. Logc requres that we know some facts about the world wth absolute certanty, and then, we may deduce only those facts whch must follow wth absolute certanty. In the real world, there are almost no facts that we know wth absolute certanty most of what we know about the world we acqure ndrectly, through our fve senses, or from dalogue wth other people. One can therefore conclude that most of what we know about the world s uncertan. (Fndng somethng that we know wth certanty has occuped generatons of phlosophers.) For example, suppose I dscover that my car s not where I remember leavng t (B). Does ths mean that t was stolen? No, there are many other explanatons maybe I have forgotten where I left t or maybe t was towed. However, the knowledge of B makes A more plausble even though I do not know t to be stolen, t becomes more lkely a scenaro than before. The Copyrght c 2011 Aaron Hertzmann and Davd Fleet 21

27 Basc Probablty Theory actual degree of plausblty depends on other contextual nformaton dd I park t n a safe neghborhood?, dd I park t n a handcapped zone?, etc. Predctng the weather s another task that requres reasonng wth uncertan nformaton. Whle we can make some predctons wth great confdence (e.g. we can relably predct that t wll not snow n June, north of the equator), we are often faced wth much more dffcult questons (wll t ran today?) whch we must nfer from unrelable sources of nformaton (e.g., the weather report, clouds n the sky, yesterday s weather, etc.). In the end, we usually cannot determne for certan whether t wll ran, but we do get a degree of certanty upon whch to base decsons and decde whether or not to carry an umbrella. Another mportant example of uncertan reasonng occurs whenever you meet someone new at ths tme, you mmedately make hundreds of nferences (mostly unconscous) about who ths person s and what ther emotons and goals are. You make these decsons based on the person s appearance, the way they are dressed, ther facal expressons, ther actons, the context n whch you meet, and what you have learned from prevous experence wth other people. Of course, you have no conclusve bass for formng opnons (e.g., the panhandler you meet on the street may be a method actor preparng for a role). However, we need to be able to make judgements about other people based on ncomplete nformaton; otherwse, normal nterpersonal nteracton would be mpossble (e.g., how do you really know that everyone sn t out to get you?). What we need s a way of dscussng not just true or false statements, but statements that have varyng levels of certanty. In addton, we would lke to be able to use our belefs to reason about the world and nterpret t. As we gan new nformaton, our belefs should change to reflect our greater knowledge. For example, for any two propostons A and B (that may be true or false), f A B, then strong belef nashould ncrease our belef nb. Moreover, strong belef nbmay sometmes ncrease our belef naas well. 5.2 Basc defntons and rules The rules of probablty theory provde a system for reasonng wth uncertanty.there are a number of justfcatons for the use of probablty theory to represent logc (such as Cox s Axoms) that show, for certan partcular defntons of common-sense reasonng, that probablty theory s the only system that s consstent wth common-sense reasonng. We wll not cover these here (see, for example, Wkpeda for dscusson of the Cox Axoms). The basc rules of probablty theory are as follows. The probablty of a statement A denoted P(A) s a real number between 0 and 1, nclusve. P(A) = 1 ndcates absolute certanty that A s true, P(A) = 0 ndcates absolute certanty thatas false, and values between 0 and 1 correspond to varyng degrees of certanty. The jont probablty of two statementsaandb denotedp(a,b) s the probablty that both statements are true. (.e., the probablty that the statement A B s true). (Clearly,P(A,B) = P(B,A).) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 22

28 Basc Probablty Theory The condtonal probablty of A gven B denoted P(A B) s the probablty that we would assgn to A beng true, f we knew B to be true. The condtonal probablty s defned as P(A B) = P(A,B)/P(B). The Product Rule: P(A,B) = P(A B)P(B) (72) In other words, the probablty thataandbare both true s gven by the probablty thatbs true, multpled by the probablty we would assgn toaf we knewbto be true. Smlarly, P(A, B) = P(B A)P(A). Ths rule follows drectly from the defnton of condtonal probablty. The Sum Rule: P(A)+P(Ā) = 1 (73) In other words, the probablty of a statement beng true and the probablty that t s false must sum to 1. In other words, our certanty that A s true s n nverse proporton to our certanty that t s not true. A consequence: gven a set of mutually-exclusve statementsa, exactly one of whch must be true, we have P(A ) = 1 (74) All of the above rules can be made condtonal on addtonal nformaton. For example, gven an addtonal statementc, we can wrte the Sum Rule as: P(A C) = 1 (75) and the Product Rule as P(A,B C) = P(A B,C)P(B C) (76) From these rules, we further derve many more expressons to relate probabltes. For example, one mportant operaton s called margnalzaton: P(B) = P(A,B) (77) f A are mutually-exclusve statements, of whch exactly one must be true. In the smplest case where the statement A may be true or false we can derve: P(B) = P(A,B)+P(Ā,B) (78) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 23

29 Basc Probablty Theory The dervaton of ths formula s straghtforward, usng the basc rules of probablty theory: P(A)+P(Ā) = 1, Sum rule (79) P(A B)+P(Ā B) = 1, Condtonng (80) P(A B)P(B)+P(Ā B)P(B) = P(B), Algebra (81) P(A,B)+P(Ā,B) = P(B), Product rule (82) Margnalzaton gves us a useful way to compute the probablty of a statement B that s ntertwned wth many other uncertan statements. Another useful concept s the noton of ndependence. Two statements are ndependent f and only fp(a,b) = P(A)P(B). IfAandBare ndependent, then t follows thatp(a B) = P(A) (by combnng the Product Rule wth the defnton of ndependence). Intutvely, ths means that, whether or not B s true tells you nothng about whether A s true. In the rest of these notes, I wll always use probabltes as statements about varables. For example, suppose we have a varable x that ndcates whether there are one, two, or three people n a room (.e., the only possbltes are x = 1, x = 2, x = 3). Then, by the sum rule, we can derve P(x = 1)+P(x = 2)+P(x = 3) = 1. Probabltes can also descrbe the range of a real varable. For example,p(y < 5) s the probablty that the varabley s less than 5. (We ll dscuss contnuous random varables and probablty denstes n more detal n the next chapter.) To summarze: The basc rules of probablty theory: P(A) [0...1] Product rule: P(A,B) = P(A B)P(B) Sum rule: P(A)+P(Ā) = 1 Two statements A and B are ndependent ff: P(A, B) = P(A)P(B) Margnalzng: P(B) = P(A,B) Any basc rule can be made condtonal on addtonal nformaton. For example, t follows from the product rule thatp(a,b C) = P(A B,C)P(B C) Once we have these rules and a sutable model we can derve any probablty that we want. Wth some experence, you should be able to derve any desred probablty (e.g., P(A C)) gven a basc model. 5.3 Dscrete random varables It s convenent to descrbe systems n terms of varables. For example, to descrbe the weather, we mght defne a dscrete varable w that can take on two values sunny or rany, and then try to determne P(w = sunny),.e., the probablty that t wll be sunny today. Dscrete dstrbutons descrbe these types of probabltes. As a concrete example, let s flp a con. Let c be a varable that ndcates the result of the flp: c = heads f the con lands on ts head, and c = tals otherwse. In ths chapter and the rest of Copyrght c 2011 Aaron Hertzmann and Davd Fleet 24

30 Basc Probablty Theory these notes, I wll use probabltes specfcally to refer to values of varables, e.g., P(c = heads) s the probablty that the con lands heads. What s the probablty that the con lands heads? Ths probablty should be some real number θ, 0 θ 1. For most cons, we would say θ =.5. What does ths number mean? The number θ s a representaton of our belef about the possble values of c. Some examples: θ = 0 we are absolutely certan the con wll land tals θ = 1/3 we beleve that tals s twce as lkely as heads θ = 1/2 we beleve heads and tals are equally lkely θ = 1 we are absolutely certan the con wll land heads Formally, we denote the probablty of the con comng up heads as P(c = heads), so P(c = heads) = θ. In general, we denote the probablty of a specfc event event as P(event). By the Sum Rule, we knowp(c = heads)+p(c = tals) = 1, and thus P(c = tals) = 1 θ. Once we flp the con and observe the result, then we can be pretty sure that we know the value of c; there s no practcal need to model the uncertanty n ths measurement. However, suppose we do not observe the con flp, but nstead hear about t from a frend, who may be forgetful or untrustworthy. Let f be a varable ndcatng how the frend clams the con landed,.e. f = heads means the frend says that the con came up heads. Suppose the frend says the con landed heads do we beleve hm, and, f so, wth how much certanty? As we shall see, probablstc reasonng obtans quanttatve values that, qualtatvely, matches our common sense very effectvely. Suppose we know somethng about our frend s behavour. We can represent our belefs wth the followng probabltes, for example, P(f = heads c = heads) represents our belef that the frend says heads when the the con landed heads. Because the frend can only say one thng, we can apply the Sum Rule to get: P(f = heads c = heads)+p(f = tals c = heads) = 1 (83) P(f = heads c = tals)+p(f = tals c = tals) = 1 (84) If our frend always tells the truth, then we know P(f = heads c = heads) = 1 and P(f = tals c = heads) = 0. If our frend usually les, then, for example, we mght havep(f = heads c = heads) = Bnomal and Multnomal dstrbutons A bnomal dstrbuton s the dstrbuton over the number of postve outcomes for a yes/no (bnary) experment, where on each tral the probablty of a postve outcome sp [0,1]. For example, for n tosses of a con for whch the probablty of heads on a sngle tral s p, the dstrbuton over the number of heads we mght observe s a bnomal dstrbuton. The bnomal dstrbuton over the number of postve outcomes, denoted K, gven n trals, each havng a postve outcome wth probablty p s gven by ( ) n P(K = k) = p k (1 p) n k (85) k Copyrght c 2011 Aaron Hertzmann and Davd Fleet 25

31 Basc Probablty Theory for k = 0,1,...,n, where ( ) n = k n! k!(n k)!. (86) A multnomal dstrbuton s a natural extenson of the bnomal dstrbuton to an experment wth k mutually exclusve outcomes, havng probabltes p j, for j = 1,...,k. Of course, to be vald probabltes p j = 1. For example, rollng a de can yeld one of sx values, each wth probablty 1/6 (assumng the de s far). Gven n trals, the multnomal dstrbuton specfes the dstrbuton over the number of each of the possble outcomes. Gven n trals, k possble outcomes wth probabltes p j, the dstrbuton over the event that outcome j occurs x j tmes (and of course xj = n), s the multnomal dstrbuton gven by P(X 1 = x 1,X 2 = x 2,...,X k = x k ) = 5.5 Mathematcal expectaton n! x 1!x 2!... x k! px 1 1 p x p x k k (87) Suppose each outcome r has an assocated real value x R. Then the expected value ofxs: E[x] = P(r )x. (88) The expected value off(x) s gven by E[f(x)] = P(r )f(x ). (89) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 26

32 Probablty Densty Functons (PDFs) 6 Probablty Densty Functons (PDFs) In many cases, we wsh to handle data that can be represented as a real-valued random varable, or a real-valued vectorx = [x 1,x 2,...,x n ] T. Most of the ntutons from dscrete varables transfer drectly to the contnuous case, although there are some subtletes. We descrbe the probabltes of a real-valued scalar varable x wth a Probablty Densty Functon (PDF), wrtten p(x). Any real-valued functon p(x) that satsfes: p(x) 0 for allx (90) p(x)dx = 1 (91) s a vald PDF. I wll use the conventon of upper-case P for dscrete probabltes, and lower-case p for PDFs. Wth the PDF we can specfy the probablty that the random varable x falls wthn a gven range: P(x 0 x x 1 ) = x1 x 0 p(x)dx (92) Ths can be vsualzed by plottng the curve p(x). Then, to determne the probablty that x falls wthn a range, we compute the area under the curve for that range. The PDF can be thought of as the nfnte lmt of a dscrete dstrbuton,.e., a dscrete dstrbuton wth an nfnte number of possble outcomes. Specfcally, suppose we create a dscrete dstrbuton wth N possble outcomes, each correspondng to a range on the real number lne. Then, suppose we ncrease N towards nfnty, so that each outcome shrnks to a sngle real number; a PDF s defned as the lmtng case of ths dscrete dstrbuton. There s an mportant subtlety here: a probablty densty s not a probablty per se. For one thng, there s no requrement that p(x) 1. Moreover, the probablty that x attans any one specfc value out of the nfnte set of possble values s always zero, e.g. P(x = 5) = 5 5 p(x)dx = 0 for any PDF p(x). People (myself ncluded) are sometmes sloppy n referrng to p(x) as a probablty, but t s not a probablty rather, t s a functon that can be used n computng probabltes. Jont dstrbutons are defned n a natural way. For two varablesxandy, the jont PDFp(x,y) defnes the probablty that(x,y) les n a gven doman D: P((x,y) D) = (x,y) D p(x, y)dxdy (93) For example, the probablty that a 2D coordnate(x,y) les n the doman(0 x 1,0 y 1) s p(x,y)dxdy. The PDF over a vector may also be wrtten as a jont PDF of ts 0 x 1 0 y 1 varables. For example, for a 2D-vectora = [x,y] T, the PDFp(a) s equvalent to the PDFp(x,y). Condtonal dstrbutons are defned as well: p(x A) s the PDF over x, f the statement A s true. Ths statement may be an expresson on a contnuous value, e.g. y = 5. As a short-hand, Copyrght c 2011 Aaron Hertzmann and Davd Fleet 27

33 Probablty Densty Functons (PDFs) we can wrte p(x y), whch provdes a PDF for x for every value of y. (It must be the case that p(x y)dx = 1, snce p(x y) s a PDF over values of x.) In general, for all of the rules for manpulatng dscrete dstrbutons there are analogous rules for contnuous dstrbutons: Probablty rules for PDFs: p(x) 0, for all x p(x)dx = 1 P(x 0 x x 1 ) = x 1 x 0 p(x)dx Sum rule: p(x)dx = 1 Product rule: p(x, y) = p(x y)p(y) = p(y x)p(x). Margnalzaton: p(y) = p(x,y)dx We can also add condtonal nformaton, e.g.p(y z) = p(x,y z)dx Independence: Varables x and y are ndependent f: p(x, y) = p(x)p(y). 6.1 Mathematcal expectaton, mean, and varance Some very bref defntons of ways to descrbe a PDF: Gven a functonf(x) of an unknown varablex, the expected value of the functon wth repect to a PDF p(x) s defned as: E p(x) [f(x)] f(x)p(x)dx (94) Intutvely, ths s the value that we roughly expect x to have. The meanµof a dstrbuton p(x) s the expected value of x: µ = E p(x) [x] = xp(x)dx (95) The varance of a scalar varable x s the expected squared devaton from the mean: E p(x) [(x µ) 2 ] = (x µ) 2 p(x)dx (96) The varance of a dstrbuton tells us how uncertan, or spread-out the dstrbuton s. For a very narrow dstrbuton E p(x) [(x µ) 2 ] wll be small. The covarance of a vector x s a matrx: Σ = cov(x) = E p(x) [(x µ)(x µ) T ] = (x µ)(x µ) T p(x)dx (97) By nspecton, we can see that the dagonal entres of the covarance matrx are the varances of the ndvdual entres of the vector: Σ = var(x ) = E p(x) [(x µ ) 2 ] (98) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 28

34 Probablty Densty Functons (PDFs) The off-dagonal terms are covarances: Σ j = cov(x,x j ) = E p(x) [(x µ )(x j µ j )] (99) between varables x and x j. If the covarance s a large postve number, then we expect x to be larger thanµ whenx j s larger thanµ j. If the covarance s zero and we know no other nformaton, then knowng x > µ does not tell us whether or not t s lkely that x j > µ j. One goal of statstcs s to nfer propertes of dstrbutons. In the smplest case, the sample mean of a collecton of N data ponts x 1:N s just ther average: x = 1 N x. The sample 1 covarance of a set of data ponts s: N (x x)(x x) T. The covarance of the data ponts tells us how spread-out the data ponts are. 6.2 Unform dstrbutons The smplest PDF s the unform dstrbuton. Intutvely, ths dstrbuton states that all values wthn a gven range [x 0,x 1 ] are equally lkely. Formally, the unform dstrbuton on the nterval [x 0,x 1 ] s: { 1 p(x) = x 1 x 0 f x 0 x x 1 (100) 0 otherwse It s easy to see that ths s a vald PDF (because p(x) > 0 and p(x)dx = 1). We can also wrte ths dstrbuton wth ths alternatve notaton: x x 0,x 1 U(x 0,x 1 ) (101) Equatons 100 and 101 are equvalent. The latter smply says: x s dstrbuted unformly n the range x 0 and x 1, and t s mpossble thatxles outsde of that range. The mean of a unform dstrbuton U(x 0,x 1 ) s (x 1 +x 0 )/2. The varance s (x 1 x 0 ) 2 / Gaussan dstrbutons Arguably the sngle most mportant PDF s the Normal (a.k.a., Gaussan) probablty dstrbuton functon (PDF). Among the reasons for ts popularty are that t s theoretcally elegant, and arses naturally n a number of stuatons. It s the dstrbuton that maxmzes entropy, and t s also ted to the Central Lmt Theorem: the dstrbuton of a random varable whch s the sum of a number of random varables approaches the Gaussan dstrbuton as that number tends to nfnty (Fgure 6). Perhaps most mportantly, t s the analytcal propertes of the Gaussan that make t so ubqutous. Gaussans are easy to manpulate, and ther form so well understood, that we often assume quanttes are Gaussan dstrbuted, even though they are not, n order to turn an ntractable model, or problem, nto somethng that s easer to work wth. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 29

35 Probablty Densty Functons (PDFs) 3 N = 1 3 N = 2 3 N = Fgure 6: Hstogram plots of the mean of N unformly dstrbuted numbers for varous values of N. The effect of the Central Lmt Theorem s seen: asn ncreases, the dstrbuton becomes more Gaussan. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) The smplest case s a Gaussan PDF over a scalar value x, n whch case the PDF s: ( p(x µ,σ 2 1 ) = exp 1 ) 2πσ 2 2σ 2(x µ)2 (102) (The notaton exp(a) s the same as e a ). The Gaussan has two parameters, the mean µ, and the varance σ 2. The mean specfes the center of the dstrbuton, and the varance tells us how spread-out the PDF s. The PDF for D-dmensonal vector x, the elements of whch are jontly dstrbuted wth a the Gaussan denty functon, s gven by p(x µ,σ) = 1 (2π)D Σ exp( (x µ) T Σ 1 (x µ)/2 ) (103) whereµs the mean vector, andσs thed D covarance matrx, and A denotes the determnant of matrx A. An mportant specal case s when the Gaussan s sotropc (rotatonally nvarant). In ths case the covarance matrx can be wrtten asσ = σ 2 I whereis the dentty matrx. Ths s called a sphercal or sotropc covarance matrx. In ths case, the PDF reduces to: ( p(x µ,σ 2 1 ) = (2π)D σ exp 1 ). (104) 2D 2σ 2 x µ 2 The Gaussan dstrbuton s used frequently enough that t s useful to denote ts PDF n a smple way. We wll defne a functon G to be the Gaussan densty functon,.e., G(x;µ,Σ) 1 (2π)D Σ exp( (x µ) T Σ 1 (x µ)/2 ) (105) When formulatng problems and manpulatng PDFs ths functonal notaton wll be useful. When we want to specfy that a random vector has a Gaussan PDF, t s common to use the notaton: x µ,σ N(µ,Σ) (106) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 30

36 Probablty Densty Functons (PDFs) Equatons 103 and 106 essentally say the same thng. Equaton 106 says that x s Gaussan, and Equaton 103 specfes (evaluates) the densty for an nput x. The covarance matrx Σ of a Gaussan must be symmetrc and postve defnte ths s equvalent to requrng that Σ > 0. Otherwse, the formula does not correspond to a vald PDF, snce Equaton 103 s no longer real-valued f Σ Dagonalzaton A useful way to understand a Gaussan s to dagonalze the exponent. The exponent of the Gaussan s quadratc, and so ts shape s essentally ellptcal. Through dagonalzaton we fnd the major axes of the ellpse, and the varance of the dstrbuton along those axes. Seeng the Gaussan ths way often makes t easer to nterpret the dstrbuton. As a remnder, the egendecomposton of a real-valued symmetrc matrx Σ yelds a set of orthonormal vectors v and scalarsλ such that Σu = λ u (107) Equvalently, f we combne the egenvalues and egenvectors nto matrces U = [u 1,...,u N ] and Λ = dag(λ 1,...λ N ), then we have ΣU = UΛ (108) Snce U s orthonormal: Σ = UΛU T (109) The nverse of Σ s straghtforward, snce U s orthonormal, and hence U 1 = U T : Σ 1 = ( UΛU T) 1 = UΛ 1 U T (110) (If any of these steps are not famlar to you, you should refresh your memory of them.) Now, consder the negatve log of the Gaussan (.e., the exponent);.e., let Substtutng n the dagonalzaton gves: where f(x) = 1 2 (x µ)t Σ 1 (x µ). (111) f(x) = 1 2 (x µ)t UΛ 1 U T (x µ) (112) = 1 2 zt z (113) z = dag(λ 1 2 1,...,λ 1 2 N )UT (x µ) (114) Ths new functonf(z) = z T z/2 = z2 /2 s a quadratc, wth new varablesz. Gven varables x, we can convert them to the z representaton by applyng Eq. 114, and, f all egenvalues are Copyrght c 2011 Aaron Hertzmann and Davd Fleet 31

37 Probablty Densty Functons (PDFs) x 2 u 2 u 1 y 2 λ 1/2 2 µ λ 1/2 1 y 1 Fgure 7: The red curve shows the ellptcal surface of constant probablty densty for a Gaussan n a two-dmensonal space on whch the densty s exp( 1/2) of ts value at x = µ. The major axes of the ellpse are defned by the egenvectorsu of the covarance matrx, wth correspondng egenvalues λ. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.)(Note y 1 and y 2 n the fgure should readz 1 andz 2.) x 1 nonzero, we can convert back by nvertng Eq Hence, we can wrte our Gaussan n ths new coordnate system as 5 : ( 1 exp 1 ) (2π) N 2 z 2 = ( 1 exp 1 ) 2π 2 z2 (115) It s easy to see that for the quadratc form off(z), ts level sets (.e., the surfacesf(z) = c for constant c) are hyperspheres. Equvalently, t s clear from 115 that z s a Gaussan random vector wth an sotropc covarance, so the dfferent elements of z are uncorrelated. In other words, the value of ths transformaton s that we have decomposed the orgnal N-D quadratc wth many nteractons between the varables nto a much smpler Gaussan, composed of d ndependent varables. Ths convenent geometrcal form can be seen n Fgure 7. For example, f we consder an ndvdual z varable n solaton (.e., consder a slce of the functon f(z)), that slce wll look lke a 1D bowl. We can also understand the local curvature of f wth a slghtly dfferent dagonalzaton. Specfcally, let v = U T (x µ). Then, f(u) = 1 2 vt Λ 1 v = 1 2 v 2 λ (116) If we plot a cross-secton of ths functon, then we have a 1D bowl shape wth varance gven by λ. In other words, the egenvalues tell us varance of the Gaussan n dfferent dmensons. 5 The normalzng Σ dsappears due to the nature of change-of-varables n PDFs, whch we won t dscuss here. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 32

38 Probablty Densty Functons (PDFs) 1 10 x b x b = 0.7 p(x a x b = 0.7) p(x a,x b ) p(x a ) x a x a 1 Fgure 8: Left: The contours of a Gaussan dstrbuton p(x a,x b ) over two varables. Rght: The margnal dstrbutonp(x a ) (blue curve) and the condtonal dstrbutonp(x a x b ) forx b = 0.7 (red curve). (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) Condtonal Gaussan dstrbuton In the case of the multvarate Gaussan where the random varables have been parttoned nto two sets x a and x b, the condtonal dstrbuton of one set condtoned on the other s Gaussan. The margnal dstrbuton of ether set s also Gaussan. When manpulatng these expressons, t s easer to express the covarance matrx n nverse form, as a precson matrx, Λ Σ 1. Gven that x s a Gaussan random vector, wth mean µ and covarance Σ, we can express x, µ, Σ and Λ all n block matrx form: x = ( xa x b ), µ = ( µa µ b ) ( Σaa Σ, Σ = ab Σ ba Σ bb ) ( Λaa Λ, Λ = ab Λ ba Λ bb ), (117) Then one can show straghtforwardly that the margnal PDFs for the components x a and x b are also Gaussan,.e., x a N(µ a,σ aa ), x b N(µ b,σ bb ). (118) Wth a lttle more work one can also show that the condtonal dstrbutons are Gaussan. For example, the condtonal dstrbuton of x a gvenx b satsfes x a x b N(µ a b,λ 1 aa) (119) whereµ a b = µ a Λ 1 aaλ ab (x b µ b ). Note thatλ 1 aa s not smplyσ aa. Fgure 8 shows the margnal and condtonal dstrbutons appled to a two-dmensonal Gaussan. Fnally, another mportant property of Gaussan functons s that the product of two Gaussan functons s another Gaussan functon (although no longer normalzed to be a proper densty functon): G(x; µ 1,Σ 2 )G(x; µ 2,Σ 2 ) G(x; µ,σ), (120) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 33

39 Probablty Densty Functons (PDFs) where µ = Σ ( ) Σ 1 1 µ 1 +Σ 1 2 µ 2, (121) Σ = (Σ 1 1 +Σ 1 2 ) 1. (122) Note that the lnear transformaton of a Gaussan random varable s also Gaussan. For example, f we apply a transformaton such that y = Ax where x N(x µ,σ), we have y N(y Aµ,AΣA T ). Copyrght c 2011 Aaron Hertzmann and Davd Fleet 34

40 Estmaton 7 Estmaton We now consder the problem of determnng unknown parameters of the world based on measurements. The general problem s one of nference, whch descrbes the probabltes of these unknown parameters. Gven a model, these probabltes can be derved usng Bayes Rule. The smplest use of these probabltes s to perform estmaton, n whch we attempt to come up wth sngle best estmates of the unknown parameters. 7.1 Learnng a bnomal dstrbuton For a smple example, we return to con-flppng. We flp a conn tmes, wth the result of the-th flp denoted by a varablec : c = heads means that the-th flp came up heads. The probablty that the con lands heads on any gven tral s gven by a parameterθ. We have no pror knowledge as to the value of θ, and so our pror dstrbuton on θ s unform. 6 In other words, we descrbe θ as comng from a unform dstrbuton from 0 to 1, so p(θ) = 1; we beleve that all values of θ are equally lkely f we have not seen any data. We further assume that the ndvdual con flps are ndependent,.e., P(c 1:N θ) = p(c θ). (The notaton c 1:N ndcates the set of observatons {c 1,...,c N }.) We can summarze ths model as follows: Model: Con-Flppng θ U(0, 1) P(c = heads) = θ P(c 1:N θ) = p(c θ) (123) Suppose we wsh to learn about a con by flppng t 1000 tmes and observng the results c 1:1000, where the con landed heads 750 tmes? What s our belef about θ, gven ths data? We now need to solve for p(θ c 1:1000 ),.e., our belef about θ after seeng the 1000 con flps. To do ths, we apply the basc rules of probablty theory, begnnng wth the Product Rule: Solvng for the desred quantty gves: P(c 1:1000,θ) = P(c 1:1000 θ)p(θ) = p(θ c 1:1000 )P(c 1:1000 ) (124) p(θ c 1:1000 ) = P(c 1:1000 θ)p(θ) P(c 1:1000 ) (125) The numerator may be wrtten usng P(c 1:1000 θ)p(θ) = P(c θ) = θ 750 (1 θ) (126) 6 We would usually expect a con to be far,.e., the pror dstrbuton forθ s peaked near 0.5. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 35

41 Estmaton Fgure 9: Posteror probablty of θ from two dfferent experments: one wth a sngle con flp (landng heads), and 1000 con flps (750 of whch land heads). Note that the latter dstrbuton s much more peaked. The denomnator may be solved for by the margnalzaton rule: P(c 1:1000 ) = 1 0 P(c 1:1000,θ)dθ = 1 0 θ 750 (1 θ) dθ = Z (127) where Z s a constant (evaluatng t requres more advanced math, but t s not necessary for our purposes). Hence, the fnal probablty dstrbuton s: p(θ c 1:1000 ) = θ 750 (1 θ) /Z (128) whch s plotted n Fgure 9. Ths form gves a probablty dstrbuton over θ that expresses our belef aboutθ after we ve flpped the con 1000 tmes. Suppose we just take the peak of ths dstrbuton; from the graph, t can be seen that the peak s at θ =.75. Ths makes sense: f a con lands heads 75% of the tme, then we would probably estmate that t wll land heads 75% of the tme of the future. More generally, suppose the con lands heads H tmes out ofn flps; we can compute the peak of the dstrbuton as follows: arg maxp(θ c 1:N ) = H/N (129) θ (Dervng ths s a good exercse to do on your own; hnt: mnmze the negatve log ofp(θ c 1:N )). Copyrght c 2011 Aaron Hertzmann and Davd Fleet 36

42 Estmaton 7.2 Bayes Rule In general, gven that we have a model of the world descrbed by some unknown varables, and we observe some data; our goal s to determne the model from the data. (In con-flp example, the model conssted of the lkelhood of the con landng heads, and the pror over θ, whle the data conssted of the results ofn con flps.) We descrbe the probablty model asp(data model) f we knewmodel, then ths model wll tell us what data we expect. Furthermore, we must have some pror belefs as to what model s (p(model)), even f these belefs are completely non-commttal (e.g., a unform dstrbuton). Gven the data, what do we know aboutmodel? Applyng the product rule as before gves: p(data, model) = p(data model)p(model) = p(model data)p(data) (130) Solvng for the desred dstrbuton, gves a seemngly smple but powerful result, known wdely as Bayes Rule: Bayes Rule: p(model data) = p(data model)p(model) p(data) The dfferent terms n Bayes Rule are used so often that they all have names: p(model data) } {{ } posteror = lkelhood pror { }} {{ }} { P(data model) p(model) p(data) } {{ } evdence (131) The lkelhood dstrbuton descrbes the lkelhood of data gven model t reflects our assumptons about how the data c was generated. The pror dstrbuton descrbes our assumptons about model before observng the data data. The posteror dstrbuton descrbes our knowledge of model, ncorporatng both the data and the pror. The evdence s useful n model selecton, and wll be dscussed later. Here, ts only role s to normalze the posteror PDF. 7.3 Parameter estmaton Qute often, we are nterested n fndng a sngle estmate of the value of an unknown parameter, even f ths means dscardng all uncertanty. Ths s called estmaton: determnng the values Copyrght c 2011 Aaron Hertzmann and Davd Fleet 37

43 Estmaton of some unknown varables from observed data. In ths chapter, we outlne the problem, and descrbe some of the man ways to do ths, ncludng Maxmum A Posteror (MAP), and Maxmum Lkelhood (ML). Estmaton s the most common form of learnng gven some data from the world, we wsh to learn how the world behaves, whch we wll descrbe n terms of a set of unknown varables. Strctly speakng, parameter estmaton s not justfed by Bayesan probablty theory, and can lead to a number of problems, such as overfttng and nonsenscal results n extreme cases. Nonetheless, t s wdely used n many problems MAP, ML, and Bayes Estmates We can now defne the MAP learnng rule: choose the parameter value θ that maxmzes the posteror,.e., ˆθ = argmaxp(θ D) θ (132) = argmaxp(d θ)p(θ) θ (133) Note that we don t need to be able to evaluate the evdence term p(d) for MAP learnng, snce there are no θ terms n t. Very often, we wll assume that we have no pror assumptons about the value of θ, whch we express as a unform pror: p(θ) s a unform dstrbuton over some sutably large range. In ths case, the p(θ) term can also be gnored from MAP learnng, and we are left wth only maxmzng the lkelhood. Hence, the Maxmum Lkelhood (ML) learnng prncple (.e., estmator) s ˆθ ML = argmaxp(d θ) (134) θ It often turns out that t s more convenent to mnmze the negatve-log of the objectve functon. Because ln s a monotonc decreasng functon, we can pose MAP estmaton as: ˆθ MAP = argmaxp(d θ)p(θ) θ (135) = argmn ln(p(d θ)p(θ)) θ (136) = argmn lnp(d θ) lnp(θ) θ (137) We can see that the objectve convenently breaks nto a part correspondng to the lkelhood and a part correspondng to the pror. One problem wth ths approach s that all model uncertanty s gnored. We are choosng to put all our fath n the most probable model. Ths sometmes has surprsng and undesrable consequences. For example, n the con tossng example above, f one were to flp a con just once and see a head, then the estmator n Eqn. (129) would tell us that the probablty of the outcome beng heads s 1. Sometmes a more sutable estmator s the expected value of the posteror dstrbuton, rather than ts maxmum. Ths s called the Bayes estmate. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 38

44 Estmaton In the con tossng case above, you can show that the expected value of θ, under the posteror provdes an estmate of the probablty that s based toward 1/2. That s: 1 0 p(θ c 1:N )θdθ = H +1 N +2 (138) You can see that ths value s always somewhat based toward 1/2, but converges to the MAP estmate as N ncreases. Interestngly, even when there are s no data whatsoever, n whch case the MAP estmate s undefned, the Bayes estmate s smply 1/ Learnng Gaussans We now consder the problem of learnng a Gaussan dstrbuton from N tranng samples x 1:N. Maxmum lkelhood learnng of the parameters µ andσentals maxmzng the lkelhood: p(x 1:N µ,σ) (139) We assume here that the data ponts come from a Gaussan. We further assume that they are drawn ndependently. We can therefore wrte the jont lkelhood over the entre set of data as the produce of the lkelhoods for each ndvdual datum,.e., N p(x 1:N µ,σ) = p(x 1:N µ,σ) (140) = =1 N =1 1 ( (2π)M Σ exp 1 ) 2 (x µ) T Σ 1 (x µ), (141) where M s the dmensonalty of the data x. It s somewhat more convenent to mnmze the negatve log-lkelhood: L(µ,Σ) lnp(x 1:N µ,σ) (142) = lnp(x µ,σ) (143) = (x µ) T Σ 1 (x µ) 2 + N 2 ln Σ + NM 2 ln(2π) (144) Solvng for µ and Σ by settng L/ µ = 0 and L/ Σ = 0 (subject to the constrant that Σ s symmetrc) gves the maxmum lkelhood estmates 7 : µ = 1 x (145) N Σ = 1 (x µ )(x µ ) T (146) N 7 Warnng: the calculaton for the optmal covarance matrx nvolves Lagrange multplers and s not easy. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 39

45 Estmaton The ML estmates make ntutve sense; we estmate the Gaussan s mean to be the sample mean of the data, and the Gaussan s covarance to be the sample covarance of the data. Maxmum lkelhood estmates usually make sense ntutvely. Ths s very helpful when debuggng your math you can sometmes fnd bugs n dervatons smply because the ML estmates do not look rght. 7.5 MAP nonlnear regresson Let us revst the nonlnear regresson model from Secton 3.1, but now admttng that there exsts nose n measurements and modellng errors. We ll now wrte the model as where n s a Gaussan random varable,.e., y = w T b(x)+n (147) n N(0,σ 2 ). (148) We add ths random varable to the regresson equaton n (147) to represent the fact that most models and most measurements nvolve some degree of error. We ll refer to ths error as nose. It s straghtforward to show from basc probablty theory that Equaton (147) mples that, gven x and w,y s also Gaussan (.e., has a Gaussan densty),.e., p(y x,w) = G(y; w T b(x), σ 2 ) 1 2πσ e (y wt b(x)) 2 /2σ 2 (149) (G s defned n the prevous chapter.) It follows that, for a collecton of N ndependent tranng ponts, (y 1:N,x 1:N ), the lkelhood s gven by p(y 1:N w,x 1:N ) = = N G(y ; w T b(x ), σ 2 ) =1 1 exp (2πσ 2 ) N/2 ( ) N (y w T b(x )) 2 =1 2σ 2 (150) Furthermore, let us assume the followng (weght decay) pror dstrbuton over the unknown weghts w: w N(0,αI). (151) That s, for w R M, p(w) = M k=1 1 2πα e w2 k /2α = 1 (2πα) M/2 e wt w/2α. (152) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 40

46 Estmaton Now, to estmate the model parameters (.e., w), let s consder the posteror dstrbuton over w condtoned on our N tranng pars, (x,y ). Based on the formulaton above, assumng ndependent tranng samples, t follows that p(w y 1:N,x 1:N ) = p(y 1:N w,x 1:N )p(w x 1:N ) (153) p(y 1:N x 1:N ) = ( p(y w,x ))p(w). (154) p(y 1:N x 1:N ) Note that p(w x 1:N ) = p(w), snce we can assume that x alone provdes no nformaton about w. In MAP estmaton, we want to fnd the parameters w that maxmze ther posteror probablty: The negatve log-posteror s: w = argmax 1:N,x 1:N ) w (155) = argmn 1:N,x 1:N ) w (156) L(w) = lnp(w y 1:N,x 1:N ) (157) ( ) 1 = 2σ 2(y w T b(x )) 2 + N 2 ln(2πσ2 ) (158) + 1 2α w 2 + M 2 ln(2πα)+lnp(y 1:N x 1:N ) (159) Now, we can dscard terms that do not depend onw, snce they are rrelevant for optmzaton: ( ) 1 L(w) = 2σ 2(y w T b(x )) α w 2 + constants (160) Furthermore, we can multply by a constant, wthout changng where the optma are, so let us multply the whole expresson by 2σ 2. Then, f we defne λ = σ 2 /α, we have the exact same objectve functon as used n nonlnear regresson wth regularzaton. Hence, nonlnear leastsquares wth regularzaton s a form of MAP estmaton, and can be optmzed the same way. When the measurements are very relable, then σ s small and we gve the regularzer less nfluence on the estmate. But when the data are relatvely nosy, so σ s larger, then regularzer has more nfluence. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 41

47 Classfcaton 8 Classfcaton In classfcaton, we are tryng to learn a map from an nput space to some fnte output space. In the smplest case we smply detect whether or not the nput has some property or not. For example, we mght want to determne whether or not an emal s spam, or whether an mage contans a face. A task n the health care feld s to determne, gven a set of observed symptoms, whether or not a person has a dsease. These detecton tasks are bnary classfcaton problems. In mult-class classfcaton problems we are nterested n determnng to whch of multple categores the nput belongs. For example, gven a recorded voce sgnal we mght wsh to recognze the dentty of a speaker (perhaps from a set of people whose voce propertes are gven n advance). Another well studed example s optcal character recognton, the recognton of letters or numbers from mages of handwrtten or prnted characters. The nput x mght be a vector of real numbers, or a dscrete feature vector. In the case of bnary classfcaton problems the output y mght be an element of the set { 1,1}, whle for a mult-dmensonal classfcaton problem wth N categores the output mght be an nteger n {1,...,N}. The general goal of classfcaton s to learn a decson boundary, often specfed as the level set of a functon, e.g., a(x) = 0. The purpose of the decson boundary s to dentty the regons of the nput space that correspond to each class. For bnary classfcaton the decson boundary s the surface n the feature space that separates the test nputs nto two classes; ponts x for whch a(x) < 0 are deemed to be n one class, whle ponts for whch a(x) > 0 are n the other. The ponts on the decson boundary, a(x) = 0, are those nputs for whch the two classes are equally probable. In ths chapter we ntroduce several bass methods for classfcaton. We focus manly on on bnary classfcaton problems for whch the methods are conceptually straghtforward, easy to mplement, and often qute effectve. In subsequent chapters we dscuss some of the more sophstcated methods that mght be needed for more challengng problems. 8.1 Class Condtonals One approach s to descrbe a generatve model for each class. Suppose we have two mutuallyexclusve classesc 1 andc 2. The pror probablty of a data vector comng from classc 1 sp(c 1 ), andp(c 2 ) = 1 P(C 1 ). Each class has a dstrbuton for ts data: p(x C 1 ), andp(x C 2 ). In other words, to sample from ths model, we would frst randomly choose a class accordng to P(C 1 ), and then sample a data vector x from that class. Gven labeled tranng data {(x,y )}, we can estmate the dstrbuton for each class by maxmum lkelhood, and estmate P(C 1 ) by computng the rato of the number of elements of class 1 to the total number of elements. Once we have traned the parameters of our generatve model, we perform classfcaton by comparng the posteror class probablltes: P(C 1 x) > P(C 2 x)? (161) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 42

48 Classfcaton That s, f the posteror probablty ofc 1 s larger than the probablty ofc 2, then we mght classfy the nput as belongng to class 1. Equvalently, we can compare ther rato to 1: P(C 1 x) P(C 2 x) > 1? (162) If ths rato s greater than 1 (.e. P(C 1 x) > P(C 2 x)) then we classfy x as belongng to class 1, and class 2 otherwse. The quanttes P(C x) can by computed usng Bayes Rule as: P(C x) = p(x C )P(C ) p(x) (163) so that the rato s: p(x C 1 )P(C 1 ) p(x C 2 )P(C 2 ) (164) Note that thep(x) terms cancel and so do not need to be computed. Also, note that these computatons are typcally done n the logarthmc doman as ths s often faster and more numercally stable. Gaussan Class Condtonals. As a concrete example, consder a generatve model n whch the nputs assocated wth the th class (for = 1,2) are modeled wth a Gaussan dstrbuton,.e., Also, let s assume that the pror class probabltes are equal: p(x C ) = G(x;µ,Σ ). (165) P(C ) = 1 2. (166) The values of µ and Σ can be estmated by maxmum lkelhood on the ndvdual classes n the tranng data. Gven ths models, you can show that the log of the posteror rato (164) s gven by a(x) = 1 2 (x µ 1) T Σ 1 1 (x µ 1 ) 1 2 ln Σ (x µ 2) T Σ 1 2 (x µ 2 )+ 1 2 ln Σ 2 (167) The sgn of ths functon determnes the class of x, snce the rato of posteror class probabltes s greater than 1 when ths log s greater than zero. Snce a(x) s quadratc n x, the decson boundary (.e., the set of ponts satsfynga(x) = 0) s a conc secton (e.g., a parabola, an ellpse, a lne, etc.). Furthermore, n the specal case where Σ 1 = Σ 2, the decson boundary s lnear (why?). Copyrght c 2011 Aaron Hertzmann and Davd Fleet 43

49 Classfcaton Fgure 10: GCC classfcaton boundares for two cases. Note that the decson boundary s lnear when both classes have the same covarance. 8.2 Logstc Regresson Notng that p(x) can be wrtten as (why?) p(x) = p(x,c 1 )+p(x,c 2 ) = p(x C 1 )P(C 1 )+p(x C 2 )P(C 2 ), (168) we can express the posteror class probablty as P(C 1 x) = p(x C 1 )P(C 1 ) p(x C 1 )P(C 1 )+p(x C 2 )P(C 2 ). (169) Dvdng both the numerator and denomnator by p(x C 1 )P(C 1 ) we obtan: 1 P(C 1 x) = (170) 1+e a(x) = g(a(x)) (171) wherea(x) = ln p(x C 1)P(C 1 ) p(x C 2 )P(C 2 andg(a) s the sgmod functon. Note thatg(a) s monotonc, so that ) the probablty of class C 1 grows as a grows and s precsely 1 when a = 0. Snce P(C 2 1 x) = 1 2 represents equal probablty for both classes, ths s the boundary along whch we wsh to make decsons about class membershp. For the case of Gaussan class condtonals where both Gaussans have the same covarance, a s a lnear functon of x. In ths case the classfcaton probablty can be wrtten as P(C 1 x) = 1 1+e wt x b = g(wt x+b), (172) or, f we augment the data vector wth a 1 and the weght vector wth b, P(C 1 x) = 1 1+e wt x. (173) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 44

50 Classfcaton At ths pont, we can forget about the generatve model (e.g., the Gaussan dstrbutons) that we started wth, and use ths as our entre model. In other words, rather than learnng a dstrbuton over each class, we learn only the condtonal probablty of y gven x. As a result, we have fewer parameters to learn snce the number of parameters n logstc regresson s lnear n the dmenson of the nput vector, whle learnng a Gaussan covarance requres a quadratc number of parameters. Wth fewer parameters we can learn models more effectvely wth less data. On the other hand, we cannot perform other tasks that we could wth the generatve model (e.g., samplng from the model; classfy data wth nosy or mssng measurements). We can learn logstc regresson wth maxmum lkelhood. In partcular, gven data {x,y }, we mnmze the negatve log of: p({x,y } w,b) p({y } {x },w,b) = p(y x,w,b) = P(C 1 x ) (1 P(C 1 x )) (174) :y =C 1 :y =C 2 In the frst step above we have assumed that the nput features are ndependent of the weghts n the logstc regressor,.e., p({x }) = p({x } w,b). So ths term can be gnored n the lkelhood snce t s constant wth respect to the unknowns. In the second step we have assumed that the nput-output pars are ndependent, so the jont lkelhood s the product of the lkelhoods for each nput-output par. The decson boundary for logstc regresson s lnear; n 2D, t s a lne. To see ths, recall that the decson boundary s the set of ponts P(C 1 x) = 1/2. Solvng for x gves the ponts w T x+b = 0, whch s a lne n 2D, or a hyperplane n hgher dmensons. Although ths objectve functon cannot be optmzed n closed-form, t s convex, whch means that t has a sngle mnmum. Therefore, we can optmze t wth gradent descent (or any other gradent-based search technque), whch wll be guaranteed to fnd the global mnmum. If the classes are lnearly separable, ths approach wll lead to very large values of the weghts w, snce as the magntude ofwtends to nfnty, the functong(a(x)) behaves more and more lke a step functon and thus assgns hgher lkelhood to the data. Ths can be prevented by placng a weght-decay pror on w: p(w) = G(w; 0,σ 2 ). Multclass classfcaton. Logstc regresson can also be appled to multclass classfcaton,.e., where we wsh to classfy a data pont as belongng to one ofk classes. In ths case, the probablty of data vectorxbeng n class s: P(C x) = e wt x K k=1 e wt k x (175) You should be able to see that ths s equvalent to the method descrbed above n the two-class case. Furthermore, t s straghtforward to show that ths s a sensble choce of probablty: 0 P(C x), and k P(C k x) = 1 (verfy these for yourself). Copyrght c 2011 Aaron Hertzmann and Davd Fleet 45

51 Classfcaton 20 Data and decson boundary Fgure 11: Classfcaton boundary for logstc regresson. 8.3 Artfcal Neural Networks Logstc regresson works for lnearly separable datasets, but may not be suffcent for more complex cases. We can generalze logstc regresson by replacng the lnear functon w T x + b wth any other functon. If we replace t wth a neural network, we get: ( ( ) P(C 1 x) = g w (1) j g w (2) k,j x k +b (2) j )+b (1) (176) j Ths representaton s no longer connected to any partcular choce of class-condtonal model; t s purely a model of the class probablty gven the measurement. 8.4 K-Nearest Neghbors Classfcaton We can apply the KNN dea to classfcaton as well. For class labels { 1,1}, the classfer s: y new = sgn (177) where sgn(z) = k N K (x) y { 1 z 0 1 z > 0 (178) Alternatvely, we mght take a weghted average of the K-nearest neghbors: y = sgn w(x )y, w(x ) = e x x 2 /2σ 2 (179) N K (x) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 46

52 Classfcaton 3 class 1 class Fgure 12: For two classes and planar nputs, the decson boundary for a 1NN classffer (the bold black curve) s a subset of the perpendcular bsectng lne segments (green) between pars of neghbourng ponts (obtaned wth a Vorono tesselaton). where σ 2 s an addtonal parameter to the algorthm. For KNN the decson boundary wll be a collectons of hyperplane patches that are perpendcular bsectors of pars of ponts drawn from the two classes. As llustrated n Fgure 12, ths s a set of bsectng lne segments for 2D nputs. Fgure 12, shows a smple case but t s not hard to magne that the decson surfaces can get very complex, e.g., f a pont from class 1 les somewhere n the mddle of the ponts from class 2. By ncreasng the number of nearest neghbours (.e.,k) we are effectvely smoothng the decson boundary, hopefully thereby mprovng generalzaton. 8.5 Generatve vs. Dscrmnatve models The classfers descrbed here llustrate a dstncton between two general types of models n machne learnng: 1. Generatve models, such as the GCC, descrbe the complete probablty of the datap(x,y). 2. Dscrmnatve models, such as LR, ANNs, and KNN, descrbe the condtonal probablty of the output gven the nput: p(y x) The same dstncton occurs n regresson and classfcaton, e.g., KNN s a dscrmnatve method that can be used for ether classfcaton or regresson. The dstncton s clearest when comparng LR wth GCC wth equal covarances, snce they are both lnear classfers, but the tranng algorthms are dfferent. Ths s because they have dfferent goals; LR s optmzed for classfcaton performance, where as the GCC s a complete model of the probablty of the data that s then pressed nto servce for classfcaton. As a consequence, GCC may perform poorly wth non-gaussan data. Conversely, LR s not premsed on any partcular form of dstrbuton for the two class dstrbutons. On the other hand, LR can only be Copyrght c 2011 Aaron Hertzmann and Davd Fleet 47

53 Classfcaton 5 class 1 class 2 decson boundary Fgure 13: In ths example there are two classes, one wth a small sotropc covarance, and one wth an anstropc covarance. One can clearly see that the data are lnearly separable (.e., a lne exsts that correctly separates the nput tranng samples). Despte ths, LS regresson does not separate the tranng data well. Rather, the LS regresson decson boundary produces 5 ncorrectly classfed tranng ponts. used for classfcaton, whereas the GCC can be used for other tasks, e.g., to sample new x data, to classfy nosy nputs or nputs wth outlers, and so on. The dstnctons between generatve and dscrmnatve models become more sgnfcant n more complex problems. Generatve models allow us to put more pror knowledge nto how we buld the model, but classfcaton may often nvolve dffcult optmzaton of p(y x); dscrmnatve methods are typcally more effcent and generc, but are harder to specalze to partcular problems. 8.6 Classfcaton by LS Regresson One temptng way to perform classfcaton s wth least-squares rgresson. That s, we could treat the class labels y { 1,1} as real numbers, and estmate the weghts by mnmzng E(w) = (y x T w) 2, (180) for labeled tranng data {x,y }. Gven the optmal regresson weghts, one could then perform regresson on subsequent test nputs and use the sgn of the output to determne the output class. In smple cases ths can perform well, but n general t wll perform poorly. Ths s because the objectve functon n lnear regresson measures the dstance from the modeled class labels (whch can be any real number) to the true class labels, whch may not provde an accurate measure of how well the model has classfed the data. For example, a lnear regresson model wll tend to produce predcted labels that le outsde the range of the class labels for extreme members of a gven class (e.g. 5 when the class label s 1), causng the error to be measured as hgh even when the classfcaton (gven, say, by the sgn of the predcted label) s correct. In such a case the decson Copyrght c 2011 Aaron Hertzmann and Davd Fleet 48

54 Classfcaton boundary may be shfted towards such an extreme case, potentally reducng the number of correct classfcatons made by the model. Fgure 13 demonstrates ths wth a smple example. The problem arses from the fact that the constrant thaty ( 1,1) s not bult-n to the model (the regresson algorthm knows nothng about t), and so wastes consderable representatonal power tryng to reproduce ths effect. It s much better to buld ths constrant nto the model. 8.7 Naïve Bayes One problem wth class condtonal models, as descrbed above, concerns the large number of parameters requred to learn the lkelhood model,.e., the dstrbuton over the nputs condtoned on the class. In Gaussan Class Condtonal models, wth d-dmensonal nput vectors, we need to estmate the class mean and class covarance matrx for each class. The mean wll be a d- dmensonal vector, but the number of unknowns n the covarance matrx grows quadratcally wth d. That s, the covarance s a d d matrx (although because t s symmetrc we do not need to estmate alld 2 elements). Naïve Bayes ams to smplfy the estmaton problem by assumng that the dfferent nput features (e.g., the dfferent elements of the nput vector), are condtonally ndependent. That s, they are assumed to be ndependent when condtoned on the class. Mathematcally, for nputs x R d, we express ths as d p(x C) = p(x C). (181) =1 Wth ths assumpton, rather than estmatng one d-dmensonal densty, we nstead estmate d 1- dmensonal denstes. Ths s mportant because each 1D Gaussan only has two parameters, ts mean and varance, both of whch are scalars. So the model has 2d unknowns. In the Gaussan case, the Naïve Bayes model effectvely replaces the general d d covarance matrx by a dagonal matrx. There aredentres along the dagonal of the covarance matrx; the th entry s the varance of x C. Ths model s not as expressve but t s much easer to estmate Dscrete Input Features Up to now, we have looked at algorthms for real-valued nputs. We now consder the Naïve Bayes classfcaton algorthm for dscrete nputs. In dscrete Naïve Bayes, the nputs are a dscrete set of features, and each nput ether has or doesn t have each feature. For example, n document classfcaton (ncludng spam flterng), a feature mght be the presence or absence of a partcular word, and the feature vector for a document would be a lst of whch words the document does or doesn t have. Each data vector s descrbed by a lst of dscrete features F 1:D = [F 1,...,F D ]. Each feature F has a set of possble values that t can take; to keep thngs smple, we ll assume that each feature s bnary: F {0,1}. In the case of document classfcaton, each feature mght correspond to the presence of a partcular word n the emal (e.g., f F 3 = 1, then the emal contans the word Copyrght c 2011 Aaron Hertzmann and Davd Fleet 49

55 Classfcaton busness ), or another attrbute (e.g., F 4 = 1 mght mean that the mal headers appear forged). Smlarly, a classfer to dstngush news stores between sports and fnancal news mght be based on partcular words and phrases such as team, baseball, and mutual funds. To understand the complexty of dscrete class condtonal models n general (.e., wthout usng the Naïve Bayes model), consder the dstrbuton over 3 nputs, for classc = 1,.e.,P(F 1:3 C = 1). (There wll be another model for C = 0, but for our lttle thought experment here we ll just consder the model for C = 1.) Usng basc rules of probablty, we fnd that P(F 1:3 C = 1) = P(F 1 C = 1,F 2,F 3 ) P(F 2,F 3 C = 1) = P(F 1 C = 1,F 2,F 3 ) P(F 2 C = 1,F 3 ) P(F 3 C = 1) (182) Now, gvenc = 1 we know thatf 3 s ether 0 or 1 (e. t s a con toss), and to model t we smply want to know the probablty P(F 3 = 1 C = 1). Of course the probablty that F 3 = 0 s smply 1 P(F 3 = 1 C = 1). In other words, wth one parameter we can model the thrd factor above, P(F 3 C = 1). Now consder the second factor P(F 2 C = 1,F 3 ). In ths case, because F 2 depends on F 3, and there are two possble states of F 3, there are two dstrbutons we need to model, namely P(F 2 C = 1,F 3 = 0) and P(F 2 C = 1,F 3 = 1). Acordngly, we wll need two parameters, one for P(F 2 = 1 C = 1,F 3 = 0) and one for P(F 2 = 1 C = 1,F 3 = 1). Usng the same logc, to modelp(f 1 C = 1,F 2,F 3 ) wll requre one model parameter for each possble settng of(f 2,F 3 ), and of course there are 2 2 such settngs. For D-dmensonal bnary nputs, there are O(2 D 1 ) parameters that one needs to learn. The number of parameters requred grows prohbtvely large as D ncreases. The Naïve Bayes model, by comparson, only have D parameters to be learned. The assumpton of Naïve Bayes s that the feature vectors are all condtonally ndependent gven the class. The ndependence assumpton s often very naïve, but yet the algorthm often works well nonetheless. Ths means that the lkelhood of a feature vector for a partcular classj s gven by P(F 1:D C = j) = P(F C = j) (183) wherec denotes a classc {1,2,...K}. The probabltesp(f C) are parameters of the model: P(F = 1 C = j) = a,j (184) We must also defne class prors P(C = j) = b j. To classfy a new feature vector usng ths model, we choose the class wth maxmum probablty gven the features. By Bayes Rule ths s: P(C = j F 1:D ) = P(F 1:D C = j)p(c = j) (185) P(F 1:D ) = ( P(F C = j))p(c = j) K l=1 P(F (186) 1:D,C = l) ( :F = =1 a,j :F =0 (1 a,j) ) b j :F =0 (1 a,l) ) (187) b l K l=1( :F =1 a,l Copyrght c 2011 Aaron Hertzmann and Davd Fleet 50

56 Classfcaton If we wsh to fnd the class wth maxmum posteror probablty, we need only compute the numerator. The denomnator n (187) s of course the same for all classesj. To compute the denomnator one smply dvdes the numerators for each class by ther sum. The above computaton nvolves the product of many numbers, some of whch mght be qute small. Ths can lead to underflow. For example, f you take the producta 1 a 2...a N, and alla << 1, then the computaton may evaluate to zero n floatng pont, even though the fnal computaton after normalzaton should not be zero. If ths happens for all classes, then the denomnator wll be zero, and you get a dvde-by-zero error, even though, mathematcally, the denomnator cannot be zero. To avod these problems, t s safer to perform the computatons n the log-doman: ( α j = lna,j + ) ln(1 a,j ) +lnb j (188) :F =1 :F =0 γ = mnα j (189) j exp(α j γ) P(C = j F 1:D ) = l exp(α (190) l γ) whch, as you can see by nspecton, s mathematcally equvalent to the orgnal form, but wll not evaluate to zero for at least one class Learnng For a collecton of N tranng vectors F k, each wth an assocated class label C k, we can learn the parameters by maxmzng the data lkelhood (.e., the probablty of the data gven the model). Ths s equvalent to estmatng multnomal dstrbutons (n the case of bnary features, bnomal dstrbutons), and reduces to smple countng of features. Suppose there aren k examples of each class, andn examples total. Then the pror estmate s smply: b k = N k N Smlarly, f classk has N,k examples where F = 1, then (191) a,k = N,k N k (192) Wth large numbers of features and small datasets, t s lkely that some features wll never be seen for a class, gvng a class probablty of zero for that feature. We mght wsh to regularze, to prevent ths extreme model from occurrng. We can modfy the learnng rule as follows: a,k = N,k +α N k +2α (193) for some small value α. In the extreme case where there are no examples for whch feature s seen for class k, the probablty a,k wll be set to 1/2, correspondng to no knowledge. As the number of examples N k becomes large, the role of α wll become smaller and smaller. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 51

57 Classfcaton In general, gven n a multnomal dstrbuton wth a large number of classes and a small tranng set, we mght end up wth estmates of pror probablty b k beng zero for some classes. Ths mght be undesrable for varous reasons, or be nconsstent wth our pror belefs. Agan, to avod ths stuaton, we can regularze the maxmum lkelhood estmator wth our pror beleve that all classes should have a nonzero probablty. In dong so we can estmate the class pror probabltes as b k = N k +β (194) N +Kβ for some small value of β. When there are no observatons whatsoever, all classes are gven probablty 1/K. When there are observatons the estmated probabltes wll le between N k /N and1/k (convergng ton k /N as N ). Dervaton. Here we derve just the per-class probablty assumng two classes, gnorng the feature vectors; ths case reduces to estmatng a bnomal dstrbuton. The full estmaton can easly be derved n the same way. Suppose we observe N examples of class 0, and M examples of class 1; what s b 0, the probablty of observng class 0? Usng maxmum lkelhood estmaton, we maxmze: ( )( ) P(C = k) = P(C = 0) P(C = 1) (195) :C =0 :C =1 = b N 0 b M 1 (196) Furthermore, n order for the class probabltes to be a vald dstrbuton, t s requred thatb 0 +b 1 = 1, and thatb k 0. In order to enforce the frst constrant, we setb 1 = 1 b 0 : P(C = k) = b N 0 (1 b 0 ) M (197) The log of ths s: To maxmze, we compute the dervatve and set t to zero: Multplyng both sdes byb 0 (1 b 0 ) and solvng gves: L(b 0 ) = N lnb 0 +M ln(1 b 0 ) (198) dl = N M = 0 (199) db 0 b 0 1 b 0 b 0 = N N +M whch, fortunately, s guaranteed to satsfy the constrant b 0 0. (200) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 52

58 Gradent Descent 9 Gradent Descent There are many stuatons n whch we wsh to mnmze an objectve functon wth respect to a parameter vector: w = argmne(w) (201) w but no closed-form soluton for the mnmum exsts. In machne learnng, ths optmzaton s normally a data-fttng objectve functon, but smlar problems arse throughout computer scence, numercal analyss, physcs, fnance, and many other felds. The soluton we wll use n ths course s called gradent descent. It works for any dfferentable energy functon. However, t does not come wth many guarantees: t s only guaranteed to fnd a local mnma n the lmt of nfnte computaton tme. Gradent descent s teratve. Frst, we obtan an ntal estmate w 1 of the unknown parameter vector. How we obtan ths vector depends on the problem; one approach s to randomly-sample values for the parameters. Then, from ths ntal estmate, we note that the drecton of steepest descent from ths pont s to follow the negatve gradent E of the objectve functon evaluated atw 1. The gradent s defned as a vector of dervatves wth respect to each of the parameters: E de dw 1. de dw N (202) The key pont s that, f we follow the negatve gradent drecton n a small enough dstance, the objectve functon s guaranteed to decrease. (Ths can be shown by consderng a Taylor-seres approxmaton to the objectve functon). It s easest to vsualze ths process by consdernge(w) as a surface parameterzed byw; we are tryng to fndng the deepest pt n the surface. We do so by takng small downhll steps n the negatve gradent drecton. The entre process, n ts smplest form, can be summarzed as follows: pck ntal valuew 1 1 loop w +1 w λ E w +1 end loop Note that ths process depends on three choces: the ntalzaton, the termnaton condtons, and the step-sze λ. For the termnaton condton, one can run untl a preset number of steps has elapsed, or montor convergence,.e., termnate when E(w +1 ) E(w ) < ǫ (203) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 53

59 Gradent Descent for some preselected constantǫ, or termnate when ether condton s met. The smplest way to determne the step-sze λ s to pck a sngle value n advance, and ths approach s often taken n practce. However, t s somewhat unrelable: f we choose step-sze too large, than the objectve functon mght actually get worse on some steps; f the step-sze s too small, then the algorthm wll take a very long tme to make any progress. The soluton s to use lne search, namely, at each step, to search for a step-sze that reduces the objectve functon as much as possble. For example, a smple gradent search wth lne search procedure s: pck ntal valuew 1 1 loop E w λ 1 whlee(w λ ) E(w ) λ λ 2 end whle w +1 w λ +1 end loop A more sophstcated approach s to reuse step-szes between teratons: pck ntal valuew 1 1 λ 1 loop E w λ 2λ whlee(w λ ) E(w ) λ λ 2 end whle w +1 w λ +1 end loop Copyrght c 2011 Aaron Hertzmann and Davd Fleet 54

60 Gradent Descent There are many, many more advanced methods for numercal optmzaton. For unconstraned optmzaton, I recommend the L-BFGS-B lbrary, whch s avalable for download on the web. It s wrtten n Fortran, but there are wrappers for varous languages out there. Ths method wll be vastly superor to gradent descent for most problems. 9.1 Fnte dfferences The gradent of any functon can be computed approxmately by numercal computatons. Ths s useful for debuggng your gradent computatons, and n stuatons where t s too dffcult or tedous to mplement the complete dervatve. The numercal approxmaton follows drectly from the defnton of dervatve: de dw E(w+h) E(w) (204) w h for some sutably small stepszeh. Computng ths value for each element of the parameter vector gves you an approxmate estmate of the gradent E. It s strongly recommend that you use ths method to debug your dervatve computatons; many errors can be detected ths way! (Ths test s analogous to the use of assertons ). Asde: The term backpropagaton s sometmes used to refer to an effcent algorthm for computng dervatves for Artfcal Neural Networks. Confusngly, ths term s also used to refer to gradent descent (wthout lne search) for ANNs. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 55

61 Cross Valdaton 10 Cross Valdaton Suppose we must choose between two possble ways to ft some data. How do we choose between them? Smply measurng how well they ft they data would mean that we always try to ft the data as closely as possble the best method for fttng the data s smply to memorze t n bg look-up table. However, fttng the data s no guarantee that we wll be able to generalze to new measurements. As another example, consder the use of polynomal regresson to model a functon gven a set of data ponts. Hgher-order polynomals wll always ft the data as well or better than a low-order polynomal; ndeed, an N 1 degree polynomal wll ft N data ponts exactly (to wthn numercal error). So just fttng the data as well as we can usually produces models wth many parameters, and they are not gong to generalze to new nputs n almost all cases of nterest. The general soluton s to evaluate models by testng them on a new data set (the test set ), dstnct from the tranng set. Ths measures how predctve the model s: Is t useful n new stuatons? More generally, we often wsh to obtan emprcal estmates of performance. Ths can be useful for fndng errors n mplementaton, comparng competng models and learnng algorthms, and detectng over or under fttng n a learned model Cross-Valdaton The dea of emprcal performance evaluaton can also be used to determne model parameters that mght otherwse to hard to determne. Examples of such model parameters nclude the constantk n the K-Nearest Neghbors approach or the σ parameter n the Radal Bass Functon approach. Hold-out Valdaton. In the smplest method, we frst partton our data randomly nto a tranng set and a valdaton set. Let K be the unknown model parameter. We pck a set of range of possble values for K (e.g., K = 1,...,5). For each possble value of K, we learn a model wth that K on the tranng set, and compute that model s error on the valdaton set. For example, the error on valdaton set mght be just the squared-error, y f(x ) 2. We then pck the K whch has the smallest valdaton set error. The same dea can be appled f we have more model parameters (e.g., theσ n KNN), however, we must try many possble combnatons ofk andσ to fnd the best. There s a sgnfcant problem wth ths approach: we use less tranng data when fttng the other model parameters, and so we wll only get good results f our ntal tranng set s rather large. If large amounts of data are expensve or mpossble to obtan ths can be a serous problem. N-Fold Cross Valdaton. We can use the data much more effcently by N-fold cross-valdaton. In ths approach, we randomly partton the tranng data nto N sets of equal sze and run the learnng algorthm N tmes. Each tme, a dfferent one of the N sets s deemed the test set, and the model s traned on the remanng N 1 sets. The value of K s scored by averagng the error across the N test errors. We can then pck the value of K that has the lowest score, and then learn model parameters for ths K. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 56

62 Cross Valdaton A good choce for N s N = M 1, where M s the number of data ponts. Ths s called Leave-one-out cross-valdaton. Issues wth Cross Valdaton. Cross valdaton s a very smple and emprcal way of comparng models. However, there are a number of ssues to keep n mnd: The method can be very tme-consumng, snce many tranng runs may be needed. For models wth more than a few parameters, cross valdaton may be too neffcent to be useful. Because a reduced dataset s used for tranng, there must be suffcent tranng data so that all relevant phenomena of the problem exst n both the tranng data and the test data. It s safest to use a random partton, to avod the possblty that there are unmodeled correlatons n the data. For example, f the data was collected over a perod of a week, t s possble that data from the begnnng of the week has a dfferent structure than the data later n the week. Because cross-valdaton fnds a mnmum of an objectve functon, over- and under-fttng may stll occur, although t s much less lkely. For example, f the test set s very small, t may prefer a model that fts the random pattern n the test data. Asde: Testng machne learnng algorthms s very much lke testng scentfc theores: scentfc theores must be predctve, or, that s, falsfable. Scentfc theores must also descrbe plausble models of realty, whereas machne learnng methods need only be useful for makng decsons. However, statstcal nference and learnng frst arose as theores of scentfc hypothess testng, and reman closely related today. One of the most famous examples s the case of planetary moton. Pror to Newton, astronomers descrbed the moton of the planets through onerous tabulaton of measurements essentally, bg lookup tables. These tables were not especally predctve, and needed to updated constantly. Newton s equatons of moton whch could descrbe the moton of the planets wth only a few smple equatons were vastly smpler and yet also more effectve at predctng moton, and became the accepted theores of moton. However, there remaned some anomoles. Two astronomers, John Couch Adams and Urban Le Verer, thought that these dscrepances mght be due to a new, as-yet-undscovered planet. Usng technques smlar to modern regresson, but wth laborous hand-calculaton, they ndependently deduced the poston, mass, and orbt of the new planet. By observng n the predcted drectons, astronomers were Copyrght c 2011 Aaron Hertzmann and Davd Fleet 57

63 Cross Valdaton ndeed able to observe a new planet, whch was later named Neptune. Ths provded powerful valdaton for ther models. Incdentally, Adams was an undergraduate workng alone when he began hs nvestgatons. Reference: of Neptune Copyrght c 2011 Aaron Hertzmann and Davd Fleet 58

64 Bayesan Methods 11 Bayesan Methods So far, we have consdered statstcal methods whch select a sngle best model gven the data. Ths approach can have problems, such as over-fttng when there s not enough data to fully constran the model ft. In contrast, n the pure Bayesan approach, as much as possble we only compute dstrbutons over unknowns; we never maxmze anythng. For example, consder a model parameterzed by some weght vector w, and some tranng data D that comprses nput-output pars x,y, for = 1...N. The posteror probablty dstrbuton over the parameters, condtoned on the data s, usng Bayes rule, gven by p(w D) = p(d w)p(w) p(d) (205) The reason we want to ft the model n the frst place s to allow us to make predctons wth future test data. That s, gven some future nput x new, we want to use the model to predct y new. To accomplsh ths task through estmaton n prevous chapters, we used optmzaton to fnd ML or MAP estmates ofw, e.g., by maxmzng (205). In a Bayesan approach, rather than estmaton a sngle best value for w, we computer (or approxmate) the entre posteror dstrbuton p(w D). Gven the entre dstrbuton, we can stll make predctons wth the followng ntegral: p(y new D,x new ) = p(y new,w D,x new )dw = p(y new w,d,x new )p(w D,x new )dw (206) The frst step n ths equalty follows from the Sum Rule. The second follows from the Product Rule. Addtonally, the outputs y new and tranng data D are ndependent condtoned on w, so p(y new w,d) = p(y new w). That s, gven w, we have all avalable nformaton about makng predctons that we could possbly get from the tranng data D (accordng to the model). Fnally, gven D, t s safe to assume that x new, n tself, provdes no nformaton about W. Wth these assumptons we have the followng expresson for our predctons: p(y new D,x new ) = p(y new w,x new )p(w D)dw (207) In the case of dscrete parameters w, the ntegral becomes a summaton. The posteror dstrbutonp(y new D,x new ) tells us everythng there s to know about our belefs about the new value y new. There are many thngs we can do wth ths dstrbuton. For example, we could pck the most lkely predcton,.e., argmax y p(y new D,x new ), or we could compute the varance of ths dstrbuton to get a sense of how much confdence we have n the predcton. We could sample from ths dstrbuton n order to vsualze the range of models that are plausble for ths data. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 59

65 Bayesan Methods The ntegral n (207) s rarely easy to compute, often nvolvng ntractable ntegrals or exponentally large summatons. Thus, Bayesan methods often rely on numercal approxmatons, such as Monte Carlo samplng; MAP estmaton can also be vewed as an approxmaton. However, n a few cases, the Bayesan computatons can be done exactly, as n the regresson case dscussed below Bayesan Regresson Recall the statstcal model used n bass-functon regresson: y = b(x) T w+n, n N(0,σ 2 ) (208) for a fxed set of bass functons b(x) = [b 1 (x),...b M (x)] T. To complete the model, we also need to defne a pror dstrbuton over the weghts w (denotedp(w)) whch expresses what we beleve aboutw, n absence of any tranng data. One mght be tempted to assgn a constant densty over all possble weghts. There are several problems wth ths. Frst, the result cannot be a vald probablty dstrbuton snce no choce of the constant wll gve the densty a fnte ntegral. We could, nstead, choose a unform dstrbuton wth fnte bounds, however, ths wll make the resultng computatons more complex. More mportantly, a unform pror s often napproprate; we often fnd that smoother functons are more lkely n practce (at least for functons that we have any hope n learnng), and so we should employ a pror that prefers smooth functons. A choce of pror that does so s a Gaussan pror: w N(0,α 1 I) (209) whch expresses a pror belef that smooth functons are more lkely. Ths pror also has the addtonal beneft that t wll lead to tractable ntegrals later on. Note that ths pror depends on a parameter α; we wll see later n ths chapter how ths hyperparameter can be determned automatcally as well. As developed n prevous chapters on regresson, the data lkelhood functon that follows from the above model defnton (wth the nput and output components of the tranng dataset denoted x 1:N andy 1:N ) s N p(y 1:N x 1:N,w) = p(y x,w) (210) and so the posteror s: p(w x 1:N,y 1:N ) = =1 ( N ) =1 p(y x,w) p(w) p(y 1:N x 1:N ) (211) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 60

66 Bayesan Methods In the negatve log-doman, usng Equatons (208) and (209), the model s gven by: lnp(w x 1:N,y 1:N ) = ln(p(y x,w)) ln(p(w))+ln(p(y 1:N x 1:N )) = 1 (y 2σ 2 f(x )) 2 + α 2 w 2 + constants As above n the regresson notes, t s useful f we collect the tranng outputs nto a sngle vector,.e., y = [y 1,...,y N ] T, and we collect the all bass functons evaluated at each of the nputs nto a matrx B wth elements B,j = b j (x ). In dong so we can smplfy the log posteror as follows: lnp(w x 1:N,y 1:N ) = 1 2σ 2 y Bw 2 + α 2 w 2 + constants = 1 2σ 2(y Bw)T (y Bw)+ α 2 wt w+constants where = 1 2 wt (B T B/σ 2 +αi)w 1 2 yt Bw/σ wt B T y/σ 2 + constants = 1 2 (w w)t K 1 (w w)+constants (212) K = ( B T B/σ 2 +αi ) 1 (213) w = KB T y/σ 2 (214) (The last step of the dervaton uses the methods of completng the square. It s easest to verfy the last step by gong backwards, that s by multplyng out(w w) T K 1 (w w).) The dervaton above tells us that the posteror dstrbuton over the weght vector s a multdmensonal Gaussan wth mean w and covarance matrx K,.e., p(w x 1:N,y 1:N ) = G(w; w,k) (215) In other words, our belef about w once we have seen the data s specfed by a Gaussan densty. We beleve that w s the most probable value forw, but we have uncertanty about ths estmate, as determned by the covarance K. The covarance expresses our uncertanty about these parameters. If the covarance s very small, then we have a lot of confdence n the MAP estmate. The nature of the posteror dstrbuton s llustrated vsually n Fgure 14. Note that w s the MAP estmate for regresson, snce t maxmzes the posteror. Predcton. For a new data pont x new, the predctve dstrbuton for y new s gven by: p(y new x new,d) = p(y new x new,d,w)p(w D)dw = N(y new ;b(x new ) T w,σ 2 +b(x new ) T Kb(x new )) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 61

67 Bayesan Methods Fgure 14: Iteratve posteror computaton for a lnear regresson model: y = w 0 x+w 1. The top row shows the pror dstrbuton, and several far samples from the pror dstrbuton. The second row shows the lkelhood over w after observng a sngle data pont (.e., an x,y par), along wth the resultng posteror (the normalzed product of the lkelhood and the pror), and then several far samples from the posteror. The thrd row shows the lklhood when a new observaton s added to the prevous observaton, followed by the correspondng posteror and random samples from the posteror. The fnal row shows the result of 20 observatons. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 62

68 Bayesan Methods The predctve dstrbuton may be vewed as a functon from x new to a dstrbuton over values of y new. An example of ths for an RBF model s gven n Fgure 15. Ths s the Bayesan way to do regresson. To predct a new value y new for an nput x new, we don t estmate a sngle model w. Instead we average over all possble models, weghtng the dfferent models accordng to ther posteror probablty Hyperparameters There are often mplct parameters n our model that we hold fxed, such as the covarance constants n lnear regresson, or the parameters that govern the pror dstrbuton over the weghts. These are usually called hyperparameters. For example, n the RBF model, the hyperparameters consttute the parameters α, σ 2, and the parameters of the bass functons (e.g., the wdth of the bass functons). Thus far we have assumed that the hyperparameters were known (whch means that someone must set them by hand), or estmated by cross-valdaton (whch has a number of ptfalls, ncludng long computaton tmes, especally for large numbers of hyperparameters). Instead of ether of these approaches, we may apply the Bayesan approach n order to drectly estmate these values as well. To fnd a MAP estmate for theαparameter n the above lnear regresson example we compute: α = argmaxlnp(α x 1:N,y 1:N ) (216) where and p(α x 1:N,y 1:N ) = p(y 1:N x 1:N,α)p(α) p(y 1:N x 1:N ) p(y 1:N x 1:N,α) = = = p(y 1:N,w x 1:N,α)dw p(y 1:N x 1:N,w,α)p(w α)dw ( ) p(y x,w,α) p(w α)dw (217) For RBF regresson, ths objectve functon can be computed n closed-form. However, dependng on the form of the pror over the hyperparameters, t s often necessary to use some form of numercal optmzaton, such as gradent descent Bayesan Model Selecton How do we choose whch model to use? For example, we mght lke to automatcally choose the form of the bass functons or the number of bass functons. Cross-valdaton s one approach, but Copyrght c 2011 Aaron Hertzmann and Davd Fleet 63

69 Bayesan Methods t 1 0 t x 1 0 x 1 t 1 0 t x 1 0 x 1 Fgure 15: Predctve dstrbuton for an RBF model (wth 9 bass functons), traned on nosy snusodal data. The green curve s the true underlyng snusodal functon. The blue crcles are data ponts. The red curve s the mean predcton as a functon of the nput. The pnk regon represents 1 standard devaton. Note how ths regon shrnks close to where more data ponts are observed. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 64

70 Bayesan Methods t can be expensve, and, more mportantly, naccurate f small amounts of data are avalable. In general one ntuton s that we want to choose smple models over complex models to avod overfttng,nsofar as they provde equvalent fts to the data. Below we consder a Bayesan approach to model selecton whch provdes just such a bas to smple models. The goal of model selecton s to choose the best model from some set of canddate models {M } L =1 based on some observed data D. Ths may be done ether wth a maxmum lkelhood approach (pckng the model that assgns the largest lkelhood to the data) or a MAP approach (pckng the model wth the hghest posteror probablty). If we take a unform pror over models (.e. p(m ) s a constant for all = 1...L) then these approaches can be seen to be equvalent snce: p(m D) = p(d M )p(m ) p(d) p(d M ) In practce a unform pror over models may not be approprate, but the desgn of sutable prors n these cases wll depend sgnfcantly on one s knowledge of the applcaton doman. So here we wll assume a unform pror over models and focus onp(d M ). In some sense, whenever we estmate a parameter n a model we are dong model selecton where the famly of models s ndexed by the dfferent values of that parameter. However the term model selecton can also mean choosng the best model from some set of parametrc models that are parameterzed dfferently. A good example of ths would be choosng the number of bass functons to use n an RBF regresson model. Another smple example s choosng the polynomal degree for polynomal regresson. The key quantty for Bayesan model selecton s p(d M ), often called the margnal data lkelhood. Gven two models, M 1 and M 2, we wll choose the model M 1 when p(d M 1 ) > p(d M 1 ). To specfy these quanttes n more detal we need to take the model parameters nto account. Dfferent models may have dfferent numbers of parameters (e.g., polynomals of dfferent degrees), or entrely dfferent parameterzatons (e.g., RBFs and neural networks). In what follows, let w be the vector of parameters for model M. In the case of regresson, for example, w mght comprse the regresson weghts and hyper-parameters lke the weght on the regularzer. The extent to whch a model explans (or fts) the data depends on the choce of the rght parameters. Usng the sum rule and Bayes rule t follows we can wrte the margnal data lkelhood as p(d M ) = p(d,w M )dw = p(d w,m )p(w M )dw (218) Ths tells us that the deal model s one that assgns hgh pror probabltyp(w M ) to every weght vector that also yelds a hgh value of the lkelhood p(d w,m ) (.e., to parameter vectors that ft the data well). One can also recognze that the product of the data lkelhood and the pror n the ntegrand s proportonal to the posteror over the parameters that we prevously maxmzed to fnd MAP estmates of the model parameters. 8 8 Ths s the same quantty we compute when optmzng hyper-parameters (whch s a type of model selecton) and Copyrght c 2011 Aaron Hertzmann and Davd Fleet 65

71 Bayesan Methods Typcally, a complex model that assgns a sgnfcant posteror probablty mass to complex data wll be able to assgn sgnfcantly less mass to smpler data than a smpler model would. Ths s because the ntegral of the probablty mass must sum to 1 and so a complex model wll have less mass to spend on smpler data. Also, snce a complex model wll requre hgher-dmensonal parameterzatons, mass must be spread over a hgher-dmensonal space and hence more thnly. Ths phenomenon s vsualzed n Fgure 17. As an ad to ntuton to explan why ths margnal data lkelhood helps us choose good models, we consder a smple approxmaton to the margnal data lkelhood p(d M ) (depcted n Fgure 16 for a scalar parameter w). Frst, as s common n many problems of nterest, the posteror dstrbuton over the model parameters p(w D,M ) p(d w,m )p(w M ) to have a strong peak at the MAP parameter estmate w MAP Equaton (218) as the heght of the peak,.e., p(d w MAP wdth w posteror. p(d w,m )p(w M )dw p(d w MAP. Accordngly we can approxmate the ntegral n,m )p(w MAP M ), multpled by ts,m )p(w MAP M ) w posteror We then assume that the pror dstrbuton over parametersp(w M ) s a relatvely broad unform wth wdth w pror, sop(w ) 1. Ths yelds a further approxmaton: Takng the logarthm, ths becomes w pror p(d w,m )p(w M )dw p(d wmap,m ) w posteror w pror lnp(d w MAP,M )+ln wposteror w pror Intutvely, ths approxmaton tells us that models wth wder pror dstrbutons on the parameters wll tend to assgn less lkelhood to the data because the wder pror captures a larger varety of data (so the densty s spread thnner over the data-space). Smlarly, models that have a very narrow peak around ther modes are generally less preferable because they assgn a lower probablty mass to the surroundng area (and so a slghtly perturbed settng of the parameters would provde a poor ft to the data, suggestng that over-fttng has occurred). From another perspectve, note that n most cases of nterest we can assume that w posteror < w pror. I.e., the posteror wdth wll be less than the wdth of the pror. The log rato s maxmal when the pror and posteror wdths are equal. For example, a complex model wth many parameters, or a a very broad pror over the parameters wll necessarly assgn a small probablty to any sngle value (ncludng those under the posteror peak). A smpler model wll assgn a hgher pror also corresponds to the denomnator p(d) n Bayes rule for fndng the posteror probablty of a partcular settng of the parameters w. Note that above we generally wrote p(d) and not p(d M ) because we were only consderng a sngle model, and so t was not necessary to condton on t. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 66

72 Bayesan Methods w posteror w MAP w w pror Fgure 16: A vsualzaton of the wdth-based evdence approxmaton. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) probablty to the useful parameter values (e those under the posteror peak). When the model s too smple, then the lkelhood term n the ntegrand wll be partcularly hgh and therefore lowers the margnal data lkelhood. So, as models become more complex the data lkelhood ncreasngly fts the data better. But as the models become more and more complex the log rato ln wposteror w pror acts as a penalty on unnecessarly complex models. By selectng a model that assgns the hghest posteror probablty to the data we are automatcally balancng model complexty wth the ablty of the model to capture the data. Ths can be seen as the mathematcal realzaton of Occam s Razor. Model averagng. To be fully Bayesan, arguably, we shouldn t select a sngle best model but should nstead combne estmates from all models accordng to ther respectve posteror probabltes: p(y new D,x new ) = p(y new M,D,x new )p(m D) (219) but ths s often mpractcal and so we resort to model selecton nstead. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 67

73 Bayesan Methods p(d) M 1 M 2 M 3 D 0 D Fgure 17: The x-axs s data complexty from smplest to most complex, and models M are ndexed n order of ncreasng complexty. Note that n ths example M 2 s the best model choce for data D 0 snce t smultaneously s complex enough to assgn mass to D 0 but not so complex that t must spread ts mass too thnly. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 68

74 Monte Carlo Methods 12 Monte Carlo Methods Monte Carlo s an umbrella term referrng to a set of numercal technques for solvng one or both of these problems: 1. Approxmatng expected values that cannot be solved n closed-form 2. Samplng from dstrbutons for whch s a smple samplng algorthm s not avalable. Recall that expectaton of a functon φ(x) of a contnuous varable x wth respect to a dstrbuton p(x) s defned as: E p(x) [φ(x)] p(x)φ(x)dx (220) Monte Carlo methods approxmate ths ntegral by drawng N samples from p(x) and then approxmatng the ntegral by the weghted average: x p(x) (221) E p(x) [φ(x)] 1 N N φ(x ) (222) =1 Estmator propertes. Ths estmate s unbased: [ ] 1 E p(x1:n ) φ(x ) = 1 E p(x )[φ(x )] = 1 N N N NE p(x)[φ(x)] = E p(x) [φ(x)] (223) Furthermore, the varance of ths estmate s nversely proportonal to the number of samples: [ ] 1 var p(x1:n ) φ(x ) = 1 var N N 2 p(x1:n )[φ(x )] = 1 N 2Nvar p(x )[φ(x )] = 1 N var p(x)[φ(x)] (224) Hence, the more samples we get, the better our estmate wll be; n the lmt, the estmator wll converge to the true value. Dealng wth unnormalzed dstrbutons. We often wsh to compute the expected value of a dstrbuton for whch evaluatng the normalzaton constant s dffcult. For example, the posteror dstrbuton over parameters w gven data D s: p(w D) = p(d w)p(w) p(d) (225) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 69

75 Monte Carlo Methods The posteror mean and covarance ( w = E[w] and E[(w w)(w w) T ]) can be useful to understand ths posteror,.e., what we beleve the parameter values are on average, and how much uncertanty there s n the parameters. The numerator of p(w D) s typcally easy to compute, but p(d) entals an ntegral whch s often ntractable, and thus must be handled numercally. Most generally, we can wrte the problem as computng the expected value wth respect to a dstrbuton p(x) defned as p(x) 1 Z P (x), Z = P (x)dx (226) Monte Carlo methods wll allow us to handle dstrbutons of ths form Samplng Gaussans We begn wth algorthms for samplng from a Gaussan dstrbuton. For the smple 1-dmensonal case, x N(0, 1), there s well known algorthm called the Box-Muller Method that s based on an approach called rejecton samplng. It s mplemented n Matlab n the command randn. For a general 1D Gaussan, x N(µ,σ 2 ), we sample a varable z N(0,1), and then set x = σz +µ. You should be able to show that x has the desred mean and varance. For the mult-dmensonal case, x N(0, I), each element s ndependent and Gaussan: x N(0,1) and so each element can be sampled wth randn. To sample from a Gaussan wth general mean vectorµand varance matrxσwe frst sample z N(0,I), and then set x = Lz + µ, where Σ = LL T. We can compute L from Σ by the Cholesky Factorzaton of Σ, whch must be postve defnte. Then we have and E[x] = E[Lz+µ] = LE[z]+µ = µ (227) E[(z µ)(z µ) T ] = E[Lz(Lz) T ] = LE[zz T ]L T = LL T = Σ (228) 12.2 Importance Samplng In some stuatons, t may be dffcult to sample from the desred dstrbuton p(x); however, we can sample from a smlar dstrbuton q(x). Importance samplng s a technque that allows one to approxmate expectaton wth respect to p(x) by samplng from q(x). The only requrement on q s that t have the same support as p,.e.,q s nonzero everywhere thatps nonzero. Importance samplng s based on the followng equalty: E q(x) [ p(x) q(x) φ(x) ] = = p(x) φ(x)q(x)dx (229) q(x) φ(x)p(x)dx (230) = E p(x) [φ(x)] (231) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 70

76 Monte Carlo Methods In other words, we can compute the desred expectaton by samplng valuesx fromq(x), and then computng [ ] p(x) E q q(x) φ(x) 1 p(x ) N q(x ) φ(x ) (232) It often happens thatpand/or q are known only up to multplcatve constants. That s, p(x) 1 Z p P (x) (233) q(x) 1 Z q Q (x) (234) where P andq are easy to evaluate but the constants Z p andz q are not. Then we have: 1 Z E p(x) [φ(x)] = p P (x) 1 Z q Q (x) φ(x)q(x)dx = Z [ ] q P (x) E q(x) Z p Q (x) φ(x) (235) and so t remans to approxmate Zq Z p. If we substtute φ(x) = 1, the above formula states that [ ] Z q P (x) E q(x) = 1 (236) Z p Q (x) and so Zp Z q = E q(x) [ P (x) ]. Thus we have: Q (x) E p(x) [φ(x)] = [ ] E P (x) q(x) φ(x) Q (x) [ ] (237) E P (x) q(x) Q (x) Hence, the mportance samplng algorthm s: 1. Sample N values x q(x ) 2. Compute 3. Estmate the expected value w = P (x ) Q (x ) E[φ(x)] w φ(x ) w (238) (239) The mportance samplng algorthm wll only work well when q(x) s suffcently smlar to the functon p(x) φ(x). Put more concretely, the varance of the estmator grows as the dssmlarty between q(x) and p(x) φ(x) grows (and s mnmzed when they are equal). An alternatve s to use the MCMC algorthm to draw samples drectly from p(x), as descrbed below. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 71

77 Monte Carlo Methods p(x) q(x) φ(x) x Fgure 18: Importance samplng may be used to sample relatvely complcated dstrbutons lke ths bmodal p(x) by nstead samplng smpler dstrbutons lke ths unmodal q(x). Note that n ths example, samplng fromq(x) wll produce many samples that wll be gven a very low weght snce q(x) has a lot of mass where p(x) s near zero (n the center of the plot). On the other hand, q(x) has ample mass around the two modes ofp(x) and so t s a relatvely good choce. Ifq(x) had very lttle mass around one of the modes of p(x), the estmate gven by mportance samplng would have a very hgh varance (unless φ(x) was small enough there to compensate for the dfference). (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 72

78 Monte Carlo Methods 12.3 Markov Chan Monte Carlo (MCMC) MCMC s a very general algorthm for samplng from any dstrbuton. For example, there s no smple method for samplng models w from the posteror dstrbuton except n specalzed cases (e.g., when the posteror s Gaussan). MCMC s an teratve algorthm that, gven a sample x t p(x), modfes that sample to produce a new sample x t+1 p(x). Ths modfcaton s done usng a proposal dstrbuton q(x x), that, gven a x, randomly selects a mutaton to x. Ths proposal dstrbuton may be almost anythng, and t s up to the user of the algorthm to choose ths dstrbuton; a common choce would be smply a Gaussan centered atx: q(x x) = N(x x,σ 2 I). The entre algorthm s: select ntal pont x 1 t 1 loop Sample x q(x x t ) α P (x ) q(x t x ) P (x t ) q(x x t ) Sample u Unform[0, 1] f u α then x t+1 x else x t+1 x t end f t t+1 end loop Amazngly, t can be shown that, f x 1 s a sample from p(x), then every subsequent x t s also a sample from p(x), f they are consdered n solaton. The samples are correlated to each other va the Markov Chan, but the margnal dstrbuton of any ndvdual sample s p(x). So far we assumed that x 1 s a sample from the target dstrbuton, but, of course, obtanng ths frst sample s tself dffcult. Instead, we must perform a process called burn-n: we ntalze wth any x 1, and then dscard the frst T samples obtaned by the algorthm; f we pck a large enough value oft, we are guaranteed that the remanng samples are vald samples from the target dstrbuton. However, there s no exact method for determnng a suffcent T, and so heurstcs and/or expermentaton must be used. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 73

79 Monte Carlo Methods Fgure 19: MCMC appled to a 2D ellptcal Gaussan wth a proposal dstrbuton consstng of a crcular Gaussan centered on the prevous sample. Green lnes ndcate accepted proposals whle red lnes ndcate rejected ones. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 74

80 Prncpal Components Analyss 13 Prncpal Components Analyss We now dscuss an unsupervsed learnng algorthm, called Prncpal Components Analyss, or PCA. The method s unsupervsed because we are learnng a mappng wthout any examples of what the mappng looks lke; all we see are the outputs, and we want to estmate both the mappng and the nputs. PCA s prmarly a tool for dealng wth hgh-dmensonal data. If our measurements are 17- dmensonal, or 30-dmensonal, or 10,000-dmensonal, manpulatng the data can be extremely dffcult. Qute often, the actual data can be descrbed by a much lower-dmensonal representaton that captures all of the structure of the data. PCA s perhaps the smplest approach for fndng such a representaton, and yet s t also very fast and effectve, resultng n t beng very wdely used. There are several ways n whch PCA can help: Vsualzaton: PCA provdes a way to vsualze the data, by projectng the data down to two or three dmensons that you can plot, n order to get a better sense of the data. Furthermore, the prncpal component vectors sometmes provde nsght as to the nature of the data as well. Preprocessng: Learnng complex models of hgh-dmensonal data s often very slow, and also prone to overfttng the number of parameters n a model s usually exponental n the number of dmensons, meanng that very large data sets are requred for hgher-dmensonal models. Ths problem s generally called the curse of dmensonalty. PCA can be used to frst map the data to a low-dmensonal representaton before applyng a more sophstcated algorthm to t. Wth PCA one can also whten the representaton, whch rebalances the weghts of the data to gve better performance n some cases. Modelng: PCA learns a representaton that s sometmes used as an entre model, e.g., a pror dstrbuton for new data. Compresson: PCA can be used to compress data, by replacng data wth ts low-dmensonal representaton The model and learnng In PCA, we assume we are gven N data vectors {y }, where each vector s D-dmensonal: y R D. Our goal s to replace these vectors wth lower-dmensonal vectors{x } wth dmensonalty C, where C < D. We assume that they are related by a lnear transformaton: C y = Wx+b = w j x j +b (240) j=1 The matrx W can be vewed as a contanng a set of C bass vectors W = [w 1,...,w C ]. If we also assume Gaussan nose n the measurements, ths model s the same as the lnear regresson model studed earler, but now the x s are unknown n addton to the lnear parameters. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 75

81 Prncpal Components Analyss To learn the model, we solve the followng constraned least-squares problem: arg mn W,b,{x } y (Wx +b) 2 (241) subject to W T W = I (242) The constrant W T W = I requres that we obtan an orthonormal mappng W; t s equvalent to sayng that w T w j = { 1 = j 0 j (243) Ths constrant s requred to resolve an ambguty n the mappng: f we dd not requre W to be orthonormal, then the objectve functon s underconstraned (why?). Note that an ambguty remans n the learnng even wth ths constrant (whch one?), but ths ambguty s not very mportant. The x coordnates are often called latent coordnates. The algorthm for mnmzng ths objectve functon s as follows: 1. Letb = 1 N y 2. LetK = 1 N (y b)(y b) T 3. Let VΛV T = K be the egenvector decomposton of K. Λ s a dagonal matrx of egenvalues (Λ = dag(λ 1,...λ D )). The matrx V contans the egenvectors: V = [V 1,...V D ] and s orthonormal V T V = I. 4. Assume that the egenvalues are sorted from largest to smallest (λ λ +1 ). If ths s not the case, sort them (and ther correspondng egenvectors). 5. LetWbe a matrx of the frstc egenvectors: W = [V 1,...V C ]. 6. Letx = W T (y b), for all Reconstructon Suppose we have learned a PCA model, and are gven a new y new value; how do we estmate ts correspondng x new? Ths can be done by mnmzng y new (Wx new +b) 2 (244) Ths s a lnear least-squares problem, and can be solved wth standard methods (n MATLAB, mplemented by the backslash operator). HoweverWs orthonormal, and thus ts transpose s the pseudonverse, so the soluton s gven smply by: x new = W T (y new b) (245) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 76

82 Prncpal Components Analyss 13.3 Propertes of PCA Mean zero coeffcents. One can show that the PCA coeffcents that represent the tranng data,.e.,{x } N =1, are mean zero. mean(x) 1 x = 1 W T (y b) (246) N N ) ( = 1 N WT y Nb (247) = 0 (248) Varance maxmzaton. PCA can also be defned n the followng way; n fact, ths s the orgnal defnton of PCA, and the one that s often meant when people dscuss PCA. However, ths formulaton s exactly equvalent to the one dscussed above. In ths goal, we wsh to fnd the frst prncpal component w 1 to maxmze the varance of the frst coordnate of the data: var(x 1 ) = 1 x 2 1, = 1 (w T N N 1(y b)) 2 (249) such that w 1 2 = 1. Then, we wsh to choose the second prncpal component to be a unt vector and orthogonal to the frst component, whle maxmzng the varance ofx 2. The remanng prncple components are also defned n ths recursve way, so that each component w s a unt vector, orthogonal to all prevous bass vectors. Uncorrelated coeffcents. It s straghtforward to show that the covarance matrx of the PCA coeffcents s the just the upper leftc C submatrx ofλ(.e., the dagonal matrx contanng the C leadng egenvalues ofk. cov(x) 1 (W T (y b))(w T (y b)) T (250) N = 1 N WT ( (y b)(y b) T )W (251) = W T KW (252) = W T VΛV T W (253) = Λ (254) where Λ s the dagonal matrx contanng the C leadng egenvalues n Λ. Ths smple dervaton also shows that the margnal varances of the PCA coeffcents are gven by the egenvalues;.e., var(x j ) = λ j. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 77

83 Prncpal Components Analyss Out of Subspace Error. The total varance n the data s gven by the sum of the egenvalues of the sample covarance matrx K. The varance captured by the PCA subspace representaton s the sum of the frst C egenvalues. The total amount of varance lost n the representaton s gven by the sum of the remanng egenvalues. In fact, one can show that the least-squares error n the approxmaton to the orgnal data provded by the optmal (ML) model parameters, W, {x }, andb, s gven by y (W x +b ) 2 = D j=c+1 λ j. (255) When learnng a PCA model t s common to use the rato of the total LS error and the total varance n the tranng data (.e., the sum of all egenvalues). One needs to choose C to be large enough that ths rato s small (often 0.1 or less) Whtenng Whtenng s a preprocess that replaces the data wth a representaton that has zero-mean and unt covarance, and s often useful as a data preprocessng step. Gven measurements{y }, we replace them wth {z } gven by where Λ s a dagonal matrx of the frstc egenvalues. Then, the sample mean of the z s s equal to 0: z = Λ 1 2 W T (y b) = Λ 1 2 x (256) mean(z) = mean( Λ 1 2 x ) = Λ 1 2 mean(x) = 0 (257) To derve the sample covarance, we wll frst compute the covarance of the untruncated values: z Λ 1 2 V T (y b): cov( z) 1 Λ 1 2 V T (y b)(y b) T VΛ 1 2 (258) N ( = Λ 1 2 V T 1 (y b)(y b) )VΛ T 1 2 (259) N = Λ 1 2 V T KVΛ 1 2 (260) = Λ 1 2 V T VΛV T VΛ 1 2 (261) = I (262) Snce z s just the frstc elements of z,zalso has sample covarance I. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 78

84 Prncpal Components Analyss 13.5 Modelng PCA s sometmes used to model data lkelhood, e.g., we can use t as a form of a pror. For example, suppose we have nosy measurements of some y values and wsh to estmate ther true values. If we parameterze the unknown y values by ther correspondng x values nstead, then we constran the estmated values to le n the low-dmensonal subspace of the orgnal data. However, ths approach mples a unform pror over x values, whch may be nadequate, whle beng ntolerant to devatons from the subspace. A better approach wth an nherently probablstc model s descrbed below Probablstc PCA Probablstc PCA s a way to estmate a probablty dstrbuton p(y); n fact, t s a form of Gaussan dstrbuton. In partcular, we assume the followng probablty dstrbuton: x N(0,I) (263) y = Wx+b+n, n N(0,σ 2 I) (264) where x and n are assumed to be statstcally ndependent. The model says that the low-dmensonal coordnates x (.e., the underlyng causes) come from a unt Gaussan dstrbuton, and the y measurements are a lnear functon of these low-dmensoanl causes, plus Gaussan nose. Note that we do not requre that W be orthonormal anymore (n part because we now constran the magntude of the x varables). Snce any lnear transformaton of a Gaussan varable s tself Gaussan,ymust also be Gaussan. Ths dstrbuton s: p(y) = p(x,y)dx = p(y x)p(x)dx = G(y;Wx+b, σ 2 I)G(x;0, I)dx (265) Evaluatng ths ntegral wll gve usp(y), however, there s a smpler way to solve for the Gaussan dstrbuton. Snce we know that y s Gaussan, all we need to do s derve ts mean and covarance, whch can be done as follows (usng the fact that mathematcal expectaton s lnear): mean(y) = E[y] = E[Wx+b+n] (266) = WE[x]+b+E[n] (267) = b (268) cov(y) = E[(y b)(y b) T ] (269) = E[(Wx+b+n b)(wx+b+n b) T ] (270) = E[(Wx+n)(Wx+n) T ] (271) = E[Wxx T W T ]+E[Wxn T ]+E[nx T W T ]+E[nn T ] (272) = WE[xx T ]W T +WE[x]E[n T ]+E[n]E[x T ]W T +σ 2 I (273) = WW T +σ 2 I (274) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 79

85 Prncpal Components Analyss y 2 w y 2 p(y ˆx) b }ẑ w b p(x) p(y) ˆx x y 1 y 1 Fgure 20: Vsualzaton of PPCA mappng for a 1D to 2D model. A Gaussan n 1D s mapped to a lne, and then blurred wth 2D nose. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) Hence y N(b,WW T +σ 2 I) (275) In other words, learnng a PPCA model s equvalent to learnng a partcular form of a Gaussan dstrbuton. Ths s llustrated n Fgure 20. The PPCA model s not as general as learnng a full Gaussan model wth ad D covarance matrx; however, t uses fewer numbers to represent the Gaussan (CD+1 versusd 2 /2+D/2; why?). Because the representaton s more compact, t can be estmated from smaller datasets, and requres less memory to store the model. These dfferences wll be sgnfcant when D s large; e.g., f D = 100, the full covarance matrx would requre 5050 parameters and thus requre hundreds of thousands of data ponts to estmate relably. However, f the effectve dmensonalty s, say, 2 or 3, then the PPCA representaton wll only have a few hundred parameters and many fewer measurements. Learnng. The PPCA model can be learned by Maxmum Lkelhood,.e., by mnmzng: L(W,b,σ 2 ) = ln N G(y ; b, WW T +σ 2 I) (276) =1 = 1 (y b) T (WW T +σ 2 I) 1 (y b)+ N 2 2 ln(2π)d WW T +σ 2 I (277) Ths can be optmzed n closed form. The soluton s very smlar to the conventonal PCA case: 1. Letb = 1 N y 2. LetK = 1 N (y b)(y b) T Copyrght c 2011 Aaron Hertzmann and Davd Fleet 80

86 Prncpal Components Analyss 3. Let VΛV T = K be the egenvector decomposton of K. Λ s a dagonal matrx of egenvalues (Λ = dag(λ 1,...λ D )). The matrx V contans the egenvectors: V = [V 1,...V D ] and s orthonormal V T V = I. 4. Assume that the egenvalues are sorted from largest to smallest (λ λ +1 ). If ths s not the case, sort them (and ther correspondng egenvectors). 5. Let σ 2 = 1 D C D j=c+1 λ j. In words, the estmated nose varance s equal to the average margnal data varance over all drectons that are orthogonal to the C prncpal drectons (.e., ths s the average varance (per dmenson) of the data that s lost n the approxmaton of the data n the C dmensonal subspace). 6. Let Ṽ be the matrx comprsng the frst C egenvectors: Ṽ = [V 1,...V C ], and let Λ be the dagonal matrx wth the C leadng egenvalues: Λ = [λ 1,...λ C ]. 7. W = Ṽ( Λ σ 2 I) Letx = W T (y b), for all. Note that ths soluton s smlar to that n the conventonal PCA case wth whtenng, except that (a) the nose varance s estmated, and (b) the nose s removed from the varances of the remanng egenvalues. An alternatve optmzaton. In the above learnng algorthm, we margnalzed out x when estmatng PPCA. In other words, we maxmzed p(y 1:N W,b,σ 2 ) = p(y 1:N,x 1:N W,b,σ 2 )dx 1:N (278) = p(y 1:N x 1:N,W,b,σ 2 )p(x 1:N )dx 1:N (279) = p(y x,w,b,σ 2 )p(x )dx (280) nstead of maxmzng p(y 1:N,x 1:N W,b,σ 2 ) = = p(y,x W,b,σ 2 ) (281) p(y x,w,b,σ 2 )p(x ) (282) By ntegratng out x, we are estmatng fewer parameters and thus can get better estmates. Loosely speakng, dong so mght be vewed as beng more Bayesan. Suppose we dd nstead try to Copyrght c 2011 Aaron Hertzmann and Davd Fleet 81

87 Prncpal Components Analyss estmate the x s together wth the model parameters: L(x 1:N,W,b,σ 2 ) = lnp(y 1:N,x 1:N W,b,σ 2 ) (283) = ( 1 2σ 2 y (Wx +b) ) 2 x 2 + ND 2 lnσ2 +NDln2π (284) Now, suppose we are optmzng ths objectve functon, and we have some estmates for W and x. We can always reduce the objectve functon by replacng W 2W (285) x x/2 (286) By dong ths replacement arbtrarly many tmes, we can get nfntesmal values for x. Ths ndcates that the objectve functon s degenerate; usng t wll yeld to very poor results. Note that, however, ths arses usng the same model as before, but wthout margnalzng out x. Ths llustrates a general prncple: the more parameters you estmate (nstead of margnalzng out), the greater the danger of based and/or degenerate solutons. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 82

88 Lagrange Multplers 14 Lagrange Multplers The Method of Lagrange Multplers s a powerful technque for constraned optmzaton. Whle t has applcatons far beyond machne learnng (t was orgnally developed to solve physcs equatons), t s used for several key dervatons n machne learnng. The problem set-up s as follows: we wsh to fnd extrema (.e., maxma or mnma) of a dfferentable objectve functon E(x) = E(x 1,x 2,...x D ) (287) If we have no constrants on the problem, then the extrema are solutons to the followng system of equatons: E = 0 (288) whch s equvalent to wrtng de dx = 0 for all. Ths equaton says that there s no way to nfntesmally perturb x to get a dfferent value fore; the objectve functon s locally flat. Now, however, our goal wll be to fnd extrema subject to a sngle constrant: g(x) = 0 (289) In other words, we want to fnd the extrema among the set of ponts x that satsfy g(x) = 0. It s sometmes possble to reparameterze the problem n order to elmnate the constrants (.e., so that the new parameterzaton ncludes all possble solutons to g(x) = 0), however, ths can be awkward n some cases, and mpossble n others. Gven the constrant g(x) = 0, we are no longer lookng for a pont where no perturbaton n any drecton changes E. Instead, we need to fnd a pont at whch perturbatons that satsfy the constrants do not change E. Ths can be expressed by the followng condton: E +λ g = 0 (290) for some arbtrary scalar value λ. The expresson E = λg says that any perturbaton to x that changes E also makes the constrant become volated. Hence, perturbatons that do not change g do not change E ether. Hence, our goal s to fnd a pont x that satsfes ths condton and also g(x) = 0 In the Method of Lagrange Multplers, we defne a new objectve functon, called the Lagrangan: L(x,λ) = E(x)+λg(x) (291) Now we wll nstead fnd the extrema of L wth respect to both x and λ. The key fact s that extrema of the unconstraned objectve L are the extrema of the orgnal constraned problem. So we have elmnated the nasty constrants by changng the objectve functon and also ntroducng new unknowns. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 83

89 Lagrange Multplers E x g g(x) = 0 Fgure 21: The set of solutons tog(x) = 0 vsualzed as a curve. The gradent g s always normal to the curve. At an extremal pont, E ponts s parallel to g. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) To see why, let s look at the extrema of L. The extrema toloccur when dl = g(x) = 0 (292) dλ dl = E +λ g = 0 (293) dx whch are exactly the condtons gven above. Usng the Lagrangan s just a convenent way of combnng these two constrants nto one unconstraned optmzaton Examples Mnmzng on a crcle. We begn wth a smple geometrc example. We have the followng constraned optmzaton problem: argmn x,y x+y (294) subject tox 2 +y 2 = 1 (295) In other words, we want to fnd the pont on a crcle that mnmzesx+y; the problem s vsualzed n Fgure 22. Here, E(x,y) = x+y and g(x,y) = x 2 +y 2 1. The Lagrangan for ths problem s: L(x,y,λ) = x+y +λ(x 2 +y 2 1) (296) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 84

90 Lagrange Multplers Fgure 22: Illustraton of the maxmzaton on a crcle problem. (Image from Wkpeda.) Settng the gradent to zero gves ths system of equatons: dl = 1+2λx = 0 dx (297) dl = 1+2λy = 0 dy (298) dl dλ = x2 +y 2 1 = 0 (299) From the frst two lnes, we can see that x = y. Substtutng ths nto the constrant and solvng gves two solutons x = y = ± 1 2. Substtutng these two solutons nto the objectve, we see that the mnmum s atx = y = 1 2. Estmatng a multnomal dstrbuton. In a multnomal dstrbuton, we have an event e wth K possble dscrete, dsjont outcomes, where P(e = k) = p k (300) For example, con-flppng s a bnomal dstrbuton where N = 2 and e = 1 mght ndcate that the con lands heads. Suppose we observe N events; the lkelhood of the data s: K =1P(e p) = k p N k k (301) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 85

91 Lagrange Multplers where N k s the number of tmes that e = k,.e., the number of occurrences of the k-th event. To estmate ths dstrbuton, we can mnmze the negatve log-lkelhood: arg mn k N klnp k (302) subject to k p k = 1,p k 0, for allk (303) The constrants are requred n order to ensure that the p s form a vald probablty dstrbuton. One way to optmze ths problem s to reparameterze: set p K = 1 K 1 k=1 p k, substtute n, and then optmze the unconstraned problem n closed-form. Whle ths method does work n ths case, t breaks the natural symmetry of the problem, resultng n some messy calculatons. Moreover, ths method often cannot be generalzed to other problems. The Lagrangan for ths problem s: L(p,λ) = ( ) N k lnp k +λ p k 1 (304) k Here we omt the constrant that p k 0 and hope that ths constrant wll be satsfed by the soluton (t wll). Settng the gradent to zero gves: Multplyng dl/dp k = 0 byp k and summng overk gves: k dl = N k +λ = 0 for all k dp k p k (305) dl dλ = p k 1 = 0 k (306) 0 = K k=1n k +λ k p k = N +λ (307) snce k N k = N and k p k = 1. Hence, the optmal λ = N. Substtutng ths nto dl/dp k and solvng gves: p k = N k N whch s the famlar maxmum-lkelhood estmator for a multnomal dstrbuton. (308) Maxmum varance PCA. In the orgnal formulaton of PCA, the goal s to fnd a low-dmensonal projecton of N data ponts y x = w T (y b) (309) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 86

92 Lagrange Multplers such that the varance of the x s s maxmzed, subject to the constrant that w T w = 1. The Lagrangan s: ( ) L(w,b,λ) = 1 x 1 2 x +λ(w T w 1) (310) N N ( = 1 w T (y b) 1 2 w T (y b)) +λ(w T w 1) (311) N N ( = 1 (w T (y b) 1 2 (y b))) +λ(w T w 1) (312) N N = 1 ( w T (y ȳ) ) 2 +λ(w T w 1) (313) N = 1 w T (y ȳ)(y ȳ) T w+λ(w T w 1) (314) N ( = w T 1 (y ȳ)(y ȳ) )w+λ(w T T w 1) (315) N where ȳ = y /N. Solvng dl/dw = 0 gves: ( 1 (y ȳ)(y ȳ) )w T = λw (316) N Ths s just the egenvector equaton: n other words, w must be an egenvector of the sample covarance of the y s, and λ must be the correspondng egenvalue. In order to determne whch one, we can substtute ths equalty nto the Lagrangan to get: L = w T λw+λ(w T w 1) (317) = λ (318) snce w T w = 1. Snce our goal s to maxmze the varance, we choose the egenvector w whch has the largest egenvalue λ. We have not yet selected b, but t s clear that the value of the objectve functon does not depend on b, so we mght as well set t to be the mean of the data b = y /N, whch results n the x s havng zero mean: x /N = Least-Squares PCA n one-dmenson We now derve PCA for the case of a one-dmensonal projecton, n terms of mnmzng squared error. Specfcally, we are gven a collecton of data vectorsy 1:N, and wsh to fnd a basb, a sngle Copyrght c 2011 Aaron Hertzmann and Davd Fleet 87

93 Lagrange Multplers unt vector w, and one-dmensonal coordnates x 1:N, to mnmze: arg mn y (wx +b) 2 (319) w,x 1:N,b subject tow T w = 1 (320) The vectorws called the frst prncpal component. The Lagrangan s: L(w,x 1:N,b,λ) = y (wx +b) 2 +λ( w 2 1) (321) There are several sets of unknowns, and we derve ther optmal values each n turn. Projectons (x ). We frst derve the projectons: Usngw T w = 1 and solvng for x gves: dl dx = 2w T (y (wx +b)) = 0 (322) x = w T (y b) (323) Bas (b). We begn by dfferentatng: dl db = 2 (y (wx +b)) (324) Substtutng n Equaton 323 gves dl db = 2 (y (ww T (y b)+b)) (325) = 2 y +2ww T y 2Nww T b+2nb (326) = 2(I ww T ) y +2(I ww T )Nb = 0 (327) Dvdng both sdes by 2(I ww T )N and rearrangng terms gves: b = 1 y (328) N Copyrght c 2011 Aaron Hertzmann and Davd Fleet 88

94 Lagrange Multplers Bass vector (w). To make thngs smpler, we wll defne ỹ = (y b) as the mean-subtracted data ponts, and the reconstructons are thenx = w T ỹ, and the objectve functon s: L = = = = ỹ wx 2 +λ(w T w 1) (329) ỹ ww T ỹ 2 +λ(w T w 1) (330) (ỹ ww T ỹ ) T (ỹ ww T ỹ )+λ(w T w 1) (331) (ỹ T ỹ 2ỹ T ww T ỹ +ỹ T ww T ww T ỹ )+λ(w T w 1) (332) = ỹ T ỹ (ỹ T w) 2 +λ(w T w 1) (333) where we have usedw T w = 1. We then dfferentate and smplfy: We can rearrange ths to get: dl dw = 2 ỹ ỹ T w+2λw = 0 (334) ( ) ỹ ỹ T w = λw (335) Ths s exactly the egenvector equaton, meanng that extrema for L occur when w s an egenvector of the matrx ỹỹ T, andλs the correspondng egenvalue. Multplyng both sdes by 1/N, we see ths matrx has the same egenvectors as the data covarance: ( 1 (y b)(y b) )w T = λ N N w (336) Now we must determne whch egenvector to use. We rewrte Equaton 333 as: L = = and substtute n Equaton 335: ỹ T ỹ w T ỹ ỹ T w+λ(w T w 1) (337) ( ) ỹ T ỹ w T ỹ ỹ T w+λ(w T w 1) (338) (339) L = = ỹ T ỹ λw T w+λ(w T w 1) (340) ỹ T ỹ λ (341) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 89

95 Lagrange Multplers agan usng w T w = 1. We must pck the egenvalue λ that gves the smallest value of L. Hence, we pck the largest egenvalue, and setwto be the correspondng egenvector Multple constrants When we wsh to optmze wth respect to multple constrants {g k (x)},.e., Extrema occur when: argmn x E(x) (342) subject to g k (x) = 0 fork = 1...K (343) E + k λ k g k = 0 (344) where we have ntroduced K Lagrange multplers λ k. The constrants can be combned nto a sngle Lagrangan: L(x,λ 1:K ) = E(x)+ λ k g k (x) (345) k 14.4 Inequalty constrants The method can be extended to nequalty constrants of the form g(x) 0. For a soluton to be vald and maxmal, there two possble cases: The optmal soluton s nsde the constrant regon, and, hence E = 0 and g(x) > 0. In ths regon, the constrant s nactve, meanng thatλcan be set to zero. The optmal soluton les on the boundaryg(x) = 0. In ths case, the gradent E must pont n the opposte drecton of the gradent of g; otherwse, followng the gradent of E would causeg to become postve whle also modfynge. Hence, we must have E = λ g for λ 0. Note that, n both cases, we have λg(x) = 0. Hence, we can enforce that one of these cases s found wth the followng optmzaton problem: max w,λ E(x)+λg(x) (346) such that g(x) 0 (347) λ 0 (348) λg(x) = 0 (349) These are called the Karush-Kuhn-Tucker (KKT) condtons, whch generalze the Method of Lagrange Multplers. When mnmzng, we want E to pont n the same drecton as g when on the boundary, and so we mnmze E λg nstead of E +λg. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 90

96 Lagrange Multplers E x 1 g x 2 g(x) > 0 g(x) = 0 Fgure 23: Illustraton of the condton for nequalty constrants: the soluton may le on the boundary of the constrant regon, or n the nteror. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 91

97 Clusterng 15 Clusterng Clusterng s an unsupervsed learnng problem n whch our goal s to dscover clusters n the data. A cluster s a collecton of data that are smlar n some way. Clusterng s often used for several dfferent problems. For example, a market researcher mght want to dentfy dstnct groups of the populaton wth smlar preferences and desres. When workng wth documents you mght want to fnd clusters of documents based on the occurrence frequency of certan words. For example, ths mght allow one to dscover fnancal documents, legal documents, or emal from frends. Workng wth mage collectons you mght fnd clusters of mages whch are mages of people versus mages of buldngs. Often when we are gven large amounts of complcated data we want to look for some underlyng structure n the data, whch mght reflect certan natural knds wthn the tranng data. Clusterng can also be used to compress data, by replacng all of the elements n a cluster wth a sngle representatve element K-means Clusterng We begn wth a smple method called K-means. Gven N nput data vectors {y } N =1, we wsh to label each vector as belongng to one ofk clusters. Ths labelng wll be done va a bnary matrx L, the elements of whch are gven by L,j = { 1 f data pont belongs to clusterj 0 otherwse (350) The clusterng s mutually exclusve. Each data vector can only be assgned to only cluster: K j=1 L,j = 1. Along the way, we wll also be estmatng a centerc j for each cluster. The full objectve functon for K-means clusterng s: E(c,L) =,j L,j y c j 2 (351) Ths objectve functon penalzes the dstance between each data pont and the center of the cluster to whch t s assgned. Hence, to mnmze ths error, we want to brng the cluster centers close to the data t has been assgned, and we also want to assgn the data to nearby centers. Ths objectve functon cannot be optmzed n closed-form, and so an teratve method s requred. It ncludes dscrete varables (the labels L), and so gradent-based methods aren t drectly applcable. Instead, we use a strategy called coordnate descent, n whch we alternate between closed-form optmzaton of one set of varables holdng the other varables fxed. That s, we frst pck ntal values, then we alternate between updatng the labels for the current centers, and then updatng the centers for the current labels. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 92

98 Clusterng Here s the K-means algorthm: pck ntal values for L andc 1:K loop // Labelng update: set L argmn L E(c,L) for each data pont do j argmn j y c j 2 L,j = 1 L,a = 0 for alla j end for // Centers update: set c argmn c E(c,L) for each centerj do L,jy c j L,j end for end loop Each step of the optmzaton s guaranteed to lower the objectve functon untl the algorthm converges (you should be able to show that each step s optmal.) However, there s no guarantee that the algorthm wll fnd the global optmum and ndeed t may easly get trapped n a poor local mnma. Intalzaton. The algorthm s senstve to ntalzaton, and poor ntalzaton can sometmes lead to very poor results. Here are a few strateges that can be used to ntalze the algorthm: 1. Random labelng. Intalze the labelnglrandomly, and then run the center-update step to determne the ntal centers. Ths approach s not recommended because the ntal centers wll lkely end up just beng very close to the mean of the data. 2. Random ntal centers. We could try to place ntal center locatons randomly, e.g., by random samplng n the boundng box of the data. However, t s very lkely that some of the centers wll fall nto empty regons of the feature space, and wll therefore be assgned no data. Gettng a good ntalzaton ths way can be dffcult. 3. Random data ponts as centers. Ths method works much better: use a random subset of the data as the ntal center locatons. 4. K-medods clusterng. Ths wll be descrbed below. 5. Multple restarts. In multple restarts, we run K-means multple tmes, each tme wth a dfferent random ntalzaton (usng one of the above methods). We then take the best clusterng out of all of the runs, based on the value of the objectve functon above n Equaton (351). Copyrght c 2011 Aaron Hertzmann and Davd Fleet 93

99 Clusterng Fgure 24: K-means appled to a dataset sampled from three Gaussan dstrbutons. Each data assgned to each cluster are drawn as crcles wth dstnct colours. The cluster centers are shown as red stars. Another key queston s how one chooses the number of clusters,.e.,k. A common approach s to fx K based on some pror knowledge or computatonal constrants. One can also try dfferent values ofk, addng another term to to the objectve functon to penalze model complexty K-medods Clusterng (The materal n ths secton s not requred for ths course.) K-medods clusterng s a varant of K-means wth the addtonal constrant that the cluster centers must be drawn from the data. The followng algorthm, called Farthest Frst Traversal, or Hochbaum-Shmoys, s smple and effectve: Randomly select a data pont y as the frst cluster center: c 1 y for j = 2 to K Fnd the data pont furthest from all exstng centers: argmax mn k<j y c k 2 c j y end for Label all remanng data ponts accordng to ther nearest centers (as nk-means) Ths algorthm provdes a qualty guarantee: t gves a clusterng that s no worse than twce the error of the optmal clusterng. K-medods clusterng can also be mproved by coordnate descent. The labelng step s the same as n K-means. However, the cluster updates must be done by brute-force search for each canddate cluster center update. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 94

100 Clusterng 15.3 Mxtures of Gaussans The Mxtures-of-Gaussans (MoG) model s a generalzaton of K-means clusterng. Whereas K- means clusterng works for clusters that are more or less sphercal, the MoG model can handle oblong clusters and overlappng clusters. The K-means algorthm does an excellent job when clusters are well separated, but not when the clusters overlap. MoG algorthms compute a soft, probablstc clusterng whch allows the algorthm to better handle overlappng clusters. Fnally, the MoG model s probablstc, and so t can be used to learn probablty dstrbutons from data. The MoG model conssts of K Gaussan dstrbutons, each wth ther own means and covarances{(µ j,k j )}. Each Gaussan also has an assocated (pror) probabltya j, such that j a j = 1. That s, the probabltesa j wll represent the fracton of the data that are assgned to (or generated by) the dfferent Gaussan components. As a shorthand, we wll wrte all the model parameters wth a sngle varable,.e.,θ = {a 1:K,µ 1:K,K 1:K }. When used for clusterng, the dea s that each Gaussan component n the mxture should correspond to a sngle cluster. The complete probablstc model comprses the pror probabltes of each Gaussan component, and Gaussan lkelhood over the data (or feature) space for each component: P(L = j θ) = a j (352) p(y θ,l = j) = G(y; µ j,k j ) (353) To sample a sngle data pont from ths (generatve) model, we frst randomly select a Gaussan component accordng to ther pror probabltes {a j }, and then we randomly sample from the correspondng Gaussan component. The lkelhood of a sngle data pont can be derved by the product rule and the sum rule as follows: p(y θ) = = = K p(y,l = j θ) (354) j=1 K p(y L = j,θ)p(l = j θ) (355) j=1 K j=1 a j 1 (2π)D K j e 1 2 (y µ j) T K 1 j (y µ j ) (356) where D s the dmenson of data vectors. Ths model can be nterpreted as a lnear combnaton (or blend) of Gaussans: we get a multmodal dstrbuton by addng together unmodal Gaussans. Interestngly, the MoG model s smlar to the Gaussan Class-Condtonal model that we used for classfcaton; the dfference s that the class labels wll no longer be ncluded n the tranng set. In general, the approach of buldng models by mxtures s qute general and can be used for many other types of dstrbutons as well, for example, we could buld a mxture of Student-t dstrbutons, or a mxture of a Gaussan and a unform, and so on. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 95

101 Clusterng Fgure 25: Mxture of Gaussans model appled to a dataset generated from three Gaussans. The resultng γ s vsualzed on the rght. The data ponts are shown as colored crcles. The color s determned by the cluster wth the hghest posteror assgnment probablty γ j. One standard devaton ellpses are shown for each Gaussan. Note that the blue ponts are well solated and there s lttle ambguty n ther assgnments. The other two dstrbutons overlap, and one can see how the orentaton and eccentrcty of the covarance structure (the ellpses) nfluence the assgnment probabltes Learnng Gven a data sety 1:N, where each data pont s assumed to be drawn ndependently from the model, we learn the model parameters,θ, by mnmzng the negatve log-lkelhood of the data: L(θ) = lnp(y 1:N θ) (357) = lnp(y θ) (358) Note that ths s a constraned optmzaton, snce we requrea j 0 and j a j = 1. Furthermore, K j must be symmetrc, postve-defnte matrx to be a covarance matrx. Unfortunately, ths optmzaton cannot be performed n closed-form. One approach s to use gradent descent to optmzaton by gradent descent. There are a few ssues assocated wth dong so. Frst, some care s requred to avod numercal ssues, as dscussed below. Second, ths learnng s a constraned optmzaton, due to constrants on the values of the a s. One soluton s to project onto the constrants durng optmzaton: at each gradent descent step (and nsde the lne search loop), we clamp all negatve a values to zero and renormalze the a s so that they sum to one. Another opton s to reparameterze the problem to be unconstraned. Specfcally, we defne new varables β j, and defne the a s as functons of the βs, e.g., a j (β) = e β j K j=1 eβ j (359) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 96

102 Clusterng Ths defnton ensures that, for any choce of the βs, the as wll satsfy the constrants. We substtute ths expresson nto the model defnton and then optmze for the βs nstead of the as wth gradent descent. Smlarly, we can enforce the constrants on the covarance matrx by reparameterzaton; ths s normally done usng a upper-trangular matrx U such thatk = U T U. An alternatve to gradent descent s the Expectaton-Maxmzaton algorthm, or EM. EM s a qute general algorthm for hdden varable problems; n ths case, the labels L are hdden (or unobserved ). In EM, we defne a probablstc labelng varable γ,j. The varable γ,j corresponds to the probablty that data pont came from cluster j: γ,j s meant to estmate P(L = j y ). In EM, we optmze both θ and γ together. The algorthm alternates between the E-step whch updates the γs, and the M-step whch updates the model parameters θ. pck ntal values forγ andθ loop E-step: for each data pontdo γ,j P(L = j y,θ) end for M-step: for each clusterj do a j µ j K j end for end loop γ,j N γ,jy γ,j γ,j(y µ j )(y µ j ) T γ,j Note that the E-step s the same as classfcaton n the Gaussan Class-Condtonal model. The EM algorthm s a local optmzaton algorthm, and so the results wll depend on ntalzaton. Intalzaton strateges smlar to those used for K-means above can be used Numercal ssues Exponentatng very small negatve numbers can often lead to underflow when mplemented n floatng-pont arthmetc, e.g., e A wll gve zero for large A, and lne A wll gve an error (or -Inf) whereas t should return A. These ssues wll often cause machne learnng algorthms to fal; MoG has several steps whch are susceptble. Fortunately, there are some smple trcks that can be used. 1. Many computatons can be performed drectly n the log doman. For example, t may be Copyrght c 2011 Aaron Hertzmann and Davd Fleet 97

103 Clusterng more stable to compute as ae b (360) e lna+b (361) Ths avods ssues where b s so small that e b evaluates to zero n floatng pont, but ae b s much greater than zero. 2. When computng an expresson of the form: e β j j e β j (362) large values of β could lead to the above expresson beng zero for all j, even though the expresson must sum to one. Ths may arse, for example, when computng the γ updates, whch have the above form. The soluton s to make use of the dentty: e β j = e βj+c (363) j e β j j e β j+c for any value ofc. We can choosec to prevent underflow; a sutable choce sc = mn j β j. 3. Underflow can also occur when evaluatng ln e β j (364) whch can be fxed by usng the dentty ln ( e β j = ln e β j+c ) C (365) The Free Energy Amazngly, EM optmzes the log-lkelhood, whch doesn t even have a γ parameter. In order to understand the EM algorthm and why t works, t s helpful to ntroduce a quantty called the Free Energy: F(θ,γ) =,j γ,j lnp(y,l = j θ)+,j γ,j lnγ,j (366) = 1 γ,j (y µ j ) T K 1 j (y µ j ) (367) 2,j + 1 γ,j ln(2π) D K j γ,j lna j (368) 2,j,j +,j γ,j lnγ,j (369) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 98

104 Clusterng The EM algorthm s a coordnate descent algorthm for optmzng the free energy, subject to the constrant that j γ,j = 1 and the constrants on a. In other words, EM can be wrtten compactly as: pck ntal values for γ andθ loop E-step: γ argmn γ F(θ,γ) M-step: θ argmn θ F(θ,γ) end loop However, the free energy s dfferent from the negatve log-lkelhoodl(θ) that we ntally set out to mnmze. Fortunately, the free energy has the followng mportant propertes: When the value of γ s optmal, the Free Energy s equal to the negatve log-lkelhood: L(θ) = mn γ F(θ,γ) (370) We can use ths fact to evaluate the negatve log-lkelhood smply by runnng an E-step and then computng the free energy. In fact, ths s often more effcent than drectly computng the negatve log-lkelhood. The proof s gven n the next secton. The mnma of the free energy are also mnma of the negatve log-lkelhood: mnl(θ) = mnf(θ,γ) (371) θ θ,γ Ths follows from the prevous property. Hence, optmzng the free energy s the same as optmzng the negatve log-lkelhood. The Free Energy s an upper-bound on the negatve log-lkelhood: F(θ, γ) L(θ) (372) for all values of γ. Ths observaton gves a santy check for debuggng the free energy computaton. The Free Energy also provdes a very helpful tool for debuggng: any step of an mplementaton that ncreases the free energy must be ncorrect. The term Free Energy arses from ts orgnal defnton n statstcal physcs Proofs Ths content of ths secton s not requred materal for ths course and you may skp t. Here we outlne proofs for the key features of the free energy. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 99

105 Clusterng EM updates. The steps of the EM algorthm may be derved by solvng argmn γ F(θ,γ) and argmn θ F(θ,γ). In most cases, the dervatons generalze famlar ones, e.g., weghted leastsquares. The a and γ parameters are multnomal dstrbutons, and optmzaton of them requres Lagrange multplers or reparameterzaton. One may gnore the postvty constrant, as t turns out to be automatcally satsfed. The detals wll be skpped here. Equalty after the E-step. The E-step computes the optmal value for γ: whch s gven by: Substtutng ths nto the Free Energy gves: γ argmn γ F(θ,γ) (373) γ,j = P(L = j y ) (374) F(θ,γ ) =,j P(L = j y )ln p(y,l = j) P(L = j y ) (375) =,j = = P(L = j y )lnp(y ) (376) ( lnp(y ) ) P(L = j y ) j (377) lnp(y ) (378) = L(θ) (379) Hence, L(θ) = mn γ F(θ,γ) (380) Bound. An mportant buldng block n provng that F(θ, γ) L(θ) s Jensen s Inequalty, whch apples snce ln s a concave functon and j b j = 1,b j 0. ln j b j x j j b j lnx j, or (381) ln j b j x j j b j lnx j (382) We wll not prove ths here. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 100

106 Clusterng We can then derve the bound as follows, droppng the dependence on θ for brevty: L(θ) = ln j p(y,l = j) (383) = ln j γ,j γ,j p(y,l = j) (384) γ,j ln p(y,l = j) γ,j,j (385) = F(θ,γ) (386) Relaton to K-means It should be clear that the K-means algorthm s very closely related to EM. In fact, EM reduces to K-means f we make the followng restrctons on the model: The class probabltes are equal: a j = 1 K. The Gaussans are sphercal wth dentcal varances: K j = σ 2 I for all j. The Gaussan varances are nfntesmal,.e., we consder the algorthm n the lmt as σ 2 0. Ths causes the optmal values for γ to be bnary, snce, f j s the nearest class, lm σ 2 0P(L = j y ) = 1. Wth these modfcatons, the Free Energy becomes equvalent to thek-means objectve functon, up to constant values, and the EM algorthm becomes dentcal to K-means Degeneracy There s a degeneracy n the MoG objectve functon. Suppose we center one Gaussan at one of the data ponts, so that c j = y. The error for ths data pont wll be zero, and by reducng the varance of ths Gaussan, we can always ncrease the lkelhood of the data. In the lmt as ths Gaussan s varance goes to zero, the data lkelhood goes to nfnty. Hence, some effort may be requred to avod ths stuaton. Ths degeneracy can also be avoded by usng a more Bayesan form of the algorthm, e.g., margnalzng out the cluster centers rather than estmatng them Determnng the number of clusters Determnng the value of K s a model selecton problem: we want to determne the most-lkely value ofk gven the data. Cross valdaton s not approprate here, snce we do not have any supervson (e.g., correct labels from a subset of the data). Bayesan model selecton can be employed, e.g., by maxmzng K = argmax P(K y 1:N) = argmax p(k,θ y 1:N )dθ (387) K K Copyrght c 2011 Aaron Hertzmann and Davd Fleet 101

107 Clusterng where θ are the model parameters. Ths evaluaton s somewhat mathematcally-nvolved. A very coarse approxmaton to ths computaton s Bayesan Informaton Crteron (BIC). Copyrght c 2011 Aaron Hertzmann and Davd Fleet 102

108 Hdden Markov Models Fgure 26: Examples of tme-seres data: speech and language. 16 Hdden Markov Models Untl now, we have exclusvely dealt wth data sets n whch each ndvdual measurement s ndependent and dentcally dstrbuted (IID). That s, for two ponts y 1 and y 2 n our data set, we have p(y 1 ) = p(y 2 ) and p(y 1,y 2 ) = p(y 1 )p(y 2 ) (for a fxed model). Tme-seres data consst of sequences of data that are not IID: they arse from a process that vares over tme, and modelng the dynamcs of ths process s mportant Markov Models Markov models are tme seres that have the Markov property: P(s t s t 1,s n 2,...,s 1 ) = P(s t s t 1 ) (388) wheres t s the state of the system at tmet. Intutvely, ths property says the probablty of a state at tme t s competely determned by the system state at the prevous tme step. More generally, for any setaof ndces less thantand set of ndces B greater thantwe have: whch follows from the Markov property. P(s t {s } A B ) = P(s t s max(a),s mn(b) ) (389) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 103

109 Hdden Markov Models Another useful dentty whch also follows drectly from the Markov property s: P(s t 1,s t+1 s t ) = P(s t 1 s t )P(s t+1 s t ) (390) Dscrete Markov Models. A mportant example of Markov chans are dscrete Markov models. Each state s t can take on one of a dscrete set of states, and the probablty of transtonng from one state to another s governed by a probablty table for the whole sequence of states. More concretely, s t {1,...,K} for some fnte K and, for all tmes t, P(s t = j s t 1 = ) = A j whereas parameter of the model that s a fxed matrx of vald probabltes (so thata j 0 and K j=1 A j = 1). To fully characterze the model, we also requre a dstrbuton over states for the frst tme-step: P(s 1 = ) = a Hdden Markov Models A Hdden Markov model (HMM) models a tme-seres of observatons y 1:T as beng determned by a hdden dscrete Markov chan s 1:T. In partcular, the measurement y t s assumed to be determned by an emsson dstrbuton that depends on the hdden state at tme t: p(y t s t = ). The Markov chan s called hdden because we do not measure t, but must reason about t ndrectly. Typcally,s t encodes underlyng structure of the tme-seres, where as they t correspond to the measurements that are actually observed. For example, n speech modelng applcatons, the measurements y mght be the waveforms measured from a mcrophone, and the hdden states mght be the correspondng word that the speaker s utterng. In language modelng, the measurements mght be dscrete words, and the hdden states ther underlyng parts of speech. HMMs can be used for dscrete or contnuous data; n ths course, we wll focus solely on the contnuous case, wth Gaussan emsson dstrbutons. The jont dstrbuton over observed and hdden s: where and The Gaussan model says: p(s 1:T,y 1:T ) = p(y 1:T s 1:T )P(s 1:T ) (391) P(s 1:T ) = P(s 1 ) P(y 1:T s 1:T ) = T P(s t s t 1 ) (392) t=2 T p(y t s t ) (393) t=1 p(y t s t = ) = N(y t ;µ,σ ) (394) for some mean and covarance parametersµ andσ. In other words, each statehas ts own Gaussan wth ts own parameters. A complete HMM conssts of the followng parameters: a,a,µ 1:K, Copyrght c 2011 Aaron Hertzmann and Davd Fleet 104

110 Hdden Markov Models s 1 s 2 s t 1 s t s t +1 y 1 y 2 y t 1 y t y t +1 Fgure 27: Illustraton of the varables n an HMM, and ther condtonal dependences. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) A 22 k = 2 A 12 A 21 A 32 A 23 k = 1 A 11 k = 3 A 31 A 13 A 33 Fgure 28: The hdden states of an HMM correspond to a state machne. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 105

111 Hdden Markov Models Fgure 29: Illustraton of samplng a sequence of dataponts from an HMM. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) andσ 1:K. As a short-hand, we wll denote these parameters by a varableθ = {a,a,µ 1:K,Σ 1:K }. Note that, f A j = a j for all, then ths model s equvalent to a Mxtures-of-Gaussan model wth mxng proportons gven by the a s, snce the dstrbuton over states at any nstant does not depend on the prevous state. In the remander of ths chapter, we wll dscuss algorthms for computng wth HMMs Vterb Algorthm We begn by consderng the problem of computng the most-lkely sequence of states gven a data sety 1:T and a known HMM model. That s, we wsh to compute s 1:T = argmax s 1:T P(s 1:T θ,y 1:T ) (395) The nave approach s to smply enumerate every possble state sequence and choose the one that maxmzes the above condtonal probablty. Snce there are K T possble state-sequences, ths approach s clearly nfeasble for sequences of more than a few steps. Fortunately, we can take advantage of the Markov property to perform ths computaton much more effcently. The Vterb algorthm s a dynamc programmng approach to fndng the most lkely sequence of statess 1:T gvenθ and a sequence of observatons y 1:T. We begn by defnng the followng quantty for each state and each tme-step: δ t () max s 1:t 1 p(s 1:t 1,s t =,y 1:t ) (396) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 106

112 Hdden Markov Models (Henceforth, we omt θ from these equatons for brevty.) Ths quantty tells us the lkelhood that the most-lkely sequence up to tme t ends at state, gven the data up to tme t. We wll compute ths quantty recursvely. The base case s smply: for all. The recursve case s: δ 1 () = p(s 1 =,y 1 ) = p(y 1 s 1 = )P(s 1 = ) (397) δ t () = max s 1:t 1 p(s 1:t 1,s t =,y 1:t ) = max p(s t = s t 1 = j)p(y t s t = )p(s 1:t 2,s t 1 = j,y 1:t 1 ) s 1:t 2,j [ ] = p(y t s t = )max p(s t = s t 1 = j)maxp(s 1:t 2,s t 1 = j,y 1:t 1 ) j s 1:t 2 = p(y t s t = )maxa j δ t 1 (j) j Once we have computed δ for all tme-steps and all states, we can determne the fnal state of the most-lkely sequence as: s T = argmaxp(s T = y 1:T ) (398) = argmaxp(s T =,y 1:T ) (399) = argmaxδ T () (400) sncep(y 1:T ) does not depend on the state sequence. We can then backtrack throughδ to determne the states of each prevous tme-step, by fndng whch statej was used to compute each maxmum n the recursve step above. These states would normally be stored durng the recursve process so that they can be looked-up later The Forward-Backward Algorthm We may be nterested n computng quanttes such as p(y 1:T θ) or P(s t y 1:T,θ); these dstrbutons are useful for learnng and analyzng models. Agan, the nave approach to computng these quanttes nvolves summng over all possble sequences of hdden states, and thus s ntractable to compute. The Forward-Backward Algorthm allows us to compute these quanttes n polynomal tme, usng dynamc programmng. In the Forward Recurson, we compute: α t () p(y 1:t,s t = ) (401) The base case s: α 1 () = p(y 1 s 1 = )p(s 1 = ) (402) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 107

113 Hdden Markov Models k = 1 A 11 A 11 A 11 k = 2 k = 3 A 33 A 33 A 33 n 2 n 1 n n+1 Fgure 30: Illustraton of the α t () values computed durng the Forward Recurson. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) and the recursve case s: α t () = j = j p(y 1:t,s t =,s t 1 = j) (403) p(y t s t = )P(s t = s t 1 = j)p(y 1:t 1,s t 1 = j) (404) = p(y t s t = ) K A j α t 1 (j) (405) j=1 Note that ths s dentcal to the Vterb algorthm, except that maxmzaton over j has been replaced by summaton. In the Backward Recurson we compute: β t () p(y t+1:t s t = ) (406) The base case s: The recursve case s: β T () = 1 (407) β t () = K A j p(y t+1 s t+1 = j)β t+1 (j) (408) j=1 From these quanttes, we can easly the followng useful quanttes. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 108

114 Hdden Markov Models (s t 1,1 ) k = 1 (s t 1,2 ) k = 2 (s t 1,3 ) k = 3 t 1 A 21 A 11 A 31 (s t, 1 ) p(y t s t, 1 ) t k = 1 k = 2 k = 3 β(s t, 1 ) β(s t +1,1 ) A 13 A 11 A 12 β(s t +1,2 ) p(y t s t +1,1 ) p(y t s t +1,2 ) β(s t +1,3 ) t t +1 p(y t s t +1,3 ) Fgure 31: Illustraton of the steps of the Forward Recurson and the Backward Recurson (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) The probablty that the hdden sequence had state at tme t s: γ t () p(s t = y 1:T ) (409) = p(y 1:T s t = )p(s t = ) p(y 1:T ) (410) = p(y 1:t s t = )p(y t+1:t s t = )p(s t = ) p(y 1:T ) (411) = α t()β t () p(y 1:T ) (412) The normalzng constant whch s also the lkelhood of the entre sequence,p(y 1:T ) can be computed by the followng formula: p(y 1:T ) = = p(s t =,y 1:T ) (413) α t ()β t () (414) The result of ths summaton wll be the same regardless of whch tme-step t we choose to do the summaton over. The probablty that the hdden sequence transtoned from state at tme t to state j at tme Copyrght c 2011 Aaron Hertzmann and Davd Fleet 109

115 Hdden Markov Models t+1 s: ξ t (,j) P(s t =,s t+1 = j y 1:T ) (415) = α t()a j p(y t+1 s t+1 = j)β t+1 (j) p(y 1:T ) (416) α t ()A j p(y t+1 s t+1 = j)β t+1 (j) = j α t()a j p(y t+1 s t+1 = j)β t+1 (j) (417) Note that the denomnator gves an expresson for p(y 1:T ), whch can be computed for any value of t EM: The Baum-Welch Algorthm Learnng n HMMs s normally done by maxmum lkelhood,.e., we wsh to fnd the model parameters such that: θ = argmaxp(y 1:T θ) (418) θ As before, even evaluatng ths objectve functon requres K T steps, and methods lke gradent descent wll be mpractcal. Instead, we can use the EM algorthm. Note that, snce an HMM s a generalzaton of a Mxture-of-Gaussans, EM for HMMs wll be a generalzaton of EM for MoGs. The EM algorthm appled to HMMs s also known as the Baum-Welch Algorthm. The algorthm alternates between the followng two steps: The E-Step: The Forward-Backward Algorthm s performed, n order to compute γ andξ. The M-Step: The parameters θ are updated as follows: Numercal ssues: renormalzaton a = γ 1 () (419) µ = tγ t()y t t γ (420) t() t Σ = γ t()(y t µ )(y t µ ) T t γ (421) t() t A j = ξ t(,j) k t ξ (422) t(,k) In practce, numercal ssues are a problem for the straghtforward mplementaton of the Forward- Backward Algorthm. Snce the α t s nvolve jont probabltes over the entre sequence up to tme t, they wll be very small. In fact, as t grows, the values of α t tend to shrnk exponentally Copyrght c 2011 Aaron Hertzmann and Davd Fleet 110

116 Hdden Markov Models towards 0. Thus the lmt of machne precson wll quckly be reached and the computed values wll underflow (evaluate to zero). The soluton s to compute a normalzed terms n the Forward-Backward recursons: ˆα t () = α (t) T m=1 c m (423) ˆβ t () = ( T m=t+1 c m )β (t) (424) Specfcally, we use c t = p(y t y 1:t 1 ). It can then be seen that, f we use ˆα and ˆβ n the M-step nstead of α and β, the c t terms wll cancel out (you can see ths by substtutng the formulas for γ and ξ nto the M-step). We then must choose c t to keep the scalng of ˆα and ˆβ wthn machne precson. In the base case for the forward recurson, we set: c 1 = K p(y 1 s 1 = )a (425) =1 ˆα 1 () = p(y 1 s 1 = )a c 1 (426) (Ths may be mplemented by frst computng the numerator of ˆα, and then summng t to get c 1 ). The recurson for computng ˆα s c t = p(y t s t = ) ˆα t () = p(y t s t = ) In the backward step, the base case s: and the recursve case s ˆβ t () = K A jˆα t 1 (j) (427) j=1 usng the same c t values computed n the forward recurson. The γ andξ varables can then be computed as K j=1 A jˆα t 1 (j) c t (428) ˆβ T () = 1 (429) K j=1 A jp(y t+1 s t+1 = j)ˆβ t+1 (j) c t+1 (430) γ t () = ˆα t ()ˆβ t () (431) ξ t (,j) = ˆα t()p(y t+1 s t+1 = j)a jˆβt+1 (j) c t+1 (432) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 111

117 Hdden Markov Models It can be shown thatc t = p(y t y 1:t 1 ). Hence, once the recurson s complete, we can compute the data lkelhood as p(y 1:T ) = c t (433) t or, n the log-doman (whch s more stable), lnp(y 1:T ) = lnc t (434) Ths quantty s decreased after ever EM-step untl convergence of EM Free Energy EM can be vewed as optmzng the model parameters θ together wth the dstrbuton ξ. The Free Energy for a Hdden Markov Model s: F(θ,ξ) = +,j γ 1 ()lna T 1 ξ t (,j)lna j,j t=1 T 1 ξ t (,j)lnξ t (,j) t=1 T γ t ()lnp(y t s t = ) t=1 T 2 γ t ()lnγ t () (435) t=2 where γ s defned as a functon of ξ as: γ t () = k ξ t (,k) = k ξ t 1 (k,) (436) Warnng! Snce we weren t able to fnd any formula for the free energy, we derved t from scratch (see below). In our tests, t ddn t precsely match the negatve log-lkelhood. So there mght be a mstake here, although the free energy dd decrease as expected. Dervaton. Ths materal s very advanced and not requred for the course. It s manly here because we couldn t fnd t elsewhere. As a short-hand, we defne s = s 1:T to be a varable representng an entre state sequence. The lkelhood of a data sequence s: p(y 1:T ) = s p(y 1:T,s) (437) where the summaton s over all possble state sequences. In EM, we re really optmzngθ and a dstrbutonq(s) over the possble state sequences. The varable ξ s just one way of representng ths dstrbuton by ts margnals; the varable γ are the Copyrght c 2011 Aaron Hertzmann and Davd Fleet 112

118 Hdden Markov Models margnals of ξ: γ t () = q(s t = ) = s\{}q(s) (438) ξ t (,j) = q(s t =,s t+1 = j) = q(s) (439) We can also compute the full dstrbuton from ξ and γ: s\{,j} The Free Energy s then: T 1 q(s) = q(s 1 ) q(s t+1 s t ) (440) t=1 T 1 = γ 1 () = t=1 ξ t (,j) γ t () T 1 t=1 ξ t(,j) T 1 t=2 γ t() (441) (442) F(θ,q) = s q(s)lnp(s,y 1:T )+ s q(s) ln q(s) (443) The frst term can be decomposed as: F 1 (q) = s = s = s = = F 1 (q)+f 2 (q) (444) q(s)lnp(s,y 1:T ) (445) q(s) ln ( P(s 1 ) T 1 t=1 q(s)lnp(s 1 ) s P(s t+1 s t ) γ 1 ()lnp(s 1 = ),j,t ) T p(y t s t ) t=1 q(s)lnp(s t+1 s t ) t s ξ t (,j)lnp(s t+1 = j s t = ) (446) q(s)lnp(y t s t ) (447) t,t γ t ()lnp(y t s t = ) (448) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 113

119 Hdden Markov Models The second term can be smplfed as: F 2 (q) = s = s = s =,j q(s) ln q(s) (449) T 1 t=1 q(s) ln t(,j) T 1 t=2 γ t() (450) T 1 q(s)lnξ t (,j) T 1 q(s)lnγ t () t=1 s t=2 (451) T 1 ξ t (,j)lnξ t (,j) T 1 γ t ()lnγ t () t=1 t=2 (452) 16.6 Most lkely state sequences Suppose we wanted to computed the most lkely states s t for each tme n a sequence. There are two ways that we mght do t: we could take the most lkely state sequence: or we could take the sequence of most-lkely states: s 1:T = argmax s 1:T p(s 1:T y 1:T ) (453) s t = argmax s t p(s t y 1:T ) (454) Whle these sequences may often be smlar, they can be dfferent as well. For example, t s possble that the most lkely states for two consecutve tme-steps do not have a vald transton between them,.e., f s t = and s t+1 = j, t s possble (though unlkely) that A j = 0. Ths llustrates that these two ways to create sequences of states answer two dfferent questons: what sequence s jontly most lkely? And, for each tme-step, what s the most lkely state just for that tme-step? Copyrght c 2011 Aaron Hertzmann and Davd Fleet 114

120 Support Vector Machnes 17 Support Vector Machnes We now dscuss an nfluental and effectve classfcaton algorthm called Support Vector Machnes (SVMs). In addton to ther successes n many classfcaton problems, SVMs are responsble for ntroducng and/or popularzng several mportant deas to machne learnng, namely, kernel methods, maxmum margn methods, convex optmzaton, and sparsty/support vectors. Unlke the mostly-bayesan treatment that we have gven n ths course, SVMs are based on some very sophstcated Frequentst arguments (based on a theory called Structural Rsk Mnmzaton and VC-Dmenson) whch we wll not dscuss here, although there are many close connectons to Bayesan formulatons Maxmzng the margn Suppose we are gvenn tranng vectors{(x,y )}, wherex R D,y { 1,1}. We want to learn a classfer f(x) = w T φ(x)+b (455) so that the classfer s output for a newxs sgn(f(x)). Suppose that our tranng data are lnearly-separable n the feature space φ(x),.e., as llustrated n Fgure 32, the two classes of tranng exemplars are suffcently well separated n the feature space that one can draw a hyperplane between them (e.g., a lne n 2D, or plane n 3D). If they are lnearly separable then n almost all cases there wll be many possble choces for the lnear decson boundary, each one of whch wll produce no classfcaton errors on the tranng data. Whch one should we choose? If we place the boundary very close to some of the data, there seems to be a greater danger that we wll msclassfy some data, especally when the tranng data are alsmot certany nosy. Ths motvates the dea of placng the boundary to maxmze the margn, that s, the dstance from the hyperplane to the closest data pont n ether class. Ths can be thought of havng the largest margn for error f you are drvng a fast car between a scattered set of obstacles, t s safest to fnd a path that stays as far from them as possble. More precsely, n a maxmum margn method, we want to optmze the followng objectve functon: max w,b mn dst(x,w,b) (456) such that, for all,y (w T φ(x )+b) 0 (457) where dst(x, w, b) s the Eucldean dstance from the feature pont φ(x) to the hyperplane defned bywandb. Wth ths objectve functon we are maxmzng the dstance from the decson boundary w T φ(x) + b = 0 to the nearest pont. The constrants force us to fnd a decson boundary that classfes all tranng data correctly. That s, for the classfer a tranng pont correctly y and w T φ(x )+b should have the same sgn, n whch case ther product must be postve. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 115

121 Support Vector Machnes y= f = 1 y= f = 0 y= f = 11 y= 1 f = 1 y= f = 0 y= 1 f = 1 margn Fgure 32: Left: the margn for a decson boundary s the dstance to the nearest data pont. Rght: In SVMs, we fnd the boundary wth maxmum margn. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) It can be shown that the dstance from a pont φ(x ) to a hyperplane w T φ(x)+b = 0 s gven by wt φ(x )+b, or, snce y w tells us the sgn of f(x ), y (w T φ(x )+b). Ths can be seen ntutvely by w wrtng the hyperplane n the form f(x) = w T (φ(x ) p), where p s a pont on the hyperplane such thatw T p = b. The vector fromφ(x ) to the hyperplane projected ontow/ w gves a vector from the hyperplane to the the pont; the length of ths vector s the desred dstance. Substtutng ths expresson for the dstance functon nto the above objectve functon, we get: y max w,b mn (w T φ(x )+b) w (458) such that, for all,y (w T φ(x )+b) 0 (459) Note that, because of the normalzaton by w n (458), the scale ofws arbtrary n ths objectve functon. That s, f we were to multply w and b by some real scalar α, the factors of α n the numerator and denomnator wll cancel one another. Now, suppose that we choose the scale so that the nearest pont to the hyperplane, x, satsfes y (w T φ(x )+b) = 1. Wth ths assumpton the mn n Eqn (458) becomes redundant and can be removed. Thus we can rewrte the objectve functon and the constrant as 1 max w,b w (460) such that, for all,y (w T φ(x )+b) 1 (461) Fnally, as a last step, snce maxmzng 1/ w s the same as mnmzng w 2 /2, we can re-express the optmzaton problem as 1 mn w,b 2 w 2 (462) such that, for all,y (w T φ(x )+b) 1 (463) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 116

122 Support Vector Machnes Ths objectve functon s a quadratc program, or QP, because the objectve functon and the constrants are both quadratc n the unknowns. A QP has a sngle global mnma, whch can be found effcently wth current optmzaton packages. In order to understand ths optmzaton problem, we can see that the constrants wll be actve for only a few dataponts. That s, only a few dataponts wll be close to the margn, thereby constranng the soluton. These ponts are called the support vectors. Small movements of the other data ponts have no effect on the decson boundary. Indeed, the decson boundary s determned only by the support vectors. Of course, movng ponts to wthn the margn of the decson boundary wll change whch ponts are support vectors, and thus change the decson boundary. Ths s n constrast to the probablstc methods we have seen earler n the course, n whch the postons of all data ponts affect the locaton of the decson boundary Slack Varables for Non-Separable Datasets Many datasets wll not be lnearly separable. As a result, there wll be no way to satsfy all the constrants n Eqn. (463). One way to cope wth such datasets and stll learn useful classfers s to loosen some of the constrants by ntroducng slack varables. Slack varables are ntroduced to allow certan constrant to be volated. That s, certan tranng ponts wll be allowed to be wthn the margn. We want the number of ponts wthn the margn to be as small as possble, and of course we want ther penetraton of the margn to be as small as possble. To ths end, we ntroduce a slack varable ξ, one for each datapont. (ξ s the Greek letter x, pronounced ks. ). The slack varable s ntroduced nto the optmzaton problem n two ways. Frst, the slack varable ξ dctates the degree to whch the constrant on the th datapont can be volated. Second, by addng the slack varable to the energy functon we are amng to smultaneously mnmze the use of the slack varables. Mathematcally, the new optmzaton problem can be expressed as mn ξ +λ 1 w,b,ξ 1:N 2 w 2 (464) such that, for all,y (w T φ(x )+b) 1 ξ andξ 0 (465) As dscussed above, we am to both maxmze the margn and mnmze volaton of the margn constrants. Ths objectve functon s stll a QP, and so can be optmzed wth a QP lbrary. However, t does have a much larger number of optmzaton varables, namely, one ξ must now be optmzed for each datapont. In practce, SVMs are normally optmzed wth specal-purpose optmzaton procedures desgned specfcally for SVMs. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 117

123 Support Vector Machnes ξ > 1 f y= = 11 y= f = 0 f y= = 1 ξ < 1 ξ = 0 ξ = 0 Fgure 33: The slack varables ξ 1 for msclassfed ponts, and 0 < ξ < 1 for ponts close to the decson boundary. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) Loss Functons In order to better understand the behavor of SVMs, and how they compare to other methods, we wll analyze them n terms of ther loss functons. 9 In some cases, ths loss functon mght come from the problem beng solved: for example, we mght pay a certan dollar amount f we ncorrectly classfy a vector, and the penalty for a false postve mght be very dfferent for the penalty for a false negatve. The rewards and losses due to correct and ncorrect classfcaton depend on the partcular problem beng optmzed. Here, we wll smply attempt to mnmze the total number of classfcaton errors, usng a penalty s called the 0-1 Loss: L 0 1 (x,y) = { 1 yf(x) < 0 0 otherwse (466) (Note that yf(x) > 0 s the same as requrng that y and f(x) have the same sgn.) Ths loss functon says that we pay a penalty of 1 when we msclassfy a new nput, and a penalty of zero f we classfy t correctly. Ideally, we would choose the classfer to mnmze the loss over the new test data that we are gven; of course, we don t know the true labels, and nstead we optmze the followng surrogate objectve functon over the tranng data: E(w) = L(x,y )+λr(w) (467) 9 A loss functon specfes a measure of the qualty of a soluton to an optmzaton problem. It s the penalty functon that tell us how badly we want to penalze errors n a models ablty to ft the data. In probablstc methods t s typcally the negatve log lkelhood or the negatve log posteror. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 118

124 Support Vector Machnes E(z) z Fgure 34: Loss functons, E(z), for learnng, for z = yf(x). Black: 0-1 loss. Red: LR loss. Green: Quadratc loss ((z 1) 2 ). Blue: Hnge loss. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) wherer(w) s a regularzer meant to prevent overfttng (and thus mprove performance on future data). The basc assumpton s that loss on the tranng set should correspond to loss on the test set. If we can get the classfer to have small loss on the tranng data, whle also beng smooth, then the loss we pay on new data ought to not be too bg ether. Ths optmzaton framework s equvalent to MAP estmaton as dscussed prevously 10 ; however, here we are not at all concerned wth probabltes. We only care about whether the classfer gets the rght answers or not. Unfortunately, optmzng a classfer for the 0-1 loss s very dffcult: t s not dfferentable everywhere, and, where t s dfferentable, the gradent s zero everywhere. There are a set of algorthms called Perceptron Learnng whch attempt to do ths; of these, the Voted Perceptron algorthm s consdered one of the best. However, these methods are somewhat complex to analyze and we wll not dscuss them further. Instead, we wll use other loss functons that approxmate 0-1 loss. We can see that maxmum lkelhood logstc regresson s equvalent to optmzaton wth the followng loss functon: L LR = ln ( 1+e yf(x)) (468) whch s the negatve log-lkelhood of a sngle data vector. Ths functon s a poor approxmaton to the 0-1 loss, and, f all we care about s gettng the labels rght (and not the class probabltes), then we ought to search for a better approxmaton. SVMs mnmze the slack varables, whch, from the constrants, can be seen to gve the hnge loss: { 1 yf(x) 1 yf(x) > 0 L hnge = (469) 0 otherwse 10 However, not all loss functons can be vewed as the negatve log of a vald lkelhood functon, although all negatve-log lkelhoods can be vewed as loss functons for learnng. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 119

125 Support Vector Machnes Ths loss functon s zero for ponts that are classfed correctly (wth dstance to the decson boundary at least 1); hence, t s nsenstve to correctly-classfed ponts far from the boundary. It ncreases lnearly for msclassfed ponts, not nearly as quckly as the LR loss The Lagrangan and the Kernel Trck We now use the Lagrangan n order to transform the SVM problem n a way that wll lead to a powerful generalzaton. For smplcty here we assume that the dataset s lnearly separable, and so we drop the slack varables. The Langrangan allows us to take the constraned optmzaton problem above n Eqn. (463) and re-express t as an unconstraned problem. The Lagrangan for the SVM objectve functon n Eqn. (463), wth Lagrange multplers a 0, s: L(w,b,a 1:N ) = 1 2 w 2 a ( y ( w T φ(x )+b ) 1 ) (470) The mnus sgn wth the secon term s used because we are mnmzng wth respect to the frst term, but maxmzng the second. Settng the dervatve of dl dl = 0 and = 0 gves the followng constrants on the soluton: dw db w = a y φ(x ) (471) y a = 0 (472) Usng (471) we can substtute for w n 470. Then smplfyng the result, and makng use of the next constrant (471), one can derve what s often called the dual Lagrangan: L(a 1:N ) = a 1 a a j y y j φ(x ) T φ(x j ) (473) 2 Whle ths objectve functon s actually more expensve to evaluate than the prmal Lagrangan (.e., 470), t does lead to the followng modfed form j L(a 1:N ) = a 1 a a j y y j k(x,x j ) (474) 2 where k(x,x j ) = φ(x ) T φ(x j ) s called a kernel functon. For example, f we used the basc lnear features,.e.,φ(x) = x, then k(x,x j ) = x T x j. The advantage of the kernel functon representaton s that t frees us from thnkng about the features drectly; the classfer can be specfed solely n terms of the kernel. Any kernel that j Copyrght c 2011 Aaron Hertzmann and Davd Fleet 120

126 Support Vector Machnes satsfes a specfc techncal condton 11 s a vald kernel. For example, one of the most commonlyused kernels s the RBF kernel : k(x,z) = e γ x z 2 (475) whch corresponds to a vector of features φ(x) wth nfnte dmensonalty! (Specfcally, each element ofφs a Gaussan bass functon wth vanshng varance). Note that, just as most constrants n the Eq. (463) are not actve, the same wll be true here. That s, only some constrants wll be actve (e the support vectors), and for all other constrants, a = 0. Hence, once the model s learned, most of the tranng data can be dscarded; only the support vectors and ther a values matter. The one fnal thng we need to do s estmate the bas b. We now know the values for a for all support vectors (.e., for data constrants that are consdered actve), and hence we know w. Accordngly, for all support vectors we know, by assumpton above, that From ths one can easly solve forb. f(x ) = w T φ(x )+b = 1. (476) Applyng the SVM to new data. For the kernel representaton to be useful, we need to be able to classfy new data wthout needng to evaluate the weghts. Ths can be done as follows: f(x new ) = w T φ(x new )+b (477) ( T = a y φ(x )) φ(x new )+b (478) = a y k(x,x new )+b (479) Generalzng the kernel representaton to non-separable datasets (.e., wth slack varables) s straghtforward, but wll not be covered n ths course Choosng parameters To determne an SVM classfer, one must select: The regularzaton weght λ The parameters to the kernel functon The type of kernel functon These values are typcally selected ether by hand-tunng or cross-valdaton. 11 Specfcally, suppose one s gven N nput ponts x 1:N, and forms a matrx K such that K,j = k(x,x j ). Ths matrx must be postve semdefnte (.e., all egenvalues non-negatve) for all possble nput sets for k to be a vald kernel. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 121

127 Support Vector Machnes Fgure 35: Nonlnear classfcaton boundary learned usng kernel SVM (wth an RBF kernel). The crcled ponts are the support vectors; curves are socontours of the decson functon (e.g., the decson boundary f(x) = 0, etc.) (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) 17.6 Software Lke many methods n machne learnng there s freely avalable software on the web. For SVM classfcaton and regresson there s well-known software developed by Thorsten Joachms, called SVMlght, (URL: ). Copyrght c 2011 Aaron Hertzmann and Davd Fleet 122

128 AdaBoost 18 AdaBoost Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t to buld a much better classfer, thereby boostng the performance of the weak classfcaton algorthm. Ths boostng s done by averagng the outputs of a collecton of weak classfers. The most popular boostng algorthm s AdaBoost, so-called because t s adaptve. 12 AdaBoost s extremely smple to use and mplement (far smpler than SVMs), and often gves very effectve results. There s tremendous flexblty n the choce of weak classfer as well. Boostng s a specfc example of a general class of learnng algorthms called ensemble methods, whch attempt to buld better learnng algorthms by combnng multple smpler algorthms. Suppose we are gven tranng data {(x,y )} N =1, where x R K and y { 1,1}. And suppose we are gven a (potentally large) number of weak classfers, denoted f m (x) { 1,1}, and a 0-1 loss functon I, defned as I(f m (x),y) = { 0 ffm (x ) = y 1 ff m (x ) y (480) Then, the pseudocode of the AdaBoost algorthm s as follows: for from 1 to N,w (1) = 1 for m = 1 to M do Ft weak classfermto mnmze the objectve functon: ǫ m = N =1 w(m) I(f m(x ) y ) w(m) where I(f m (x ) y ) = 1 f f m (x ) y and 0 otherwse α m = ln 1 ǫm ǫ m for all do w (m+1) end for end for = w (m) e αmi(fm(x ) y ) After learnng, the fnal classfer s based on a lnear combnaton of the weak classfers: ( M ) g(x) = sgn α m f m (x) m=1 (481) Essentally, AdaBoost s a greedy algorthm that bulds up a strong classfer,.e., g(x), ncrementally, by optmzng the weghts for, and addng, one weak classfer at a tme. 12 AdaBoost was called adaptve because, unlke prevous boostng algorthms, t does not need to know error bounds on the weak classfers, nor does t need to know the number of classfers n advance. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 123

129 AdaBoost Fgure 36: Illustraton of the steps of AdaBoost. The decson boundary s shown n green for each step, and the decson stump for each step shown as a dashed lne. The results are shown after 1, 2, 3, 6, 10, and 150 steps of AdaBoost. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 124

130 AdaBoost Tranng data Classfed data Loss on tranng set Exp loss Bnary loss f(x) = Σ α m f m (x) Decson boundary Fgure 37: 50 steps of AdaBoost used to learn a classfer wth decson stumps. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 125

131 AdaBoost 18.1 Decson stumps As an example of a weak classfer, we consder decson stumps, whch are a trval specal case of decson trees. A decson stump has the followng form: f(x) = s(x k > c) (482) where the value n the parentheses s 1 f the k-th element of the vector x s greater than c, and -1 otherwse. The scalar s s ether -1 or 1 whch allows one the classfer to respond wth class 1 whenx k c. Accordngly, there are three parameters to a decson stump: c R k {1,...K}, where K s the dmenson ofx, and s { 1,1} Because the number of possble parameter settngs s relatvely small, a decson stump s often traned by brute force: dscretze the real numbers from the smallest to the largest value n the tranng set, enumerate all possble classfers, and pck the one wth the lowest tranng error. One can be more clever n the dscretzaton: between each par of data ponts, only one classfer must be tested (snce any stump n ths range wll gve the same value). More sophstcated methods, for example, based on bnnng the data, or buldng CDFs of the data, may also be possble Why does t work? There are many dfferent ways to analyze AdaBoost; none of them alone gves a full pcture of why AdaBoost works so well. AdaBoost was frst nvented based on optmzaton of certan bounds on tranng, and, snce then, a number of new theoretcal propertes have been dscovered. Loss functon vew. Here we dscuss the loss functon nterpretaton of AdaBoost. As was shown (decades after AdaBoost was frst nvented), AdaBoost can be vewed as greedy optmzaton of a partcular loss functon. We defne f(x) = 1 2 m α mf m (x), and rewrte the classfer as g(x) = sgn(f(x)) (the factor of 1/2 has no effect on the classfer output). AdaBoost can then be vewed as optmzng the exponental loss: so that the full learnng objectve functon s L exp (x,y) = e yf(x) (483) E = e 1 2 y M m=1 αmfm(x) (484) whch must be optmzed wth respect to the weghts α and the parameters of the weak classfers. The optmzaton process s greedy and sequental: we add one weak classfer at a tme, choosng t Copyrght c 2011 Aaron Hertzmann and Davd Fleet 126

132 AdaBoost and tsαto be optmal wth respect toe, and then never change t agan. Note that the exponental loss s an upper-bound on the 0-1 loss: L exp (x,y) L 0 1 (x,y) (485) Hence, f exponental loss of zero s acheved, then the 0-1 loss s zero as well, and all tranng ponts are correctly classfed. Consder the weak classfer f m to be added at step m. The entre objectve functon can be wrtten to separate out the contrbuton of ths classfer: E = e 1 2 y m 1 j=1 α jf j (x) 1 2 y α mf m(x) (486) = e 1 2 y m 1 j=1 α jf j (x) e 1 2 y α mf m(x) (487) Snce we are holdng constant the frst m 1 terms, we can replace them wth a sngle constant w (m) = e 1 2 y m 1 j=1 α jf j (x). Note that these are the same weghts computed by the recurson used by AdaBoost: w (m) w (m 1) e 1 2 y α j f m 1 (x). (There s a proportonalty constant that can be gnored). Hence, we have E = w (m) e 1 2 ymαmfm(x) (488) We can splt ths nto two summatons, one for data correctly classfed by f m, and one for those msclassfed: E = e αm 2 + e αm 2 (489) w (m) :f m(x )=y Rearrangng terms, we have E = (e αm 2 e αm 2 ) :f m(x ) y w (m) w (m) I(f m (x ) y )+e αm 2 w (m) (490) Optmzng ths wth respect to f m s equvalent to optmzng w(m) I(f m (x ) y ), whch s what AdaBoost does. The optmal value for α m can be derved by solvng de dα m = 0: de = α m dα m 2 Dvdng both sdes by ( e αm 2 +e αm 2 α m 2 w(m) ), we have w (m) I(f m (x ) y ) α m 2 e αm 2 w (m) = 0 (491) 0 = e αm 2 ǫm +e αm 2 ǫm e αm 2 (492) e αm 2 ǫm = e αm 2 (1 ǫm ) (493) α m 2 +lnǫ m = α m 2 +ln(1 ǫ m) (494) α m = ln 1 ǫ m ǫ m (495) Copyrght c 2011 Aaron Hertzmann and Davd Fleet 127

133 AdaBoost E(z) z Fgure 38: Loss functons for learnng: Black: 0-1 loss. Blue: Hnge Loss. Red: Logstc regresson. Green: Exponental loss. (Fgure from Pattern Recognton and Machne Learnng by Chrs Bshop.) Problems wth the loss functon vew. The exponental loss s not a very good loss functon to use n general. For example, f we drectly optmze the exponental loss over all varables n the classfer (e.g., wth gradent descent), we wll often get terrble performance. So the loss-functon nterpretaton of AdaBoost does not tell the whole story. Margn vew. One mght expect that, when AdaBoost reaches zero tranng set error, addng any new weak classfers would cause overfttng. In practce, the opposte often occurs: contnung to add weak classfers actually mproves test set performance n many stuatons. One explanaton comes from lookng at the margns: addng classfers tends to ncrease the margn sze. The formal detals of ths wll not be dscussed here Early stoppng It s nonetheless possble to overft wth AdaBoost, by addng too many classfers. The soluton that s normally used practce s a procedure called early stoppng. The dea s as follows. We partton our data set nto two peces, a tranng set and a test set. The tranng set s used to tran the algorthm normally. However, at each step of the algorthm, we also compute the 0-1 bnary loss on the test set. Durng the course of the algorthm, the exponental loss on the tranng set s guaranteed to decrease, and the 0-1 bnary loss wll generally decrease as well. The errors on the testng set wll also generally decrease n the frst steps of the algorthm, however, at some pont, the testng error wll begn to get notceably worse. When ths happens, we revert the classfer to the form that gave the best test error, and dscard any subsequent changes (.e., addtonal weak classfers). The ntuton for the algorthm s as follows. When we begn learnng, our ntal classfer s Copyrght c 2011 Aaron Hertzmann and Davd Fleet 128

134 AdaBoost extremely smple and smooth. Durng the learnng process, we add more and more complexty to the model to mprove the ft to the data. At some pont, addng addtonal complexty to the model overfts: we are no longer modelng the decson boundary we wsh to ft, but are fttng the nose n the data nstead. We use the test set to determne when overfttng begns, and stop learnng at that pont. Early stoppng can be used for most teratve learnng algorthms. For example, suppose we use gradent descent to learn a regresson algorthm. If we begn wth weghtsw = 0, we are begnnng wth a very smooth curve. Each step of gradent descent wll make the curve less smooth, as the entres of w get larger and larger; stoppng early can prevent w from gettng too large (and thus too non-smooth). Early stoppng s very smple and very general; however, t s heurstc, as the fnal result one gets wll depend on the partculars n the optmzaton algorthm beng used, and not just on the objectve functon. However, AdaBoost s procedure s suboptmal anyway (once a weak classfer s added, t s never updated). An even more aggressve form of early stoppng s to smply stop learnng at a fxed number of teratons, or by some other crtera unrelated to test set error (e.g., when the result looks good. ) In fact, practoners often usng early stoppng to regularze unntentonally, smply because they halt the optmzer before t has converged, e.g., because the convergence threshold s set too hgh, or because they are too mpatent to wat. Copyrght c 2011 Aaron Hertzmann and Davd Fleet 129