Mean Field Theory for Sigmoid Belief Networks. Abstract


 Kelly Walker
 2 years ago
 Views:
Transcription
1 Journal of Artæcal Intellgence Research 4 è1996è Submtted 11è95; publshed 3è96 Mean Feld Theory for Sgmod Belef Networks Lawrence K. Saul Tomm Jaakkola Mchael I. Jordan Center for Bologcal and Computatonal Learnng Massachusetts Insttute of Technology 79 Amherst Street, E Cambrdge, MA Abstract We develop a mean æeld theory for sgmod belef networks based on deas from statstcal mechancs. Our mean æeld theory provdes a tractable approxmaton to the true probablty dstrbuton n these networks; t also yelds a lower bound on the lkelhood of evdence. We demonstrate the utlty of ths framework on a benchmark problem n statstcal pattern recognton the classæcaton of handwrtten dgts. 1. Introducton Bayesan belef networks èpearl, 1988; Laurtzen & Spegelhalter, 1988è provde a rch graphcal representaton of probablstc models. The nodes n these networks represent random varables, whle the lnks represent causal næuences. These assocatons endow drected acyclc graphs èdagsè wth a precse probablstc semantcs. The ease of nterpretaton aæorded by ths semantcs explans the growng appeal of belef networks, now wdely used as models of plannng, reasonng, and uncertanty. Inference and learnng n belef networks are possble nsofar as one can eæcently compute èor approxmateè the lkelhood of observed patterns of evdence èbuntne, 1994; Russell, Bnder, Koller, & Kanazawa, 1995è. There exst provably eæcent algorthms for computng lkelhoods n belef networks wth tree or chanlke archtectures. In practce, these algorthms also tend to perform well on more general sparse networks. However, for networks n whch nodes have many parents, the exact algorthms are too slow èjensen, Kong, & Kjaefulæ, 1995è. Indeed, n large networks wth dense or layered connectvty, exact methods are ntractable as they requre summng over an exponentally large number of hdden states. One approach to dealng wth such networks has been to use Gbbs samplng èpearl, 1988è, a stochastc smulaton methodology wth roots n statstcal mechancs ègeman & Geman, 1984è. Our approach n ths paper reles on a dæerent tool from statstcal mechancs namely, mean æeld theory èpars, 1988è. The mean æeld approxmaton s well known for probablstc models that can be represented as undrected graphs socalled Markov networks. For example, n Boltzmann machnes èackley, Hnton, & Sejnowsk, 1985è, mean æeld learnng rules have been shown to yeld tremendous savngs n tme and computaton over samplngbased methods èpeterson & Anderson, 1987è. The man motvaton for ths work was to extend the mean æeld approxmaton for undrected graphcal models to ther drected counterparts. Snce belef networks can be transformed to Markov networks, and mean æeld theores for Markov networks are well known, t s natural to ask why a new framework s requred at all. The reason s that probablstc models whch have compact representatons as DAGs mayhave unweldy representatons as undrected graphs. As we shall see, avodng ths complexty and workng drectly on DAGs requres an extenson of exstng methods. In ths paper we focus on sgmod belef networks èneal, 1992è, for whch the resultng mean æeld theory s most straghtforward. These are networks of bnary random varables whose local cæ1996 AI Access Foundaton and Morgan Kaufmann Publshers. All rghts reserved.
2 Saul, Jaakkola, & Jordan condtonal dstrbutons are based on loglnear models. We develop a mean æeld approxmaton for these networks and use t to compute a lower bound on the lkelhood of evdence. Our method apples to arbtrary partal nstantatons of the varables n these networks and makes no restrctons on the network topology. Note that once a lower bound s avalable, a learnng procedure can maxmze the lower bound; ths s useful when the true lkelhood tself cannot be computed eæcently. A smlar approxmaton for models of contnous random varables s dscussed by Jaakkola et al è1995è. The dea of boundng the lkelhood n sgmod belef networks was ntroduced n a related archtecture known as the Helmholtz machne èhnton, Dayan, Frey, & Neal 1995è. A fundamental advance of ths work was to establsh a framework for approxmaton that s especally conducve to learnng the parameters of layered belef networks. The close connecton between ths dea and the mean æeld approxmaton from statstcal mechancs, however, was not developed. In ths paper we hope not only to elucdate ths connecton, but also to convey a sense of whch approxmatons are lkely to generate useful lower bounds whle, at the same tme, remanng analytcally tractable. We develop here what s perhaps the smplest such approxmaton for belef networks, notng that more sophstcated methods èjaakkola & Jordan, 1996a; Saul & Jordan, 1995è are also avalable. It should be emphaszed that approxmatons of some form are requred to handle the multlayer neural networks used n statstcal pattern recognton. For these networks, exact algorthms are hopelessly ntractable; moreover, Gbbs samplng methods are mpractcally slow. The organzaton of ths paper s as follows. Secton 2 ntroduces the problems of nference and learnng n sgmod belef networks. Secton 3 contans the man contrbuton of the paper: a tractable mean æeld theory. Here we present the mean æeld approxmaton for sgmod belef networks and derve alower bound on the lkelhood of nstantated patterns of evdence. Secton 4 looks at a mean æeld algorthm for learnng the parameters of sgmod belef networks. For ths algorthm, we gve results on a benchmark problem n pattern recognton the classæcaton of handwrtten dgts. Fnally, secton 5 presents our conclusons, as well as future ssues for research. 2. Sgmod Belef Networks The great vrtue of belef networks s that they clearly exhbt the condtonal dependences of the underlyng probablty model. Consder a belef network deæned over bnary random varables S =ès 1 ;S 2 ;:::;S N è. We denote the parents of S by paès è çfs 1 ; S 2 ;:::S,1 g; ths s the smallest set of nodes for whch P ès js 1 ;S 2 ;:::;S,1 è=p ès jpaès èè: è1è In sgmod belef networks èneal, 1992è, the condtonal dstrbutons attached to each node are based on loglnear models. In partcular, the probablty that the th node s actvated s gven by P ès =1jpaèS èè = ç X j J j S j + h 1 A ; è2è where J j and h are the weghts and bases n the network, and çèzè = 1 1+e,z è3è s the sgmod functon shown n Fgure 1. In sgmod belef networks, wehave J j = 0 for S j 62 paès è; moreover, J j = 0 for j ç snce the network's structure s that of a drected acyclc graph. The sgmod functon n eq. è2è provdes a compact parametrzaton of the condtonal probablty dstrbutons 1 n eq. è2è used to propagate belefs. In partcular, P ès jpaès èè depends on paès è only through a sum of weghted nputs, where the weghts may be vewed as the parameters n a 1. The relaton to nosyor models s dscussed n appendx A. 62
3 Mean Feld Theory for Sgmod Belef Networks σ(z) z Fgure 1: Sgmod functon çèzè = ë1 + e,z ë,1.ifz s the sum of weghted nputs to node S, then P ès = 1jzè = çèzè s the condtonal probablty that node S s actvated. logstc regresson èmccullagh & Nelder, 1983è. The condtonal probablty dstrbuton for S may be summarzed as: hç P ç exp J j js j + h S P ès jpaès èè = h P : è4è 1 + exp J j js j + h Note that substtutng S = 1 n eq. è4è recovers the result n eq. è2è. Combnng eqs. è1è and è4è, we may wrte the jont probablty dstrbuton over the varables n the network as: P èsè = Y = Y P ès jpaès èè 8 é : exp è5è hç P ç 9 J j js j + h S = h P J j js j + h ; : è6è 1 + exp The denomnator n eq. è6è ensures that the probablty dstrbuton s normalzed to unty. We now turn to the problem of nference n sgmod belef networks. Absorbng evdence dvdes the unts n the belef network nto two types, vsble and hdden. The vsble unts èor ëevdence nodes"è are those for whch we have nstantated values; the hdden unts are those for whch we do not. When there s no possble ambguty,we wll use H and V to denote the subsets of hdden and vsble unts. Usng Bayes' rule, nference s done under the condtonal dstrbuton P èhjv è= P èh; V è P èv è ; è7è where P èv è= X H P èh; V è è8è s the lkelhood of the evdence V. In prncple, the lkelhood may be computed by summng over all 2 jhj conæguratons of the hdden unts. Unfortunately, ths calculaton s ntractable n large, densely connected networks. Ths ntractablty presents a major obstacle to learnng parameters for these networks, as nearly all procedures for statstcal estmaton requre frequent estmates of the lkelhood. The calculatons for exact probablstc nference are beset by the same dæcultes. 63
4 Saul, Jaakkola, & Jordan Unable to compute P èv èorwork drectly wth P èhjv è, we wll resort to an approxmaton from statstcal physcs known as mean æeld theory. 3. Mean Feld Theory The mean æeld approxmaton appears under a multtude of guses n the physcs lterature; ndeed, t s ëalmost as old as statstcal mechancs" èitzykson & Drouæe, 1991è. Let us breæy explan howt acqured ts name and why t s so ubqutous. In the physcal models descrbed by Markov networks, the varables S represent localzed magnetc moments èe.g., at the stes of a crystal lattceè, and the sums P j J js j + h represent local magnetc æelds. Roughly speakng, n certan cases a central lmt theorem may be appled to these sums, and a useful approxmaton s to gnore the æuctuatons n these æelds and replace them by ther mean value hence the name, ëmean æeld" theory. In some models, ths s an excellent approxmaton; n others, a poor one. Because of ts smplcty, however, t s wdely used as a ærst step n understandng many types of physcal phenomena. Though ths explans the phlologcal orgns of mean æeld theory, there are n fact many ways to derve what amounts to the same approxmaton èpars, 1988è. In ths paper we present the formulaton most approprate for nference and learnng n graphcal models. In partcular, we vew mean æeld theory as a prncpled method for approxmatng an ntractable graphcal model by a tractable one. Ths s done va a varatonal prncple that chooses the parameters of the tractable model to mnmze an entropc measure of error. The basc framework of mean æeld theory remans the same for drected graphs, though we have found t necessary to ntroduce extra mean æeld parameters n addton to the usual ones. As n Markov networks, one ænds a set of nonlnear equatons for the mean æeld parameters that can be solved by teraton. In practce, we have found ths teraton to converge farly quckly and to scale well to large networks. Let us now return to the problem posed at the end of the last secton. There we found that for many belef networks, t was ntractable to decompose the jont dstrbuton as P èsè = P èhjv èp èv è, where P èv è was the lkelhood of the evdence V. For the purposes of probablstc modelng, mean æeld theory has two man vrtues. Frst, t provdes a tractable approxmaton, QèHjV è ç P èhjv è, to the condtonal dstrbutons requred for nference. Second, t provdes a lower bound on the lkelhoods requred for learnng. Let us ærst consder the orgn of the lower bound. Clearly, for any approxmatng dstrbuton QèHjV è, we have the equalty: ln P èv è = ln X H = ln X H P èh; V è QèHjV è æ è9è ç ç P èh; V è : è10è QèHjV è To obtan a lower bound, we now apply Jensen's nequalty ècover & Thomas, 1991è, pushng the logarthm through the sum over hdden states and nto the expectaton: X ç P èh; V è ç ln P èv è ç QèHjV èln : è11è QèHjV è H It s straghtforward to verfy that the dæerence between the left and rght hand sde of eq. è11è s the KullbackLebler dvergence ècover & Thomas, 1991è: X ç ç QèHjV è KLèQjjP è= QèHjV èln : è12è P èhjv è H Thus, the better the approxmaton to P èhjv è, the tghter the bound on ln P èv è. 64
5 Mean Feld Theory for Sgmod Belef Networks Antcpatng the connecton to statstcal mechancs, we wll refer to QèHjV è as the mean æeld dstrbuton. It s natural to dvde the calculaton of the bound nto two components, both of whch are partcular averages over ths approxmatng dstrbuton. These components are the mean æeld entropy and energy; the overall bound s gven by ther dæerence: ln P èv è ç è, X H QèHjV èlnqèhjv è!, è, X H QèHjV èlnp èh; V è! : è13è Both terms havephyscal nterpretatons. The ærst measures the amount of uncertanty n the meanæeld dstrbuton and follows the standard deænton of entropy. The second measures the average value 2 of, ln P èh; V è; the name ëenergy" arses from nterpretng the probablty dstrbutons n belef networks as Boltzmann dstrbutons 3 at unt temperature. In ths case, the energy of each network conæguraton s gven èup to a constantè by mnus the logarthm of ts probablty under the Boltzmann dstrbuton. In sgmod belef networks, the energy has the form X X X , ln P èh; V è=, J j S S j, h S + ln4 1 + exp J j S j + h A5 ; è14è j as follows from eq. è6è. The ærst two terms n ths equaton are famlar from Markov networks wth parwse nteractons èhertz, Krogh, & Palmer, 1991è; the last term s pecular to sgmod belef networks. Note that the overall energy s nether a lnear functon of the weghts nor a polynomal functon of the unts. Ths s the prce we pay n sgmod belef networks for dentfyng P èhjv è as a Boltzmann dstrbuton and the loglkelhood P èv è as ts partton functon. Note that ths dentæcaton was made mplctly n the form of eqs. è7è and è8è. The bound n eq. è11è s vald for any probablty dstrbuton QèHjV è. To make use of t, however, we must choose a dstrbuton that enables us to evaluate the rght hand sde of eq. è11è. Consder the factorzed X j QèHjV è= Y 2H ç S è1, ç è 1,S ; è15è n whch the bnary hdden unts fs g 2H appear as ndependent Bernoull varables wth adjustable means ç. A mean æeld approxmaton s obtaned by substtutng the factorzed dstrbuton, eq. è15è, for the true Boltzmann dstrbuton, eq. è7è. It may seem that ths approxmaton replaces the rch probablstc dependences n P èhjv èby an mpovershed assumpton of complete factorzablty. Though ths s true to some degree, the reader should keep n mnd that the values we choose for fç g 2H èand hence the statstcs of the hdden untsè wll depend on the evdence V. The best approxmaton of the form, eq. è15è, s found by choosng the mean values, fç g 2H, that mnmze the KullbackLebler dvergence, KLèQjjP è. Ths s equvalent to mnmzng the gap between the true loglkelhood, ln P èv è, and the lower bound obtaned from mean æeld theory. The 2. A smlar average s performed n the Estep of an EM algorthm èdempster, Lard, & Rubn, 1977è; the dæerence here s that the average s performed over the mean æeld dstrbuton, QèHjV è, rather than the true posteror, P èh jv è. For a related dscusson, see Neal & Hnton è1993è. 3. Our termnology s as follows. Let S denote the degrees of freedom n a statstcal mechancal system. The energy of the system, EèSè, s a realvalued functon of these degrees of freedom, and the Boltzmann dstrbuton P èsè = e,æeèsè PS e,æeèsè deænes a probablty dstrbuton over the possble conæguratons of S. The parameter æ s the nverse temperature; t serves to calbrate the energy scale and wll be æxed to unty n our dscusson of belef networks. Fnally, the sum n the denomnator known as the partton functon ensures that the Boltzmann dstrbuton s normalzed to unty. 65
6 Saul, Jaakkola, & Jordan mean æeld bound on the loglkelhood may be calculated by substtutng eq. è15è nto the rght hand sde of eq. è11è. The result of ths calculaton s ln P èv è ç X j, J j ç ç j + X X h ç, X ëç ln ç +è1, ç è lnè1, ç èë ; ç ç çç ln 1+e Pj Jj Sj+h where hæ ndcates an expectaton value over the mean æeld dstrbuton, eq. è15è. The terms n the ærst lne of eq. è16è represent the mean æeld energy, derved from eq. è14è; those n the second represent the mean æeld entropy. In a slght abuse of notaton, we have deæned mean values ç for the vsble unts; these of course are set to the nstantated values ç 2f0; 1g. Note that to compute the average energy n the mean æeld approxmaton, we must ænd the expected value of hln ë1+e z ë, where z = P j J js j + h s the sum of weghted nputs to the th unt n the belef network. Unfortunately, even under the mean æeld assumpton that the hdden unts are uncorrelated, ths average does not have a smple closed form. Ths term does not arse n the mean æeld theory for Markov networks wth parwse nteractons; agan, t s pecular to sgmod belef networks. In prncpal, the average may be performed by enumeratng the possble states of paès è. The result of ths calculaton, however, would be an extremely unweldy functon of the parameters n the belef network. Ths reæects the fact that n general, the sgmod belef network deæned by the weghts J j has an equvalent Markov network wth Nth order nteractons and not parwse ones. To avod ths complexty, we must develop a mean æeld theory that works drectly on DAGs. How we handle the expected value of hln ë1+e z ë s what dstngushes our mean æeld theory from prevous ones. Unable to compute ths term exactly, we resort to another bound. Note that for any random varable z and any real number ç, wehave the equalty: è16è æ æ ææ hlnë1 + e z ë = ln e çz e,çz è1 + e z è è17è E = çhz + D lnëe,çz + e è1,çèz ë : è18è We can upper bound the rght hand sde by applyng Jensen's nequalty n the opposte drecton as before, pullng the logarthm outsde the expectaton: E hlnë1 + e z ëççhz +ln De,çz + e è1,çèz : è19è Settng ç = 0 n eq. è19è gves the standard bound: hlnè1 + e z èçlnh1+e z. A tghter bound èseung, 1995è can be obtaned, however, by allowng nonzero values of ç. Ths s llustrated n Fgure 2 for the specal case where z s a Gaussan dstrbuted random varable wth zero mean and unt varance. The bound n eq. è19è has two useful propertes whch we state here wthout proof: èè the rght hand sde s a convex functon of ç; èè the value of ç whch mnmzes ths functon occurs n the nterval ç 2 ë0; 1ë. Thus, provded t s possble to evaluate eq. è19è for dæerent values of ç, the tghtest bound of ths form can be found by a smple onedmensonal mnmzaton. The above bound can be put to mmedate use by attachng an extra mean æeld parameter ç to each unt n the belef network. We can then upper bound the ntractable terms n the mean æeld energy by ç ç çç 0 ln 1+e Pj Jj Sj+h ç X j J j ç j + h 1 A +ln D e,çz + e è1,çèz E ; è20è 66
7 Mean Feld Theory for Sgmod Belef Networks bound 0.8 exact ξ Fgure 2: Bound n eq. è19è for the case where z s normally dstrbuted wth zero mean and unt varance. In ths case, the exact result s hlnè1 + e z è =0:806; the bound gves mn ç nlnëe 2 1 ç2 + e 1 2 è1,çè2 ë at ç = 0 and gves 0:974. o = 0:818. The standard bound from Jensen's nequalty occurs P where z = J j js j + h. The expectatons nsde the logarthm can be evaluated exactly for the factoral dstrbuton, eq. è15è; for example, Y he,çz = e,çh j, 1, çj + ç j e,çjj æ : è21è A smlar result holds for he è1,çèz. Though these averages are tractable, we wll tend not to wrte them out n what follows. The reader, however, should keep n mnd that these averages do not present any dæculty; they are smply averages over products of ndependent random varables, as opposed to sums. Assemblng the terms n eqs. è16è and è20è gvesalower bound ln P èv è çl V, L V = X j, X ç X j X J j ç ç j + h ç, E ln De X,çz + e è1,çèz + X J j ç j + h 1 A ëç ln ç +è1, ç èlnè1, ç èë ; on the loglkelhood of the evdence V. So far we have not specæed the parameters fç g 2H and fç g; n partcular, the bound n eq. è22è s vald for any choce of parameters. We naturally seek the values that maxmze the rght hand sde of eq. è22è. Suppose we æx the mean values fç g 2H and ask for the parameters fç g that yeld the tghtest possble bound. Note that the rght hand sde of eq. è22è does not couple terms wth ç that belong to dæerent unts n the network. The mnmzaton over fç g therefore reduces to N ndependent mnmzatons over the nterval ë0; 1ë. These can be done by anynumber of standard methods èpress, Flannery, Teukolsky, & Vetterlng, 1986è. To choose the means, we set the gradents of the bound wth respect to fç g 2H equal to zero. To ths end, let us deæne the ntermedate matrx: K j E, ln De,çz + e è1,çèz ; j è22è 67
8 Saul, Jaakkola, & Jordan S Fgure 3: The Markov blanket of unt S parents of ts chldren. ncludes ts parents and chldren, as well as the other where z s the weghted sum of nputs to th unt. Note that K j s zero unless S j s a parent of S ; n other words, t has the same connectvty as the weght matrx J j. Wthn the mean æeld approxmaton, K j measures the parental næuence of S j on S gven the nstantated evdence V. The degree of correlaton èpostve or negatveè s measured relatve to the other parents of S. The matrx elements of K may beevaluated by expandng the expectatons as n eq. è21è; a full dervaton s gven n appendx B. Settng the V equal to zero gves the ænal mean æeld equaton: ç = ç X h + j 1 ëj j ç j + J j èç j, ç j è+k j ëa ; è24è where çèæè s the sgmod functon. The argument of the sgmod functon may be vewed as an eæectve nput to the th unt n the belef network. Ths eæectve nput s composed of terms from the unt's Markov blanket èpearl, 1988è, shown n Fgure 3; n partcular, these terms take nto account the unt's nternal bas, the values of ts parents and chldren, and, through the matrx K j, the values of ts chldren's other parents. In solvng these equatons by teraton, the values of the nstantated unts are propagated throughout the entre network. An analogous propagaton of nformaton occurs n exact algorthms èlaurtzen & Spegelhalter, 1988è to compute lkelhoods n belef networks. Whle the factorzed approxmaton to the true posteror s not exact, the mean æeld equatons set the parameters fç g 2H to values whch make the approxmaton as accurate as possble. Ths n turn translates nto the tghtest mean æeld bound on the loglkelhood. The overall procedure for boundng the loglkelhood thus conssts of two alternatng steps: èè update fç g for æxed fç g; èè update fç g 2H for æxed fç g. The ærst step nvolves N ndependent mnmzatons over the nterval ë0; 1ë; the second s done by teratng the mean æeld equatons. In practce, the steps are repeated untl the mean æeld bound on the loglkelhood converges 4 to a desred degree of accuracy. The qualty of the bound depends on two approxmatons: the complete factorzablty of the mean æeld dstrbuton, eq. è15è, and the logarthm bound, eq. è19è. How relable are these approxmatons n belef networks? To study ths queston, we performed numercal experments on the three layer belef network shown n Fgure 4. The advantage of workng wth such a small network è2x4x6è s that true lkelhoods can be computed by exact enumeraton. We consdered the partcular event that all the unts n the bottom layer were nstantated to zero. For ths event, we compared the mean æeld bound on the lkelhood to ts true value, obtaned by enumeratng the 4. It can be shown that asychronous updates of the mean æeld parameters lead to monotonc ncreases n the lower bound èjust as n the case of Markov networksè. 68
9 Mean Feld Theory for Sgmod Belef Networks Fgure 4: Three layer belef network è2x4x6è wth topdown propagaton of belefs. To model the mages of handwrtten dgts n secton 4, we used 8x24x64 networks where unts n the bottom layer encoded pxel values n 8x8 btmaps mean feld approxmaton unform approxmaton relatve error n log lkelhood relatve error n log lkelhood Fgure 5: Hstograms of relatve error n loglkelhood over randomly generated three layer networks. At left: the relatve error from the mean æeld approxmaton; at rght: the relatve error f all states n the bottom layer are assumed to occur wth equal probablty. The loglkelhood was computed for the event that the all the nodes n the bottom layer were nstantated to zero. states n the top two layers. Ths was done for random networks whose weghts and bases were unformly dstrbuted between 1 and 1. Fgure 5 èleftè shows the hstogram of the relatve error n log lkelhood, computed as L V = ln P èv è, 1; for these networks, the mean relatve error s 1.6è. Fgure 5 èrghtè shows the hstogram that results from assumng that all states n the bottom layer occur wth equal probablty; n ths case the relatve error was computed as èln 2,6 è= ln P èv è, 1. For ths ëunform" approxmaton, the root mean square relatve error s 22.6è. The large dscrepancy between these results suggests that mean æeld theory can provde a useful lower bound on the lkelhood n certan belef networks. Of course, what ultmately matters s the behavor of mean æeld theory n networks that solve meanngful problems. Ths s the subject of the next secton. 4. Learnng One attractve use of sgmod belef networks s to perform densty estmaton n hgh dmensonal nput spaces. Ths s a problem n parameter estmaton: gven a set of patterns over partcular unts n the belef network, ænd the set of weghts J j and bases h that assgn hgh probablty to these patterns. Clearly, the ablty to compute lkelhoods les at the crux of any algorthm for learnng the parameters n belef networks. 69
10 Saul, Jaakkola, & Jordan true log lkelhood lower bound true log lkelhood lower bound tranng tme tranng tme Fgure 6: Relatonshp between the true loglkelhood and ts lower bound durng learnng. One possblty èat leftè s that both ncrease together. The other s that the true loglkelhood decreases, closng the gap between tself and the bound. The latter can be vewed as a form of regularzaton. Mean æeld algorthms provde a strategy for dscoverng approprate values of J j and h wthout resort to Gbbs samplng. Consder, for nstance, the followng procedure. For each pattern n the tranng set, solve the mean æeld equatons for fç ;ç g and compute the assocated bound on the loglkelhood, L V. Next, adapt the weghts n the belef network by gradent ascent 5 n the mean æeld bound, æj j = j æh = ; è25è è26è where ç s a sutably chosen learnng rate. Fnally, cycle through the patterns n the tranng set, maxmzng ther lkelhoods 6 for a æxed number of teratons or untl one detects the onset of overættng èe.g., by crossvaldatonè. The above procedure uses a lower bound on the loglkelhood as a cost functon for tranng belef networks èhnton, Dayan, Frey, & Neal, 1995è. The fact that we have alower bound on the loglkelhood, rather than an upper bound, s of course crucal to the success of ths learnng algorthm. Adjustng the weghts to maxmze ths lower bound can aæect the true loglkelhood n two ways èsee Fgure 6è. Ether the true loglkelhood ncreases, movng n the same drecton as the bound, or the true loglkelhood decreases, closng the gap between these two quanttes. For the purposes of maxmum lkelhood estmaton, the ærst outcome s clearly desrable; the second, though less desrable, can also be vewed n a postve lght. In ths case, the mean æeld approxmaton s actng as a regularzer, steerng the network toward smple, factoral solutons even at the expense of lower lkelhood estmates. We tested ths algorthm by buldng a maxmumlkelhood classæer for mages of handwrtten dgts. The data conssted of examples of handwrtten dgts ë09ë compled by the U.S. Postal Servce Oæce of Advanced Technology. The examples were preprocessed to produce 8x8 bnary mages, as shown n Fgure 7. For each dgt, we dvded the avalable data nto a tranng set wth 700 examples and a test set wth 400 examples. We then traned a three layer network 7 èsee 5. Expressons for the gradents of L V are gven n the appendx B. 6. Of course, one can also ncorporate pror dstrbutons over the weghts and bases and maxmze an approxmaton to the log posteror probablty of the tranng set. 7. There are many possble archtectures that could be chosen for the purpose of densty estmaton; we used layered networks to permt a comparson wth prevous benchmarks on ths data set. 70
11 Mean Feld Theory for Sgmod Belef Networks Fgure 7: Bnary mages of handwrtten dgts: two and æve Table 1: Confuson matrx for dgt classæcaton. The entry n the th row and jth column counts the number of tmes that dgt was classæed as dgt j. Fgure 4è on each dgt, sweepng through each tranng set æve tmes wth learnng rate ç =0:05. The networks had 8 unts n the top layers, 24 unts n the mddle layer, and 64 unts n the bottom layer, makng them far too large to be treated wth exact methods. After tranng, we classæed the dgts n each test set by the network that assgned them the hghest lkelhood. Table 1 shows the confuson matrx n whch the jth entry counts the number of tmes dgt was classæed as dgt j. There were 184 errors n classæcaton èout of a possble 4000è, yeldng an overall error rate of 4.6è. Table 2 gves the performance of varous other algorthms on the same partton of ths data set. Table 3 shows the average loglkelhood score of each network on the dgts n ts test set. ènote that these scores are actually lower bounds.è These scores are normalzed so that a network wth zero weghts and bases è.e., one n whch all 8x8 patterns are equally lkelyè would receve a score of 1. As expected, dgts wth relatvely smple constructons èe.g., zeros, ones, and sevensè are more easly modeled than the rest. Both measures of performance error rate and loglkelhood score are compettve wth prevously publshed results èhnton, Dayan, Frey, & Neal, 1995è on ths data set. The success of the algorthm aærms both the strategy of maxmzng a lower bound and the utlty of the mean æeld approxmaton. Though smlar results can be obtaned va Gbbs samplng, ths seems to requre consderably more computaton than methods based on maxmzng a lower bound èfrey, Dayan, & Hnton, 1995è. 71
12 Saul, Jaakkola, & Jordan algorthm classæcaton error nearest neghbor 6.7è backpropagaton 5.6è wakesleep 4.8è mean æeld 4.6è Table 2: Classæcaton error rates for the data set of handwrtten dgts. The ærst three were reported by Hnton et al è1995è. dgt loglkelhood score all Table 3: Normalzed loglkelhood score for each network on the dgts n ts test set. To obtan the raw score, multply by 400 æ 64 æ ln 2. The last row shows the score averaged across all dgts. 5. Dscusson Endowng networks wth probablstc semantcs provdes a unæed framework for ncorporatng pror knowledge, handlng mssng data, and performng nference under uncertanty. Probablstc calculatons, however, can quckly become ntractable, so t s mportant to develop technques that approxmate probablty dstrbutons n a æexble manner. Ths s especally true for networks wth multlayer archtectures and large numbers of hdden unts. Exact algorthms and Gbbs samplng methods are not generally practcal for such networks; approxmatons are requred. In ths paper we have developed a mean æeld approxmaton for sgmod belef networks. As a computatonal tool, our mean æeld theory has two man vrtues: ærst, t provdes a tractable approxmaton to the condtonal dstrbutons requred for nference; second, t provdes a lower bound on the lkelhoods requred for learnng. The problem of computng exact lkelhoods n belef networks s NPhard ècooper, 1990è; the same s true for approxmatng lkelhoods to wthn a guaranteed degree of accuracy èdagum & Luby, 1993è. It follows that one cannot establsh unversal guarantees for the accuracy of the mean æeld approxmaton. For certan networks, clearly, the mean æeld approxmaton s bound to fal t cannot capture logcal constrants or strong correlatons between æuctuatng unts. Our prelmnary results, however, suggest that these worstcase results do not apply to all belef networks. It s worth notng, moreover, that all the above qualæcatons apply to Markov networks, and that n ths doman, mean æeld methods are already wellestablshed. 72
13 Mean Feld Theory for Sgmod Belef Networks The dea of boundng the lkelhood n sgmod belef networks was ntroduced n a related archtecture known as the Helmholtz machne èhnton, Dayan, Neal, & Zemel, 1995è. The formalsm n ths paper dæers n a number of respects from the Helmholtz machne. Most mportantly, t enables one to compute a rgorous lower bound on the lkelhood. Ths cannot be sad for the wakesleep algorthm èfrey, Hnton, & Dayan, 1995è, whch reles on samplngbased methods, or the heurstc approxmaton of Dayan et al è1995è, whch does not guarantee a rgorous lower bound. Also, our mean æeld theory whch takes the place of the ërecognton model" of the Helmholtz machne apples generally to sgmod belef networks wth or wthout layered structure. Moreover, t places no restrctons on the locatons of vsble unts; they may occur anywhere wthn the network an mportant feature for handlng problems wth mssng data. Of course, these advantages are not accrued wthout extra computatonal demands and more complcated learnng rules. In recent work that bulds on the theory presented here, wehave begun to relax the assumpton of complete factorzablty n eq. è15è. In general, one would expect more sophstcated approxmatons to the Boltzmann dstrbuton to yeld tghter bounds on the loglkelhood. The challenge here s to ænd dstrbutons that allow for correlatons between hdden unts whle remanng computatonally tractable. By tractable, we mean that the choce of QèHjV èmust enable one to evaluate èor at least upper boundè the rght hand sde of eq. è13è. Extensons of ths knd nclude mxture models èjaakkola & Jordan, 1996è andèor partally factorzed dstrbutons èsaul & Jordan, 1995è that explot the presence of tractable substructures n the orgnal network. Our approach n ths paper has been to work out the smplest mean æeld theory that s computatonally tractable, but clearly better results wll be obtaned by talorng the approxmaton to the problem at hand. Appendx A. Sgmod versus NosyOR The semantcs of the sgmod functon are smlar, but not dentcal, to the nosyor gates èpearl, 1988è more commonly found n the belef network lterature. NosyOR gates use the weghts n the network to represent ndependent causal events. In ths case, the probablty that unt S s actvated s gven by Y P ès =1jpaèS èè=1, è1, p j è Sj è27è where p j s the probablty that S j = 1 causes S = 1 n the absence of all other causal events. If we deæne the weghts of a nosyor belef network by ç j =, lnè1, p j è, t follows that pès jpaès èè = ç X j j ç j S j 1 A ; è28è where çèzè =1, e,z s the nosyor gatng functon. Comparng ths to the sgmod functon, eq. è3è, we see that both model P ès jpaès èè as a monotoncally ncreasng functon of a sum of weghted nputs. The man dæerence s that n nosyor networks, the weghts ç j are constraned to be postve byan underlyng set of probabltes, p j. Recently, Jaakkola and Jordan è1996bè have developed a mean æeld approxmaton for nosyor belef networks. Appendx B. Gradents Here we provde expressons for the gradents that appear n eqs. è23è, è25è and è26è. As usual, let z = P j J js j + h denote the sum of nputs nto unt S. Under the factoral dstrbuton, eq. è15è, è29è 73
14 Saul, Jaakkola, & Jordan we can compute the averages: Yæ he,çz = e,çh 1, çj + ç j e,çjjæ ; j Y he è1,çèz = e è1,çèh For each unt n the network, let us deæne the quantty j è30è h1, ç j + ç j e è1,çèjj : è31è ç = he è1,çèz he,çz + e è1,çèz : è32è Note that ç les between zero and one. Wth ths deænton, we can wrte the matrx elements n eq. è23è as: K j = è1, ç èè1, e,çjj è + ç è1, e è1,çèjj è : è33è 1, ç j + ç j e,çjj 1, ç j + ç j e è1,çèjj The gradents n eqs. è25è and è26è are found by smlar means. For the weghts, we j Lkewse, for the bases, we have =,èç, ç èç j + è1, ç èç ç j e,çjj 1, ç j + ç j e,çjj, ç è1, ç èç j e è1,çèjj 1, ç j + ç j e è1,çèjj : = ç, ç : è35è Fnally,we note that one may obtan smpler gradents at the expense of ntroducng a weaker bound than eq. è19è. Ths can be advantageous when speed of computaton s more mportant than the qualty of the bound. All the experments n ths paper used the bound n eq. è19è. Acknowledgements We are especally grateful to P. Dayan, G. Hnton, B. Frey, R. Neal, and H. Seung for sharng early versons of ther manuscrpts and for provdng many stmulatng dscussons about ths work. The paper was also mproved greatly by the comments of several anonymous revewers. To facltate comparsons wth smlar methods, the results reported n ths paper used mages that were preprocessed at the UnverstyofToronto. The authors acknowledge support from NSF grantcda , ONR grant N , ATR Research Laboratores, and Semens Corporaton. References Ackley, D., Hnton, G., & Sejnowsk, T. è1985è A learnng algorthm for Boltzmann machnes. Cogntve Scence, 9, 147í169. Buntne, W. è1994è Operatons for learnng wth graphcal models. Journal of Artæcal Intellgence Research, 2, Cooper, G. è1990è Computatonal complexty of probablstc nference usng Bayesan belef networks. Artæcal Intellgence, 42, 393í405. Cover, T., & Thomas, J. è1991è Elements of Informaton Theory. New York: John Wley & Sons. Dagum, P., & Luby, M. è1993è Approxmately probablstc reasonng n Bayesan belef networks s NPhard. Artæcal Intellgence, 60, 141í
15 Mean Feld Theory for Sgmod Belef Networks Dayan, P., Hnton, G., Neal, R., & Zemel, R. è1995è The Helmholtz machne. Neural Computaton, 7, 889í904. Dempster, A., Lard, N., and Rubn, D. è1977è Maxmum lkelhood from ncomplete data va the EM algorthm. Journal of the Royal Statstcal Socety B39, 1í38. Frey, B., Hnton, G., & Dayan, P. è1995è Does the wakesleep algorthm learn good densty estmators? In D. Touretzky, M. Mozer, and M. Hasselmo èedsè. Advances of Neural Informaton Processng Systems: Proceedngs of the 1995 Conference. Geman, S., & Geman, D. è1984è Stochastc relaxaton, Gbbs dstrbutons, and the Bayesan restoraton of mages. IEEE Transactons on Pattern Analyss and Machne Intellgence, 6, 721í741. Hertz, J., Krogh, A., and Palmer, R. G. è1991è Introducton to the Theory of Neural Computaton. Redwood Cty, CA: AddsonWesley. Hnton, G., Dayan, P., Frey, B., & Neal, R. è1995è The wakesleep algorthm for unsupervsed neural networks. Scence, 268, 1158í1161. Itzykson, C., & Drouæe, J.M. è1991è. Statstcal Feld Theory. Cambrdge: Cambrdge Unversty Press. Jaakkola, T., Saul, L., & Jordan, M. è1995è Fast learnng by boundng lkelhoods n sgmodtype belef networks. In D. Touretzky, M. Mozer, and M. Hasselmo èedsè. Advances of Neural Informaton Processng Systems: Proceedngs of the 1995 Conference. Jaakkola, T., & Jordan, M. è1996aè Mxture model approxmatons for belef networks. Manuscrpt n preparaton. Jaakkola, T., & Jordan, M. è1996bè Computng upper and lower bounds on lkelhoods n ntractable networks. Submtted. Jensen, C. S., Kong, A., & Kjaerulæ, U. è1995è Blockng Gbbs samplng n very large probablstc expert systems. Internatonal Journal of Human Computer Studes. Specal Issue on RealWorld Applcatons of Uncertan Reasonng. Laurtzen, S., & Spegelhalter, D. è1988è. Local computatons wth probabltes on graphcal structures and ther applcaton to expert systems. Journal of the Royal Statstcal Socety B, 50, 157í224. McCullagh, P., & Nelder, J. A. è1983è Generalzed Lnear Models. London: Chapman and Hall. Neal, R. è1992è Connectonst learnng of belef networks. Artæcal Intellgence, 56, 71í113. Neal, R., & Hnton, G. è1993è A new vew of the EM algorthm that justæes ncremental and other varants. Submtted for publcaton. Pars, G. è1988è Statstcal Feld Theory. Redwood Cty, CA: AddsonWesley. Pearl, J. è1988è Probablstc Reasonng n Intellgent Systems. San Mateo, CA: Morgan Kaufmann. Peterson, C., & Anderson, J.R. è1987è A mean æeld theory learnng algorthm for neural networks. Complex Systems, 1, 995í1019. Press, W. H., Flannery, B. P., Teukolsky, S.A., & Vetterlng, W. T. è1986è Numercal Recpes. Cambrge: Cambrdge Unversty Press. Russell, S., Bnder, J., Koller, D., & Kanazawa, K. è1995è. Local learnng n probablstc networks wth hdden varables. In Proceedngs of IJCAIí95. 75
16 Saul, Jaakkola, & Jordan Saul, L., & Jordan, M. è1995è Explotng tractable substructures n ntractable networks. In D. Touretzky, M. Mozer, and M. Hasselmo èedsè. Advances of Neural Informaton Processng Systems: Proceedngs of the 1995 Conference. Seung, H. è1995è. Annealed theores of learnng. In J.H. Oh, C. Kwon, and S. Cho, eds. Neural Networks: The Statstcal Mechancs Perspectve, Proceedngs of the CTPPRSRI Jont Workshop on Theoretcal Physcs. Sngapore, World Scentæc. 76
What is Candidate Sampling
What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble
More informationCS8381 Advanced NLP: Conditional Random Fields
CS8381 Advanced NLP: Condtonal Random Felds Xaojn Zhu 2007 Send comments to jerryzhu@cs.wsc.edu 1 Informaton Extracton Current NLP technques cannot fully understand general natural language artcles. However,
More informationLecture 7: Hypothesis Testing and KL Divergence
C 83 Sprng 25 Statstcal Sgnal Processng nstructor: R. Nowak Lecture 7: Hypothess Testng and KL Dvergence Introducng the KullbackLebler Dvergence Suppose X, X 2,..., X n d qx and we have two models for
More informationGeorey E. Hinton. University oftoronto. Email: zoubin@cs.toronto.edu. Technical Report CRGTR961. May 21, 1996 (revised Feb 27, 1997) Abstract
The EM Algorthm for Mxtures of Factor Analyzers Zoubn Ghahraman Georey E. Hnton Department of Computer Scence Unversty oftoronto 6 Kng's College Road Toronto, Canada M5S A4 Emal: zoubn@cs.toronto.edu Techncal
More informationMarkov Networks: Theory and Applications. Warm up
Markov Networks: Theory and Applcatons Yng Wu Electrcal Engneerng and Computer Scence Northwestern Unversty Evanston, IL 60208 yngwu@eecs.northwestern.edu http://www.eecs.northwestern.edu/~yngwu Warm up
More informationRecurrence. 1 Definitions and main statements
Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.
More information1 Approximation Algorithms
CME 305: Dscrete Mathematcs and Algorthms 1 Approxmaton Algorthms In lght of the apparent ntractablty of the problems we beleve not to le n P, t makes sense to pursue deas other than complete solutons
More informationCombined 5 2 cv F Test for Comparing Supervised Classification Learning Algorithms
NOTE Communcated by Thomas Detterch Combned 5 2 cv F Test for Comparng Supervsed Classfcaton Learnng Algorthms Ethem Alpaydın IDIAP, CP 592 CH1920 Martgny, Swtzerland and Department of Computer Engneerng,
More informationForecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network
700 Proceedngs of the 8th Internatonal Conference on Innovaton & Management Forecastng the Demand of Emergency Supples: Based on the CBR Theory and BP Neural Network Fu Deqang, Lu Yun, L Changbng School
More informationECON 546: Themes in Econometrics Lab. Exercise 2
Unversty of Vctora Department of Economcs ECON 546: Themes n Econometrcs Lab. Exercse Introducton The purpose of ths lab. exercse s to show you how to use EVews to estmate the parameters of a regresson
More informationb) a) σ = σ =
Bayesan parameter estmaton through varatonal methods Tomm S. Jaakkola Unversty of Calforna Santa Cruz, CA Mchael I. Jordan Massachusetts Insttute of Technology Cambrdge, MA January 21, 1998 Abstract We
More informationYair Weiss. Suppose we are given a set of datapoints that were generated by multiple processes,
Moton Segmentaton usng EM  a short tutoral Yar Wess MIT E, Cambrdge, MA 9, USA ywess@psyche.mt.edu The expectatonmaxmzaton algorthm (EM) s used for many estmaton problems n statstcs. Here we gve a short
More informationBounding the optimal L
Boundng the optmal L Lecture 4: Entropy & Huffman Codes Sam Rowes Let K l. Assume K (otherwse the code s not UD). We can fnd a lower bound on L n terms of the p : p l p (l + log K) p log( l K) l Kp p log
More informationOrdinary Least Squares (OLS) Estimation of the Simple CLRM. 1. The Nature of the Estimation Problem
ECOOMICS 35*  OTE ECO 35*  OTE Ordnary Least Squares (OLS) Estmaton of the Smple CLRM. The ature of the Estmaton Problem Ths note derves the Ordnary Least Squares (OLS) coeffcent estmators for the
More information5.2 LeastSquares Fit to a Straight Line
5. LeastSquares Ft to a Straght Lne total probablty and chsquare mnmzng chsquare for a straght lne solvng the determnants example leastsquares ft weghted leastsquares wth an example leastsquares
More informationSimple Linear Regression and Correlation
Smple Lnear Regresson and Correlaton In ths chapter, you learn: How to use regresson analyss to predct the value of a dependent varable based on an ndependent varable The meanng of the regresson coeffcents
More informationSystems of Particles
Physcs 53 Systems of Partcles Everythng should be as smple as t s, but not smpler. Albert Ensten Overvew An object of ordnary sze whch we call a macroscopc system contans a huge number of atoms or molecules.
More informationDynamic Pricing by Restricted Boltzmann Machine
Academc Research Internatonal Vol. 5(6) November 2014 Dynamc Prcng by Restrcted Boltzmann Machne Lusajo M. Mnga 1, Noel E. Mbonde 2 Mbeya Unversty of Scence and Technology, Mbeya, TANZANIA. 1 lusajomnga.103@gmal.com,
More informationModeling knownfate data on survival binomials & logistic regression. # Survived. Ppn Survived
Modelng knownfate data on survval bnomals & logstc regresson Consder the followng dataset: Week # Females # Ded # Survved Ppn Survved (p*(p))/n sehat 48 47.979.42.2 2 47 2 45.957.87.29 3 4 2 39.95.3.34
More informationNetworks of Queues. 2. TwoStage Tandem Network with Independent Service
Networks of Queues Tng Yan and Malath Veeraraghavan, Aprl 9, 24. Introducton Networks of Queues are used to model potental contenton and queung when a set of resources s shared. Such a network can be modeled
More informationNonlinear data mapping by neural networks
Nonlnear data mappng by neural networks R.P.W. Dun Delft Unversty of Technology, Netherlands Abstract A revew s gven of the use of neural networks for nonlnear mappng of hgh dmensonal data on lower dmensonal
More informationb) The mean of the fitted (predicted) values of Y is equal to the mean of the Y values: c) The residuals of the regression line sum up to zero: = ei
Mathematcal Propertes of the Least Squares Regresson The least squares regresson lne obeys certan mathematcal propertes whch are useful to know n practce. The followng propertes can be establshed algebracally:
More information1. The Categories of Neural Network Learning Rules
1. The Categores of Neural Network Learnng Rules There are many types of Neural Network Learnng Rules, they fall nto two broad categores: supervsed learnng, and unsupervsed learnng. Block dagrams of the
More informationConvergence heuristics
Convergence heurstcs Heurstc algorthms Govann Rghn Unversty of Mlan Department of Computer Scence (Crema) Convergence heurstcs Convergence heurstcs are based on analoges wth physcal systems: each soluton
More informationThe use of negative controls to detect confounding and other sources of error in experimental and observational science
The use of negatve controls to detect confoundng and other sources of error n expermental and observatonal scence Marc Lpstch Erc Tchetgen Tchetgen Ted Cohen eppendx. Use of negatve control outcomes to
More informationToday s class. Chapter 13. Sources of uncertainty. Decision making with uncertainty
Today s class Probablty theory Bayesan nference From the ont dstrbuton Usng ndependence/factorng From sources of evdence Chapter 13 1 2 Sources of uncertanty Uncertan nputs Mssng data Nosy data Uncertan
More informationDEFINING %COMPLETE IN MICROSOFT PROJECT
CelersSystems DEFINING %COMPLETE IN MICROSOFT PROJECT PREPARED BY James E Aksel, PMP, PMISP, MVP For Addtonal Informaton about Earned Value Management Systems and reportng, please contact: CelersSystems,
More informationLecture 6: Bayesian Logistic Regression
CSE 788.04: Topcs n Machne Learnng Lecture Date: Aprl th, 202 Lecture 6: Bayesan Logstc Regresson Lecturer: Bran Kuls Scrbe: Zq Huang Logstc Regresson Logstc Regresson s an approach to learnng functons
More informationLogistic Regression. Mausam Based on slides of Rong Jin, Tom Mitchell, Yi Zhang
Logstc Regresson Mausam Based on sldes of Rong Jn, Tom Mtchell, Y Zhang Lnear Regresson y s contnuous p( y 1 x) log p ( y 1 x ) y x w c 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Logstc Regresson
More informationLogistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification
Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson
More informationModule 11 Correlation and Regression
Module Correlaton and Regresson. Correlaton Analyss.. Seral or AutoCorrelaton.. CrossCorrelaton..3 Spurous correlaton..4 Inferences on Correlaton Coeffcents..5 Kendall s Rank Correlaton Test. Regresson
More informationCAPM and BlackLitterman
CAPM and BlackLtterman IngHaw Cheng y Prnceton Unversty November, 28 Abstract Ths teachng note descrbes CAPM and the BlackLtterman portfolo optmzaton process. Keywords: CAPM, BlackLtterman Ths s a
More informationModule 2. Random Processes. Version 2 ECE IIT, Kharagpur
Module Random Processes Lesson 5 Introducton to Random Varables After readng ths lesson, you wll learn about Defnton of a random varable Propertes of cumulatve dstrbuton functon (cdf) Propertes of probablty
More information8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by
6 CHAPTER 8 COMPLEX VECTOR SPACES 5. Fnd the kernel of the lnear transformaton gven n Exercse 5. In Exercses 55 and 56, fnd the mage of v, for the ndcated composton, where and are gven by the followng
More informationThe Martingale Central Limit Theorem
The Martngale Central Lmt Theorem Steven P. Lalley Unversty of Chcago May 27, 2014 1 Lndeberg s Method One of the most useful generalzatons of the central lmt theorem s the martngale central lmt theorem
More informationU.C. Berkeley CS270: Algorithms Lecture 4 Professor Vazirani and Professor Rao Jan 27,2011 Lecturer: Umesh Vazirani Last revised February 10, 2012
U.C. Berkeley CS270: Algorthms Lecture 4 Professor Vazran and Professor Rao Jan 27,2011 Lecturer: Umesh Vazran Last revsed February 10, 2012 Lecture 4 1 The multplcatve weghts update method The multplcatve
More informationAn Alternative Way to Measure Private Equity Performance
An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate
More informationMULTIDIMENSIONAL FUNCTION APPROXIMATION USING NEURAL NETWORKS
MULTIDIMESIOAL FUCTIO APPROXIMATIO USIG EURAL ETWORS Enăchescu Căln Petru Maor Unversty of Targu Mures, ROMAIA ecaln@upm.ro Abstract: Solvng a problem wth a neural network a prmordal task s establshng
More informationVisual Computing Gaussian Distribution, Maximum Likelihood Estimation Solution
Computer Graphcs Lab ML Group Prof. M. Gross / Prof. J. Buhmann Solve before: June 7, 006 Remo Zegler, Chrstan Voegel, Danel Cottng, Chrstan Sgg, Jens Keuchel Vsual Computng Gaussan Dstrbuton, Maxmum Lkelhood
More information6.896 Topics in Algorithmic Game Theory February 16, Lecture 4
6.896 opcs n Algorthmc Game heory February 6, 200 Lecture 4 Lecturer: Constantnos Daskalaks Scrbe: Jason Bddle, Alan Deckelbaum NOE: he content of these notes has not been formally revewed bhe lecturer.
More informationLuby s Alg. for Maximal Independent Sets using Pairwise Independence
Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent
More informationLoop Corrected Belief Propagation
Loop Corrected Belef Propagaton Jors Mooj 1 B. Wemmenhove 1 B. Kappen 1 T. Rzzo 2 1 Department of Bophyscs Radboud Unversty Njmegen, The Netherlands {j.mooj b.wemmenhove b.kappen}@scence.ru.nl 2 E. Ferm
More informationII. PROBABILITY OF AN EVENT
II. PROBABILITY OF AN EVENT As ndcated above, probablty s a quantfcaton, or a mathematcal model, of a random experment. Ths quantfcaton s a measure of the lkelhood that a gven event wll occur when the
More informationbenefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).
REVIEW OF RISK MANAGEMENT CONCEPTS LOSS DISTRIBUTIONS AND INSURANCE Loss and nsurance: When someone s subject to the rsk of ncurrng a fnancal loss, the loss s generally modeled usng a random varable or
More informationCommunication Networks II Contents
8 / 1  Communcaton Networs II (Görg)  www.comnets.unbremen.de Communcaton Networs II Contents 1 Fundamentals of probablty theory 2 Traffc n communcaton networs 3 Stochastc & Marovan Processes (SP
More informationData Envelopment Analysis
Data Envelopment Analyss Data Envelopment Analyss (DEA) s an ncreasngly popular management tool. DEA s commonly used to evaluate the effcency of a number of producers. A typcal statstcal approach s characterzed
More informationEstimating Frequency Moments of Streams
Estmatng Frequency Moments of Streams In ths class we wll look at the two smple sketches for estmatng the frequency moments of a stream. The analyss wll ntroduce two mportant trcks n probablty boostng
More informationINTERNATIONAL JOURNAL OF APPLIED ENGINEERING RESEARCH, DINDIGUL Volume 2, No 2, 2011
Volume, No, Copyrght All rghts reserved Integrated Publshng Assocaton REVIEW ARTICLE ISSN 976 59 Analyss of Axally Vbratng Varable Cross Secton Isotropc Rod Usng DTM Mohammad Rafee, Amr Morad Mechancal
More informationMaximum Likelihood Estimation L. Magee January, 2010
Maxmum Lkelhood Estmaton L. Magee January, 2010 1 Notaton Let (f y (y x, θ be the probablty densty functon (pdf of y gven a vector of exogenous explanatory varables x (treated as nonrandom here and an
More informationPerformance Measures in Dynamic Parameter Design
Performance easures n Dynamc Parameter Desgn. Introducton V. Roshan Joseph and C. F. Jeff Wu Department of Statstcs Unversty of chgan Ann Arbor, I 48985, USA Journal of Japanese Qualty Engneerng Socety,
More information13.1 Routing to minimize congestion
CS787: Advanced Algorthms Scrbe: Archt Gupta, Pavan Kuppl Lecturer: Shuch Chawla Topc: Randomzed roundng (contd.), LP dualty. Date: 10/15/2007 13.1 Routng to mnmze congeston We are gven a graph G = (V,
More informationLecture 3.4 Electric Potential
Lecture 3.4 Electrc Potental Today we are gong to look at electrostatc problems from a dfferent stand pont. We wll use the same dea whch we have developed n classcal mechancs. As you may recall, we frst
More informationChapter 7. RandomVariate Generation 7.1. Prof. Dr. Mesut Güneş Ch. 7 RandomVariate Generation
Chapter 7 RandomVarate Generaton 7. Contents Inversetransform Technque AcceptanceRejecton Technque Specal Propertes 7. Purpose & Overvew Develop understandng of generatng samples from a specfed dstrbuton
More informationSIX WAYS TO SOLVE A SIMPLE PROBLEM: FITTING A STRAIGHT LINE TO MEASUREMENT DATA
SIX WAYS TO SOLVE A SIMPLE PROBLEM: FITTING A STRAIGHT LINE TO MEASUREMENT DATA E. LAGENDIJK Department of Appled Physcs, Delft Unversty of Technology Lorentzweg 1, 68 CJ, The Netherlands Emal: e.lagendjk@tnw.tudelft.nl
More information5.4 Headon Elastic Collisions
5.4 Headon Elastc Collsons In prevous sectons, you read about systems nvolvng elastc and nelastc collsons. You appled the law of conservaton of momentum and the law of conservaton of netc energy to solve
More informationTHE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek
HE DISRIBUION OF LOAN PORFOLIO VALUE * Oldrch Alfons Vascek he amount of captal necessary to support a portfolo of debt securtes depends on the probablty dstrbuton of the portfolo loss. Consder a portfolo
More informationIDIAP IDIAP. Martigny  Valais  Suisse. Combined 5x2cv F Test for. Comparing. Supervised Classification. Learning Algorithms.
R E S E A R C H R E P O R T IDIAP IDIAP Martgny  Valas  Susse Combned 5x2cv F Test for Comparng Supervsed Classfcaton Learnng Algorthms Ethem Alpaydn a IDIAP{RR 9804 May 1998 Dalle Molle Insttute for
More informationI. Deadweight Loss of a Tax
Economcs 203: Dervaton of Ramsey's Optmal Tax Formula Fall 2007 Casey B. Mullgan These notes derve Ramsey's optmal tax formula that we dscussed n lecture. Econ 203 students should understand the logc behnd
More informationProbability and Statistics I
Constantn Rothkopf 1 Probablty and Statstcs I FIAS Summer School on Theoretcal Neuroscence and Complex Systems Frankfurt, August 224, 2008 from Kersten and Yulle () Constantn Rothkopf 2 Constantn Rothkopf
More informationRealization of uniform approximation by applying meansquare approximation
Computer Applcatons n Electrcal Engneerng Realzaton of unform approxmaton by applyng meansquare approxmaton Jan Purczyńsk West Pomeranan Unversty of Technology 71126 Szczecn, ul 26 Kwetna 1, emal: janpurczynsk@pspl
More informationRandom Walks on Rooted Trees
Random Walks on Rooted rees Felx Lazebnk, Wenbo L and Ryan Martn Department of Mathematcal Scences, Unversty of Delaware, Newark, DE 19716 Abstract For arbtrary postve ntegers h and m, we consder the famly
More informationSimulating Nonhomogeneous Poisson Point Process Based on Multi Criteria Intensity Function and Comparison with Its Simple Form
ournal of mathematcs and computer Scence 9 (2014) 133138 Smulatng Nonhomogeneous Posson Pont Process Based on Mult Crtera Intensty Functon and Comparson wth Its Smple Form Artcle hstory: Receved August
More informationSupplementary Materials for Tensor Analyzers
Supplementary Materals for Tensor Analyzers Ychuan Tang Ruslan Salakhudnov Geoffrey Hnton Department of Computer Scence Unversty of Toronto Toronto, Ontaro, Canada. tang@cs.toronto.edu 1 Dervatons of the
More informationInstitute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic
Lagrange Multplers as Quanttatve Indcators n Economcs Ivan Mezník Insttute of Informatcs, Faculty of Busness and Management, Brno Unversty of TechnologCzech Republc Abstract The quanttatve role of Lagrange
More informationCausal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting
Causal, Explanatory Forecastng Assumes causeandeffect relatonshp between system nputs and ts output Forecastng wth Regresson Analyss Rchard S. Barr Inputs System Cause + Effect Relatonshp The job of
More informationAPPLIED MACHINE LEARNING
APPLIED MACHINE LEARNING Methods for Reducton of Dmensonalty through Lnear Projecton Prncpal Component Analyss (PCA) 1 Curse of Dmensonalty Computatonal Costs O(N 2 ) O(N) N: Nb of dmensons Lnear ncrease
More informationForecasting the Direction and Strength of Stock Market Movement
Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye cjngwe@stanford.edu mchen5@stanford.edu nanye@stanford.edu Abstract  Stock market s one of the most complcated systems
More informationAn Introduction to. Support Vector Machine
An Introducton to Support Vector Machne Support Vector Machne (SVM) A classfer derved from statstcal learnng theory by Vapnk, et al. n 99 SVM became famous when, usng mages as nput, t gave accuracy comparable
More informationModule 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..
More informationL10: Linear discriminants analysis
L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss
More informationLecture 3. 1 Largest singular value The Behavior of Algorithms in Practice 2/14/2
18.409 The Behavor of Algorthms n Practce 2/14/2 Lecturer: Dan Spelman Lecture 3 Scrbe: Arvnd Sankar 1 Largest sngular value In order to bound the condton number, we need an upper bound on the largest
More informationJ. Parallel Distrib. Comput.
J. Parallel Dstrb. Comput. 71 (2011) 62 76 Contents lsts avalable at ScenceDrect J. Parallel Dstrb. Comput. journal homepage: www.elsever.com/locate/jpdc Optmzng server placement n dstrbuted systems n
More informationVALUE AT RISK FOR LONG/SHORT POSITIONS. VaR = E[Portfolio]  q α (1)
VALUE AT RISK FOR LOG/SHORT POSITIOS Sorn R. Straa, Ph.D., FRM Montgomery Investment Technology, Inc. 200 Federal Street Camden, J 0803 Phone: (60) 6888 sorn.straa@fntools.com www.fntools.com The obectve
More informationTHE TITANIC SHIPWRECK: WHO WAS
THE TITANIC SHIPWRECK: WHO WAS MOST LIKELY TO SURVIVE? A STATISTICAL ANALYSIS Ths paper examnes the probablty of survvng the Ttanc shpwreck usng lmted dependent varable regresson analyss. Ths appled analyss
More informationMoment of a force about a point and about an axis
3. STATICS O RIGID BODIES In the precedng chapter t was assumed that each of the bodes consdered could be treated as a sngle partcle. Such a vew, however, s not always possble, and a body, n general, should
More informationβ = to determine whether there was a significant relationship
Topc 4. Multple Regresson (Ch. 9) ) Smple Lnear Regresson Revsted Recall our smple lnear regresson model was y = α + βx+ ε () where ε ~ IN (0, σ ). Accordng to ths model, an observaton y s formed from
More informationCS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements
Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there
More informationBERNSTEIN POLYNOMIALS
OnLne Geometrc Modelng Notes BERNSTEIN POLYNOMIALS Kenneth I. Joy Vsualzaton and Graphcs Research Group Department of Computer Scence Unversty of Calforna, Davs Overvew Polynomals are ncredbly useful
More informationMCMC MCMC. Gibbs Sampling. Approximate Inference 3: Monte Carlo Markov Chain
MCMC Appromate Inference 3: Monte Carlo Markov Chan Lmtatons of LW: Evdence affects samplng only for nodes that are ts descendants For nondescendants the weghts account for the effect of the evdence If
More informationSection 13.5 Test of Hypothesis for Population Proportion
Secton 135 Test of Hypothess for Populaton Proporton As dscussed n Secton 114115, nference about the probablty of success n ndependent Bernoull trals s same as the mean of ndependent observatons on bnary
More informationHYPOTHESIS TESTING OF PARAMETERS FOR ORDINARY LINEAR CIRCULAR REGRESSION
HYPOTHESIS TESTING OF PARAMETERS FOR ORDINARY LINEAR CIRCULAR REGRESSION Abdul Ghapor Hussn Centre for Foundaton Studes n Scence Unversty of Malaya 563 KUALA LUMPUR Emal: ghapor@umedumy Abstract Ths paper
More informationEconometrics Deriving the OLS estimator in the univariate and bivariate cases
Econometrcs Dervng the OLS estmator n the unvarate and bvarate cases athanel Hggns nhggns@jhu.edu Prmer otaton Here are some bts of notaton that you wll need to understand the dervatons below. They are
More informationMODULE 11a Topics: The vector norm of a matrix
MODULE 11a Topcs: The vector norm of a matrx Let denote a norm on R m and R n. Typcally, we thnk of x = x = max x, but t can be any norm. We defne the vector norm of a matrx A by A = max x =1 Ax. We say
More informationChi Squared Fit. Chi Squared Fit
Ch Squared Ft Measured Data (Charge Voxels) Pulse Reconstructon (Charge on one Pad) Ht Reconstructon (Charge n one Pad Row) Track Fndng (Combnng Hts) Track Fndng (Combnng Pulses) Track Fttng: Ch Squared
More informationVarious Techniques. 1 Subset Sum problem. 1.1 Introduction. 1.2 ksum problem
Lecture 1 (18.11.2012) Varous Technques Author: Jaros law B lasok The followng lecture s dvded nto three parts, n the frst one we wll consder the Subset Sum problem and see how to solve t usng varaton
More informationNuno Vasconcelos UCSD
Bayesan parameter estmaton Nuno Vasconcelos UCSD 1 Maxmum lkelhood parameter estmaton n three steps: 1 choose a parametrc model for probabltes to make ths clear we denote the vector of parameters by Θ
More informationPutting Ricardo to Work
Journal of Economc Perspectves Puttng Rcardo to Work Jonathan Eaton and Samuel Kortum 1 Mathematcal Appendx 1.1 Dervatons for the Probablstc Model Let s begn wth Eaton and Kortum (2002), who express the
More informationFrank M. Thiesing, Ulrich Middelberg, Oliver Vornberger. University of Osnabruck.
EUFIT 95, Thrd European Congress on Intellgent Technques and Soft Computng, 28{31 August 95, Aachen, Germany A NEURAL NETWORK APPROACH FOR PREDICTING THE SALE OF ARTICLES IN SUPERMARKETS Frank M. Thesng,
More information5.3 Kraft inequality and optimal codeword length
5.3. KRAFT INEQUALITY AND OPTIMAL CODEWORD LENGTH 6 For random processes we do not have..d. varables. However, t s possble to generalze the theorem for ths case, usng the entropy rate. The convergence
More informationChapter 10 Dummy Variable Models
Chapter ummy Varable Models In general, the explanatory varables n any regresson analyss are assumed to be quanttatve n nature. For example, the varables lke temperature, dstance, age etc. are quanttatve
More informationGaussNewton / LevenbergMarquardt Optimization
GaussNewton / LevenbergMarquardt Optmzaton Ethan Eade Updated March 20, 2013 1 Defntons Let x X be the state parameters to be optmzed, wth n degrees of freedom. The goal of the optmzaton s to maxmze
More informationContingent Claims and the Arbitrage Theorem
Contngent Clams and the Arbtrage Theorem Paul J. Atzberger Paul J. Atzberger Please send any comments to: atzberg@math.ucsb.edu Introducton No arbtrage prncples play a central role n models of fnance and
More informationSimulation of Wide Area Frequency Measurements from Phasor Measurement Units (PMUs) or Frequency Disturbance Recorders (FDRs)
Smulaton of Wde Area Frequency Measurements from Phasor Measurement Unts (PMUs) or Frequency Dsturbance Recorders (FDRs) Ghadr Radman and Nck Hodges Outlne Motvaton for the Study What s Meant by Power
More informationMATH 118 Class Notes For Chapter 5 By: Maan Omran
MATH 118 Class Notes For Chapter 5 By: Maan Omran Secton 5.1 Central Tendency Mode: the number or numbers that occur most often. Medan: the number at the mdpont of a ranked data. Ex1: The test scores for
More informationOn the guaranteed convergence of the squareroot iteration method
On the guaranteed convergence of the squareroot teraton method M. S. Petkovć*, L. Rančć Faculty of Electronc Engneerng, Unversty of Nš, P. O. Box 73 8 000 Nš, Serba and Montenegro Abstract. The constructon
More informationInequality and The Accounting Period. Quentin Wodon and Shlomo Yitzhaki. World Bank and Hebrew University. September 2001.
Inequalty and The Accountng Perod Quentn Wodon and Shlomo Ytzha World Ban and Hebrew Unversty September Abstract Income nequalty typcally declnes wth the length of tme taen nto account for measurement.
More informationCS 2750 Machine Learning. Lecture 17a. Clustering. CS 2750 Machine Learning. Clustering
Lecture 7a Clusterng Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square Clusterng Groups together smlar nstances n the data sample Basc clusterng problem: dstrbute data nto k dfferent groups such that
More informationThe Development of Web Log Mining Based on ImproveKMeans Clustering Analysis
The Development of Web Log Mnng Based on ImproveKMeans Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract.
More informationErrorPropagation.nb 1. Error Propagation
ErrorPropagaton.nb Error Propagaton Suppose that we make observatons of a quantty x that s subject to random fluctuatons or measurement errors. Our best estmate of the true value for ths quantty s then
More informationA graphical model for multirelational social network analysis
A graphcal model for multrelatonal socal network analyss Mohammad Khoshneshn Bowlng Green State Unversty, Bowlng Green, Oho 43403 USA Nck Street Unversty of Iowa, Iowa Cty, Iowa 52242 USA mkhosh@bgsu.edu
More information