Inference and Representation
|
|
- Margery Hart
- 7 years ago
- Views:
Transcription
1 Inference and Representation David Sontag New York University Lecture 11, Nov. 24, 2015 David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
2 Approximate marginal inference Given the joint p(x 1,..., x n ) represented as a graphical model, how do we perform marginal inference, e.g. to compute p(x 1 e)? We showed in Lecture 4 that doing this exactly is NP-hard Nearly all approximate inference algorithms are either: 1 Monte-carlo methods (e.g., Gibbs sampling, likelihood reweighting, MCMC) 2 Variational algorithms (e.g., mean-field, loopy belief propagation) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
3 Variational methods Goal: Approximate difficult distribution p(x e) with a new distribution q(x) such that: 1 p(x e) and q(x) are close 2 Computation on q(x) is easy How should we measure distance between distributions? The Kullback-Leibler divergence (KL-divergence) between two distributions p and q is defined as D(p q) = x p(x) log p(x) q(x). (measures the expected number of extra bits required to describe samples from p(x) using a code based on q instead of p) D(p q) 0 for all p, q, with equality if and only if p = q Notice that KL-divergence is asymmetric David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
4 KL-divergence (see Section of Murphy) D(p q) = x p(x) log p(x) q(x). Suppose p is the true distribution we wish to do inference with What is the difference between the solution to arg min D(p q) q (called the M-projection of q onto p) and (called the I-projection)? arg min D(q p) q These two will differ only when q is minimized over a restricted set of probability distributions Q = {q 1,...}, and in particular when p Q David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
5 KL-divergence M-projection q = arg min q Q D(p q) = x p(x) log p(x) q(x). For example, suppose that p(z) is a 2D Gaussian and Q is the set of all Gaussian distributions with diagonal covariance matrices: 1 z z 1 1 p=green, (b) q =Red David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
6 KL-divergence I-projection q = arg min q Q D(q p) = x q(x) log q(x) p(x). For example, suppose that p(z) is a 2D Gaussian and Q is the set of all Gaussian distributions with diagonal covariance matrices: 1 z z 1 1 p=green, (a) q =Red David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
7 KL-divergence (single Gaussian) In this simple example, both the M-projection and I-projection find an approximate q(x) that has the correct mean (i.e. E p [z] = E q [z]): 1 1 z 2 z z 1 1 (b) z 1 1 (a) What if p(x) is multi-modal? David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
8 KL-divergence M-projection (mixture of Gaussians) q = arg min q Q D(p q) = x p(x) log p(x) q(x). Now suppose that p(x) is mixture of two 2D Gaussians and Q is the set of all 2D Gaussian distributions (with arbitrary covariance matrices): p=blue, q =Red M-projection yields distribution q(x) with the correct mean and covariance. David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
9 KL-divergence I-projection (mixture of Gaussians) q = arg min q Q D(q p) = x q(x) log q(x) p(x). p=blue, q =Red (two local minima!) Unlike M-projection, the I-projection does not always yield the correct moments. Q: D(p q) is convex so why are there local minima? A: using a parametric form for q (i.e., a Gaussian). Not convex in µ, Σ. David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
10 M-projection does moment matching Recall that the M-projection is: q = arg min q Q D(p q) = x p(x) log p(x) q(x). Suppose that Q is an exponential family (p(x) can be arbitrary) and that we perform the M-projection, finding q Theorem: The expected sufficient statistics, with respect to q (x), are exactly the marginals of p(x): E q [f(x)] = E p [f(x)] Thus, solving for the M-projection (exactly) is just as hard as the original inference problem David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
11 M-projection does moment matching Recall that the M-projection is: q = arg min D(p q) = q(x;η) Q x Theorem: E q [f(x)] = E p [f(x)]. Proof: Look at the first-order optimality conditions. ηi D(p q) = ηi p(x) log q(x) x = ηi x = ηi x p(x) log p(x) q(x). { } p(x) log h(x) exp{η f(x) ln Z(η)} { } p(x) η f(x) ln Z(η) = p(x)f i (x) + E q(x;η) [f i (x)] x = E p [f i (x)] + E q(x;η) [f i (x)] = 0. (since ηi ln Z(η) = E q [f i (x)]) Corollary: Even computing the gradients is hard (can t do gradient descent) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
12 Most variational inference algorithms make use of the I-projection David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
13 Variational methods Suppose that we have an arbitrary graphical model: p(x; θ) = 1 ( ) φ c (x c ) = exp θ c (x c ) ln Z(θ) Z(θ) c C All of the approaches begin as follows: D(q p) = x q(x) ln q(x) p(x) c C = q(x) ln p(x) q(x) ln 1 q(x) x x = q(x) ( θ c (x c ) ln Z(θ) ) H(q(x)) x c C = q(x)θ c (x c ) + q(x) ln Z(θ) H(q(x)) c C x x = c C E q [θ c (x c )] + ln Z(θ) H(q(x)). David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
14 Mean field algorithms for variational inference Eric CMU, max q Q q(x c )θ c (x c ) + H(q(x)). c C x c Naïve Mean Field Although this function is concave and thus in theory should be easy to optimize, we Fully need factorized somevariational compact distribution way of representing q(x) Mean field algorithms assume a factored representation of the joint distribution, e.g. q(x) = i V q i (x i ) Eric Xing CMU, naive mean field) 34 David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
15 Naive mean-field Suppose that Q consists of all fully factored distributions, of the form q(x) = i V q i(x i ) We can use this to simplify q(x c )θ c (x c ) + H(q) max q Q c C First, note that q(x c ) = i c q i(x i ) x c Next, notice that the joint entropy decomposes as a sum of local entropies: H(q) = x q(x) ln q(x) = q(x) ln q i (x i ) = q(x) ln q i (x i ) x i V x i V = q(x) ln q i (x i ) i V x = q i (x i ) ln q i (x i ) q(x V \i x i ) = H(q i ). i V x i x V \i i V David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
16 Naive mean-field Suppose that Q consists of all fully factored distributions, of the form q(x) = i V q i(x i ) We can use this to simplify q(x c )θ c (x c ) + H(q) max q Q c C First, note that q(x c ) = i c q i(x i ) x c Next, notice that the joint entropy decomposes as H(q) = i V H(q i). Putting these together, we obtain the following variational objective: ( ) max θ c (x c ) q i (x i ) + H(q i ) q x c i V subject to the constraints c C i c q i (x i ) 0 i V, x i Val(X i ) q i (x i ) = 1 i V x i Val(X i ) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
17 Naive mean-field for pairwise MRFs How do we maximize the variational objective? ( ) max θ ij (x i, x j )q i (x i )q j (x j ) q i (x i ) ln q i (x i ) q x i,x j x i ij E i V This is a non-concave optimization problem, with many local maxima! Nonetheless, we can greedily maximize it using block coordinate ascent: 1 Iterate over each of the variables i V. For variable i, 2 Fully maximize (*) with respect to {q i (x i ), x i Val(X i )}. 3 Repeat until convergence. Constructing the Lagrangian, taking the derivative, setting to zero, and solving yields the update: (shown on blackboard) q i (x i ) 1 exp {θ i (x i ) + Z i j N(i) } q j (x j )θ ij (x i, x j ) x j David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
18 How accurate will the approximation be? Consider a distribution which is an XOR of two binary variables A and B: p(a, b) = 0.5 ɛ if a b and p(a, b) = ɛ if a = b The contour plot of the variational objective is: Q(b 1 ) Q(a 1 ) Even for a single edge, mean field can give very wrong answers! Interestingly, once ɛ > 0.1, mean field has a single maximum point at the uniform distribution (thus, exact) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
19 Fig. 5.4 Structured mean field approximation for a factorial HMM. (a) Original model David Sontag (NYU) consists of a set of hiddeninference Markovand models Representation (defined on chains), coupled Lecture 11, at each Nov. time 24, 2015 by 19 / 32 Structured mean-field approximations Rather than assuming a fully-factored distribution for q, we can use a structured approximation, such as a spanning tree For example, for a factorial HMM, a good approximation may be a product of 146chain-structured Mean Field Methods models:
20 Recall our starting place for variational methods... Suppose that we have an arbitrary graphical model: p(x; θ) = 1 ( ) φ c (x c ) = exp θ c (x c ) ln Z(θ) Z(θ) c C All of the approaches begin as follows: D(q p) = x q(x) ln q(x) p(x) c C = q(x) ln p(x) q(x) ln 1 q(x) x x = q(x) ( θ c (x c ) ln Z(θ) ) H(q(x)) x c C = q(x)θ c (x c ) + q(x) ln Z(θ) H(q(x)) c C x x = c C E q [θ c (x c )] + ln Z(θ) H(q(x)). David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
21 The log-partition function Since D(q p) 0, we have c C E q [θ c (x c )] + ln Z(θ) H(q(x)) 0, which implies that ln Z(θ) c C E q [θ c (x c )] + H(q(x)). Thus, any approximating distribution q(x) gives a lower bound on the log-partition function (for a BN, this is the log probability of the observed variables) Recall that D(q p) = 0 if and only if p = q.thus, if we allow ourselves to optimize over all distributions, we have: ln Z(θ) = max q E q [θ c (x c )] + H(q(x)). c C David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
22 Re-writing objective in terms of moments ln Z(θ) = max q = max q = max q E q [θ c (x c )] + H(q(x)) c C q(x)θ c (x c ) + H(q(x)) c C x q(x c )θ c (x c ) + H(q(x)). c C x c Now assume that p(x) is in the exponential family, and let f(x) be its sufficient statistic vector Define µ q = E q [f(x)] to be the marginals of q(x) We can re-write the objective as ln Z(θ) = max max µ M q:e q[f(x)]=µ c C x c θ c (x c )µ c (x c ) + H(q(x)), where M, the marginal polytope, consists of all valid marginal vectors David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
23 Re-writing objective in terms of moments Next, push the max over q instead to obtain: ln Z(θ) = max θ c (x c )µ c (x c ) + H(µ), where µ M c C x c H(µ) = max q:e q[f(x)]=µ H(q) Does this look familiar? For discrete random variables, the marginal polytope M is given by M = {µ R d µ = } p(x)f(x) for some p(x) 0, p(x) = 1 x X m x X { m = conv f(x), x X m} (conv denotes the convex hull operation) For a discrete-variable MRF, the sufficient statistic vector f(x) is simply the concatenation of indicator functions for each clique of variables that appear together in a potential function For example, if we have a pairwise MRF on binary variables with m = V variables and E edges, d = 2m + 4 E David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
24 Marginal polytope for discrete MRFs µ = 0! 1! 0! 1! 1! 0! 0" 0" 1" 0" 0! 0! 0! 1! 0! 0! 1! 0" Marginal polytope! (Wainwright & Jordan, 03)! µ = 1 µ 2 + µ valid marginal probabilities! X 1! = 1! 1! 0! 0! 1! 1! 0! 1" 0" 0" 0" 0! 1! 0! 0! 0! 0! 1! 0" Assignment for X 1" Assignment for X 2" Assignment for X 3! Edge assignment for" X 1 X 3! Edge assignment for" X 1 X 2! Edge assignment for" X 2 X 3! X 1! = 0! X 2! = 1! X 3 =! 0! X 2! = 1! X 3 =! 0! Figure 2-1: Illustration of the marginal polytope for a Markov random field with three nodes that have states in {0, 1}. The vertices correspond one-to-one with global assignments to David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
25 Relaxation ln Z(θ) = max θ c (x c )µ c (x c ) + H(µ) µ M c C x c We still haven t achieved anything, because: 1 The marginal polytope M is complex to describe (in general, exponentially many vertices and facets) 2 H(µ) is very difficult to compute or optimize over We now make two approximations: 1 We replace M with a relaxation of the marginal polytope, e.g. the local consistency constraints M L 2 We replace H(µ) with a function H(µ) which approximates H(µ) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
26 Local consistency constraints Force every cluster of variables to choose a local assignment: µ i (x i ) 0 i V, x i x i µ i (x i ) = 1 i V µ ij (x i, x j ) 0 ij E, x i, x j x i,x j µ ij (x i, x j ) = 1 ij E Enforce that these local assignments are globally consistent: µ i (x i ) = x j µ ij (x i, x j ) ij E, x i µ j (x j ) = x i µ ij (x i, x j ) ij E, x j The local consistency polytope, M L is defined by these constraints Theorem: The local consistency constraints exactly define the marginal polytope for a tree-structured MRF David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
27 Entropy for tree-structured models Suppose that p is a tree-structured distribution, so that we are optimizing only over marginals µ ij (x i, x j ) for ij T The solution to arg max q:eq[f(x)]=µ H(q) is a tree-structured MRF (c.f. lecture 10, maximum entropy estimation) The entropy of q as a function of its marginals can be shown to be H( µ) = i V H(µ i ) ij T I (µ ij ) where H(µ i ) = x i µ i (x i ) log µ i (x i ) I (µ ij ) = x i,x j µ ij (x i, x j ) log µ ij(x i, x j ) µ i (x i )µ j (x j ) Can we use this for non-tree structured models? David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
28 Bethe-free energy approximation The Bethe entropy approximation is (for any graph) H bethe ( µ) = H(µ i ) I (µ ij ) i V ij E This gives the following variational approximation: max θ c (x c )µ c (x c ) + H bethe ( µ) µ M L c C x c For non tree-structured models this is not concave, and is hard to maximize Loopy belief propagation, if it converges, finds a saddle point! David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
29 Concave relaxation Let H(µ) be an upper bound on H(µ), i.e. H(µ) H(µ) As a result, we obtain the following upper bound on the log-partition function: ln Z(θ) max θ c (x c )µ c (x c ) + H(µ) µ M L c C x c An example of a concave entropy upper bound is the tree-reweighted approximation (Jaakkola, Wainwright, & Wilsky, 05), given by specifying a distribution over spanning trees of the graph e f b e f b (a) (b) (c) (d) Figure 1. Illustration of the spanning tree polytope T(G). Original graph is shown in panel (a). Probability 1/3 is assigned to each of the three spanning trees { Ti i = 1, 2, 3 } shown in panels (b) (d). Edge b is a so-called bridge in G, meaning that it must appear in any spanning tree (i.e., µb = 1). Edges e and f appear in two and one of the spanning trees respectively, which gives rise to edge appearance probabilities µe = 2/3 and µf = 1/3. e f b e f b the Bethe free energy [11]. In fact, suppose that w It can be seen that this function is closely related David Sontag (NYU) Tree-consistent pseudomarginals: Inference and Representation The con- set µst = 1 for all edges (s, t) E, meaning that eve Lecture edge appears 11, Nov. with probability 24, 2015one. In29 this/ case, 32 t joint pseudomarginal Tst: Letting {ρ ij } denote edge appearance probabilities, we have: H TRW ( µ) = i V H(µ i ) ij E ρ ij I (µ ij ) Ist(Tst) = Tst;jk log ( (j,k) k Xt Tst;jk Tst;jk)( j Xs Tst;jk Borrowing terminology from statistical physics [11], w define an average energy term as follows: T θ = Ts;jθs;j + Tst;jkθst; s V j (s,t) E (j,k) Using this notation, our bounds are based on the fo lowing function: F(T; µe; θ ) Hs(Ts) + µstist(tst) T s V (s,t) E
30 Comparison of LBP and TRW We showed two approximation methods, both making use of the local consistency constraints M L on the marginal polytope: 1 Bethe-free energy approximation (for pairwise MRFs): max µ ij (x i, x j )θ ij (x i, x j ) + H(µ i ) I (µ ij ) µ M L ij E x i,x j i V ij E Not concave. Can use concave-convex procedure to find local optima Loopy BP, if it converges, finds a saddle point (often a local maxima) 2 Tree re-weighted approximation (for pairwise MRFs): ( ) max µ ij (x i, x j )θ ij (x i, x j ) + H(µ i ) ρ ij I (µ ij ) µ M L ij E x i,x j i V ij E {ρ ij } are edge appearance probabilities (must be consistent with some set of spanning trees) This is concave! Find global maximiza using projected gradient ascent Provides an upper bound on log-partition function, i.e. ln Z(θ) ( ) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
31 Two types of variational algorithms: Mean-field and Mean Field Approximation relaxation max q Q q(x c )θ c (x c ) + H(q(x)). c C x c Eric CMU, Although this function is concave and thus in theory should be easy to optimize, we need some compact way of representing q(x) Relaxation algorithms work directly with pseudomarginals which may not be Naïve Mean Field consistent with any joint distribution Mean-field algorithms assume a factored representation of the joint distribution, e.g. Fully factorized variational distribution 33 q(x) = i V q i (x i ) Eric CMU, (called naive mean field) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
32 Naive mean-field Using the same notation as in the rest of the lecture, naive mean-field is: ( ) max θ c (x c )µ c (x c ) + H(µ i ) subject to µ x c c C x i Val(X i ) i V µ i (x i ) 0 i V, x i Val(X i ) µ i (x i ) = 1 i V µ c (x c ) = i c µ i (x i ) Corresponds to optimizing over an inner5.4bound Nonconvexityon of Mean the Fieldmarginal 141 polytope: We obtain a lower bound on the partition function, i.e. ( ) ln Z(θ) Fig. 5.3 Cartoon illustration of the set MF (G) of mean parameters that arise from tractable distributions is a nonconvex inner bound on M(G). Illustrated here is the case of discrete random variables where M(G) is a polytope. The circles correspond to mean parameters that arise from delta distributions, and belong to both M(G) and MF (G). David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32
Variational Mean Field for Graphical Models
Variational Mean Field for Graphical Models CS/CNS/EE 155 Baback Moghaddam Machine Learning Group baback @ jpl.nasa.gov Approximate Inference Consider general UGs (i.e., not tree-structured) All basic
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationCourse: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.
More informationIntroduction to Deep Learning Variational Inference, Mean Field Theory
Introduction to Deep Learning Variational Inference, Mean Field Theory 1 Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Center for Visual Computing Ecole Centrale Paris Galen Group INRIA-Saclay Lecture 3: recap
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationCHAPTER 9. Integer Programming
CHAPTER 9 Integer Programming An integer linear program (ILP) is, by definition, a linear program with the additional constraint that all variables take integer values: (9.1) max c T x s t Ax b and x integral
More informationApproximating the Partition Function by Deleting and then Correcting for Model Edges
Approximating the Partition Function by Deleting and then Correcting for Model Edges Arthur Choi and Adnan Darwiche Computer Science Department University of California, Los Angeles Los Angeles, CA 995
More informationTHE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok
THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan
More information2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
More informationIntroduction to Markov Chain Monte Carlo
Introduction to Markov Chain Monte Carlo Monte Carlo: sample from a distribution to estimate the distribution to compute max, mean Markov Chain Monte Carlo: sampling using local information Generic problem
More informationLinear Programming. March 14, 2014
Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1
More informationThe Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
More informationBayesian Clustering for Email Campaign Detection
Peter Haider haider@cs.uni-potsdam.de Tobias Scheffer scheffer@cs.uni-potsdam.de University of Potsdam, Department of Computer Science, August-Bebel-Strasse 89, 14482 Potsdam, Germany Abstract We discuss
More information17.3.1 Follow the Perturbed Leader
CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.
More informationFinding the M Most Probable Configurations Using Loopy Belief Propagation
Finding the M Most Probable Configurations Using Loopy Belief Propagation Chen Yanover and Yair Weiss School of Computer Science and Engineering The Hebrew University of Jerusalem 91904 Jerusalem, Israel
More informationSolving NP Hard problems in practice lessons from Computer Vision and Computational Biology
Solving NP Hard problems in practice lessons from Computer Vision and Computational Biology Yair Weiss School of Computer Science and Engineering The Hebrew University of Jerusalem www.cs.huji.ac.il/ yweiss
More informationInformation Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More information3. The Junction Tree Algorithms
A Short Course on Graphical Models 3. The Junction Tree Algorithms Mark Paskin mark@paskin.org 1 Review: conditional independence Two random variables X and Y are independent (written X Y ) iff p X ( )
More informationBayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
More informationWeek 1: Introduction to Online Learning
Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider
More informationLecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method
Lecture 3 3B1B Optimization Michaelmas 2015 A. Zisserman Linear Programming Extreme solutions Simplex method Interior point method Integer programming and relaxation The Optimization Tree Linear Programming
More informationTowards running complex models on big data
Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation
More informationThe Goldberg Rao Algorithm for the Maximum Flow Problem
The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }
More informationGambling and Data Compression
Gambling and Data Compression Gambling. Horse Race Definition The wealth relative S(X) = b(x)o(x) is the factor by which the gambler s wealth grows if horse X wins the race, where b(x) is the fraction
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationScheduling Shop Scheduling. Tim Nieberg
Scheduling Shop Scheduling Tim Nieberg Shop models: General Introduction Remark: Consider non preemptive problems with regular objectives Notation Shop Problems: m machines, n jobs 1,..., n operations
More informationStructured Learning and Prediction in Computer Vision. Contents
Foundations and Trends R in Computer Graphics and Vision Vol. 6, Nos. 3 4 (2010) 185 365 c 2011 S. Nowozin and C. H. Lampert DOI: 10.1561/0600000033 Structured Learning and Prediction in Computer Vision
More informationMAP-Inference for Highly-Connected Graphs with DC-Programming
MAP-Inference for Highly-Connected Graphs with DC-Programming Jörg Kappes and Christoph Schnörr Image and Pattern Analysis Group, Heidelberg Collaboratory for Image Processing, University of Heidelberg,
More informationThe Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method
The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method Robert M. Freund February, 004 004 Massachusetts Institute of Technology. 1 1 The Algorithm The problem
More informationIncreasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.
1. Differentiation The first derivative of a function measures by how much changes in reaction to an infinitesimal shift in its argument. The largest the derivative (in absolute value), the faster is evolving.
More informationProbabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
More informationLogistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationInterpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma
Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma Shinto Eguchi a, and John Copas b a Institute of Statistical Mathematics and Graduate University of Advanced Studies, Minami-azabu
More informationDuality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725
Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T
More informationStatistical Machine Translation: IBM Models 1 and 2
Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation
More informationMethods of Data Analysis Working with probability distributions
Methods of Data Analysis Working with probability distributions Week 4 1 Motivation One of the key problems in non-parametric data analysis is to create a good model of a generating probability distribution,
More informationMessage-passing sequential detection of multiple change points in networks
Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal
More informationPattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University
Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision
More informationTable 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationChristfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
More informationSemi-Supervised Support Vector Machines and Application to Spam Filtering
Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
More informationMathematical finance and linear programming (optimization)
Mathematical finance and linear programming (optimization) Geir Dahl September 15, 2009 1 Introduction The purpose of this short note is to explain how linear programming (LP) (=linear optimization) may
More informationDiscrete Optimization
Discrete Optimization [Chen, Batson, Dang: Applied integer Programming] Chapter 3 and 4.1-4.3 by Johan Högdahl and Victoria Svedberg Seminar 2, 2015-03-31 Todays presentation Chapter 3 Transforms using
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationLECTURE 4. Last time: Lecture outline
LECTURE 4 Last time: Types of convergence Weak Law of Large Numbers Strong Law of Large Numbers Asymptotic Equipartition Property Lecture outline Stochastic processes Markov chains Entropy rate Random
More informationt := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).
1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction
More informationDistributed Structured Prediction for Big Data
Distributed Structured Prediction for Big Data A. G. Schwing ETH Zurich aschwing@inf.ethz.ch T. Hazan TTI Chicago M. Pollefeys ETH Zurich R. Urtasun TTI Chicago Abstract The biggest limitations of learning
More informationMarkov random fields and Gibbs measures
Chapter Markov random fields and Gibbs measures 1. Conditional independence Suppose X i is a random element of (X i, B i ), for i = 1, 2, 3, with all X i defined on the same probability space (.F, P).
More informationStatistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
More informationMath 120 Final Exam Practice Problems, Form: A
Math 120 Final Exam Practice Problems, Form: A Name: While every attempt was made to be complete in the types of problems given below, we make no guarantees about the completeness of the problems. Specifically,
More informationCS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
More informationRecovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach
MASTER S THESIS Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach PAULINE ALDENVIK MIRJAM SCHIERSCHER Department of Mathematical
More informationMessage-passing for graph-structured linear programs: Proximal methods and rounding schemes
Message-passing for graph-structured linear programs: Proximal methods and rounding schemes Pradeep Ravikumar pradeepr@stat.berkeley.edu Martin J. Wainwright, wainwrig@stat.berkeley.edu Alekh Agarwal alekh@eecs.berkeley.edu
More informationDirichlet Processes A gentle tutorial
Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.
More informationPerron vector Optimization applied to search engines
Perron vector Optimization applied to search engines Olivier Fercoq INRIA Saclay and CMAP Ecole Polytechnique May 18, 2011 Web page ranking The core of search engines Semantic rankings (keywords) Hyperlink
More informationModern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
More informationDefinition 11.1. Given a graph G on n vertices, we define the following quantities:
Lecture 11 The Lovász ϑ Function 11.1 Perfect graphs We begin with some background on perfect graphs. graphs. First, we define some quantities on Definition 11.1. Given a graph G on n vertices, we define
More informationProximal mapping via network optimization
L. Vandenberghe EE236C (Spring 23-4) Proximal mapping via network optimization minimum cut and maximum flow problems parametric minimum cut problem application to proximal mapping Introduction this lecture:
More informationConditional mean field
Conditional mean field Peter Carbonetto Department of Computer Science University of British Columbia Vancouver, BC, Canada V6T 1Z4 pcarbo@cs.ubc.ca Nando de Freitas Department of Computer Science University
More information5 Directed acyclic graphs
5 Directed acyclic graphs (5.1) Introduction In many statistical studies we have prior knowledge about a temporal or causal ordering of the variables. In this chapter we will use directed graphs to incorporate
More informationTail inequalities for order statistics of log-concave vectors and applications
Tail inequalities for order statistics of log-concave vectors and applications Rafał Latała Based in part on a joint work with R.Adamczak, A.E.Litvak, A.Pajor and N.Tomczak-Jaegermann Banff, May 2011 Basic
More informationWhy? A central concept in Computer Science. Algorithms are ubiquitous.
Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online
More informationThe Expectation Maximization Algorithm A short tutorial
The Expectation Maximiation Algorithm A short tutorial Sean Borman Comments and corrections to: em-tut at seanborman dot com July 8 2004 Last updated January 09, 2009 Revision history 2009-0-09 Corrected
More informationLecture 1: Course overview, circuits, and formulas
Lecture 1: Course overview, circuits, and formulas Topics in Complexity Theory and Pseudorandomness (Spring 2013) Rutgers University Swastik Kopparty Scribes: John Kim, Ben Lund 1 Course Information Swastik
More information1. Prove that the empty set is a subset of every set.
1. Prove that the empty set is a subset of every set. Basic Topology Written by Men-Gen Tsai email: b89902089@ntu.edu.tw Proof: For any element x of the empty set, x is also an element of every set since
More informationEquilibrium computation: Part 1
Equilibrium computation: Part 1 Nicola Gatti 1 Troels Bjerre Sorensen 2 1 Politecnico di Milano, Italy 2 Duke University, USA Nicola Gatti and Troels Bjerre Sørensen ( Politecnico di Milano, Italy, Equilibrium
More informationTutorial on Markov Chain Monte Carlo
Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,
More informationProbabilistic user behavior models in online stores for recommender systems
Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user
More informationMonte Carlo-based statistical methods (MASM11/FMS091)
Monte Carlo-based statistical methods (MASM11/FMS091) Jimmy Olsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February 5, 2013 J. Olsson Monte Carlo-based
More informationLecture 3: Finding integer solutions to systems of linear equations
Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture
More informationα = u v. In other words, Orthogonal Projection
Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v
More informationGraphical Models, Exponential Families, and Variational Inference
Foundations and Trends R in Machine Learning Vol. 1, Nos. 1 2 (2008) 1 305 c 2008 M. J. Wainwright and M. I. Jordan DOI: 10.1561/2200000001 Graphical Models, Exponential Families, and Variational Inference
More informationSampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data
Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More information5.1 Bipartite Matching
CS787: Advanced Algorithms Lecture 5: Applications of Network Flow In the last lecture, we looked at the problem of finding the maximum flow in a graph, and how it can be efficiently solved using the Ford-Fulkerson
More informationSeveral Views of Support Vector Machines
Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min
More informationWalrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.
Walrasian Demand Econ 2100 Fall 2015 Lecture 5, September 16 Outline 1 Walrasian Demand 2 Properties of Walrasian Demand 3 An Optimization Recipe 4 First and Second Order Conditions Definition Walrasian
More informationSection 5. Stan for Big Data. Bob Carpenter. Columbia University
Section 5. Stan for Big Data Bob Carpenter Columbia University Part I Overview Scaling and Evaluation data size (bytes) 1e18 1e15 1e12 1e9 1e6 Big Model and Big Data approach state of the art big model
More information1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09
1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch
More informationMATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.
MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column
More information954 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 3, MARCH 2005
954 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 3, MARCH 2005 Using Linear Programming to Decode Binary Linear Codes Jon Feldman, Martin J. Wainwright, Member, IEEE, and David R. Karger, Associate
More informationIntroduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
More information8.1 Min Degree Spanning Tree
CS880: Approximations Algorithms Scribe: Siddharth Barman Lecturer: Shuchi Chawla Topic: Min Degree Spanning Tree Date: 02/15/07 In this lecture we give a local search based algorithm for the Min Degree
More informationLecture 7: NP-Complete Problems
IAS/PCMI Summer Session 2000 Clay Mathematics Undergraduate Program Basic Course on Computational Complexity Lecture 7: NP-Complete Problems David Mix Barrington and Alexis Maciel July 25, 2000 1. Circuit
More information1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
More informationExponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009
Exponential Random Graph Models for Social Network Analysis Danny Wyatt 590AI March 6, 2009 Traditional Social Network Analysis Covered by Eytan Traditional SNA uses descriptive statistics Path lengths
More informationLecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs
CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
More informationScheduling Home Health Care with Separating Benders Cuts in Decision Diagrams
Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams André Ciré University of Toronto John Hooker Carnegie Mellon University INFORMS 2014 Home Health Care Home health care delivery
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
More information2.1 Complexity Classes
15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look
More informationAdaptive Linear Programming Decoding
Adaptive Linear Programming Decoding Mohammad H. Taghavi and Paul H. Siegel ECE Department, University of California, San Diego Email: (mtaghavi, psiegel)@ucsd.edu ISIT 2006, Seattle, USA, July 9 14, 2006
More informationLecture 11: 0-1 Quadratic Program and Lower Bounds
Lecture : - Quadratic Program and Lower Bounds (3 units) Outline Problem formulations Reformulation: Linearization & continuous relaxation Branch & Bound Method framework Simple bounds, LP bound and semidefinite
More informationLoad Balancing and Switch Scheduling
EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load
More informationLABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationBest Monotone Degree Bounds for Various Graph Parameters
Best Monotone Degree Bounds for Various Graph Parameters D. Bauer Department of Mathematical Sciences Stevens Institute of Technology Hoboken, NJ 07030 S. L. Hakimi Department of Electrical and Computer
More information