The Expectation Maximization Algorithm A short tutorial



Similar documents
Critical points of once continuously differentiable functions are important because they are the only points that can be local maxima or minima.

Notes from Week 1: Algorithms for sequential prediction

A binary search tree is a binary tree with a special property called the BST-property, which is given as follows:

WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT?

Solutions of Equations in One Variable. Fixed-Point Iteration II

Course: Model, Learning, and Inference: Lecture 5

2.3 Convex Constrained Optimization Problems

Stochastic Inventory Control

Follow the Perturbed Leader

Bayesian Image Super-Resolution

Week 1: Introduction to Online Learning

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Adaptive Online Gradient Descent

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

AP CALCULUS AB 2009 SCORING GUIDELINES

Concentration inequalities for order statistics Using the entropy method and Rényi s representation

x a x 2 (1 + x 2 ) n.

9.2 Summation Notation

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

LECTURE 15: AMERICAN OPTIONS

Metric Spaces. Chapter 1

How To Prove The Dirichlet Unit Theorem

Online Convex Optimization

How To Find Out How To Calculate A Premeasure On A Set Of Two-Dimensional Algebra

Universal Algorithm for Trading in Stock Market Based on the Method of Calibration

Follow links for Class Use and other Permissions. For more information send to:

Taylor and Maclaurin Series

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Supplement to Call Centers with Delay Information: Models and Insights

2.2. Instantaneous Velocity

THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Nonlinear Optimization: Algorithms 3: Interior-point methods

Master s Theory Exam Spring 2006

Solutions Of Some Non-Linear Programming Problems BIJAN KUMAR PATEL. Master of Science in Mathematics. Prof. ANIL KUMAR

Mathematical Methods of Engineering Analysis

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Convex analysis and profit/cost/support functions

(Quasi-)Newton methods

No-Betting Pareto Dominance

On a comparison result for Markov processes

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, Notes on Algebra

Choice under Uncertainty

CHAPTER 1 Splines and B-splines an Introduction

The Multiplicative Weights Update method

Probabilistic user behavior models in online stores for recommender systems

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Metric Spaces. Chapter Metrics

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

CHAPTER 2 Estimating Probabilities

ALMOST COMMON PRIORS 1. INTRODUCTION

Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the school year.

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

Ideal Class Group and Units

Thnkwell s Homeschool Precalculus Course Lesson Plan: 36 weeks

Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand

Portfolio Optimization within Mixture of Distributions

The Relation between Two Present Value Formulae

Understanding Basic Calculus

Factoring & Primality

TOPIC 4: DERIVATIVES

Lecture Notes on Elasticity of Substitution

MA 323 Geometric Modelling Course Notes: Day 02 Model Construction Problem

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

Nonparametric adaptive age replacement with a one-cycle criterion

1 if 1 x 0 1 if 0 x 1

Gambling Systems and Multiplication-Invariant Measures

6.4 Logarithmic Equations and Inequalities

Basics of Statistical Machine Learning

24. The Branch and Bound Method

Cartesian Products and Relations

DRAFT. Algebra 1 EOC Item Specifications

Game Theory: Supermodular Games 1

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

1 Short Introduction to Time Series

Statistical Machine Learning from Data

3. INNER PRODUCT SPACES

Lecture 13: Martingales

Model-Based Cluster Analysis for Web Users Sessions

How To Calculate Max Likelihood For A Software Reliability Model

2.1 Complexity Classes

4: SINGLE-PERIOD MARKET MODELS

GI01/M055 Supervised Learning Proximal Methods

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.

Solving Exponential Equations

Data Mining: Algorithms and Applications Matrix Math Review

Continued Fractions and the Euclidean Algorithm

Practice problems for Homework 11 - Point Estimation

Random graphs with a given degree sequence

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Reject Inference in Credit Scoring. Jie-Men Mok

Gambling and Data Compression

Lecture 5 Principal Minors and the Hessian

Mixtures of Robust Probabilistic Principal Component Analyzers

Transcription:

The Expectation Maximiation Algorithm A short tutorial Sean Borman Comments and corrections to: em-tut at seanborman dot com July 8 2004 Last updated January 09, 2009 Revision history 2009-0-09 Corrected grammar in the paragraph which precedes Equation (7. Changed datestamp format in the revision history. 2008-07-05 Corrected caption for Figure (2. Added conditioning on n for l in convergence discussion in Section (3.2. Changed email contact info to reduce spam. 2006-0-4 Added explanation and disambiguating parentheses in the development leading to Equation (4. Minor corrections. 2006-06-28 Added Figure (. Corrected typo above Equation (5. Minor corrections. Added hyperlinks. 2005-08-26 Minor corrections. 2004-07-8 Initial revision. Introduction This tutorial discusses the Expectation Maximiation (EM algorithm of Dempster, Laird and Rubin []. The approach taken follows that of an unpublished note by Stuart Russel, but fleshes out some of the gory details. In order to ensure that the presentation is reasonably self-contained, some of the results on which the derivation of the algorithm is based are presented prior to the main results. The EM algorithm has become a popular tool in statistical estimation problems involving incomplete data, or in problems which can be posed in a similar form, such as mixture estimation [3, 4]. The EM algorithm has also been used in various motion estimation frameworks [5] and variants of it have been used in multiframe superresolution restoration methods which combine motion estimation along the lines of [2].

λf(x + ( λf(x 2 f(λx + ( λx 2 a x λx + ( λx 2 x 2 Figure : f is convex on [a, b] if f(λx + ( λx 2 λf(x + ( λf(x 2 x, x 2 [a, b], λ [0, ]. 2 Convex Functions Definition Let f be a real valued function defined on an interval I = [a, b]. f is said to be convex on I if x, x 2 I, λ [0, ], f(λx + ( λx 2 λf(x + ( λf(x 2. f is said to be strictly convex if the inequality is strict. Intuitively, this definition states that the function falls below (strictly convex or is never above (convex the straight line (the secant from points (x, f(x to (x 2, f(x 2. See Figure (. Definition 2 f is concave (strictly concave if f is convex (strictly convex. Theorem If f(x is twice differentiable on [a, b] and f (x 0 on [a, b] then f(x is convex on [a, b]. Proof: For x y [a, b] and λ [0, ] let = λy+( λx. By definition, f is convex iff f(λy+( λx λf(y+( λf(x. Writing = λy+( λx, and noting that f( = λf(+( λf( we have that f( = λf(+( λf( λf(y+( λf(x. By rearranging terms, an equivalent definition for convexity can be obtained: f is convex if λ[f(y f(] ( λ[f( f(x] ( By the mean value theorem, s, x s s.t. f( f(x = f (s( x (2 b 2

Similarly, applying the mean value theorem to f(y f(, t, t y s.t. f(y f( = f (t(y (3 Thus we have the situation, x s t y. By assumption, f (x 0 on [a, b] so f (s f (t since s t. (4 Also note that we may rewrite = λy + ( λx in the form Finally, combining the above we have, ( λ( x = λ(y. (5 ( λ[f( f(x] = ( λf (s( x by Equation (2 f (t( λ( x by Equation (4 = λf (t(y by Equation (5 = λ[f(y f(] by Equation (3. Proposition ln(x is strictly convex on (0,. Proof: With f(x = ln(x, we have f (x = x 2 > 0 for x (0,. By Theorem (, ln(x is strictly convex on (0,. Also, by Definition (2 ln(x is strictly concave on (0,. The notion of convexity can be extended to apply to n points. This result is known as Jensen s inequality. Theorem 2 (Jensen s inequality Let f be a convex function defined on an interval I. If x, x 2,..., x n I and λ, λ 2,..., λ n 0 with n λ i =, ( n f λ i x i λ i f(x i Proof: For n = this is trivial. The case n = 2 corresponds to the definition of convexity. To show that this is true for all natural numbers, we proceed by induction. Assume the theorem is true for some n then, ( n+ ( f λ i x i = f λ n+ x n+ + λ i x i ( = f λ n+ x n+ + ( λ n+ λ n+ 3 λ i x i

( λ n+ f (x n+ + ( λ n+ f λ i x i λ n+ ( n λ i = λ n+ f (x n+ + ( λ n+ f x i λ n+ λ n+ f (x n+ + ( λ n+ = λ n+ f (x n+ + = n+ λ i f (x i λ i f (x i λ i λ n+ f (x i Since ln(x is concave, we may apply Jensen s inequality to obtain the useful result, ln λ i x i λ i ln(x i. (6 This allows us to lower-bound a logarithm of a sum, a result that is used in the derivation of the EM algorithm. Jensen s inequality provides a simple proof that the arithmetic mean is greater than or equal to the geometric mean. Proposition 2 n x i n x x 2 x n. Proof: If x, x 2,..., x n 0 then, since ln(x is concave we have ln x i n n ln(x i = n ln(x x 2 x n = ln(x x 2 x n n Thus, we have n x i n x x 2 x n 4

3 The Expectation-Maximiation Algorithm The EM algorithm is an efficient iterative procedure to compute the Maximum Likelihood (ML estimate in the presence of missing or hidden data. In ML estimation, we wish to estimate the model parameter(s for which the observed data are the most likely. Each iteration of the EM algorithm consists of two processes: The E-step, and the M-step. In the expectation, or E-step, the missing data are estimated given the observed data and current estimate of the model parameters. This is achieved using the conditional expectation, explaining the choice of terminology. In the M-step, the likelihood function is maximied under the assumption that the missing data are known. The estimate of the missing data from the E-step are used in lieu of the actual missing data. Convergence is assured since the algorithm is guaranteed to increase the likelihood at each iteration. 3. Derivation of the EM-algorithm Let X be random vector which results from a parameteried family. We wish to find such that P(X is a maximum. This is known as the Maximum Likelihood (ML estimate for. In order to estimate, it is typical to introduce the log likelihood function defined as, L( = ln P(X. (7 The likelihood function is considered to be a function of the parameter given the data X. Since ln(x is a strictly increasing function, the value of which maximies P(X also maximies L(. The EM algorithm is an iterative procedure for maximiing L(. Assume that after the n th iteration the current estimate for is given by n. Since the objective is to maximie L(, we wish to compute an updated estimate such that, L( > L( n (8 Equivalently we want to maximie the difference, L( L( n = ln P(X ln P(X n. (9 So far, we have not considered any unobserved or missing variables. In problems where such data exist, the EM algorithm provides a natural framework for their inclusion. Alternately, hidden variables may be introduced purely as an artifice for making the maximum likelihood estimation of tractable. In this case, it is assumed that knowledge of the hidden variables will make the maximiation of the likelihood function easier. Either way, denote the hidden random vector by Z and a given realiation by. The total probability P(X may be written in terms of the hidden variables as, P(X = P(X, P(. (0 5

We may then rewrite Equation (9 as, L( L( n = ln P(X, P( ln P(X n. ( Notice that this expression involves the logarithm of a sum. In Section (2 using Jensen s inequality, it was shown that, ln λ i x i λ i ln(x i for constants λ i 0 with n λ i =. This result may be applied to Equation ( provided that the constants λ i can be identified. Consider letting the constants be of the form P( X, n. Since P( X, n is a probability measure, we have that P( X, n 0 and that P( X, n = as required. Then starting with Equation ( the constants P( X, n are introduced as, L( L( n = ln P(X, P( ln P(X n = ln P(X, P( P( X, n ln P(X n P( X, n P(X, P( = ln P( X, n ln P(X n P( X, n P(X, P( P( X, n ln ln P(X n (2 P( X, n = P(X, P( P( X, n ln (3 P( X, n P(X n = ( n. (4 In going from Equation (2 to Equation (3 we made use of the fact that P( X, n = so that ln P(X n = P( X, nln P(X n which allows the term ln P(X n to be brought into the summation. We continue by writing and for convenience define, L( L( n + ( n (5 l( n = L( n + ( n so that the relationship in Equation (5 can be made explicit as, L( l( n. 6

L( n+ l( n+ n L( n = l( n n L( l( n L( l( n n n+ Figure 2: Graphical interpretation of a single iteration of the EM algorithm: The function l( n is bounded above by the likelihood function L(. The functions are equal at = n. The EM algorithm chooses n+ as the value of for which l( n is a maximum. Since L( l( n increasing l( n ensures that the value of the likelihood function L( is increased at each step. We have now a function, l( n which is bounded above by the likelihood function L(. Additionally, observe that, l( n n = L( n + ( n n = L( n + = L( n + P( X, n ln P(X, np( n P( X, n P(X n P( X, n ln P(X, n P(X, n = L( n + P( X, n ln = L( n, (6 so for = n the functions l( n and L( are equal. Our objective is to choose a values of so that L( is maximied. We have shown that the function l( n is bounded above by the likelihood function L( and that the value of the functions l( n and L( are equal at the current estimate for = n. Therefore, any which increases l( n will also increase L(. In order to achieve the greatest possible increase in the value of L(, the EM algorithm calls for selecting such that l( n is maximied. We denote this updated value as n+. This process is illustrated in Figure (2. 7

Formally we have, n+ {l( n } { L( n + P( X, n ln Now drop terms which are constant w.r.t. { } P( X, n ln P(X, P( } P(X, P( P(X n P( X, n { } P(X,, P(, P( X, n ln P(, P( { } P( X, n ln P(X, { EZ X,n {ln P(X, } } (7 In Equation (7 the expectation and maximiation steps are apparent. The EM algorithm thus consists of iterating the:. E-step: Determine the conditional expectation E Z X,n {ln P(X, } 2. M-step: Maximie this expression with respect to. At this point it is fair to ask what has been gained given that we have simply traded the maximiation of L( for the maximiation of l( n. The answer lies in the fact that l( n takes into account the unobserved or missing data Z. In the case where we wish to estimate these variables the EM algorithms provides a framework for doing so. Also, as alluded to earlier, it may be convenient to introduce such hidden variables so that the maximiation of L( n is simplified given knowledge of the hidden variables. (as compared with a direct maximiation of L( 3.2 Convergence of the EM Algorithm The convergence properties of the EM algorithm are discussed in detail by McLachlan and Krishnan [3]. In this section we discuss the general convergence of the algorithm. Recall that n+ is the estimate for which maximies the difference ( n. Starting with the current estimate for, that is, n we had that ( n n = 0. Since n+ is chosen to maximie ( n, we then have that ( n+ n ( n n = 0, so for each iteration the likelihood L( is nondecreasing. When the algorithm reaches a fixed point for some n the value n maximies l( n. Since L and l are equal at n if L and l are differentiable at n, then n must be a stationary point of L. The stationary point need not, however, be a local maximum. In [3] it is shown that it is possible for the algorithm to converge to local minima or saddle points in unusual cases. 8

3.3 The Generalied EM Algorithm In the formulation of the EM algorithm described above, n+ was chosen as the value of for which ( n was maximied. While this ensures the greatest increase in L(, it is however possible to relax the requirement of maximiation to one of simply increasing ( n so that ( n+ n ( n n. This approach, to simply increase and not necessarily maximie ( n+ n is known as the Generalied Expectation Maximiation (GEM algorithm and is often useful in cases where the maximiation is difficult. The convergence of the GEM algorithm can be argued as above. References [] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B, 39(: 38, November 977. [2] R. C. Hardie, K. J. Barnard, and E. E. Armstrong. Joint MAP registration and high-resolution image estimation using a sequence of undersampled images. IEEE Transactions on Image Processing, 6(2:62 633, December 997. [3] Geoffrey McLachlan and Thriyambakam Krishnan. The EM Algorithm and Extensions. John Wiley & Sons, New York, 996. [4] Geoffrey McLachlan and David Peel. Finite Mixture Models. John Wiley & Sons, New York, 2000. [5] Yair Weiss. Bayesian motion estimation and segmentation. PhD thesis, Massachusetts Institute of Technology, May 998. 9