Machine Learning. Expectation Maximization. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Similar documents
Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Learning from Data

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Nonlinear Programming Methods.S2 Quadratic Programming

The Expectation Maximization Algorithm A short tutorial

Hidden Markov Models

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Week 1: Introduction to Online Learning

CHAPTER 2 Estimating Probabilities

On the Interaction and Competition among Internet Service Providers

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Nonlinear Optimization: Algorithms 3: Interior-point methods

Creating a NL Texas Hold em Bot

3.1 Solving Systems Using Tables and Graphs

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Lecture 2: August 29. Linear Programming (part I)

Adaptive Online Gradient Descent

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Mathematical finance and linear programming (optimization)

Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem. Lecture 12 04/08/2008. Sven Zenker

Factor analysis. Angela Montanari

Multiple Optimization Using the JMP Statistical Software Kodak Research Conference May 9, 2005

Notes from Week 1: Algorithms for sequential prediction

Designing a learning system

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

GI01/M055 Supervised Learning Proximal Methods

14.1 Rent-or-buy problem

Insurance. Michael Peters. December 27, 2013

1 Teaching notes on GMM 1.

1 Maximum likelihood estimation

Linear Programming Notes V Problem Transformations

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Experiment 4 ~ Resistors in Series & Parallel

Introduction to Logistic Regression

Towards running complex models on big data

Statistical Machine Learning

Basic Components of an LP:

Normal distribution. ) 2 /2σ. 2π σ

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Tagging with Hidden Markov Models

Lecture 3: Linear methods for classification

The Concept of Present Value

Introduction to Online Learning Theory

Course: Model, Learning, and Inference: Lecture 5

1 Sufficient statistics

APPLIED MISSING DATA ANALYSIS

L4: Bayesian Decision Theory

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Lecture 6: Logistic Regression

Probabilistic user behavior models in online stores for recommender systems

Bayesian probability theory

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

CS229 Lecture notes. Andrew Ng

Finding Rates and the Geometric Mean

PS 271B: Quantitative Methods II. Lecture Notes

Vieta s Formulas and the Identity Theorem

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

7 Gaussian Elimination and LU Factorization

GLM, insurance pricing & big data: paying attention to convergence issues.

10. Proximal point method

Language Modeling. Chapter Introduction

A Simple Introduction to Support Vector Machines

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Objective. Materials. TI-73 Calculator

Economics 1011a: Intermediate Microeconomics

Likelihood: Frequentist vs Bayesian Reasoning

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

1 Review of Least Squares Solutions to Overdetermined Systems

On the Traffic Capacity of Cellular Data Networks. 1 Introduction. T. Bonald 1,2, A. Proutière 1,2

Managerial Economics Prof. Trupti Mishra S.J.M. School of Management Indian Institute of Technology, Bombay. Lecture - 13 Consumer Behaviour (Contd )

WHERE DOES THE 10% CONDITION COME FROM?

THE CENTRAL LIMIT THEOREM TORONTO

Big Data - Lecture 1 Optimization reminders

Maximum Likelihood Estimation

Solutions Of Some Non-Linear Programming Problems BIJAN KUMAR PATEL. Master of Science in Mathematics. Prof. ANIL KUMAR

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Games of Incomplete Information

Linear Programming. Solving LP Models Using MS Excel, 18

Basics of Statistical Machine Learning

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

2.2 Derivative as a Function

Proximal mapping via network optimization

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Chapter 2 Solving Linear Programs

Lecture 13: Martingales

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Several Views of Support Vector Machines

Probabilistic Latent Semantic Analysis (plsa)

Applied Algorithm Design Lecture 5

Online Appendix to Stochastic Imitative Game Dynamics with Committed Agents

Lecture 2: Consumer Theory

1.4 Compound Inequalities

Lecture notes on Moral Hazard, i.e. the Hidden Action Principle-Agent Model

Session 7 Fractions and Decimals

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Transcription:

Machine Learning Expectation Maximization hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1

We ve seen the update eqs. of GMM, but How are they derived? What is the general algorithm of EM? hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 2

General Settings for EM Given data X = (x 1,, x n ), where x i ~p x; θ. Want to maximize log-likelihood log p X; θ. p X; θ is difficult to maximize because it involves some latent variables. But maximizing the complete data log-likelihood log p X, ; θ would be easy (if we observed ). We think of as missing data. In this scenario, EM gives an efficient method to maximize likelihood p X; θ to some local maximum. hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 3

Basic Idea of EM Since maximizing data log-likelihood log p X; θ is hard but maximizing complete data log-likelihood log p X, ; θ is easy (if we observed ), out bet is to maximize the latter and hopefully it also increases the former. (E step) Since we didn t observe, we cannot maximize log p X, ; θ directly. We will consider its expected value under the posterior dist. of, using old parameter. (M step) We then update parameter θ to maximize the expected complete data log-likelihood. hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 4

EM Algorithm in General 1. Initialize parameter θ old. 2. E step: evaluate posterior dist. of latent variables p X; θ old, using old parameter. Then the expected complete data log-likelihood, under this dist. would be A function of θ Q θ; θ old = p X; θ old Taking expectation log p X, ; θ Complete data log-likelihood 3. M step: update parameters to maximize the expected complete data log-likelihood. θ new = arg max Q θ; θold θ 4. Check convergence criterion. Return to step 2 if not satisfied. hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 5

GMM Revisited E step Evaluate posterior dist. of latent variables p X; θ old, using old parameters. Since data are i.i.d., we evaluate for each i. 1 e (x i μ j ) 2 2σ 2 j 2πσ j q (j) i p z i = j x i = p z i = j, x i p x i = p x i z i = j p z i = j l=1 p x i z i = l p z i = l w j hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 6

GMM Revisited E step Then the expected complete data log-likelihood, under this dist. would be Q θ; θ old = p z i = j x i log p x i, z i ; θ old j=1 = q i (j) log p x i, z i ; θ old j=1 = q i (j) log p z i = j p x i z i = j j=1 = q i (j) log w j j=1 1 e (x i μ j ) 2 2σ 2 j 2πσ j hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 7

GMM Revisited M step (for μ j ) Update parameters to maximize the expected complete data log-likelihood. Q θ; θ old = q i (j) log w j j=1 1 e (x i μ j ) 2 2σ 2 j 2πσ j For μ j, set derivative to 0. Q θ; θ old μ j = q i (j) x i μ j σ 2 j We get update equation for means: μ j = q (j) i q i (j) x i. = 0 hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 8

GMM Revisited M step (for σ j ) The derivation is similar to the derivation for μ j Try it yourself We get update equation for variances: σ 2 q (j) 2 i x i μ j j = q i (j) hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 9

GMM Revisited M step (for w j ) For w j, recall there is a constraint j=1 w j = 1. To maximize Q θ; θ old w.r.t. w j, we construct the Lagrangian Q θ; θ old = Q θ; θ old + β w j Set derivative to 0 Q θ; θ old = w j q i (j) w j j=1 + β = 0 1 hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 10

GMM Revisited M step (for w j ) We get w j = q (j) i β Sum over j, and using β = j=1 = q i j j=1 j=1 w j j q i = = 1, we get So we get update equation for weights: w j = 1 (j) q i hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 11

More Theoretical Questions We ve seen how the update equations of GMM are derived from the EM algorithm. But do these equations really work? Will the data likelihood be maximized (at least to some local maximum)? Will the algorithm converge? hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 12

Answers We will show that the data log-likelihood never decrease in each iteration. We also know that log-likelihood (which is log of probability) is bounded above by 0. Therefore, EM algorithm always converges! hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 13

Expected data log-likelihood increases Recall that in M step, we maximize the expected complete log-likelihood θ new = arg max Q θ; θold θ = arg max θ Therefore p X; θ old p X; θ old log p X, ; θ new log p X, ; θ p X; θ old log p X, ; θ old hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 14

From previous slide p X; θ old So we also have p X; θ old Therefore log p X, ; θ new log p X; θ old log p X; θ old log p X, ; θ old p X, ; θnew p X; θ old p X, ; θold p X; θ old = p X; θ old log p X; θ old = log p X; θ old Data log-likelihood using old parameter hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 15 (1)

Jensen s Inequality Let f be a convex function, X be a random variable, then E f X f EX Since log() is a concave function, we have Expectation Concave function Random variable p X; θ old p X, ; θnew log p X; θ old p X, ; θnew log p X; θold p X, ; θ old hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 16

p X; θ old Continue log p X, ; θnew p X; θ old p X, ; θnew log p X; θold p X, ; θ old (2) = log p X, ; θ new = log p X; θ new Data log-likelihood using new parameter hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 17

Finally Putting (1) and (2) together, we get log p X; θ old p X; θ old log log p X; θ new p X, ; θnew p X; θ old Data log-likelihood monotonically increases! hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 18

You Should now EM algorithm is an efficient way to do maximum likelihood estimation, when there are latent variables or missing data. The general algorithm of EM E step: calculate posterior dist. of latent variables. M step: update parameters by maximizing the expected complete data log-likelihood. How to derive EM update equations of GMM? Can you derive EM update equations for parameter estimation of a mixture of categorical distributions? Why does EM always converge (to some local optimum)? hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 19