A gentle introduction to Expectation Maximization



Similar documents
Properties of MLE: consistency, asymptotic normality. Fisher information.

Section 8.3 : De Moivre s Theorem and Applications

Maximum Likelihood Estimators.

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

Soving Recurrence Relations

Chapter 7 Methods of Finding Estimators

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

Overview of some probability distributions.

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1)

1. C. The formula for the confidence interval for a population mean is: x t, which was

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

Chair for Network Architectures and Services Institute of Informatics TU München Prof. Carle. Network Security. Chapter 2 Basics

Hypergeometric Distributions

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

5: Introduction to Estimation

Modified Line Search Method for Global Optimization

Determining the sample size

LECTURE 13: Cross-validation

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

A Mathematical Perspective on Gambling

Coordinating Principal Component Analyzers

THE ABRACADABRA PROBLEM

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

Estimating Probability Distributions by Observing Betting Practices

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

Normal Distribution.

Chapter 5 Unit 1. IET 350 Engineering Economics. Learning Objectives Chapter 5. Learning Objectives Unit 1. Annual Amount and Gradient Functions

Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Systems Design Project: Indoor Location of Wireless Devices

Queuing Systems: Lecture 1. Amedeo R. Odoni October 10, 2001

An Efficient Polynomial Approximation of the Normal Distribution Function & Its Inverse Function


UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

A probabilistic proof of a binomial identity

1 Computing the Standard Deviation of Sample Means

Review: Classification Outline

Present Value Factor To bring one dollar in the future back to present, one uses the Present Value Factor (PVF): Concept 9: Present Value

Function factorization using warped Gaussian processes

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

I. Chi-squared Distributions

Multi-server Optimal Bandwidth Monitoring for QoS based Multimedia Delivery Anup Basu, Irene Cheng and Yinzhe Yu

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Sampling Distribution And Central Limit Theorem

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Building Blocks Problem Related to Harmonic Series

Exam 3. Instructor: Cynthia Rudin TA: Dimitrios Bisias. November 22, 2011

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Universal coding for classes of sources

MARTINGALES AND A BASIC APPLICATION

Quadrat Sampling in Population Ecology

MMQ Problems Solutions with Calculators. Managerial Finance

Repeating Decimals are decimal numbers that have number(s) after the decimal point that repeat in a pattern.

A Fuzzy Model of Software Project Effort Estimation

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

How to set up your GMC Online account

Basic Elements of Arithmetic Sequences and Series

Department of Computer Science, University of Otago

THE HEIGHT OF q-binary SEARCH TREES

Plug-in martingales for testing exchangeability on-line

Logistic Regression. Chapter Modeling Conditional Probabilities

WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER?

(VCP-310)

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

A Flexible Elastic Control Plane for Private Clouds

The Stable Marriage Problem

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

Totally Corrective Boosting Algorithms that Maximize the Margin

I. Why is there a time value to money (TVM)?

Chapter 14 Nonparametric Statistics

AMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99

Measures of Spread and Boxplots Discrete Math, Section 9.4

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

FOUNDATIONS OF MATHEMATICS AND PRE-CALCULUS GRADE 10

Statistical inference: example 1. Inferential Statistics

Research Article Sign Data Derivative Recovery

5 Boolean Decision Trees (February 11)

W. Sandmann, O. Bober University of Bamberg, Germany

Confidence Intervals

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length

How To Solve The Homewor Problem Beautifully

Unit 8: Inference for Proportions. Chapters 8 & 9 in IPS

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE The absolute value of the complex number z a bi is

3 Basic Definitions of Probability Theory

5.4 Amortization. Question 1: How do you find the present value of an annuity? Question 2: How is a loan amortized?

Incremental calculation of weighted mean and variance

Business Rules-Driven SOA. A Framework for Multi-Tenant Cloud Computing

MAXIMUM LIKELIHOODESTIMATION OF DISCRETELY SAMPLED DIFFUSIONS: A CLOSED-FORM APPROXIMATION APPROACH. By Yacine Aït-Sahalia 1

Transcription:

A getle itroductio to Expectatio Maximizatio Mark Johso Brow Uiversity November 2009 1 / 15

Outlie What is Expectatio Maximizatio? Mixture models ad clusterig EM for setece topic modelig 2 / 15

Why Expectatio Maximizatio? Expectatio Maximizatio (EM) is a geeral approach for solvig problems ivolvig hidde or latet variables Y Goal: lear the parameter vector θ of a model P θ (X, Y) from traiig data D = (x 1,..., x ) cosistig of samples from P θ (X), i.e., Y is hidde Maximum likelihood estimate usig D: ˆθ = argmax θ L D (θ) = argmax θ i=1 y Y P θ (x i, y) EM is useful whe directly optimizig L D (θ) is itractible, but computig MLE from fully-observed data D = ((x 1, y 1 ),..., (x, y )) is easy 3 / 15

Outlie What is Expectatio Maximizatio? Mixture models ad clusterig EM for setece topic modelig 4 / 15

Mixture models ad clusterig A mixture model is a liear combiatio of models P(X = x) = P(Y = y) P(X = x Y = y), where: y Y y Y idetifies the mixture compoet, P(y) is probability of geeratig mixture compoet y, ad P(x y) is distributio associated with mixture compoet y I clusterig, Y = {1,..., m} are the cluster labels After learig P(y) ad P(x y), compute cluster probabilities for data item x i as follows: P(Y = y X = x i ) = P(Y = y) P(X = x i Y = y) y Y P(Y = y ) P(X = x i Y = y ) 5 / 15

Mixtures of multiomials (1) Y = {1,..., m}, i.e., m differet clusters Y is coi idetity i coi-tossig game Y is setece topic i setece clusterig applicatio X = U l, i.e., each observatio is a sequece x = (u 1,..., u l ), where each u k U U = {H, T}, x is oe sequece of coi tosses from same (ukow) coi U is the vocabulary, x is a setece (sequece of words) Assume each u k is geerated i.i.d. give y, so models have parameters: P(Y = y) = πy, i.e., probability of pickig cluster y P(Uk = u Y = y) = ϕ u y, i.e., probability of geeratig a u i cluster y 6 / 15

Mixtures of multiomials (2) P(Y = y) = π y P(U k = u Y = y) = ϕ u y l P(X = x, Y = y) = π y ϕ uk y k=1 = π y ϕ c u(x) u k y u U where x = (u 1,..., u l ), ad c u (x) is umber of times u appears i x. 7 / 15

Coi-tossig example π 1 = π 2 = 0.5 ϕ H 1 = 0.1; ϕ T 1 = 0.9 ϕ H 2 = 0.8; ϕ T 2 = 0.2 P(X = HTHH, Y = 1) = π 1 ϕ 3 H 1 ϕ1 T 1 = 0.00045 P(X = HTHH, Y = 2) = π 2 ϕ 3 H 2 ϕ1 T 2 = 0.0512 P(X = HTHH) = π 1 ϕ 3 H 1 ϕ1 T 1 + π 2 ϕ 3 H 2 ϕ1 T 2 = 0.05165, so: P(Y = 1 X = HTHH) = P(X = HTHH, Y = 1) P(X = HTHH) = 0.008712 P(Y = 2 X = HTHH) = 0.9912 8 / 15

Estimatio from visible data Give visible data how would we estimate π ad ϕ? Data D = ((x 1, y 1 ),..., (x, y )), where each x i = (u i,1,..., u i,l ) Sufficiet statistics for estimatig multiomial mixture: y = i=1 II(y, y i), i.e., umber of times cluster y is see u,y = i=1 c u(x i )II(y, y i ), i.e., umber of times u is see i cluster y, where c u (x) is the umber of times u appears i x Maximum likelihood estimates: π y = y ϕ u y = u,y u U u,y 9 / 15

Estimatio from hidde data (1) Data D = (x 1,..., x ), where each x i = (u i,1,..., u i,l ) Log likelihood of hidde data: log L D (π, ϕ) = log π y ϕ c u(x i ) u y i=1 y Y u U Imposig Lagrage multipliers ad settig the derivative to zero, we ca show: π y = E[ y] ; ϕ u y = E[ y ] = E[ u,y ] = i=1 P π, ϕ (Y = y X = x i ) c u (x i ) P π, ϕ (Y = y X = x i ) i=1 E[ u,y ] u U E[ u,y], where: 10 / 15

Estimatio from hidde data (2) π y = E[ y] ; ϕ u y = E[ y ] = E[ u,y ] = i=1 P π, ϕ (Y = y X = x i ) c u (x i ) P π, ϕ (Y = y X = x i ) i=1 E[ u,y ] u U E[ u,y], where: Ulike i the visible data case, these are ot a closed-form solutio for π or ϕ, as E[ y ] ad E[ u,y ] ivolve π ad ϕ But they do suggest a fixed-poit calculatio procedure 11 / 15

EM for multiomial mixtures Guess iitial values π (0) ad ϕ (0) For iteratios t = 1, 2, 3,... do: E-step: calculate expected values of sufficiet statistics E[ y ] = E[ u,y ] = i=1 P π (t 1),ϕ (t 1)(Y = y X = x i) c u (x i ) P π (t 1),ϕ (t 1)(Y = y X = x i) i=1 M-step: update model based o sufficiet statistics π (t) y = E[ y] ϕ (t) u y = E[ u,y ] u U E[ u,y] 12 / 15

Summary of the model P(Y = y X = x) = P π,ϕ (X = x, Y = y) = π y u U P(Y = y, X = x) y Y P(Y = y, X = x) ϕ c u(x) u y, where: c u (x) = the umber of times u appears i x 13 / 15

Outlie What is Expectatio Maximizatio? Mixture models ad clusterig EM for setece topic modelig 14 / 15

Homework hits The fact that differet seteces have differet legths does t affect the calculatio c u (x i ) is the umber of times word u appears i setece x i You ca iitialize π with a uiform distributio, but you ll eed to iitialize ϕ (0) to break symmetry, e.g., by addig a radom umber of about 10 4 You should compute the log likelihood at each iteratio (it s easy to do this as a by-product of the expectatio calculatios) There is a theorem that says the log likelihood ever decreases o each EM step If your log likelihood decreases, the you have a bug! 15 / 15