Mixture Models. Jia Li. Department of Statistics The Pennsylvania State University. Mixture Models

Similar documents
Neural Networks Lesson 5 - Cluster Analysis

Statistical Machine Learning from Data

Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Social Media Mining. Data Mining Essentials

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Lecture 3: Linear methods for classification

Linear Threshold Units

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

How To Cluster

Introduction to Machine Learning Using Python. Vikram Kamath

Classification by Pairwise Coupling

Probabilistic user behavior models in online stores for recommender systems

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Revenue Management with Correlated Demand Forecasting

STA 4273H: Statistical Machine Learning

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Machine Learning Big Data using Map Reduce

Standardization and Its Effects on K-Means Clustering Algorithm

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Probabilistic Latent Semantic Analysis (plsa)

Course: Model, Learning, and Inference: Lecture 5

A comparison of various clustering methods and algorithms in data mining

Visualization of Collaborative Data

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Cluster Analysis: Advanced Concepts

CLUSTER ANALYSIS FOR SEGMENTATION

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Classification Techniques for Remote Sensing

: Introduction to Machine Learning Dr. Rita Osadchy

Chapter ML:XI (continued)

K-Means Clustering Tutorial

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Multivariate Normal Distribution

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Distances, Clustering, and Classification. Heatmaps

Machine Learning using MapReduce

Logistic Regression (1/24/13)

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

A self-growing Bayesian network classifier for online learning of human motion patterns. Title

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CSCI567 Machine Learning (Fall 2014)

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

α = u v. In other words, Orthogonal Projection

Data, Measurements, Features

Comparing large datasets structures through unsupervised learning

Environmental Remote Sensing GEOG 2021

CS229 Lecture notes. Andrew Ng

Model based clustering of longitudinal data: application to modeling disease course and gene expression trajectories

An Introduction to Machine Learning

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Fortgeschrittene Computerintensive Methoden: Finite Mixture Models Steffen Unkel Manuel Eugster, Bettina Grün, Friedrich Leisch, Matthias Schmid

MACHINE LEARNING IN HIGH ENERGY PHYSICS

HT2015: SC4 Statistical Data Mining and Machine Learning

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Linear Classification. Volker Tresp Summer 2015

Lecture 9: Introduction to Pattern Analysis

Unsupervised learning: Clustering

Statistical machine learning, high dimension and big data

An Overview of Knowledge Discovery Database and Data mining Techniques

How To Solve The Cluster Algorithm

Machine Learning and Pattern Recognition Logistic Regression

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Supervised and unsupervised learning - 1

Vector and Matrix Norms

Item selection by latent class-based methods: an application to nursing homes evaluation

Least-Squares Intersection of Lines

Clustering Connectionist and Statistical Language Processing

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

D-optimal plans in observational studies

Introduction to Matrix Algebra

Comparison of Heterogeneous Probability Models for Ranking Data

Sampling Within k-means Algorithm to Cluster Large Datasets

The Expectation Maximization Algorithm A short tutorial

A Learning Based Method for Super-Resolution of Low Resolution Images

STATISTICA Formula Guide: Logistic Regression. Table of Contents

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Predict the Popularity of YouTube Videos Using Early View Data

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

BAYESIAN CLASSIFICATION USING GAUSSIAN MIXTURE MODEL AND EM ESTIMATION: IMPLEMENTATIONS AND COMPARISONS

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Customer Data Mining and Visualization by Generative Topographic Mapping Methods

Automated Hierarchical Mixtures of Probabilistic Principal Component Analyzers

UNSUPERVISED RESTORATION IN GAUSSIAN PAIRWISE MIXTURE MODEL

Chapter 7. Cluster Analysis

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Using Data Mining for Mobile Communication Clustering and Characterization

Prototype-based classification by fuzzification of cases

Categorical Data Visualization and Clustering Using Subjective Factors

Data Exploration Data Visualization

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Transcription:

Mixture Models Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu

Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation

Clustering A basic tool in data mining/pattern recognition: Divide a set of data into groups. Samples in one cluster are close and clusters are far apart.

Motivations: Discover classes of data in an unsupervised way (unsupervised learning). Efficient representation of data: fast retrieval, data complexity reduction. Various engineering purposes: tightly lined with pattern recognition.

Approaches to Clustering Represent samples by feature vectors. Define a distance measure to assess the closeness between data. Closeness can be measured in many ways. Define distance based on various norms. For gene expression levels in a set of micro-array data, closeness between genes may be measured by the Euclidean distance between the gene profile vectors, or by correlation.

Approaches: Define an objective function to assess the quality of clustering and optimize the objective function (purely computational). Clustering can be performed based merely on pair-wise distances. How each sample is represented does not come into the picture. Statistical model based clustering.

K-means Assume there are M clusters with centroids Z = {z 1, z 2,..., z M }. Each training sample is assigned to one of the clusters. Denote the assignment function by η( ). Then η(i) = j means the ith training sample is assigned to the jth cluster. Goal: minimize the total mean squared error between the training samples and their representative cluster centroids, that is, the trace of the pooled within cluster covariance matrix. arg min Z,η N x i z η(i) 2 i=1

Denote the objective function by L(Z, η) = N x i z η(i) 2. i=1 Intuition: training samples are tightly clustered around the centroids. Hence, the centroids serve as a compact representation for the training data.

Necessary Conditions If Z is fixed, the optimal assignment function η( ) should follow the nearest neighbor rule, that is, η(i) = arg min j {1,2,...,M} x i z j. If η( ) is fixed, the cluster centroid z j should be the average of all the samples assigned to the jth cluster: i:η(i)=j z j = x i. N j N j is the number of samples assigned to cluster j.

The Algorithm Based on the necessary conditions, the -means algorithm alternates the two steps: For a fixed set of centroids, optimize η( ) by assigning each sample to its closest centroid using Euclidean distance. Update the centroids by computing the average of all the samples assigned to it. The algorithm converges since after each iteration, the objective function decreases (non-increasing). Usually converges fast. Stopping criterion: the ratio between the decrease and the objective function is below a threshold.

Mixture Model-based Clustering Each cluster is mathematically represented by a parametric distribution. Examples: Gaussian (continuous), Poisson (discrete). The entire data set is modeled by a mixture of these distributions. An individual distribution used to model a specific cluster is often referred to as a component distribution.

Suppose there are K components (clusters). Each component is a Gaussian distribution parameterized by µ, Σ. Denote the data by X, X R d. The density of component is f (x) = φ(x µ, Σ ) = 1 (2π) d Σ exp( (x µ ) t Σ 1 (x µ ) ). 2 The prior probability (weight) of component is a. The mixture density is: f (x) = K a f (x) = =1 K a φ(x µ, Σ ). =1

Advantages A mixture model with high lielihood tends to have the following traits: Component distributions have high peas (data in one cluster are tight) The mixture model covers the data well (dominant patterns in the data are captured by component distributions).

Advantages Well-studied statistical inference techniques available. Flexibility in choosing the component distributions. Obtain a density estimation for each cluster. A soft classification is available.

EM Algorithm The parameters are estimated by the maximum lielihood (ML) criterion using the EM algorithm. A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum lielihood from incomplete data via the EM algorithm, Journal Royal Statistics Society, vol. 39, no. 1, pp. 1-21, 1977. The EM algorithm provides an iterative computation of maximum lielihood estimation when the observed data are incomplete.

Incompleteness can be conceptual. We need to estimate the distribution of X, in sample space X, but we can only observe X indirectly through Y, in sample space Y. In many cases, there is a mapping x y(x) from X to Y, and x is only nown to lie in a subset of X, denoted by X (y), which is determined by the equation y = y(x). The distribution of X is parameterized by a family of distributions f (x θ), with parameters θ Ω, on x. The distribution of y, g(y θ) is g(y θ) = f (x θ)dx. X (y)

The EM algorithm aims at finding a θ that maximizes g(y θ) given an observed y. Introduce the function Q(θ θ) = E(log f (x θ ) y, θ), that is, the expected value of log f (x θ ) according to the conditional distribution of x given y and parameter θ. The expectation is assumed to exist for all pairs (θ, θ). In particular, it is assumed that f (x θ) > 0 for θ Ω. EM Iteration: E-step: Compute Q(θ θ (p) ). M-step: Choose θ (p+1) to be a value of θ Ω that maximizes Q(θ θ (p) ).

EM for the Mixture of Normals Observed data (incomplete): {x 1, x 2,..., x n }, where n is the sample size. Denote all the samples collectively by x. Complete data: {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )}, where y i is the cluster (component) identity of sample x i. The collection of parameters, θ, includes: a, µ, Σ, = 1, 2,..., K. The lielihood function is: ( n K ) L(x θ) = log a φ(x i µ, Σ ). i=1 =1 L(x θ) is the objective function of the EM algorithm (maximize). Numerical difficulty comes from the sum inside the log.

The Q function is: [ Q(θ θ) = E log ] n a y i φ(x i µ y i, Σ y i ) x, θ i=1 [ n ] ( = E log(a yi ) + log φ(x i µ y i, Σ ) y i x, θ = i=1 n E [ log(a y i ) + log φ(x i µ y i, Σ y i ) x i, θ ]. i=1 The last equality comes from the fact the samples are independent.

Note that when x i is given, only y i is random in the complete data (x i, y i ). Also y i only taes a finite number of values, i.e, cluster identities 1 to K. The distribution of Y given X = x i is the posterior probability of Y given X. Denote the posterior probabilities of Y =, = 1,..., K given x i by p i,. By the Bayes formula, the posterior probabilities are: p i, a φ(x i µ, Σ ), K p i, = 1. =1

Then each summand in Q(θ θ) is = E [ log(a y i ) + log φ(x i µ y i, Σ y i ) x i, θ ] K K p i, log a + p i, log φ(x i µ, Σ ). =1 =1 Note that we cannot see the direct effect of θ in the above equation, but p i, are computed using θ, i.e, the current parameters. θ includes the updated parameters.

We then have: Q(θ θ) = n K p i, log a + i=1 =1 n i=1 =1 K p i, log φ(x i µ, Σ ) Note that the prior probabilities a and the parameters of the Gaussian components µ, Σ can be optimized separately.

The a s subject to K =1 a = 1. Basic optimization theories show that a are optimized by n a = i=1 p i,. n The optimization of µ and Σ is simply a maximum lielihood estimation of the parameters using samples x i with weights p i,. Basic optimization techniques also lead to n µ = i=1 p i,x i n i=1 p i, Σ = n i=1 p i,(x i µ )(x i µ )t n i=1 p i, After every iteration, the lielihood function L is guaranteed to increase (may not strictly). We have derived the EM algorithm for a mixture of Gaussians.

EM Algorithm for the Mixture of Gaussians Parameters estimated at the pth iteration are mared by a superscript (p). 1. Initialize parameters 2. E-step: Compute the posterior probabilities for all i = 1,..., n, = 1,..., K. 3. M-step: p i, = n a (p+1) i=1 = p i, n Σ (p+1) = a (p) K =1 a(p) φ(x i µ (p) φ(x i µ (p), Σ(p) ), Σ(p) ). n, µ (p+1) i=1 = p i,x i n i=1 p i, n i=1 p i,(x i µ (p+1) n i=1 p i, 4. Repeat step 2 and 3 until converge. )(x i µ (p+1) ) t

Comment: for mixtures of other distributions, the EM algorithm is very similar. The E-step involves computing the posterior probabilities. Only the particular distribution φ needs to be changed. The M-step always involves parameter optimization. Formulas differ according to distributions.

Computation Issues If a different Σ is allowed for each component, the lielihood function is not bounded. Global optimum is meaningless. (Don t overdo it!) How to initialize? Example: Apply -means first. Initialize µ and Σ using all the samples classified to cluster. Initialize a by the proportion of data assigned to cluster by -means. In practice, we may want to reduce model complexity by putting constraints on the parameters. For instance, assume equal priors, identical covariance matrices for all the components.

Examples The heart disease data set is taen from the UCI machine learning database repository. There are 297 cases (samples) in the data set, of which 137 have heart diseases. Each sample contains 13 quantitative variables, including cholesterol, max heart rate, etc. We remove the mean of each variable and normalize it to yield unit variance. data are projected onto the plane spanned by the two most dominant principal component directions. A two-component Gaussian mixture is fit.

The heart disease data set and the estimated cluster densities. Left: The scatter plot of the data. Right: The contour plot of the pdf estimated using a single-layer mixture of two normals. The thic lines are the boundaries between the two clusters based on the estimated pdfs of individual clusters.

Classification Lielihood The lielihood: L(x θ) = ( n K ) log a φ(x i µ, Σ ) i=1 =1 maximized by the EM algorithm is sometimes called mixture lielihood. Maximization can also be applied to the classification lielihood. Denote the collection of cluster identities of all the samples by y = {y 1,..., y n }. L(x θ, y) = n log (a yi φ(x i µ yi, Σ yi )) i=1

The cluster identities y i, i = 1,..., n are treated as parameters together with θ and are part of the estimation. To maximize L, EM algorithm can be modified to yield an ascending algorithm. This modified version is called Classification EM (CEM).

Classification EM A classification step is inserted between the E-step and the M-step. Initialize parameters E-step: Compute the posterior probabilities for all i = 1,..., n, = 1,..., K, Classification: p i, = a (p) K =1 a(p) y (p+1) i φ(x i µ (p) φ(x i µ (p), Σ(p) ) = arg max p i,., Σ(p) ). Or equivalently, let ˆp i, = 1 if = arg max p i, and 0 otherwise.

M-step: n a (p+1) i=1 = ˆp i, n n µ (p+1) i=1 = ˆp i,x i n i=1 ˆp i, = = n i=1 n i=1 I (y (p+1) i = ) n (p+1) I (y i = )x i n (p+1) i=1 I (y i = ) Σ (p+1) = n i=1 ˆp i,(x i µ (p+1) )(x i µ (p+1) n i=1 ˆp i, ) t = n i=1 (p+1) I (y i = )(x i µ (p+1) )(x i µ (p+1) n (p+1) i=1 I (y i = ) Repeat the above three steps until converge. ) t

Comment: CEM tends to underestimate the variances. It usually converges much faster than EM. For the purpose of clustering, it is generally believed that it performs similarly as EM. If we assume equal priors a and also the covariance matrices Σ are identical and are a scalar matrix, CEM is exactly -means. (Exercise)