Information theory and machine learning. I: Rate Distortion Theory, Deterministic Annealing, and Soft K-means. II: Information Bottleneck Method.

Size: px

Start display at page:

Download "Information theory and machine learning. I: Rate Distortion Theory, Deterministic Annealing, and Soft K-means. II: Information Bottleneck Method."

Lionel Lyons
7 years ago
Views:

1 Information theory and machine learning I: Rate Distortion Theory, Deterministic Annealing, and Soft K-means. II: Information Bottleneck Method.

2 Lossy Compression Summarize data by keeping only relevant information and throwing away irrelevant information. Need a measure for information -> Shannon s mutual information. Need a notion of relevance.

3 Relevance (1) Shannon (1948): Define a function that measures the distortion between the original signal and its compressed representation. Note: This is related to the similarity measure in unsupervised learning/cluster analysis.

4 Distortion function Degree of freedom: which function to use is up to the experimenter. It is not always obvious what function should be used, especially if the data do not live in a metric space, and so there is no natural measure. Example: Speech.

5 Relevance (2) Tishby, Pereira, Bialek (1999): Measure relevance directly via Shannon s mutual information by defining: Relevant information = the information about a variable of interest. Then, there is no need to define a distortion function ad hoc, and the appropriate similarity measure arises naturally.

6 Learning and lossy data compression When we build a model of a data set, we map observations to a representation that summarizes the data in an efficient way. Example: K-means. Map N data points to K clusters, with centroids c. If K << N, then we get a substantial compression! log(k) << log(n). Recall: Entropy H[X] ~ log(n).

7 How efficient is the representation? Entropy can be used to measure the compactness of the model. Sometimes called statistical complexity (J. P. Crutchfield.) This is a good measure of complexity if we are searching for a deterministic map (in clustering called a hard partition ) In general, we may search over probabilistic assignments (clustering: soft partition )

8 Trade-off between compression and accuracy Think about a continuous variable. To describe it exactly, you need infinitely many bits. But finite bits to describe it up to some accuracy.

9 Rate distortion theory used for clustering Find assignments p(c x) from data x X to clusters c = 1,..., K, and find class-representatives x c ( cluster centers ), such that the average distortion D = d(x, x c ) is small. Compress the data by summarizing it efficiently in terms of bit-cost: Minimize the coding rate, the information which the clusters retain about the raw data. p(x, c) I = p(x)p(c)

10 Constrained optimization min [I(x, c) + β d(x, x c ) ] p(c x) x c Solution: p(c x) = p(c) Z(x, β) exp [ βd(x, x c)] x c from d dx c d(x, x c ) p(x c) = 0 (centroid condition)

11 Rate distortion curve Family of optimal solutions, one for each value of the Lagrange multiplyer beta. This parameter controls the trade-off between compression and fidelity. p(c x) = p(c) Z(x, β) exp [ βd(x, x c)] Evaluate objective function at the optimum for each value of beta and plot I vs D. => Rate-distortion curve.

12 Remarks for squared error distortion, d = (x x c ) 2 /2, the centroid condition reduces to x c = x p(x c) soft K-means because assignments can be probabilistic (fuzzy): p(c x) = p(c) Z(x, β) exp [ βd(x, x c)]

13 Remarks in the zero temperature limit, β, we have deterministic (or hard ) assignments because c := arg min c d(x, x c ); D(x, c) := d(x, x c ) d(x, x c ) > 0 p(c x) = exp( βd(x, x c )) c exp( βd(x, x c)) = ( 1 + c c exp( βd(x, c)) ) 1 1 the analogy to thermodynamics inspired Deterministic Annealing (Rose, 1990)

14 Soft K-means algorithm Choose a distortion measure. Fix the temperature, T, to a very large value (corresponds to small = 1/T) Solve iteratively, until convergence: Assignments: p(c x) = p(c) Z(x, β) exp [ βd(x, x c)] Centroids: d dx c d(x, x c ) p(x c) = 0 Lower temperature: T <- at, and repeat. a = annealing rate, a small, positive number.

15 Rate-distortion curve feasible region bit rate K = 2 K = 3 infeasible region etc. Distortion

16 How to choose the distortion function? Example: Speech Cluster speech such that signals which encode the same word will group together May be extremely difficult to define a distortion function which achieves this Intelligibility criterion? Should measure how well the meaning is preserved.

Information Bottleneck Method Tishby, Pereira, Bialek, 1999 Instead of guessing a distortion function, define relevant information as information the data

17 Information Bottleneck Method Tishby, Pereira, Bialek, 1999 Instead of guessing a distortion function, define relevant information as information the data carries about a quantity of interest (Example: phonemes or words.) Data is compressed such that relevant information is kept maximally. X y min I(x,c) max I(c,y) C

18 Constrained optimization max p(c x) [I(y, c) λi(x, c)] Optimal assignment rule?

19 Constrained optimization max p(c x) [I(y, c) λi(x, c)] Optimal assignment rule p(c x) = p(c) Z(x, λ) exp ( 1 λ D KL[p(y x) p(y c)] Kullback-Leibler divergence emerges as the distortion function: D KL [p(y x) p(y c)] = y p(y x) log 2 [ p(y x) p(y c) ] )

20 Information Plane I(c,y) infeasible x x K=4etc. K=3 x K=2 I(c,x)

21 Homework Implement Soft K-means algorithm. Helpful reading (on course website): K. Rose, "Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems," Proceedings of the IEEE, vol. 80, pp , November 1998.

Introduction to Learning & Decision Trees

Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing