Sparse Coding: An Overview

Size: px

Start display at page:

Download "Sparse Coding: An Overview"

Domenic Haynes
7 years ago
Views:

1 Sparse Coding: An Overview Brian Booth SFU Machine Learning Reading Group November 12, 2013

2 The aim of sparse coding

3 The aim of sparse coding Every column of D is a prototype

4 The aim of sparse coding Every column of D is a prototype Similar to, but more general than, PCA

5 Example: Sparse Coding of Images

6 Sparse Coding in V1

7 Introduction The Basics Example: Image Denoising Adding Prior Knowledge Conclusions

8 Example: Image Restoration

9 Sparse Coding and Acoustics

10 Sparse Coding and Acoustics Inner ear (cochlea) also does sparse coding of frequencies

11 Sparse Coding and Natural Language Processing

12 Outline 1 Introduction: Why Sparse Coding?

13 Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge

14 Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

15 Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

16 Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

17 The aim of sparse coding, revisited We assume our data x satisfies x n α i d i = αd i=1

18 The aim of sparse coding, revisited We assume our data x satisfies x n α i d i = αd i=1 Learning: Given training data x j, j {1,, m} Learn dictionary D and sparse code α

19 The aim of sparse coding, revisited We assume our data x satisfies x n α i d i = αd i=1 Learning: Given training data x j, j {1,, m} Learn dictionary D and sparse code α Encoding: Given test data x, dictionary D Learn sparse code α

20 Learning: The Objective Function Dictionary learning involves optimizing: arg min {d i },{α j } j=1 m x j n α j i d i 2 i=1

21 Learning: The Objective Function Dictionary learning involves optimizing: arg min {d i },{α j } j=1 m x j n m n α j i d i 2 + β α j i i=1 j=1 i=1

22 Learning: The Objective Function Dictionary learning involves optimizing: arg min {d i },{α j } j=1 m x j n m n α j i d i 2 + β α j i i=1 j=1 i=1 subject to d i 2 c, i = 1,, n.

23 Learning: The Objective Function Dictionary learning involves optimizing: arg min In matrix notation: {d i },{α j } j=1 m x j n α j i d i 2 + β i=1 subject to d i 2 c, i = 1,, n. arg min D,A X AD 2 F + β i,j m j=1 i=1 α i,j n α j i subject to i D 2 i,j c, i = 1,, n.

24 Learning: The Objective Function Dictionary learning involves optimizing: arg min In matrix notation: {d i },{α j } j=1 m x j n α j i d i 2 + β i=1 subject to d i 2 c, i = 1,, n. arg min D,A X AD 2 F + β i,j m j=1 i=1 α i,j n α j i subject to i D 2 i,j c, i = 1,, n. Split the optimization over D and A in two.

25 Step 1: Learning the Dictionary Reduced optimization problem: arg min D X AD 2 F subject to i D 2 i,j c, i = 1,, n.

26 Step 1: Learning the Dictionary Reduced optimization problem: arg min D X AD 2 F subject to i D 2 i,j c, i = 1,, n. Introduce Lagrange multipliers: ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D i,j c )

27 Step 1: Learning the Dictionary Reduced optimization problem: arg min D X AD 2 F subject to i D 2 i,j c, i = 1,, n. Introduce Lagrange multipliers: ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D i,j c ) where each λ j 0 is a dual variable...

28 Step 1: Moving to the dual From the Lagrangian ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D 2 i,j c )

29 Step 1: Moving to the dual From the Lagrangian ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D 2 i,j c ) minimize over D to obtain Lagrange dual D (λ) = min D L (D, λ) =

30 Step 1: Moving to the dual From the Lagrangian ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D 2 i,j c ) minimize over D to obtain Lagrange dual ( ( ) 1 D (λ) = min L (D, λ) = tr X T X XA T AA T + Λ (XA ) ) T T cλ D

31 Step 1: Moving to the dual From the Lagrangian ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D 2 i,j c ) minimize over D to obtain Lagrange dual ( ( ) 1 D (λ) = min L (D, λ) = tr X T X XA T AA T + Λ (XA ) ) T T cλ D The dual can be optimized using congugate gradient

32 Step 1: Moving to the dual From the Lagrangian ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D 2 i,j c ) minimize over D to obtain Lagrange dual ( ( ) 1 D (λ) = min L (D, λ) = tr X T X XA T AA T + Λ (XA ) ) T T cλ D The dual can be optimized using congugate gradient Only n, λ values compared to D being n k

33 Step 1: Dual to the Dictionary With the optimal Λ, our dictionary is D T = ( ) 1 AA T + Λ (XA ) T T

34 Step 1: Dual to the Dictionary With the optimal Λ, our dictionary is D T = ( ) 1 AA T + Λ (XA ) T T Key point: Moving to the dual reduces the number of optimization variables, speeding up the optimization.

35 Step 2: Learning the Sparse Code With D now fixed, optimize for A arg min A X AD 2 F + β i,j α i,j

36 Step 2: Learning the Sparse Code With D now fixed, optimize for A arg min A X AD 2 F + β i,j α i,j Unconstrained, convex quadratic optimization

37 Step 2: Learning the Sparse Code With D now fixed, optimize for A arg min A X AD 2 F + β i,j α i,j Unconstrained, convex quadratic optimization Many solvers for this (e.g. interior point methods, in-crowd algorithm, fixed-point continuation)

38 Step 2: Learning the Sparse Code With D now fixed, optimize for A arg min A X AD 2 F + β i,j α i,j Note: Unconstrained, convex quadratic optimization Many solvers for this (e.g. interior point methods, in-crowd algorithm, fixed-point continuation) Same problem as the encoding problem. Runtime of optimization in the encoding stage?

39 Speeding up the testing phase Fair amount of work on speeding up the encoding stage: H. Lee et al., Efficient sparse coding algorithms nips06-sparsecoding.pdf K. Gregor and Y. LeCun, Learning Fast Approximations of Sparse Coding gregor-icml-10.pdf S. Hawe et al., Separable Dictionary Learning

40 Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

41 Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

42 Relationships between Dictionary atoms Dictionaries are over-complete bases

43 Relationships between Dictionary atoms Dictionaries are over-complete bases Dictate relationships between atoms

44 Relationships between Dictionary atoms Dictionaries are over-complete bases Dictate relationships between atoms Example: Hierarchical dictionaries

45 Example: Image Patches

46 Example: Document Topics

47 Problem Statement Goal: Have sub-groups of sparse code α all be non-zero (or zero).

48 Problem Statement Goal: Have sub-groups of sparse code α all be non-zero (or zero). Hierarchical: If a node is non-zero, it s parent must be non-zero If a node s parent is zero, the node must be zero

49 Problem Statement Goal: Have sub-groups of sparse code α all be non-zero (or zero). Hierarchical: If a node is non-zero, it s parent must be non-zero If a node s parent is zero, the node must be zero Implementation: Change the regularization Enforce sparsity differently...

50 Grouping Code Entries Level k included in k + 1 groups

51 Grouping Code Entries Level k included in k + 1 groups Add α i to objective function once for each group

52 Group Regularization Updated objective function: arg min D,{α j } m ] [ x j Dα j 2 j=1

53 Group Regularization Updated objective function: arg min D,{α j } m j=1 [ x j Dα j 2 + βω (α j) ]

54 Group Regularization Updated objective function: arg min D,{α j } m j=1 [ x j Dα j 2 + βω (α j) ] where Ω (α) = g P w g α g

55 Group Regularization Updated objective function: arg min D,{α j } m j=1 [ x j Dα j 2 + βω (α j) ] where Ω (α) = g P w g α g α g are the code values for group g.

56 Group Regularization Updated objective function: arg min D,{α j } m j=1 [ x j Dα j 2 + βω (α j) ] where Ω (α) = g P w g α g α g are the code values for group g. w g weights the enforcement of the hierarchy

57 Group Regularization Updated objective function: arg min D,{α j } m j=1 [ x j Dα j 2 + βω (α j) ] where Ω (α) = g P w g α g α g are the code values for group g. w g weights the enforcement of the hierarchy Solve using proximal methods.

58 Other Examples Other examples of structured sparsity: M. Stojnic et al., On the Reconstruction of Block-Sparse Signals With an Optimal Number of Measurements, J. Mairal et al., Convex and Network Flow Optimization for Structured Sparsity, volume12/mairal11a/mairal11a.pdf

59 Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

60 Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

61 Summary

62 Summary Two interesting directions:

63 Summary Two interesting directions: Increasing speed of the testing phase

64 Summary Two interesting directions: Increasing speed of the testing phase Optimizing dictionary structure

Big learning: challenges and opportunities

Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,