Sparse Coding: An Overview

Similar documents
Big learning: challenges and opportunities

Machine learning challenges for big data

6. Cholesky factorization

Class-specific Sparse Coding for Learning of Object Representations

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Group Sparse Coding. Fernando Pereira Google Mountain View, CA Dennis Strelow Google Mountain View, CA

Proximal mapping via network optimization

Représentations parcimonieuses pour le traitement d image et la vision artificielle

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

A Simple Introduction to Support Vector Machines

C variance Matrices and Computer Network Analysis

Several Views of Support Vector Machines

Part II Redundant Dictionaries and Pursuit Algorithms

An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Online Learning for Matrix Factorization and Sparse Coding

Large-Scale Similarity and Distance Metric Learning

Lecture Topic: Low-Rank Approximations

Big Data - Lecture 1 Optimization reminders

Support Vector Machine (SVM)

Optimization Modeling for Mining Engineers

Support Vector Machines

Support Vector Machines Explained

Matrix Multiplication

Efficient online learning of a non-negative sparse autoencoder

Reduced echelon form: Add the following conditions to conditions 1, 2, and 3 above:

A REVIEW OF FAST l 1 -MINIMIZATION ALGORITHMS FOR ROBUST FACE RECOGNITION

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Image Super-Resolution via Sparse Representation

Bag of Pursuits and Neural Gas for Improved Sparse Coding

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

How To Teach A Dictionary To Learn For Big Data Via Atom Update

CSCI567 Machine Learning (Fall 2014)

Abstract: We describe the beautiful LU factorization of a square matrix (or how to write Gaussian elimination in terms of matrix multiplication).

2) Based on the information in the table which choice BEST shows the answer to 1 906?

Variational approach to restore point-like and curve-like singularities in imaging

Matrix-vector multiplication in terms of dot-products

Big Data Analytics. Lucas Rego Drumond

Statistical machine learning, high dimension and big data

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, ACCEPTED FOR PUBLICATION 1

A numerically adaptive implementation of the simplex method

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Graph Based Processing of Big Images

Acoustical Surfaces, Inc.

Dantzig-Wolfe bound and Dantzig-Wolfe cookbook

10. Proximal point method

Distributed Machine Learning and Big Data

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Morphological Diversity and Sparsity for Multichannel Data Restoration

Couple Dictionary Training for Image Super-resolution

1 Solving LPs: The Simplex Algorithm of George Dantzig

Mathematical finance and linear programming (optimization)

Using Microsoft Excel to build Efficient Frontiers via the Mean Variance Optimization Method

3. Interpolation. Closing the Gaps of Discretization... Beyond Polynomials

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

Applications to Data Smoothing and Image Processing I

Sparse Signal Recovery: Theory, Applications and Algorithms

Nonlinear Programming Methods.S2 Quadratic Programming

Dual Methods for Total Variation-Based Image Restoration

A primal-dual algorithm for group sparse regularization with overlapping groups

Section for Cognitive Systems DTU Informatics, Technical University of Denmark

Sublinear Algorithms for Big Data. Part 4: Random Topics

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

A Lagrangian-DNN Relaxation: a Fast Method for Computing Tight Lower Bounds for a Class of Quadratic Optimization Problems

Big Data Analytics: Optimization and Randomization

Efficient variable selection in support vector machines via the alternating direction method of multipliers

HPC Deployment of OpenFOAM in an Industrial Setting

The Many Facets of Big Data

An Introduction to Physically Based Modeling: Constrained Dynamics

Factoring nonnegative matrices with linear programs

Matrix Algebra. Some Basic Matrix Laws. Before reading the text or the following notes glance at the following list of basic matrix algebra laws.

Solutions Of Some Non-Linear Programming Problems BIJAN KUMAR PATEL. Master of Science in Mathematics. Prof. ANIL KUMAR

Numerical Methods I Eigenvalue Problems

Introduction to Sparsity in Signal Processing 1

RECENTLY, there has been a great deal of interest in

Sparsity-promoting recovery from simultaneous data: a compressive sensing approach

Image Super-Resolution as Sparse Representation of Raw Image Patches

EFFICIENT QUANTIZATION PARAMETER ESTIMATION IN HEVC BASED ON ρ-domain

Interior-Point Algorithms for Quadratic Programming

Controlling the Internet in the era of Software Defined and Virtualized Networks. Fernando Paganini Universidad ORT Uruguay

Linear Programming. March 14, 2014

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

SSketch: An Automated Framework for Streaming Sketch-based Analysis of Big Data on FPGA *

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Massive Data Classification via Unconstrained Support Vector Machines

See All by Looking at A Few: Sparse Modeling for Finding Representative Objects

Learning A Discriminative Dictionary for Sparse Coding via Label Consistent K-SVD

Support Vector Machines

THE problem of computing sparse solutions (i.e., solutions

A Distributed Line Search for Network Optimization

Fleet Assignment Using Collective Intelligence

A HIGH PERFORMANCE SOFTWARE IMPLEMENTATION OF MPEG AUDIO ENCODER. Figure 1. Basic structure of an encoder.

Online Semi-Supervised Discriminative Dictionary Learning for Sparse Representation

UNIVERSAL SPEECH MODELS FOR SPEAKER INDEPENDENT SINGLE CHANNEL SOURCE SEPARATION

Analysis of data separation and recovery problems using clustered sparsity

Group Testing a tool of protecting Network Security

Big Data Analytics in Future Internet of Things

DATA ANALYSIS II. Matrix Algorithms

Solving Linear Programs in Excel

Transcription:

Sparse Coding: An Overview Brian Booth SFU Machine Learning Reading Group November 12, 2013

The aim of sparse coding

The aim of sparse coding Every column of D is a prototype

The aim of sparse coding Every column of D is a prototype Similar to, but more general than, PCA

Example: Sparse Coding of Images

Sparse Coding in V1

Introduction The Basics Example: Image Denoising Adding Prior Knowledge Conclusions

Example: Image Restoration

Sparse Coding and Acoustics

Sparse Coding and Acoustics Inner ear (cochlea) also does sparse coding of frequencies

Sparse Coding and Natural Language Processing

Outline 1 Introduction: Why Sparse Coding?

Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge

Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

The aim of sparse coding, revisited We assume our data x satisfies x n α i d i = αd i=1

The aim of sparse coding, revisited We assume our data x satisfies x n α i d i = αd i=1 Learning: Given training data x j, j {1,, m} Learn dictionary D and sparse code α

The aim of sparse coding, revisited We assume our data x satisfies x n α i d i = αd i=1 Learning: Given training data x j, j {1,, m} Learn dictionary D and sparse code α Encoding: Given test data x, dictionary D Learn sparse code α

Learning: The Objective Function Dictionary learning involves optimizing: arg min {d i },{α j } j=1 m x j n α j i d i 2 i=1

Learning: The Objective Function Dictionary learning involves optimizing: arg min {d i },{α j } j=1 m x j n m n α j i d i 2 + β α j i i=1 j=1 i=1

Learning: The Objective Function Dictionary learning involves optimizing: arg min {d i },{α j } j=1 m x j n m n α j i d i 2 + β α j i i=1 j=1 i=1 subject to d i 2 c, i = 1,, n.

Learning: The Objective Function Dictionary learning involves optimizing: arg min In matrix notation: {d i },{α j } j=1 m x j n α j i d i 2 + β i=1 subject to d i 2 c, i = 1,, n. arg min D,A X AD 2 F + β i,j m j=1 i=1 α i,j n α j i subject to i D 2 i,j c, i = 1,, n.

Learning: The Objective Function Dictionary learning involves optimizing: arg min In matrix notation: {d i },{α j } j=1 m x j n α j i d i 2 + β i=1 subject to d i 2 c, i = 1,, n. arg min D,A X AD 2 F + β i,j m j=1 i=1 α i,j n α j i subject to i D 2 i,j c, i = 1,, n. Split the optimization over D and A in two.

Step 1: Learning the Dictionary Reduced optimization problem: arg min D X AD 2 F subject to i D 2 i,j c, i = 1,, n.

Step 1: Learning the Dictionary Reduced optimization problem: arg min D X AD 2 F subject to i D 2 i,j c, i = 1,, n. Introduce Lagrange multipliers: ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D i,j c )

Step 1: Learning the Dictionary Reduced optimization problem: arg min D X AD 2 F subject to i D 2 i,j c, i = 1,, n. Introduce Lagrange multipliers: ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D i,j c ) where each λ j 0 is a dual variable...

Step 1: Moving to the dual From the Lagrangian ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D 2 i,j c )

Step 1: Moving to the dual From the Lagrangian ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D 2 i,j c ) minimize over D to obtain Lagrange dual D (λ) = min D L (D, λ) =

Step 1: Moving to the dual From the Lagrangian ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D 2 i,j c ) minimize over D to obtain Lagrange dual ( ( ) 1 D (λ) = min L (D, λ) = tr X T X XA T AA T + Λ (XA ) ) T T cλ D

Step 1: Moving to the dual From the Lagrangian ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D 2 i,j c ) minimize over D to obtain Lagrange dual ( ( ) 1 D (λ) = min L (D, λ) = tr X T X XA T AA T + Λ (XA ) ) T T cλ D The dual can be optimized using congugate gradient

Step 1: Moving to the dual From the Lagrangian ( ) L (D, λ) = tr (X AD) T (X AD) + n j=1 λ j ( i D 2 i,j c ) minimize over D to obtain Lagrange dual ( ( ) 1 D (λ) = min L (D, λ) = tr X T X XA T AA T + Λ (XA ) ) T T cλ D The dual can be optimized using congugate gradient Only n, λ values compared to D being n k

Step 1: Dual to the Dictionary With the optimal Λ, our dictionary is D T = ( ) 1 AA T + Λ (XA ) T T

Step 1: Dual to the Dictionary With the optimal Λ, our dictionary is D T = ( ) 1 AA T + Λ (XA ) T T Key point: Moving to the dual reduces the number of optimization variables, speeding up the optimization.

Step 2: Learning the Sparse Code With D now fixed, optimize for A arg min A X AD 2 F + β i,j α i,j

Step 2: Learning the Sparse Code With D now fixed, optimize for A arg min A X AD 2 F + β i,j α i,j Unconstrained, convex quadratic optimization

Step 2: Learning the Sparse Code With D now fixed, optimize for A arg min A X AD 2 F + β i,j α i,j Unconstrained, convex quadratic optimization Many solvers for this (e.g. interior point methods, in-crowd algorithm, fixed-point continuation)

Step 2: Learning the Sparse Code With D now fixed, optimize for A arg min A X AD 2 F + β i,j α i,j Note: Unconstrained, convex quadratic optimization Many solvers for this (e.g. interior point methods, in-crowd algorithm, fixed-point continuation) Same problem as the encoding problem. Runtime of optimization in the encoding stage?

Speeding up the testing phase Fair amount of work on speeding up the encoding stage: H. Lee et al., Efficient sparse coding algorithms http://ai.stanford.edu/~hllee/ nips06-sparsecoding.pdf K. Gregor and Y. LeCun, Learning Fast Approximations of Sparse Coding http://yann.lecun.com/exdb/publis/pdf/ gregor-icml-10.pdf S. Hawe et al., Separable Dictionary Learning http://arxiv.org/pdf/1303.5244v1.pdf

Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

Relationships between Dictionary atoms Dictionaries are over-complete bases

Relationships between Dictionary atoms Dictionaries are over-complete bases Dictate relationships between atoms

Relationships between Dictionary atoms Dictionaries are over-complete bases Dictate relationships between atoms Example: Hierarchical dictionaries

Example: Image Patches

Example: Document Topics

Problem Statement Goal: Have sub-groups of sparse code α all be non-zero (or zero).

Problem Statement Goal: Have sub-groups of sparse code α all be non-zero (or zero). Hierarchical: If a node is non-zero, it s parent must be non-zero If a node s parent is zero, the node must be zero

Problem Statement Goal: Have sub-groups of sparse code α all be non-zero (or zero). Hierarchical: If a node is non-zero, it s parent must be non-zero If a node s parent is zero, the node must be zero Implementation: Change the regularization Enforce sparsity differently...

Grouping Code Entries Level k included in k + 1 groups

Grouping Code Entries Level k included in k + 1 groups Add α i to objective function once for each group

Group Regularization Updated objective function: arg min D,{α j } m ] [ x j Dα j 2 j=1

Group Regularization Updated objective function: arg min D,{α j } m j=1 [ x j Dα j 2 + βω (α j) ]

Group Regularization Updated objective function: arg min D,{α j } m j=1 [ x j Dα j 2 + βω (α j) ] where Ω (α) = g P w g α g

Group Regularization Updated objective function: arg min D,{α j } m j=1 [ x j Dα j 2 + βω (α j) ] where Ω (α) = g P w g α g α g are the code values for group g.

Group Regularization Updated objective function: arg min D,{α j } m j=1 [ x j Dα j 2 + βω (α j) ] where Ω (α) = g P w g α g α g are the code values for group g. w g weights the enforcement of the hierarchy

Group Regularization Updated objective function: arg min D,{α j } m j=1 [ x j Dα j 2 + βω (α j) ] where Ω (α) = g P w g α g α g are the code values for group g. w g weights the enforcement of the hierarchy Solve using proximal methods.

Other Examples Other examples of structured sparsity: M. Stojnic et al., On the Reconstruction of Block-Sparse Signals With an Optimal Number of Measurements, http://dx.doi.org/10.1109/tsp.2009.2020754 J. Mairal et al., Convex and Network Flow Optimization for Structured Sparsity, http://jmlr.org/papers/ volume12/mairal11a/mairal11a.pdf

Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

Outline 1 Introduction: Why Sparse Coding? 2 Sparse Coding: The Basics 3 Adding Prior Knowledge 4 Conclusions

Summary

Summary Two interesting directions:

Summary Two interesting directions: Increasing speed of the testing phase

Summary Two interesting directions: Increasing speed of the testing phase Optimizing dictionary structure