Coordinate Descent for L 1 Minimization

Similar documents
Variational approach to restore point-like and curve-like singularities in imaging

Dual Methods for Total Variation-Based Image Restoration

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Lecture 6: Logistic Regression

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing

7 Gaussian Elimination and LU Factorization

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Mathematical finance and linear programming (optimization)

Linear Threshold Units

v w is orthogonal to both v and w. the three vectors v, w and v w form a right-handed set of vectors.

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Statistical machine learning, high dimension and big data

SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI. (B.Sc.(Hons.), BUAA)

Lecture 3: Linear methods for classification

How To Solve The Cluster Algorithm

Linear Programming. March 14, 2014

Linear Programming Notes V Problem Transformations

Adaptive Online Gradient Descent

Binary Image Reconstruction

Vector Math Computer Graphics Scott D. Anderson

NOTES ON LINEAR TRANSFORMATIONS

1 Solving LPs: The Simplex Algorithm of George Dantzig

Big Data - Lecture 1 Optimization reminders

Perron vector Optimization applied to search engines

Introduction to Online Learning Theory

Numerical Methods For Image Restoration

Scalar Valued Functions of Several Variables; the Gradient Vector

2.2 Creaseness operator

1 Error in Euler s Method

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Follow the Perturbed Leader

3. INNER PRODUCT SPACES

Part II Redundant Dictionaries and Pursuit Algorithms

Microeconomic Theory: Basic Math Concepts

Lecture 1: Schur s Unitary Triangularization Theorem

Optimization Modeling for Mining Engineers

1 Determinants and the Solvability of Linear Systems

A Network Flow Approach in Cloud Computing

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Høgskolen i Narvik Sivilingeniørutdanningen STE6237 ELEMENTMETODER. Oppgaver

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

1 Norms and Vector Spaces

17. Inner product spaces Definition Let V be a real vector space. An inner product on V is a function

Lecture Topic: Low-Rank Approximations

Notes on Symmetric Matrices

MATH 4330/5330, Fourier Analysis Section 11, The Discrete Fourier Transform

Introduction to the Finite Element Method

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

SOLVING LINEAR SYSTEMS

Sublinear Algorithms for Big Data. Part 4: Random Topics

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

α = u v. In other words, Orthogonal Projection

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

Date: April 12, Contents

CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

x1 x 2 x 3 y 1 y 2 y 3 x 1 y 2 x 2 y 1 0.

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Bilinear Prediction Using Low-Rank Models

Statistical Machine Learning

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

1 The Brownian bridge construction

Inner products on R n, and more

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Linear Programming. April 12, 2005

Chapter 4: Artificial Neural Networks

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

(Quasi-)Newton methods

The Heat Equation. Lectures INF2320 p. 1/88

Factorization Theorems

Similarity and Diagonalization. Similar Matrices

Optimization in R n Introduction

3 Contour integrals and Cauchy s Theorem

2.3 Convex Constrained Optimization Problems

Solutions to Math 51 First Exam January 29, 2015

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Nonlinear Algebraic Equations Example

Lecture 2: August 29. Linear Programming (part I)

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

Numerical methods for American options

SOLUTIONS. f x = 6x 2 6xy 24x, f y = 3x 2 6y. To find the critical points, we solve

5.1 Bipartite Matching

Chapter 20. Vector Spaces and Bases

GI01/M055 Supervised Learning Proximal Methods

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

Principles of demand management Airline yield management Determining the booking limits. » A simple problem» Stochastic gradients for general problems

MAT 200, Midterm Exam Solution. a. (5 points) Compute the determinant of the matrix A =

Systems of Linear Equations

A central problem in network revenue management

Lectures notes on orthogonal matrices (with exercises) Linear Algebra II - Spring 2004 by D. Klain

MATH1231 Algebra, 2015 Chapter 7: Linear maps

Parallel Selective Algorithms for Nonconvex Big Data Optimization

Multi-variable Calculus and Optimization

Principles of Scientific Computing Nonlinear Equations and Optimization

Applications to Data Smoothing and Image Processing I

A linear algebraic method for pricing temporary life annuities

Transcription:

Coordinate Descent for L 1 Minimization Yingying Li and Stanley Osher Department of Mathematics University of California, Los Angeles Jan 13, 2010 Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 1 / 30

Outline Coordinate descent and L 1 minimization problems Application to compressed sensing Application to image denoising Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 2 / 30

Basic Idea of Coordinate Descent Consider solving the optimization problem ũ = arg min u R n E(u), u = (u 1, u 2,..., u n ). To solve it by coordinate descent, we update one coordinate while fixing all the others every step and iterate. ũ j = arg min x R E(u 1,..., u j 1, x, u j+1,..., u n ) The way we choose the updated coordinate j is called the sweep pattern, which is essential to the convergence. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 3 / 30

Models Including L 1 Functions We consider energies including an L 1 term n E(u 1, u 2,..., u n ) = F (u 1, u 2,..., u n ) + u i, (1) i=1 and also energies with a total variation (TV) term n E(u 1, u 2,..., u n ) = F (u 1, u 2,..., u n ) + u i, (2) where F is convex and differentiable and is a discrete approximation of gradient. i=1 Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 4 / 30

Models Including L 1 Functions Compressed Sensing u = arg min u 1 s.t. Au = f (3) TV-base image processing, like ROF (Rudin-Osher-Fatemi) denoising and nonlocal ROF denoising (Guy Gilboa and Stanley Osher). u = arg min u TV + λ u f 2 2 (4) u = arg min u NL-TV + λ u f 2 2 (5) where λ is a scalar parameter and f is input noisy image. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 5 / 30

Why Coordinate Descent? Advantages Scalar minimization is easier than multivariable minimization Gradient free Easy implementation Disadvantages Hard to take advantage of structured relationships among coordinates (like a problem involving the FFT matrix) The convergence rate is not good; sometimes it gets stuck at non-optimal points (for example, fused lasso E(u) = Au f 2 2 + λ 1 u 1 + λ 2 u TV ) Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 6 / 30

Coordinate Descent Sweep Patterns Sweep Patterns Sequential: 1, 2,..., n, 1, 2,..., n,... or 1, 2,..., n, n 1, n 2,..., 1,... Random: permutation of 1, 2,..., n, repeat Essentially Cyclic Rule Suppose the energy is convex and the nondifferentiable parts of the energy function are separable. Then coordinate descent converges if the sweep pattern satisfies the essentially cyclic rule, that every coordinate is visited infinitely often. However, no theoretical results for the convergence rate. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 7 / 30

Compressed Sensing (CS) min E(u) = u 0, subject to Au = f. u R n If the matrix A is incoherent, then it is equivalent to its convex relaxation: min u 1, subject to Au = f. u R n Restricted Isometry Condition (RIC): A measurement matrix A satisfies RIC with parameters (s, δ) for δ (0, 1) if we have (1 δ) v 2 Av 2 (1 + δ) v 2 for all s-sparse vectors. When δ is small, the RIC says that every set of s columns of A is approximately an orthonormal system. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 8 / 30

Bregman Iterative Method The constrained problem: The unconstrained problem: min u u R n 1 subject to Au = f. min E(u) = u u R n 1 + λ Au f 2 2, Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 9 / 30

Bregman Iterative Algorithm 1: Initialize: k = 0, u 0 = 0, f 0 = f. 2: while Au f 2 f 2 > tolerance do 3: Solve u k+1 arg min u u 1 + λ Au f k 2 2 by coordinate descent 4: f k+1 f + (f k Au k+1 ) 5: k k + 1 6: end while Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 10 / 30

Solving the Coordinate Subproblem min E(u) = λ u j ( p = λ ( m n ) 2 a ij u j f i + i=1 ) j=1 n u i i=1 aij 2 uj 2 2 a ij (f i a ik u k ) u j + i=1 i k j = λ ( a j 2 2uj 2 2β j u j + f 2 ) 2 + uj + u i, p fi 2 i=1 where β j = i a ij(f i k j a iku k ). The optimal value for u j is i j ũ j = 1 a j 2 shrink ( β j, 2λ) 1. 2 + n u i So this formula corrects the jth component of u, which strictly decreases the energy function E(u). Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 11 / 30 i=1

Pathwise Coordinate Descent Algorithm (Pathwise Coordinate Descent): While not converge For i = 1, 2,..., n, β j i a ij(f i k j a iku k ) u j 1 shrink ( β a j 2 j, 1 ) 2λ 2 where the shrink operator is: f µ, if f > µ; shrink(f, µ) = 0, if µ f µ; f + µ, if f < µ. shrink(f, µ) µ µ f Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 12 / 30

Adaptive Greedy Sweep Consider the energy decrease by only updating the jth coordinate at once: E j = E(u 1,..., u j,..., u n ) E(u 1,..., ũ j,..., u n ) ( u j β ) 2 ( j + u j λ a j 2 2 ũ j = λ a j 2 2 = λ a j 2 2(u j ũ j ) a j 2 2 ( u j + ũ j 2β j a j 2 2 β j a j 2 2 ) + u j ũ j. ) 2 ũ j We select the coordinate to update so as to maximize the decrease in energy, j = arg max E j. j Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 13 / 30

Greedy is Better After one sequential sweep zoom in adaptive sweeps 3500 3000 2500 2000 100 80 60 40 90 80 70 60 50 1500 20 40 1000 500 0 0 20 40 60 30 20 10 0 500 0 200 400 600 800 1000 1200 50 100 150 200 250 300 350 400 10 0 200 400 600 800 1000 1200 Figure: The dots represent the exact solution of the unconstrained problem at λ = 1000 and the circles are the results after 512 iterations. The left figure shows the result after one sequential sweep (512 iterations), the middle one is its zoom-in version and the right one shows the result by using an adaptive sweep. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 14 / 30

Gradient-based Adaptive Sweep For the same unconstrained problem, Tongtong Wu and Kenneth Lange update u j giving the most negative value of directional derivatives along each forward or backward directions. For instance, if e k is the coordinate direction along with u k varies, the the objective function E(u) has directional derivatives: E(u + σe k ) E(u) d ek E = lim = 2λ( a k 2 σ 0 σ 2u k β k ) + { E(u σe k ) E(u) d ek E = lim = 2λ( a k 2 σ 0 σ 2u k β k )+ We choose the updating coordinate j = arg min {d ej E, d ej E}. j { 1, u k 0; 1, u k < 0. 1, u k > 0; 1, u k 0. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 15 / 30

Update Difference Based We use the following way for choosing the updating coordinate based on the update difference: j = arg max j u j ũ j. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 16 / 30

Greedy Selection Choices Energy function based: j = arg max i E i ; Gradient-based: j = arg min i {d ei E, d ei E}; Update difference based: j = arg max i u i ũ i Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 17 / 30

Computational Tricks Our algorithm involves a vector β, β j = (A T f ) j a T j Au + a j 2 2u j. Since u changes in only one coordinate each iteration, there is a computationally efficient way to update β without entirely recomputing it, β k+1 β k = (u k+1 p u k p )( a p 2 2I A T A)e p. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 18 / 30

Algorithm, Coordinate Descent with a Refined Sweep Precompute: w j = a j 2 2 ; Normalization: A(, i) = A(, i)/w i ; Initialization: u = 0, β = A T f ; Iterate until convergence: ũ = shrink(β, 1 2λ ); j = arg max i u i ũ i, u k+1 j = ũ j ; β k+1 = β k u k j ũ j (A T A)e j, β k+1 j = β k j. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 19 / 30

Convergence of Greedy Coordinate Descent Consider minimizing functions of the form min H(u 1, u 2,..., u n ) = F (u 1, u 2,..., u n ) + u i R n g i (u i ), where F is differentiable and convex, and g i is convex. Tseng (1988) proved pathwise coordinate descent converges to a minimizer of H. Theorem If lim ui H(u 1, u 2,..., u n ) = for any i, and 2 F u i u j M, then the greedy coordinate descent method based on the selection rule: j = arg max i H converges to an optimal solution. i=1 Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 20 / 30

Convergence of Greedy Coordinate Descent Lemma Consider function H(x) = (ax b) 2 + x, where a > 0. It is easy to know that x = arg min x H(x) = shrink( b a, 1 2a 2 ). Now we claim that for any x, x x 1 a( H(x) H( x) ) 1/2. Theorem The greedy coordinate descent method with the selection rule: j = arg max i u i ũ i converges to an optimal solution of the following problem min H(u) = u u R n 1 + λ Au f 2 2, Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 21 / 30

Comparison of Numerical Speed min u u R n 1 + λ Au f 2 2. (6) λ Pathwise (1) (2) (3) 0.1 43 0.33 0.24 0.18 0.5 42 0.31 0.22 0.17 1 49 0.26 0.19 0.16 10 242 0.56 0.42 0.33 100 1617 4.0 2.7 2.2 Table: Runtime in seconds of pathwise coordinate descent and adaptive sweep methods (1), (2), (3) for solving (6) when they achieve the same accuracy. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 22 / 30

Another Application to Image Denoising Exact Noisy Denoised Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 23 / 30

Solving ROF The ROF (Rudin-Osher-Fatemi) model for image denoising is the following: find u satisfying u = arg min u dx + λ u f u Ω 2 2 where f is the given noisy image and λ > 0 is a parameter. Methods for solving ROF: Gradient descent (slow, time step restriction) Graph cuts (fast, but does not parallelize) Coordinate descent Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 24 / 30

Solving in Parallel by Coordinate Descent min E(u) = u u l + u u r + u u u + u u d + λ(u f ) 2 (7) u R { u opt = median u l, u r, u u, u d, f + 2 λ, f + 1 λ, f, f 1 λ, f 2 } λ (8) u u u l u u r u d While not converged, apply (8) to pixels in the black pattern in parallel ; apply (8) to pixels in the white pattern in parallel. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 25 / 30

Numerical Results Input: SNR 8.8 SNR 13.2 (1.10s) SNR 13.5 (1.15s) Input: SNR 3.1 SNR 9.5 (1.30s) SNR 10.3 (1.27s) Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 26 / 30

Nonlocal Means (Buades, Coll and Morel) Every two pixels x and y have a weight w(x, y) used to evaluate the similarity of their patches. Define w(x, y) = exp ( 1 ) h 2 G a (t) f (x + t) f (y + t) 2 dt, (9) Ω where G a is a Gaussian with standard deviation a. For computational efficiency, the support of w(x, y) is often restricted to a search window x y R and set to zero outside. Here, we consider the Nonlocal-ROF model, arg min E(u, f ) = w(x, y) u(x) u(y) dx dy + λ u (u f ) 2 dx. (10) Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 27 / 30

Generalization of the Median Formula arg min x n w i x u i + λ x f α i=1 = median { u 1, u 2,..., u n, f + w n + + w 1 p µ, f + sign(w n + + w 2 w 1 ) w n + + w 2 w 1 p µ, f + sign(w n + + w 3 w 2 w 1 ) w n + + w 3 w 2 w 1 p µ,..., f + sign(w n w n 1 w 1 ) w n w n 1 w 1 p µ, f w n + w n 1 + + w 1 p µ }, (11) where p = 1 α 1 and µ = ( 1 αλ )p. A computational savings is that the p i do not depend on the u i, so with all else fixed they only need to be computed once. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 28 / 30

Numerical Results 5 2 search window(snr 10.4) 7 2 (SNR 10.8) 11 2 (SNR 10.9) The runtimes are 14s with the 5 5 search window and 20s with the 7 7 search window and 34s with 11 11 search window. (Parameters: λ = 5 10 3, a = 1.25, h = 10.2.) Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 29 / 30

Thanks for your attention! Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 30 / 30