Coordinate Descent for L 1 Minimization

Coordinate Descent for L 1 Minimization Yingying Li and Stanley Osher Department of Mathematics University of California, Los Angeles Jan 13, 2010 Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 1 / 30

Outline Coordinate descent and L 1 minimization problems Application to compressed sensing Application to image denoising Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 2 / 30

Basic Idea of Coordinate Descent Consider solving the optimization problem ũ = arg min u R n E(u), u = (u 1, u 2,..., u n ). To solve it by coordinate descent, we update one coordinate while fixing all the others every step and iterate. ũ j = arg min x R E(u 1,..., u j 1, x, u j+1,..., u n ) The way we choose the updated coordinate j is called the sweep pattern, which is essential to the convergence. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 3 / 30

Models Including L 1 Functions We consider energies including an L 1 term n E(u 1, u 2,..., u n ) = F (u 1, u 2,..., u n ) + u i, (1) i=1 and also energies with a total variation (TV) term n E(u 1, u 2,..., u n ) = F (u 1, u 2,..., u n ) + u i, (2) where F is convex and differentiable and is a discrete approximation of gradient. i=1 Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 4 / 30

Models Including L 1 Functions Compressed Sensing u = arg min u 1 s.t. Au = f (3) TV-base image processing, like ROF (Rudin-Osher-Fatemi) denoising and nonlocal ROF denoising (Guy Gilboa and Stanley Osher). u = arg min u TV + λ u f 2 2 (4) u = arg min u NL-TV + λ u f 2 2 (5) where λ is a scalar parameter and f is input noisy image. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 5 / 30

Why Coordinate Descent? Advantages Scalar minimization is easier than multivariable minimization Gradient free Easy implementation Disadvantages Hard to take advantage of structured relationships among coordinates (like a problem involving the FFT matrix) The convergence rate is not good; sometimes it gets stuck at non-optimal points (for example, fused lasso E(u) = Au f 2 2 + λ 1 u 1 + λ 2 u TV ) Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 6 / 30

Coordinate Descent Sweep Patterns Sweep Patterns Sequential: 1, 2,..., n, 1, 2,..., n,... or 1, 2,..., n, n 1, n 2,..., 1,... Random: permutation of 1, 2,..., n, repeat Essentially Cyclic Rule Suppose the energy is convex and the nondifferentiable parts of the energy function are separable. Then coordinate descent converges if the sweep pattern satisfies the essentially cyclic rule, that every coordinate is visited infinitely often. However, no theoretical results for the convergence rate. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 7 / 30

Compressed Sensing (CS) min E(u) = u 0, subject to Au = f. u R n If the matrix A is incoherent, then it is equivalent to its convex relaxation: min u 1, subject to Au = f. u R n Restricted Isometry Condition (RIC): A measurement matrix A satisfies RIC with parameters (s, δ) for δ (0, 1) if we have (1 δ) v 2 Av 2 (1 + δ) v 2 for all s-sparse vectors. When δ is small, the RIC says that every set of s columns of A is approximately an orthonormal system. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 8 / 30

Bregman Iterative Method The constrained problem: The unconstrained problem: min u u R n 1 subject to Au = f. min E(u) = u u R n 1 + λ Au f 2 2, Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 9 / 30

Bregman Iterative Algorithm 1: Initialize: k = 0, u 0 = 0, f 0 = f. 2: while Au f 2 f 2 > tolerance do 3: Solve u k+1 arg min u u 1 + λ Au f k 2 2 by coordinate descent 4: f k+1 f + (f k Au k+1 ) 5: k k + 1 6: end while Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 10 / 30

Solving the Coordinate Subproblem min E(u) = λ u j ( p = λ ( m n ) 2 a ij u j f i + i=1 ) j=1 n u i i=1 aij 2 uj 2 2 a ij (f i a ik u k ) u j + i=1 i k j = λ ( a j 2 2uj 2 2β j u j + f 2 ) 2 + uj + u i, p fi 2 i=1 where β j = i a ij(f i k j a iku k ). The optimal value for u j is i j ũ j = 1 a j 2 shrink ( β j, 2λ) 1. 2 + n u i So this formula corrects the jth component of u, which strictly decreases the energy function E(u). Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 11 / 30 i=1

Pathwise Coordinate Descent Algorithm (Pathwise Coordinate Descent): While not converge For i = 1, 2,..., n, β j i a ij(f i k j a iku k ) u j 1 shrink ( β a j 2 j, 1 ) 2λ 2 where the shrink operator is: f µ, if f > µ; shrink(f, µ) = 0, if µ f µ; f + µ, if f < µ. shrink(f, µ) µ µ f Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 12 / 30

Adaptive Greedy Sweep Consider the energy decrease by only updating the jth coordinate at once: E j = E(u 1,..., u j,..., u n ) E(u 1,..., ũ j,..., u n ) ( u j β ) 2 ( j + u j λ a j 2 2 ũ j = λ a j 2 2 = λ a j 2 2(u j ũ j ) a j 2 2 ( u j + ũ j 2β j a j 2 2 β j a j 2 2 ) + u j ũ j. ) 2 ũ j We select the coordinate to update so as to maximize the decrease in energy, j = arg max E j. j Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 13 / 30

Greedy is Better After one sequential sweep zoom in adaptive sweeps 3500 3000 2500 2000 100 80 60 40 90 80 70 60 50 1500 20 40 1000 500 0 0 20 40 60 30 20 10 0 500 0 200 400 600 800 1000 1200 50 100 150 200 250 300 350 400 10 0 200 400 600 800 1000 1200 Figure: The dots represent the exact solution of the unconstrained problem at λ = 1000 and the circles are the results after 512 iterations. The left figure shows the result after one sequential sweep (512 iterations), the middle one is its zoom-in version and the right one shows the result by using an adaptive sweep. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 14 / 30

Gradient-based Adaptive Sweep For the same unconstrained problem, Tongtong Wu and Kenneth Lange update u j giving the most negative value of directional derivatives along each forward or backward directions. For instance, if e k is the coordinate direction along with u k varies, the the objective function E(u) has directional derivatives: E(u + σe k ) E(u) d ek E = lim = 2λ( a k 2 σ 0 σ 2u k β k ) + { E(u σe k ) E(u) d ek E = lim = 2λ( a k 2 σ 0 σ 2u k β k )+ We choose the updating coordinate j = arg min {d ej E, d ej E}. j { 1, u k 0; 1, u k < 0. 1, u k > 0; 1, u k 0. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 15 / 30

Update Difference Based We use the following way for choosing the updating coordinate based on the update difference: j = arg max j u j ũ j. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 16 / 30

Greedy Selection Choices Energy function based: j = arg max i E i ; Gradient-based: j = arg min i {d ei E, d ei E}; Update difference based: j = arg max i u i ũ i Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 17 / 30

Computational Tricks Our algorithm involves a vector β, β j = (A T f ) j a T j Au + a j 2 2u j. Since u changes in only one coordinate each iteration, there is a computationally efficient way to update β without entirely recomputing it, β k+1 β k = (u k+1 p u k p )( a p 2 2I A T A)e p. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 18 / 30

Algorithm, Coordinate Descent with a Refined Sweep Precompute: w j = a j 2 2 ; Normalization: A(, i) = A(, i)/w i ; Initialization: u = 0, β = A T f ; Iterate until convergence: ũ = shrink(β, 1 2λ ); j = arg max i u i ũ i, u k+1 j = ũ j ; β k+1 = β k u k j ũ j (A T A)e j, β k+1 j = β k j. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 19 / 30

Convergence of Greedy Coordinate Descent Consider minimizing functions of the form min H(u 1, u 2,..., u n ) = F (u 1, u 2,..., u n ) + u i R n g i (u i ), where F is differentiable and convex, and g i is convex. Tseng (1988) proved pathwise coordinate descent converges to a minimizer of H. Theorem If lim ui H(u 1, u 2,..., u n ) = for any i, and 2 F u i u j M, then the greedy coordinate descent method based on the selection rule: j = arg max i H converges to an optimal solution. i=1 Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 20 / 30

Convergence of Greedy Coordinate Descent Lemma Consider function H(x) = (ax b) 2 + x, where a > 0. It is easy to know that x = arg min x H(x) = shrink( b a, 1 2a 2 ). Now we claim that for any x, x x 1 a( H(x) H( x) ) 1/2. Theorem The greedy coordinate descent method with the selection rule: j = arg max i u i ũ i converges to an optimal solution of the following problem min H(u) = u u R n 1 + λ Au f 2 2, Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 21 / 30

Comparison of Numerical Speed min u u R n 1 + λ Au f 2 2. (6) λ Pathwise (1) (2) (3) 0.1 43 0.33 0.24 0.18 0.5 42 0.31 0.22 0.17 1 49 0.26 0.19 0.16 10 242 0.56 0.42 0.33 100 1617 4.0 2.7 2.2 Table: Runtime in seconds of pathwise coordinate descent and adaptive sweep methods (1), (2), (3) for solving (6) when they achieve the same accuracy. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 22 / 30

Another Application to Image Denoising Exact Noisy Denoised Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 23 / 30

Solving ROF The ROF (Rudin-Osher-Fatemi) model for image denoising is the following: find u satisfying u = arg min u dx + λ u f u Ω 2 2 where f is the given noisy image and λ > 0 is a parameter. Methods for solving ROF: Gradient descent (slow, time step restriction) Graph cuts (fast, but does not parallelize) Coordinate descent Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 24 / 30

Solving in Parallel by Coordinate Descent min E(u) = u u l + u u r + u u u + u u d + λ(u f ) 2 (7) u R { u opt = median u l, u r, u u, u d, f + 2 λ, f + 1 λ, f, f 1 λ, f 2 } λ (8) u u u l u u r u d While not converged, apply (8) to pixels in the black pattern in parallel ; apply (8) to pixels in the white pattern in parallel. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 25 / 30

Numerical Results Input: SNR 8.8 SNR 13.2 (1.10s) SNR 13.5 (1.15s) Input: SNR 3.1 SNR 9.5 (1.30s) SNR 10.3 (1.27s) Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 26 / 30

Nonlocal Means (Buades, Coll and Morel) Every two pixels x and y have a weight w(x, y) used to evaluate the similarity of their patches. Define w(x, y) = exp ( 1 ) h 2 G a (t) f (x + t) f (y + t) 2 dt, (9) Ω where G a is a Gaussian with standard deviation a. For computational efficiency, the support of w(x, y) is often restricted to a search window x y R and set to zero outside. Here, we consider the Nonlocal-ROF model, arg min E(u, f ) = w(x, y) u(x) u(y) dx dy + λ u (u f ) 2 dx. (10) Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 27 / 30

Generalization of the Median Formula arg min x n w i x u i + λ x f α i=1 = median { u 1, u 2,..., u n, f + w n + + w 1 p µ, f + sign(w n + + w 2 w 1 ) w n + + w 2 w 1 p µ, f + sign(w n + + w 3 w 2 w 1 ) w n + + w 3 w 2 w 1 p µ,..., f + sign(w n w n 1 w 1 ) w n w n 1 w 1 p µ, f w n + w n 1 + + w 1 p µ }, (11) where p = 1 α 1 and µ = ( 1 αλ )p. A computational savings is that the p i do not depend on the u i, so with all else fixed they only need to be computed once. Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 28 / 30

Numerical Results 5 2 search window(snr 10.4) 7 2 (SNR 10.8) 11 2 (SNR 10.9) The runtimes are 14s with the 5 5 search window and 20s with the 7 7 search window and 34s with 11 11 search window. (Parameters: λ = 5 10 3, a = 1.25, h = 10.2.) Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 29 / 30

Thanks for your attention! Li, Osher (UCLA) Coordinate Descent for L 1 Minimization 30 / 30