GI01/M055 Supervised Learning Proximal Methods

Size: px

Start display at page:

Download "GI01/M055 Supervised Learning Proximal Methods"

Louise Kennedy
7 years ago
Views:

1 GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20

2 Today s Plan Problem setting Convex analysis concepts Proximal operators O(1/T) algorithm O(1/T 2 ) algorithm Empirical comparison (UCL) Proximal Methods 2 / 20

3 Problem setting We are interesting in the following optimization problem min F(w) := f (w) + r(w). w R d We assume that r is convex and f is convex and differentiable Examples: SQUARE LOSS: f (w) = 1 2 Aw y 2 LASSO: r(w) = λ w 1 TRACE NORM: r(w) = λ w (UCL) Proximal Methods 3 / 20

w R d We assume that r is convex and f is convex and differentiable

4 Convex Analysis I We assume that f has Lipschitz continuous gradient: f (w) f (v) L w v. Lemma The above assumption is equivalent to f (w) f (v) + f (v),w v + L 2 w v 2. (UCL) Proximal Methods 4 / 20

5 Convex Analysis II Define the linear approximation of F in v, w.r.t. f F(w;v) := f (v) + f (v),w v + r(w). Lemma (Sandwich) F(w) L 2 w v 2 F(w;v) F(w). Proof. The left inequality follows from Lemma 1, the right inequality follows from the convexity of f, f (w) f (v) + f (v),w v. Equivalent version F(w) F(w;v) + L 2 w v 2 F(w) + L 2 w v 2. (UCL) Proximal Methods 5 / 20

The left inequality follows from Lemma 1, the right inequality follows from the convexity of

6 Sandwich example F(w) = 1 2 (aw 1) w F(w; v) = 1 2 (av 1)2 + a(av 1)(w v) w 4 3 upper bound Ft+L/2 x v 2 F (UCL) Proximal Methods 6 / 20

w 4 3 upper bound Ft+L/2 x v 2 F 2 1 0 1 0.

7 Subdifferential If f : R d R is convex, its subdifferential at w is defined as f (w) = {u : f (v) f (w) + u,v w, v R d } f is a set-valued function the elements of f (w) are called the subgradients of f at w intuition: u f (w) if the affine function f (w) + u,v w is a global underestimator of f Theorem ŵ argmin f (w) 0 f (ŵ) w R d (UCL) Proximal Methods 7 / 20

the subgradients of f at w intuition: u f (w) if the affine function f (w) + u,v w is a

8 Proximal operator The proximal operator of a convex function r : R d R is defined as 1 prox r (v) = argmin w R d 2 w v 2 + r(w). Convexity of r ensures that the minimizer always exists and is unique. Examples: LASSO: r(w) = λ w 1, prox r (v) = H λ (v), which is the component-wise soft-thresholding operator (H λ (v)) i = sign(v i )( v i λ) + l 2 NORM: r(w) = λ w 2, prox r (v) = v v 2 ( v 2 λ) +. GROUP LASSO: r(w) = K l=1 w J l 2, (prox r (v)) Jl = v J l v Jl ( v J l 2 λ) +. (UCL) Proximal Methods 8 / 20

Examples: LASSO: r(w) = λ w 1, prox r (v) = H λ (v), which is the component-wise soft-thresholding operator (H λ (v)) i =

9 O(1/T) algorithm Algorithm w 0 0 for t = 0,1,...,T do L w t+1 = argmin w 2 w w t 2 + F(w;w t ) end for Recall, F(w;w t ) := f (w t ) + f (w t ),w w t + r(w) The term f (w t ) does not depend on w and can be discarded. By completing the square, we also obtain ( ) 2 L 1 2 w w t 2 + f (w t ), w w t + L f (w t) = L ( w w t 1 ) 2 L f (w 2 t) where the extra term ( 1 L f (w t) ) 2 does not depend on w. (UCL) Proximal Methods 9 / 20

r(w) The term f (w t ) does not depend on w and can be discarded.

10 O(1/T) algorithm (cont d) Algorithm 1 w 0 0 for t = 0,1,...,T do ( w t+1 = prox r wt 1 L L f (w t) ) end for Remark. If r = 0, we recover Gradient Descent, w t+1 = w t 1 L f (w t) Theorem (Convergence rate) Let w argmin w F(w), then, at iteration T, Algorithm 1 yields a solution w T that satisfies F(w T ) F(w ) L w w 0 2 2T (1) (UCL) Proximal Methods 10 / 20

If r = 0, we recover Gradient Descent, w t+1 = w t 1 L f (w t) Theorem (Convergence rate)

11 O(1/T) algorithm - Examples LASSO f (w) = 1 2 Aw y 2 r(w) = λ w 1 GROUP LASSO w t+1 = H λ L f (w) = 1 2 Aw y 2 r(w) = λ K l=1 w J l 2 ( w t 1 ) L A (Aw t y) v = w t 1 L A (Aw t y) (w t+1 ) Jl = v ( J l v v Jl Jl 2 λ ) L (UCL) Proximal Methods 11 / 20 +

12 O(1/T) algorithm - convergence rate proof Sandwich: F(w) L 2 w v 2 F(w; v) F(w) Lemma: 3-point property. If ŵ = argmin w R d 1 2 w w φ(w), then, for any w R d φ(ŵ) ŵ w 0 2 φ(w) w w w ŵ 2. Proof of the convergence rate. F(w t+1 ) F(w t+1 ; w t ) + L 2 w t+1 w t 2 (Sandwich-left) F(w ; w t ) + L 2 w w t 2 L 2 w w t+1 2 (3-point with w = w ) F(w ) + L 2 w w t 2 L 2 w w t+1 2 (Sandwich-right) (UCL) Proximal Methods 12 / 20

13 O(1/T) algorithm - convergence rate proof cont d Let us now define ε t := F(w t ) F(w ), so that ε t+1 L 2 w w t 2 L 2 w w t+1 2 Lemma. The sequence ε t, for t = 0,...,T is monotone non-increasing. F(w t+1 ) F(w t+1 ; w t ) + L 2 w t+1 w t 2 (Sandwich-left) F(w t ; w t ) + L 2 w t w t 2 = F(w t ) (Def. of w t+1 ) Since ε t is monotone non-increasing, T 1 Tε T ε t+1 L 2 w w 0 2 L 2 w w T 2 L 2 w w 0 2 t=0 ε T = F(w T ) F(w ) L w w 0 2 2T (UCL) Proximal Methods 13 / 20

F(w t+1 ) F(w t+1 ; w t ) + L 2 w t+1 w t 2 (Sandwich-left) F(w t ; w t ) + L 2 w t w t 2 = F(w t ) (Def.

14 O(1/T 2 ) algorithm We want to accelerate Algorithm 1, by introducing some factors tending to zero. We define w t+1 by taking the linear approximation at an auxiliary point v t : w t+1 := argmin w F(w; v t ) + L 2 w v t 2. We perform the same analysis as above, letting v be a reference vector that will be chosen later F(w t+1 ) F(w t+1 ; v t ) + L 2 w t+1 v t 2 (Sandwich-left) F(v ; v t ) + L 2 v v t 2 L 2 v w t+1 2 (3-point with w = v ) F(v ) + L 2 v v t 2 L 2 v w t+1 2 (Sandwich-right) By introducing yet another sequence {u t }, we would like to obtain F(w t+1 ) F(v ) + Lθ2 t 2 w u t 2 Lθ2 t 2 w u t+1 2. (WANT) (UCL) Proximal Methods 14 / 20

We perform the same analysis as above, letting v be a reference vector that will be chosen later F(w t+1 ) F(w t+1 ; v t ) + L 2 w t+1 v t 2 (Sandwich-left) F(v ;

15 O(1/T 2 ) algorithm - cont d F(w t+1 ) F(v ) + L 2 v v t 2 L 2 v w t+1 2 F(w t+1 ) F(v ) + Lθ2 t 2 w u t 2 Lθ2 t 2 w u t+1 2. (WANT) In order for (WANT) to hold, we need v v t = θ t (w u t ) v w t+1 = θ t (w u t+1 ). To satisfy the second relation we can choose v = α t + θ t w u t+1 = w t+1 α t θ t In order to to exploit the convexity of F, we can choose α t = (1 θ t )w t so that v becomes a convex combination of w and the previous point w t. (UCL) Proximal Methods 15 / 20

To satisfy the second relation we can choose v = α t + θ t w u t+1 = w t+1 α t θ t In order to to exploit the convexity

16 O(1/T 2 ) algorithm - cont d In summary, we have v = (1 θ t )w t + θ t w u t+1 = w t+1 (1 θ t )w t θ t v t = (1 θ t )w t + θ t u t. Accelerated Algorithm w 0, u 0 0 for t = 0, 1,...,T do v t (1 θ t )w t + θ t u t w t+1 argmin w F(w; v t ) + L 2 w v ( t 2 = prox r vt 1 L L f (v t) ) u t+1 wt+1 (1 θt)wt θ t end for (UCL) Proximal Methods 16 / 20

17 O(1/T 2 ) algorithm - convergence rate Let w argmin w F(w), then, at iteration T, Algorithm 2 yields w T that satisfies F(w T ) F(w ) L 2 θ2 T w 2 (2) Let us consider the sequence {θ t } θ 0 = 1 1 θ t+1 θ 2 t+1 = 1 θt 2. (Theta-Def) The sequence {θ t } satisfies θ t 2/(t + 2), (3) Proof by induction: use arithmetic-geometric inequality on (Theta-Def). We then have the following convergence rate F(w T+1 ) F(w ) L 2 θ2 T w 2 2L T 2 w 2 (UCL) Proximal Methods 17 / 20

(Theta-Def) The sequence {θ t } satisfies θ t 2/(t + 2), (3) Proof by induction: use arithmetic-geometric inequality

18 O(1/T 2 ) algorithm - convergence rate proof F(w t+1 ) F(v ) + Lθ2 t 2 w u t 2 Lθ2 t 2 w u t+1 2. (WANT) F(w t+1 ) (1 θ t )F(w t ) + θ t F(w ) + Lθ2 t 2 w u t 2 Lθ2 t 2 w u t+1 2. Define ε t := F(w t ) F(w ) and Φ t := L 2 w u t 2, 1 θ 2 t 1 θ t+1 θ 2 t+1 ε t+1 (1 θ t )ε t + θ 2 t (Φ t Φ t+1 ) ε t+1 1 θ t θt 2 ε t Φ t Φ t+1 ε t+1 1 θ t θt 2 ε t Φ t Φ t+1 Taking the sum from t = 1 to t = T gives 1 θ T+1 θ 2 T+1 ε T+1 Φ 0 Φ T θ 0 θ0 2 ε 0 1 θ T+1 θt+1 2 ε T+1 = 1 θt 2 ε T+1 Φ 0 = L 2 w u 0 2 Using (Theta-Def) (UCL) Proximal Methods 18 / 20

Define ε t := F(w t ) F(w ) and Φ t := L 2 w u t 2, 1 θ 2 t 1 θ t+1 θ 2 t+1 ε t+1 (1 θ t )ε t + θ 2 t (Φ t Φ t+1 ) ε t+1 1 θ t θt 2 ε t Φ

19 Simple numerical comparison Solve LASSO with d = 100 variables and Regression vector w has 20 nonzero components with random ±1 n = 40 examples, x ij N(0,1) and y = Xw + ε, ε i N(0,0.01) Objective value O(1/T) O(1/T 2 ) (UCL) Proximal Methods 19 / 20

examples, x ij N(0,1) and y = Xw + ε, ε i N(0,0.01).

20 Summary Proximal algorithms are fast first-order methods. The accelerated algorithm has O(1/T 2 ) convergence rate. All you need is: The gradient of the smooth part The proximal operator of the non-smooth part (UCL) Proximal Methods 20 / 20

All you need is: The gradient of the smooth part The

10. Proximal point method

L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing