Introduction to Convex Optimization for Machine Learning

Transcription

1 Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

2 Outline What is Optimization Convex Sets Convex Functions Convex Optimization Problems Lagrange Duality Optimization Algorithms Take Home Messages Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

3 What is Optimization What is Optimization (and why do we care?) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

4 What is Optimization? What is Optimization Finding the minimizer of a function subject to constraints: minimize f 0 (x) x s.t. f i (x) 0, i = {1,...,k} h j (x) = 0, j = {1,...,l} Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

5 What is Optimization? What is Optimization Finding the minimizer of a function subject to constraints: minimize f 0 (x) x s.t. f i (x) 0, i = {1,...,k} h j (x) = 0, j = {1,...,l} Example: Stock market. Minimize variance of return subject to getting at least $50. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

6 What is Optimization Why do we care? Optimization is at the heart of many (most practical?) machine learning algorithms. Linear regression: minimize w Xw y 2 Classification (logistic regresion or SVM): minimize w n log ( 1 + exp( y i x T i w) ) i=1 or w 2 + C n ξ i s.t. ξ i 1 y i x T i w, ξ i 0. i=1 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

7 We still care... What is Optimization Maximum likelihood estimation: maximize θ n log p θ (x i ) i=1 Collaborative filtering: minimize w log ( 1 + exp(w T x i w T x j ) ) i j k-means: minimize µ 1,...,µ k J(µ) = k j=1 i C j x i µ j 2 And more (graphical models, feature selection, active learning, control) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

8 What is Optimization But generally speaking... We re screwed. Local (non global) minima of f 0 All kinds of constraints (even restricting to continuous functions): h(x) = sin(2πx) = Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

9 What is Optimization But generally speaking... We re screwed. Local (non global) minima of f 0 All kinds of constraints (even restricting to continuous functions): h(x) = sin(2πx) = Go for convex problems! Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

10 Convex Sets Convex Sets Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

11 Convex Sets Definition A set C R n is convex if for x, y C and any α [0, 1], αx + (1 α)y C. x y Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

12 Examples Convex Sets All of R n (obvious) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

13 Examples Convex Sets All of R n (obvious) Non-negative orthant, R n +: let x 0, y 0, clearly αx + (1 α)y 0. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

14 Examples Convex Sets All of R n (obvious) Non-negative orthant, R n +: let x 0, y 0, clearly αx + (1 α)y 0. Norm balls: let x 1, y 1, then αx + (1 α)y αx + (1 α)y = α x + (1 α) y 1. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

15 Examples Convex Sets Affine subspaces: Ax = b, Ay = b, then A(αx + (1 α)y) = αax + (1 α)ay = αb + (1 α)b = b x x x Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

16 More examples Convex Sets Arbitrary intersections of convex sets: let C i be convex for i I, C = i C i, then x C, y C αx + (1 α)y C i i I so αx + (1 α)y C. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

17 More examples Convex Sets PSD Matrices, a.k.a. the positive semidefinite cone S n + R n n. A S n + means x T Ax 0 for all x R n. For A, B S + n, x T (αa + (1 α)b) x z = αx T Ax + (1 α)x T Bx On right: {[ ] } S 2 x z + = 0 = { 1 y 0 x, y, z : x 0, y 0, xy z 2} z y 0.4 x Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

18 Convex Functions Convex Functions Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

19 Convex Functions Definition A function f : R n R is convex if for x, y domf and any α [0, 1], f(αx + (1 α)y) αf(x) + (1 α)f(y). αf(x) + (1 - α)f(y) f(y) f(x) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

20 Convex Functions First order convexity conditions Theorem Suppose f : R n R is differentiable. Then f is convex if and only if for all x, y domf f(y) f(x) + f(x) T (y x) f(y) f(x) + f(x) T (y - x) (x, f(x)) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

21 Convex Functions Actually, more general than that Definition The subgradient set, or subdifferential set, f(x) of f at x is f(x) = { g : f(y) f(x) + g T (y x) for all y }. f(y) Theorem f : R n R is convex if and only if it has non-empty subdifferential set everywhere. (x, f(x)) f(x) + g T (y - x) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

22 Convex Functions Second order convexity conditions Theorem Suppose f : R n R is twice differentiable. Then f is convex if and only if for all x dom f, 2 f(x) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

23 Convex Functions Convex sets and convex functions Definition The epigraph of a function f is the set of points epi f epi f = {(x, t) : f(x) t}. epi f is convex if and only if f is convex. Sublevel sets, {x : f(x) a} are convex for convex f. a Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

24 Examples Convex Functions Examples Linear/affine functions: f(x) = b T x + c. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

25 Examples Convex Functions Examples Linear/affine functions: f(x) = b T x + c. Quadratic functions: for A 0. For regression: f(x) = 1 2 xt Ax + b T x + c 1 2 Xw y 2 = 1 2 wt X T Xw y T Xw yt y. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

26 More examples Convex Functions Examples Norms (like l 1 or l 2 for regularization): αx + (1 α)y αx + (1 α)y = α x + (1 α) y. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

27 Convex Functions Examples More examples Norms (like l 1 or l 2 for regularization): αx + (1 α)y αx + (1 α)y = α x + (1 α) y. Composition with an affine function f(ax + b): f(a(αx + (1 α)y) + b) = f(α(ax + b) + (1 α)(ay + b)) αf(ax + b) + (1 α)f(ay + b) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

28 Convex Functions Examples More examples Norms (like l 1 or l 2 for regularization): αx + (1 α)y αx + (1 α)y = α x + (1 α) y. Composition with an affine function f(ax + b): f(a(αx + (1 α)y) + b) = f(α(ax + b) + (1 α)(ay + b)) αf(ax + b) + (1 α)f(ay + b) Log-sum-exp (via 2 f(x) PSD): ( n ) f(x) = log exp(x i ) i=1 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

29 Convex Functions Examples Important examples in Machine Learning 3 SVM loss: f(w) = [ 1 y i x T i w ] + Binary logistic loss: [1 - x] + f(w) = log ( 1 + exp( y i x T i w) ) log(1 + e x ) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

30 Convex Optimization Problems Convex Optimization Problems Definition An optimization problem is convex if its objective is a convex function, the inequality constraints f j are convex, and the equality constraints h j are affine minimize f 0 (x) (Convex function) x s.t. f i (x) 0 (Convex sets) h j (x) = 0 (Affine) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

31 It s nice to be convex Convex Optimization Problems Theorem If ˆx is a local minimizer of a convex optimization problem, it is a global minimizer x Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

32 Convex Optimization Problems Even more reasons to be convex Theorem f(x) = 0 if and only if x is a global minimizer of f(x). Proof. f(x) = 0. We have f(y) f(x) + f(x) T (y x) = f(x). f(x) 0. There is a direction of descent. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

33 Convex Optimization Problems LET S TAKE A BREAK Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

34 Lagrange Duality Lagrange Duality Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

35 Lagrange Duality Goals of Lagrange Duality Get certificate for optimality of a problem Remove constraints Reformulate problem Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

36 Constructing the dual Lagrange Duality Start with optimization problem: minimize f 0 (x) x s.t. f i (x) 0, i = {1,...,k} h j (x) = 0, j = {1,...,l} Form Lagrangian using Lagrange multipliers λ i 0, ν i R Form dual function g(λ, ν) = inf x L(x, λ, ν) = f 0 (x) + L(x, λ, ν) = inf x k λ i f i (x) + i=1 f 0(x) + l ν j h j (x) j=1 k λ i f i (x) + i=1 l ν j h j (x) j=1 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

37 Remarks Lagrange Duality Original problem is equivalent to [ minimize x sup L(x, λ, ν) λ 0,ν Dual problem is switching the min and max: [ ] maximize inf L(x, λ, ν). λ 0,ν x ] Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

38 Lagrange Duality One Great Property of Dual Lemma (Weak Duality) If λ 0, then g(λ, ν) f 0 (x ). Proof. We have g(λ, ν) = inf L(x, λ, ν) x L(x, λ, ν) k l = f 0 (x ) + λ i f i (x ) + ν j h j (x ) f 0 (x ). i=1 j=1 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

39 Lagrange Duality The Greatest Property of the Dual Theorem For reasonable 1 convex problems, sup g(λ, ν) = f 0 (x ) λ 0,ν 1 There are conditions called constraint qualification for which this is true Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

40 Geometric Look Lagrange Duality Minimize 1 2 (x c 1)2 subject to x 2 c x True function (blue), constraint (green), L(x, λ) for different λ (dotted) Dual function g(λ) (black), primal optimal (dotted blue) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53 λ

41 Intuition Lagrange Duality Can interpret duality as linear approximation. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

42 Lagrange Duality Intuition Can interpret duality as linear approximation. I (a) = if a > 0, 0 otherwise; I 0 (a) = unless a = 0. Rewrite problem as minimize x k l f 0 (x) + I (f i (x)) + I 0 (h j (x)) i=1 j=1 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

43 Lagrange Duality Intuition Can interpret duality as linear approximation. I (a) = if a > 0, 0 otherwise; I 0 (a) = unless a = 0. Rewrite problem as minimize x k l f 0 (x) + I (f i (x)) + I 0 (h j (x)) i=1 j=1 Replace I(f i (x)) with λ i f i (x); a measure of displeasure when λ i 0, f i (x) > 0. ν i h j (x) lower bounds I 0 (h j (x)): minimize x k l f 0 (x) + λ i f i (x) + ν j h j (x) i=1 j=1 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

44 Lagrange Duality Example: Linearly constrained least squares minimize x Form the Lagrangian: 1 2 Ax b 2 s.t. Bx = d. L(x, ν) = 1 2 Ax b 2 + ν T (Bx d) Take infimum: x L(x, ν) = A T Ax A T b + B T ν x = (A T A) 1 (A T b B T ν) Simple unconstrained quadratic problem! inf x = 1 2 L(x, ν) A(A T A) 1 (A T b B T ν) b 2 + ν T B((A T A) 1 A T b B T ν) d T ν Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

45 Lagrange Duality Example: Quadratically constrained least squares 1 minimize x 2 Ax b 2 s.t. Form the Lagrangian (λ 0): 1 2 x 2 c. L(x, λ) = 1 2 Ax b λ( x 2 2c) Take infimum: x L(x, ν) = A T Ax A T b + λi x = (A T A + λi) 1 A T b inf L(x, λ) = 1 A(A T A + λi) 1 A T b b 2 + λ (A T A + λi) 1 A T b 2 λc x 2 2 One variable dual problem! g(λ) = 1 2 bt A(A T A + λi) 1 A T b λc b 2. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

46 Lagrange Duality Uses of the Dual Main use: certificate of optimality (a.k.a. duality gap). If we have feasible x and know the dual g(λ, ν), then g(λ, ν) f 0 (x ) f 0 (x) f 0 (x ) f 0 (x) g(λ, ν) f 0 (x) f 0 (x) f 0 (x ) f 0 (x) g(λ, ν). Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

47 Lagrange Duality Uses of the Dual Main use: certificate of optimality (a.k.a. duality gap). If we have feasible x and know the dual g(λ, ν), then g(λ, ν) f 0 (x ) f 0 (x) f 0 (x ) f 0 (x) g(λ, ν) f 0 (x) f 0 (x) f 0 (x ) f 0 (x) g(λ, ν). Also used in more advanced primal-dual algorithms (we won t talk about these). Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

48 Optimization Algorithms Optimization Algorithms Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

49 Gradient Descent Optimization Algorithms Gradient Methods The simplest algorithm in the world (almost). Goal: minimize x f(x) Just iterate where η t is stepsize. x t+1 = x t η t f(x t ) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

50 Single Step Illustration Optimization Algorithms Gradient Methods f(x) (x t, f(x t )) f(x t ) - η f(x t ) T (x - x t ) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

51 Full Gradient Descent Optimization Algorithms Gradient Methods f(x) = log(exp(x 1 + 3x 2.1) + exp(x 1 3x 2.1) + exp( x 1.1)) Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

52 Optimization Algorithms Gradient Methods Stepsize Selection How do I choose a stepsize? Idea 1: exact line search Too expensive to be practical. η t = argminf (x η f(x)) η Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

53 Optimization Algorithms Gradient Methods Stepsize Selection How do I choose a stepsize? Idea 1: exact line search Too expensive to be practical. η t = argminf (x η f(x)) η Idea 2: backtracking (Armijo) line search. Let α (0, 1 2 ), β (0, 1). Multiply η = βη until Works well in practice. f (x η f(x)) f(x) αη f(x) 2 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

54 Optimization Algorithms Gradient Methods Illustration of Armijo/Backtracking Line Search f(x - η f(x)) f(x) - ηα f(x) 2 η t = 0 η 0 η As a function of stepsize η. Clearly a region where f(x η f(x)) is below line f(x) αη f(x) 2. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

55 Newton s method Optimization Algorithms Gradient Methods Idea: use a second-order approximation to function. f(x + x) f(x) + f(x) T x xt 2 f(x) x Choose x to minimize above: x = [ 2 f(x) ] 1 f(x) This is descent direction: f(x) T x = f(x) T [ 2 f(x) ] 1 f(x) < 0. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

56 Newton step picture Optimization Algorithms Gradient Methods ˆf f (x, f(x)) (x + x, f(x + x)) ˆf is 2 nd -order approximation, f is true function. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

57 Optimization Algorithms Gradient Methods Convergence of gradient descent and Newton s method Strongly convex case: 2 f(x) mi, then Linear convergence. For some γ (0, 1), f(x t ) f(x ) γ t, γ < 1. f(x t ) f(x ) γ t or t 1 γ log 1 ε f(x t) f(x ) ε. Smooth case: f(x) f(y) C x y. f(x t ) f(x ) K t 2 Newton s method often is faster, especially when f has long valleys Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

58 Optimization Algorithms What about constraints? Gradient Methods Linear constraints Ax = b are easy. For example, in Newton method (assume Ax = b): minimize x f(x) T x xt 2 f(x) x s.t. A x = 0. Solution x satisfies A(x + x) = Ax + A x = b. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

59 Optimization Algorithms Gradient Methods What about constraints? Linear constraints Ax = b are easy. For example, in Newton method (assume Ax = b): minimize x f(x) T x xt 2 f(x) x s.t. A x = 0. Solution x satisfies A(x + x) = Ax + A x = b. Inequality constraints are a bit tougher... minimize f(x) T x + 1 x 2 xt 2 f(x) x s.t. f i (x + x) 0 just as hard as original. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

60 Optimization Algorithms Gradient Methods Logarithmic Barrier Methods Goal: Convert to minimize x f 0 (x) minimize x s.t. f i (x) 0, i = {1,...,k} f 0 (x) + k I (f i (x)) i=1 Approximate I (u) t log( u) for small t. minimize x k f 0 (x) t log ( f i (x)) i=1 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

61 The barrier function Optimization Algorithms Gradient Methods u I (u) is dotted line, others are t log( u). Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

62 Illustration Optimization Algorithms Gradient Methods Minimizing c T x subject to Ax b. c t = 1 t = 5 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

63 Subgradient Descent Optimization Algorithms Subgradient Methods Really, the simplest algorithm in the world. Goal: minimize x f(x) Just iterate where η t is a stepsize, g t f(x t ). x t+1 = x t η t g t Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

64 Optimization Algorithms Subgradient Methods Why subgradient descent? Lots of non-differentiable convex functions used in machine learning: f(x) = [ 1 a T x ] k +, f(x) = x 1, f(x) = σ r (X) r=1 where σ r is the rth singular value of X. Easy to analyze Do not even need true sub-gradient: just have Eg t f(x t ). Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

65 Optimization Algorithms Subgradient Methods Proof of convergence for subgradient descent Idea: bound x t+1 x using subgradient inequality. Assume that g t G. x t+1 x 2 = x t ηg t x 2 = x t x 2 2ηg T t (x t x ) + η 2 g t 2 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

66 Optimization Algorithms Subgradient Methods Proof of convergence for subgradient descent Idea: bound x t+1 x using subgradient inequality. Assume that g t G. Recall that x t+1 x 2 = x t ηg t x 2 = x t x 2 2ηg T t (x t x ) + η 2 g t 2 f(x ) f(x t ) + g T t (x x t ) g T t (x t x ) f(x ) f(x t ) so Then x t+1 x 2 x t x 2 + 2η [f(x ) f(x t )] + η 2 G 2. f(x t ) f(x ) x t x 2 x t+1 x 2 + η 2η 2 G2. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

67 Almost done... Optimization Algorithms Subgradient Methods Sum from t = 1 to T: T t=1 f(x t ) f(x ) 1 2η T t=1 [ x t x 2 x t+1 x 2] + Tη 2 G2 = 1 2η x 1 x 2 1 2η x T+1 x 2 + Tη 2 G2 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

68 Almost done... Optimization Algorithms Subgradient Methods Sum from t = 1 to T: T t=1 f(x t ) f(x ) 1 2η T t=1 [ x t x 2 x t+1 x 2] + Tη 2 G2 = 1 2η x 1 x 2 1 2η x T+1 x 2 + Tη 2 G2 Now let D = x 1 x, and keep track of min along run, f (x best ) f(x ) 1 2ηT D2 + η 2 G2. Set η = D G T and f (x best) f(x ) DG T. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

69 Optimization Algorithms Subgradient Methods Extension: projected subgradient descent Now have a convex constraint set X. Goal: minimize f(x) x X Idea: do subgradient steps, project x t back into X at every iteration. Π X (x t ) x t x t+1 = Π X (x t ηg t ) X Proof: x Π X (x t ) x x t x if x X. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

70 Optimization Algorithms Projected subgradient example Subgradient Methods minimize x 1 2 Ax b s.t. x 1 1 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

78 Optimization Algorithms Subgradient Methods Convergence results for (projected) subgradient methods Any decreasing, non-summable stepsize η t 0, t=1 η t = gives f ( x avg(t) ) f(x ) 0. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

79 Optimization Algorithms Subgradient Methods Convergence results for (projected) subgradient methods Any decreasing, non-summable stepsize η t 0, t=1 η t = gives f ( x avg(t) ) f(x ) 0. Slightly less brain-dead analysis than earlier shows with η t 1/ t f ( x avg(t) ) f(x ) C t Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

80 Optimization Algorithms Subgradient Methods Convergence results for (projected) subgradient methods Any decreasing, non-summable stepsize η t 0, t=1 η t = gives f ( x avg(t) ) f(x ) 0. Slightly less brain-dead analysis than earlier shows with η t 1/ t f ( x avg(t) ) f(x ) C t Same convergence when g t is random, i.e. Eg t f(x t ). Example: f(w) = 1 2 w 2 + C Just pick random training example. n [ 1 yi x T i w ] + i=1 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

81 Recap Take Home Messages Defined convex sets and functions Saw why we want optimization problems to be convex (solvable) Sketched some of Lagrange duality First order methods are easy and (often) work well Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53

82 Take Home Messages Take Home Messages Many useful problems formulated as convex optimization problems If it is not convex and not an eigenvalue problem, you are out of luck If it is convex, you are golden Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall / 53