Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1
Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T u h T v subject to Ax = b subject to A T u G T v = c Gx h v 0 Primal LP Dual LP Explanation: for any u and v 0, and x primal feasible, u T (Ax b) + v T (Gx h) 0, i.e., ( A T u G T v) T x b T u h T v So if c = A T u G T v, we get a bound on primal optimal value 2
Explanation # 2: for any u and v 0, and x primal feasible c T x c T x + u T (Ax b) + v T (Gx h) := L(x, u, v) So if C denotes primal feasible set, f primal optimal value, then for any u and v 0, f min x C L(x, u, v) min x L(x, u, v) := g(u, v) In other words, g(u, v) is a lower bound on f for any u and v 0. Note that { b T u h T v if c = A T u G T v g(u, v) = otherwise This second explanation reproduces the same dual, but is actually completely general and applies to arbitrary optimization problems (even nonconvex ones) 3
Outline Today: Lagrange dual function Langrange dual problem Weak and strong duality Examples Preview of duality uses 4
Lagrangian Consider general minimization problem min x subject to f(x) h i (x) 0, i = 1,... m l j (x) = 0, j = 1,... r Need not be convex, but of course we will pay special attention to convex case We define the Lagrangian as L(x, u, v) = f(x) + m u i h i (x) + i=1 r v j l j (x) j=1 New variables u R m, v R r, with u 0 (implicitly, we define L(x, u, v) = for u < 0) 5
Important property: for any u 0 and v, f(x) L(x, u, v) at each feasible x Why? For feasible x, The Lagrange dual function 217 L(x, u, v) = f(x) + m u i h i (x) + } {{ } 0 i=1 r j=1 v j l j (x) f(x) } {{ } =0 5 4 3 2 1 0 1 2 1 0.5 0 0.5 1 x Figure 5.1 Lower bound from a dual feasible point. The solid curve shows the objective function f 0,andthedashedcurveshowstheconstraintfunctionf 1. The feasible set is the interval [ 0.46, 0.46], which is indicated by the two Solid line is f Dashed line is h, hence feasible set [ 0.46, 0.46] Each dotted line shows L(x, u, v) for different choices of u 0 and v (From B & V page 217) 6
Lagrange dual function 3 Let C denote primal feasible set, f denote 1 primal optimal value. Minimizing L(x, u, v) over all x gives a lower 0 bound: f min x C L(x, u, v) min x 2 L(x, u, v) := g(u, v) We call g(u, v) the Lagrange dual function, and it gives a lower bound on f for any u 0 and v, called dual feasible u, v Dashed horizontal line is f Dual variable λ is (our u) Solid line shows g(λ) (From B & V page 217) 4 2 1 1 0.5 0 0.5 1 x Figure 5.1 Lower bound from a dual feasible point. The solid curve shows the objective function f0, andthedashedcurveshowstheconstraintfunctionf1. The feasible set is the interval [ 0.46, 0.46], which is indicated by the two dotted vertical lines. The optimal point and value are x = 0.46, p =1.54 (shown as a circle). The dotted curves show L(x,λ) forλ =0.1, 0.2,...,1.0. Each of these has a minimum value smaller than p,sinceonthefeasibleset (and for λ 0) we have L(x, λ) f0(x). g(λ) 1.6 1.5 1.4 1.3 1.2 1.1 1 0 0.2 0.4 0.6 0.8 1 λ Figure 5.2 The dual function g for the problem in figure 5.1. Neither f0 nor f1 is convex, but the dual function is concave. The horizontal dashed line shows p,theoptimalvalueoftheproblem. 7
Consider quadratic program: where Q 0. Lagrangian: Example: quadratic program 1 min x R n 2 xt Qx + c T x subject to Ax = b, x 0 L(x, u, v) = 1 2 xt Qx + c T x u T x + v T (Ax b) Lagrange dual function: g(u, v) = min x R n L(x, u, v) = 1 2 (c u+at v) T Q 1 (c u+a T v) b T v For any u 0 and any v, this is lower a bound on primal optimal value f 8
Same problem but now Q 0. Lagrangian: 1 min x R n 2 xt Qx + c T x subject to Ax = b, x 0 L(x, u, v) = 1 2 xt Qx + c T x u T x + v T (Ax b) Lagrange dual function: 1 2 (c u + AT v) T Q + (c u + A T v) b T v g(u, v) = if c u + A T v null(q) otherwise where Q + denotes generalized inverse of Q. For any u 0, v, and c u + A T v null(q), g(u, v) is a nontrivial lower bound on f 9
10 Example: quadratic program in 2D We choose f(x) to be quadratic in 2 variables, subject to x 0. Dual function g(u) is also quadratic in 2 variables, also subject to u 0 primal Dual function g(u) provides a bound on f for every u 0 f / g dual Largest bound this gives us: turns out to be exactly f... coincidence? x1 / u1 x2 / u2 More on this later, via KKT conditions
Given primal problem Lagrange dual problem min x subject to f(x) h i (x) 0, i = 1,... m l j (x) = 0, j = 1,... r Our constructed dual function g(u, v) satisfies f g(u, v) for all u 0 and v. Hence best lower bound is given by maximizing g(u, v) over all dual feasible u, v, yielding Lagrange dual problem: max u,v g(u, v) subject to u 0 Key property, called weak duality: if dual optimal value is g, then f g Note that this always holds (even if primal problem is nonconvex) 11
12 Another key property: the dual problem is a convex optimization problem (as written, it is a concave maximization problem) Again, this is always true (even when primal problem is not convex) By definition: { g(u, v) = min f(x) + x = max x m u i h i (x) + i=1 { f(x) r j=1 m u i h i (x) i=1 } v j l j (x) r j=1 } v j l j (x) } {{ } pointwise maximum of convex functions in (u, v) I.e., g is concave in (u, v), and u 0 is a convex constraint, hence dual problem is a concave maximization problem
f 13 Example: nonconvex quartic minimization Define f(x) = x 4 50x 2 + 100x (nonconvex), minimize subject to constraint x 4.5 Primal Dual 1000 0 1000 3000 5000 g 1160 1120 1080 10 5 0 5 10 x 0 20 40 60 80 100 v Dual function g can be derived explicitly, via closed-form equation for roots of a cubic equation
14 Form of g is quite complicated: where for i = 1, 2, 3, F i (u) = g(u) = min F i 4 (u) 50Fi 2 (u) + 100F i (u), i=1,2,3 a ( i 12 2 1/3 432(100 u) ( 432 2 (100 u) 2 4 1200 3) ) 1/2 1/3 100 2 1/3 1 (432(100 u) ( 432 2 (100 u) 2 4 1200 3) 1/2 ) 1/3, and a 1 = 1, a 2 = ( 1 + i 3)/2, a 3 = ( 1 i 3)/2 Without the context of duality it would be difficult to tell whether or not g is concave... but we know it must be!
15 Strong duality Recall that we always have f g (weak duality). On the other hand, in some problems we have observed that actually which is called strong duality f = g Slater s condition: if the primal is a convex problem (i.e., f and h 1,... h m are convex, l 1,... l r are affine), and there exists at least one strictly feasible x R n, meaning h 1 (x) < 0,... h m (x) < 0 and l 1 (x) = 0,... l r (x) = 0 then strong duality holds This is a pretty weak condition. (Further refinement: only require strict inequalities over functions h i that are not affine)
16 LPs: back to where we started For linear programs: Easy to check that the dual of the dual LP is the primal LP Refined version of Slater s condition: strong duality holds for an LP if it is feasible Apply same logic to its dual LP: strong duality holds if it is feasible Hence strong duality holds for LPs, except when both primal and dual are infeasible In other words, we nearly always have strong duality for LPs
17 Example: support vector machine dual Given y { 1, 1} n, X R n p, rows x 1,... x n, recall the support vector machine problem: min β,β 0,ξ subject to 1 2 β 2 2 + C n i=1 ξ i ξ i 0, i = 1,... n y i (x T i β + β 0 ) 1 ξ i, i = 1,... n Introducing dual variables v, w 0, we form the Lagrangian: L(β, β 0, ξ, v, w) = 1 n n 2 β 2 2 + C ξ i v i ξ i + i=1 i=1 n ( w i 1 ξi y i (x T i β + β 0 ) ) i=1
18 Minimizing over β, β 0, ξ gives Lagrange dual function: { 1 g(v, w) = 2 wt X XT w + 1 T w if w = C1 v, w T y = 0 otherwise where X = diag(y)x. Thus SVM dual problem, eliminating slack variable v, becomes max 1 w 2 wt X XT w + 1 T w subject to 0 w C1, w T y = 0 Check: Slater s condition is satisfied, and we have strong duality. Further, from study of SVMs, might recall that at optimality β = X T w This is not a coincidence, as we ll later via the KKT conditions
19 Duality gap Given primal feasible x and dual feasible u, v, the quantity f(x) g(u, v) is called the duality gap between x and u, v. Note that f(x) f f(x) g(u, v) so if the duality gap is zero, then x is primal optimal (and similarly, u, v are dual optimal) From an algorithmic viewpoint, provides a stopping criterion: if f(x) g(u, v) ɛ, then we are guaranteed that f(x) f ɛ Very useful, especially in conjunction with iterative methods... more dual uses in coming lectures
20 Dual norms Let x be a norm, e.g., l p norm: x p = ( n i=1 x i p ) 1/p, for p 1 Trace norm: X tr = r i=1 σ i(x) We define its dual norm x as x = max z 1 zt x Gives us the inequality z T x z x, like Cauchy-Schwartz. Back to our examples, l p norm dual: ( x p ) = x q, where 1/p + 1/q = 1 Trace norm dual: ( X tr ) = X op = σ max (X) Dual norm of dual norm: it turns out that x = x... we ll see connections to duality (including this one) in coming lectures
21 References S. Boyd and L. Vandenberghe (2004), Convex optimization, Chapter 5 R. T. Rockafellar (1970), Convex analysis, Chapters 28 30