Convex Optimization. Lieven Vandenberghe University of California, Los Angeles



Similar documents
Nonlinear Optimization: Algorithms 3: Interior-point methods

An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Optimisation et simulation numérique.

Summer course on Convex Optimization. Fifth Lecture Interior-Point Methods (1) Michel Baes, K.U.Leuven Bharath Rangarajan, U.

2.3 Convex Constrained Optimization Problems

Linear Threshold Units

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Advances in Convex Optimization: Interior-point Methods, Cone Programming, and Applications

Two-Stage Stochastic Linear Programs

Proximal mapping via network optimization

Conic optimization: examples and software

An Introduction on SemiDefinite Program

10. Proximal point method

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Solving polynomial least squares problems via semidefinite programming relaxations

Additional Exercises for Convex Optimization

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

Statistical Machine Learning

Lecture Topic: Low-Rank Approximations

Support Vector Machines

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Big Data - Lecture 1 Optimization reminders

Lecture 11: 0-1 Quadratic Program and Lower Bounds

Inner Product Spaces

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Real-Time Embedded Convex Optimization

Nonlinear Programming Methods.S2 Quadratic Programming

On Minimal Valid Inequalities for Mixed Integer Conic Programs

An Introduction to Machine Learning

Lecture 3: Linear methods for classification

Lecture 2: The SVM classifier

Date: April 12, Contents

Advanced Lecture on Mathematical Science and Information Science I. Optimization in Finance

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

24. The Branch and Bound Method

Duality of linear conic problems

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Several Views of Support Vector Machines

Dual Methods for Total Variation-Based Image Restoration

Introduction to Support Vector Machines. Colin Campbell, Bristol University

5. Orthogonal matrices

Introduction to General and Generalized Linear Models

1 Norms and Vector Spaces

Lecture 13 Linear quadratic Lyapunov theory

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Electrical Engineering 103 Applied Numerical Computing

Support Vector Machine (SVM)

Vector and Matrix Norms

Semidefinite characterisation of invariant measures for one-dimensional discrete dynamical systems

Mathematical finance and linear programming (optimization)

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Applied Algorithm Design Lecture 5

Linear Programming Notes V Problem Transformations

(Quasi-)Newton methods

CHAPTER 9. Integer Programming

Lecture 2: August 29. Linear Programming (part I)

An Accelerated First-Order Method for Solving SOS Relaxations of Unconstrained Polynomial Optimization Problems

Interior Point Methods and Linear Programming

The Price of Robustness

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Lecture 5 Least-squares

Optimization Methods in Finance

Linear Classification. Volker Tresp Summer 2015

Discrete Optimization

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

A Lagrangian-DNN Relaxation: a Fast Method for Computing Tight Lower Bounds for a Class of Quadratic Optimization Problems

Least-Squares Intersection of Lines

3. Linear Programming and Polyhedral Combinatorics

Lecture 6: Logistic Regression

Lecture 7: Finding Lyapunov Functions 1

Some probability and statistics

Adaptive Online Gradient Descent

Support Vector Machines Explained

TOMLAB - For fast and robust largescale optimization in MATLAB

Derivative Free Optimization

Message-passing sequential detection of multiple change points in networks

Convex analysis and profit/cost/support functions

constraint. Let us penalize ourselves for making the constraint too big. We end up with a

South Carolina College- and Career-Ready (SCCCR) Pre-Calculus

Machine Learning and Pattern Recognition Logistic Regression

Some representability and duality results for convex mixed-integer programs.

Limits and Continuity

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

Lecture 4: BK inequality 27th August and 6th September, 2007

Principle of Data Reduction

Nonlinear Algebraic Equations. Lectures INF2320 p. 1/88

(Basic definitions and properties; Separation theorems; Characterizations) 1.1 Definition, examples, inner description, algebraic properties

Numerical Methods I Eigenvalue Problems

A Log-Robust Optimization Approach to Portfolio Management

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

MATH 425, PRACTICE FINAL EXAM SOLUTIONS.

What is Linear Programming?

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Robust Geometric Programming is co-np hard

Transcription:

Convex Optimization Lieven Vandenberghe University of California, Los Angeles Tutorial lectures, Machine Learning Summer School University of Cambridge, September 3-4, 2009 Sources: Boyd & Vandenberghe, Convex Optimization, 2004 Courses EE236B, EE236C (UCLA), EE364A, EE364B (Stephen Boyd, Stanford Univ.)

Introduction Convex optimization MLSS 2009 mathematical optimization, modeling, complexity convex optimization recent history 1

Mathematical optimization minimize f 0 (x 1,..., x n ) subject to f 1 (x 1,..., x n ) 0... f m (x 1,..., x n ) 0 x = (x 1, x 2,..., x n ) are decision variables f 0 (x 1, x 2,..., x n ) gives the cost of choosing x inequalities give constraints that x must satisfy a mathematical model of a decision, design, or estimation problem Introduction 2

Limits of mathematical optimization how realistic is the model, and how certain are we about it? is the optimization problem tractable by existing numerical algorithms? Optimization research modeling generic techniques for formulating tractable optimization problems algorithms expand class of problems that can be efficiently solved Introduction 3

Complexity of nonlinear optimization the general optimization problem is intractable even simple looking optimization problems can be very hard Examples quadratic optimization problem with many constraints minimizing a multivariate polynomial Introduction 4

The famous exception: Linear programming minimize c T x = n c i x i i=1 subject to a T i x b i, i = 1,..., m widely used since Dantzig introduced the simplex algorithm in 1948 since 1950s, many applications in operations research, network optimization, finance, engineering,... extensive theory (optimality conditions, sensitivity,... ) there exist very efficient algorithms for solving linear programs Introduction 5

Solving linear programs no closed-form expression for solution widely available and reliable software for some algorithms, can prove polynomial time problems with over 10 5 variables or constraints solved routinely Introduction 6

Convex optimization problem minimize f 0 (x) subject to f 1 (x) 0 f m (x) 0 objective and constraint functions are convex: for 0 θ 1 f i (θx + (1 θ)y) θf i (x) + (1 θ)f i (y) includes least-squares problems and linear programs as special cases can be solved exactly, with similar complexity as LPs surprisingly many problems can be solved via convex optimization Introduction 7

History 1940s: linear programming minimize c T x subject to a T i x b i, i = 1,..., m 1950s: quadratic programming 1960s: geometric programming 1990s: semidefinite programming, second-order cone programming, quadratically constrained quadratic programming, robust optimization, sum-of-squares programming,... Introduction 8

New applications since 1990 linear matrix inequality techniques in control circuit design via geometric programming support vector machine learning via quadratic programming semidefinite programming relaxations in combinatorial optimization l 1 -norm optimization for sparse signal reconstruction applications in structural optimization, statistics, signal processing, communications, image processing, computer vision, quantum information theory, finance,... Introduction 9

Algorithms Interior-point methods 1984 (Karmarkar): first practical polynomial-time algorithm 1984-1990: efficient implementations for large-scale LPs around 1990 (Nesterov & Nemirovski): polynomial-time interior-point methods for nonlinear convex programming since 1990: extensions and high-quality software packages First-order algorithms similar to gradient descent, but with better convergence properties based on Nesterov s optimal methods from 1980s extend to certain nondifferentiable or constrained problems Introduction 10

Outline basic theory convex sets and functions convex optimization problems linear, quadratic, and geometric programming cone linear programming and applications second-order cone programming semidefinite programming some recent developments in algorithms (since 1990) interior-point methods fast gradient methods 10

Convex sets and functions Convex optimization MLSS 2009 definition basic examples and properties operations that preserve convexity 11

Convex set contains line segment between any two points in the set x 1, x 2 C, 0 θ 1 = θx 1 + (1 θ)x 2 C Examples: one convex, two nonconvex sets Convex sets and functions 12

Examples and properties solution set of linear equations Ax = b (affine set) solution set of linear inequalities Ax b (polyhedron) norm balls {x x R} and norm cones {(x,t) x t} set of positive semidefinite matrices S n + = {X S n X 0} image of a convex set under a linear transformation is convex inverse image of a convex set under a linear transformation is convex intersection of convex sets is convex Convex sets and functions 13

Example of intersection property C = {x R n p(t) 1 for t π/3} where p(t) = x 1 cos t + x 2 cos 2t + + x n cos nt 2 p(t) 1 0 1 x2 1 0 C 1 0 π/3 2π/3 π t 2 2 1 0 1 2 x 1 C is intersection of infinitely many halfspaces, hence convex Convex sets and functions 14

Convex function domain dom f is a convex set and for all x, y dom f, 0 θ 1 f(θx + (1 θ)y) θf(x) + (1 θ)f(y) (x, f(x)) (y, f(y)) f is concave if f is convex Convex sets and functions 15

Epigraph and sublevel set Epigraph: epi f = {(x, t) x dom f, f(x) t} epi f a function is convex if and only its epigraph is a convex set f Sublevel sets: C α = {x dom f f(x) α} the sublevel sets of a convex function are convex (converse is false) Convex sets and functions 16

Examples exp x, log x, xlog x are convex x α is convex for x > 0 and α 1 or α 0; x α is convex for α 1 quadratic-over-linear function x T x/t is convex in x, t for t > 0 geometric mean (x 1 x 2 x n ) 1/n is concave for x 0 log det X is concave on set of positive definite matrices log(e x 1 + e x n ) is convex linear and affine functions are convex and concave norms are convex Convex sets and functions 17

Differentiable convex functions differentiable f is convex if and only if dom f is convex and f(y) f(x) + f(x) T (y x) for all x,y dom f f(y) f(x) + f(x) T (y x) (x, f(x)) twice differentiable f is convex if and only if dom f is convex and 2 f(x) 0 for all x dom f Convex sets and functions 18

Operations that preserve convexity methods for establishing convexity of a function 1. verify definition 2. for twice differentiable functions, show 2 f(x) 0 3. show that f is obtained from simple convex functions by operations that preserve convexity nonnegative weighted sum composition with affine function pointwise maximum and supremum composition minimization perspective Convex sets and functions 19

Positive weighted sum & composition with affine function Nonnegative multiple: αf is convex if f is convex, α 0 Sum: f 1 + f 2 convex if f 1, f 2 convex (extends to infinite sums, integrals) Composition with affine function: f(ax + b) is convex if f is convex Examples log barrier for linear inequalities f(x) = m log(b i a T i x) i=1 (any) norm of affine function: f(x) = Ax + b Convex sets and functions 20

Pointwise maximum is convex if f 1,..., f m are convex f(x) = max{f 1 (x),..., f m (x)} Example: sum of r largest components of x R n f(x) = x [1] + x [2] + + x [r] is convex (x [i] is ith largest component of x) proof: f(x) = max{x i1 + x i2 + + x ir 1 i 1 < i 2 < < i r n} Convex sets and functions 21

Pointwise supremum g(x) = sup f(x,y) y A is convex if f(x, y) is convex in x for each y A Example: maximum eigenvalue of symmetric matrix λ max (X) = sup y T Xy y 2 =1 Convex sets and functions 22

Composition with scalar functions composition of g : R n R and h : R R: f is convex if f(x) = h(g(x)) g convex, h convex and nondecreasing g concave, h convex and nonincreasing (if we assign h(x) = for x dom h) Examples exp g(x) is convex if g is convex 1/g(x) is convex if g is concave and positive Convex sets and functions 23

Vector composition composition of g : R n R k and h : R k R: f(x) = h(g(x)) = h (g 1 (x), g 2 (x),..., g k (x)) f is convex if g i convex, h convex and nondecreasing in each argument g i concave, h convex and nonincreasing in each argument (if we assign h(x) = for x dom h) Examples m i=1 log g i(x) is concave if g i are concave and positive log m i=1 expg i(x) is convex if g i are convex Convex sets and functions 24

Minimization g(x) = inf f(x, y) y C is convex if f(x, y) is convex in x, y and C is a convex set Examples distance to a convex set C: g(x) = inf y C x y optimal value of linear program as function of righthand side follows by taking g(x) = inf y:ay x ct y f(x,y) = c T y, dom f = {(x,y) Ay x} Convex sets and functions 25

Perspective the perspective of a function f : R n R is the function g : R n R R, g(x,t) = tf(x/t) g is convex if f is convex on dom g = {(x, t) x/t dom f, t > 0} Examples perspective of f(x) = x T x is quadratic-over-linear function g(x, t) = xt x t perspective of negative logarithm f(x) = log x is relative entropy g(x, t) = t log t t log x Convex sets and functions 26

Convex optimization problems Convex optimization MLSS 2009 standard form linear, quadratic, geometric programming modeling languages 27

Convex optimization problem minimize f 0 (x) subject to f i (x) 0, Ax = b i = 1,..., m f 0, f 1,..., f m are convex functions feasible set is convex locally optimal points are globally optimal tractable, both in theory and practice Convex optimization problems 28

Linear program (LP) minimize c T x + d subject to Gx h Ax = b inequality is componentwise vector inequality convex problem with affine objective and constraint functions feasible set is a polyhedron c P x Convex optimization problems 29

Piecewise-linear minimization minimize f(x) = max i=1,...,m (at i x + b i ) f(x) a T i x + b i Equivalent linear program x minimize t subject to a T i x + b i t, i = 1,..., m an LP with variables x, t R Convex optimization problems 30

Linear discrimination separate two sets of points {x 1,..., x N }, {y 1,..., y M } by a hyperplane a T x i + b > 0, a T y i + b < 0, i = 1,..., N i = 1,..., M homogeneous in a, b, hence equivalent to the linear inequalities (in a, b) a T x i + b 1, i = 1,..., N, a T y i + b 1, i = 1,..., M Convex optimization problems 31

Approximate linear separation of non-separable sets minimize N max{0,1 a T x i b} + i=1 M max{0,1 + a T y i + b} i=1 a piecewise-linear minimization problem in a, b; equivalent to an LP can be interpreted as a heuristic for minimizing #misclassified points Convex optimization problems 32

l 1 -Norm and l -norm minimization l 1 -Norm approximation and equivalent LP ( y 1 = k y k ) minimize Ax b 1 n minimize i=1 y i subject to y Ax b y l -Norm approximation ( y = max k y k ) minimize Ax b minimize y subject to y1 Ax b y1 (1 is vector of ones) Convex optimization problems 33

Quadratic program (QP) minimize (1/2)x T Px + q T x + r subject to Gx h Ax = b P S n +, so objective is convex quadratic minimize a convex quadratic function over a polyhedron f 0 (x ) x P Convex optimization problems 34

Linear program with random cost minimize c T x subject to Gx h c is random vector with mean c and covariance Σ hence, c T x is random variable with mean c T x and variance x T Σx Expected cost-variance trade-off minimize Ec T x + γ var(c T x) = c T x + γx T Σx subject to Gx h γ > 0 is risk aversion parameter Convex optimization problems 35

Robust linear discrimination H 1 = {z a T z + b = 1} H 2 = {z a T z + b = 1} distance between hyperplanes is 2/ a 2 to separate two sets of points by maximum margin, minimize a 2 2 = a T a subject to a T x i + b 1, a T y i + b 1, i = 1,..., N i = 1,..., M a quadratic program in a, b Convex optimization problems 36

Support vector classifier min. γ a 2 2 + N max{0, 1 a T x i b} + i=1 M max{0,1 + a T y i + b} i=1 equivalent to a QP γ = 0 γ = 10 Convex optimization problems 37

Sparse signal reconstruction 2 signal ˆx of length 1000 ten nonzero components 1 0-1 -2 0 200 400 600 800 1000 reconstruct signal from m = 100 random noisy measurements b = Aˆx + v (A ij N(0, 1) i.i.d. and v N(0, σ 2 I) with σ = 0.01) Convex optimization problems 38

l 2 -Norm regularization a least-squares problem minimize Ax b 2 2 + γ x 2 2 2 2 1 1 0 0-1 -1-2 0 200 400 600 800 1000-2 0 200 400 600 800 1000 left: exact signal ˆx; right: 2-norm reconstruction Convex optimization problems 39

l 1 -Norm regularization minimize Ax b 2 2 + γ x 1 equivalent to a QP 2 2 1 1 0 0-1 -1-2 0 200 400 600 800 1000-2 0 200 400 600 800 1000 left: exact signal ˆx; right: 1-norm reconstruction Convex optimization problems 40

Geometric programming Posynomial function: f(x) = K k=1 c k x a 1k 1 xa 2k 2 x a nk n, dom f = R n ++ with c k > 0 Geometric program (GP) minimize f 0 (x) subject to f i (x) 1, i = 1,..., m with f i posynomial Convex optimization problems 41

Geometric program in convex form change variables to y i = log x i, and take logarithm of cost, constraints Geometric program in convex form: ( K ) minimize log exp(a T 0ky + b 0k ) ( k=1 K ) subject to log exp(a T iky + b ik ) 0, i = 1,..., m k=1 b ik = log c ik Convex optimization problems 42

Modeling software Modeling packages for convex optimization CVX, Yalmip (Matlab) CVXMOD (Python) assist in formulating convex problems by automating two tasks: verifying convexity from convex calculus rules transforming problem in input format required by standard solvers Related packages general purpose optimization modeling: AMPL, GAMS Convex optimization problems 43

CVX example minimize Ax b 1 subject to 0.5 x k 0.3, k = 1,..., n Matlab code A = randn(5, 3); b = randn(5, 1); cvx_begin variable x(3); minimize(norm(a*x - b, 1)) subject to -0.5 <= x; x <= 0.3; cvx_end between cvx_begin and cvx_end, x is a CVX variable after execution, x is Matlab variable with optimal solution Convex optimization problems 44

Cone programming Convex optimization MLSS 2009 generalized inequalities second-order cone programming semidefinite programming 45

Cone linear program minimize c T x subject to Gx K h Ax = b y K z means z y K, where K is a proper convex cone extends linear programming (K = R m +) to nonpolyhedral cones popular as standard format for nonlinear convex optimization theory and algorithms very similar to linear programming Cone programming 46

Second-order cone program (SOCP) minimize f T x subject to A i x + b i 2 c T i x + d i, i = 1,..., m 2 is Euclidean norm y 2 = y 2 1 + + y2 n constraints are nonlinear, nondifferentiable, convex 1 constraints are inequalities w.r.t. second-order cone: { y } y1 2 + + y2 p 1 y p y3 0.5 0 1 0 0 1 y 2 1 1 y 1 Cone programming 47

Examples of SOC-representable constraints Convex quadratic constraint (A = LL T positive definite) x T Ax + 2b T x + c 0 L T x + L 1 b 2 (b T A 1 b c) 1/2 also extends to positive semidefinite singular A Hyperbolic constraint x T x yz, y, z 0 [ 2x y z ] 2 y + z, y, z 0 Cone programming 48

Examples of SOC-representable constraints Positive powers x 1.5 t, x 0 z : x 2 tz, z 2 x, x, z 0 two hyperbolic constraints can be converted to SOC constraints extends to powers x p for rational p 1 Negative powers x 3 t, x > 0 z : 1 tz, z 2 tx, x, z 0 two hyperbolic constraints can be converted to SOC constraints extends to powers x p for rational p < 0 Cone programming 49

Robust linear program (stochastic) minimize c T x subject to prob(a T i x b i) η, i = 1,..., m a i random and normally distributed with mean ā i, covariance Σ i we require that x satisfies each constraint with probability exceeding η η = 10% η = 50% η = 90% Cone programming 50

SOCP formulation the chance constraint prob(a T i x b i) η is equivalent to the constraint ā T i x + Φ 1 (η) Σ 1/2 i x 2 b i Φ is the (unit) normal cumulative density function 1 η Φ(t) 0.5 0 0 t Φ 1 (η) robust LP is a second-order cone program for η 0.5 Cone programming 51

Robust linear program (deterministic) minimize c T x subject to a T i x b i for all a i E i, i = 1,..., m a i uncertain but bounded by ellipsoid E i = {ā i + P i u u 2 1} we require that x satisfies each constraint for all possible a i SOCP formulation minimize c T x subject to ā T i x + P i Tx 2 b i, i = 1,..., m follows from sup (ā i + P i u) T x = ā T i + Pi T x 2 u 2 1 Cone programming 52

Semidefinite program (SDP) minimize c T x subject to x 1 A 1 + x 2 A 2 + + x n A n B A 1, A 2,..., A n, B are symmetric matrices inequality X Y means Y X is positive semidefinite, i.e., z T (Y X)z = i,j (Y ij X ij )z i z j 0 for all z includes many nonlinear constraints as special cases Cone programming 53

Geometry 1 [ x y y z ] 0 z 0.5 1 0 0 y 1 0 0.5 x 1 a nonpolyhedral convex cone feasible set of a semidefinite program is the intersection of the positive semidefinite cone in high dimension with planes Cone programming 54

Examples A(x) = A 0 + x 1 A 1 + + x m A m (A i S n ) Eigenvalue minimization (and equivalent SDP) minimize λ max (A(x)) minimize t subject to A(x) ti Matrix-fractional function minimize b T A(x) 1 b subject to A(x) 0 minimize subject to t[ A(x) b b T t ] 0 Cone programming 55

Matrix norm minimization A(x) = A 0 + x 1 A 1 + x 2 A 2 + + x n A n (A i R p q ) Matrix norm approximation ( X 2 = max k σ k (X)) minimize A(x) 2 minimize t[ subject to ti A(x) A(x) T ti ] 0 Nuclear norm approximation ( X = k σ k(x)) minimize A(x) minimize (tru + tr V )/2 [ U A(x) subject to T A(x) V ] 0 Cone programming 56

Semidefinite relaxations & randomization semidefinite programming is increasingly used to find good bounds for hard (i.e., nonconvex) problems, via relaxation as a heuristic for good suboptimal points, often via randomization Example: Boolean least-squares minimize Ax b 2 2 subject to x 2 i = 1, i = 1,..., n basic problem in digital communications could check all 2 n possible values of x { 1,1} n... an NP-hard problem, and very hard in practice Cone programming 57

with P = A T A, q = A T b, r = b T b Semidefinite lifting Ax b 2 2 = n i,j=1 P ij x i x j + 2 n q i x i + r i=1 after introducing new variables X ij = x i x j n n minimize P ij X ij + 2 q i x i + r i,j=1 i=1 subject to X ii = 1, i = 1,..., n X ij = x i x j, i,j = 1,..., n cost function and first constraints are linear last constraint in matrix form is X = xx T, nonlinear and nonconvex,... still a very hard problem Cone programming 58

Semidefinite relaxation replace X = xx T with weaker constraint X xx T, to obtain relaxation minimize n P ij X ij + 2 n q i x i + r i,j=1 subject to X ii = 1, i=1 i = 1,..., n X xx T convex; can be solved as an semidefinite program optimal value gives lower bound for BLS if X = xx T at the optimum, we have solved the exact problem otherwise, can use randomized rounding generate z from N(x,X xx T ) and take x = sign(z) Cone programming 59

Example 0.5 0.4 SDP bound LS solution frequency 0.3 0.2 0.1 0 1 1.2 Ax b 2 /(SDP bound) feasible set has 2 100 10 30 points histogram of 1000 randomized solutions from SDP relaxation Cone programming 60

Nonnegative polynomial on R f(t) = x 0 + x 1 t + + x 2m t 2m 0 for all t R a convex constraint on x holds if and only if f is a sum of squares of (two) polynomials: f(t) = k (y k0 + y k1 t + + y km t m ) 2 T = 1. t m T k y k y T k 1. t m = 1. t m Y 1. t m where Y = k y ky T k 0 Cone programming 61

SDP formulation f(t) 0 if and only if for some Y 0, f(t) = 1 ṭ. t 2m T x 0 x 1. x 2m = 1 ṭ. t m T Y 1 ṭ. t m this is an SDP constraint: there exists Y 0 such that x 0 = Y 11 x 1 = Y 12 + Y 21 x 2 = Y 13 + Y 22 + Y 32. x 2m = Y m+1,m+1 Cone programming 62

General sum-of-squares constraints f(t) = x T p(t) is a sum of squares if ( s s x T p(t) = k k=1(y T q(t)) 2 = q(t) T y k yk T k=1 ) q(t) p, q: basis functions (of polynomials, trigonometric polynomials,... ) independent variable t can be one- or multidimensional a sufficient condition for nonnegativity of x T p(t), useful in nonconvex polynomial optimization in several variables in some nontrivial cases (e.g., polynomial on R), necessary and sufficient Equivalent SDP constraint (on the variables x, X) x T p(t) = q(t) T Xq(t), X 0 Cone programming 63

Example: Cosine polynomials f(ω) = x 0 + x 1 cos ω + + x 2n cos 2nω 0 Sum of squares theorem: f(ω) 0 for α ω β if and only if f(ω) = g 1 (ω) 2 + s(ω)g 2 (ω) 2 g 1, g 2 : cosine polynomials of degree n and n 1 s(ω) = (cos ω cos β)(cos α cos ω) is a given weight function Equivalent SDP formulation: f(ω) 0 for α ω β if and only if x T p(ω) = q 1 (ω) T X 1 q 1 (ω) + s(ω)q 2 (ω) T X 2 q 2 (ω), X 1 0, X 2 0 p, q 1, q 2 : basis vectors (1, cos ω, cos(2ω),...) up to order 2n, n, n 1 Cone programming 64

Example: Linear-phase Nyquist filter minimize sup ω ωs h 0 + h 1 cos ω + + h 2n cos 2nω with h 0 = 1/M, h km = 0 for positive integer k 10 0 H(ω) 10 1 10 2 10 3 0 0.5 1 1.5 2 2.5 3 (Example with n = 25, M = 5, ω s = 0.69) ω Cone programming 65

SDP formulation minimize t subject to t H(ω) t, ω s ω π where H(ω) = h 0 + h 1 cos ω + + h 2n cos 2nω Equivalent SDP minimize t subject to t H(ω) = q 1 (ω) T X 1 q 1 (ω) + s(ω)q 2 (ω) T X 2 q 2 (ω) t + H(ω) = q 1 (ω) T X 3 q 1 (ω) + s(ω)q 2 (ω) T X 3 q 2 (ω) X 1 0, X 2 0, X 3 0, X 4 0 Variables t, h i (i km), 4 matrices X i of size roughly n Cone programming 66

Chebyshev inequalities Classical (two-sided) Chebyshev inequality prob( X < 1) 1 σ 2 holds for all random X with EX = 0, E X 2 = σ 2 there exists a distribution that achieves the bound Generalized Chebyshev inequalities give lower bound on prob(x C), given moments of X Cone programming 67

Chebyshev inequality for quadratic constraints C is defined by quadratic inequalities C = {x R n x T A i x + 2b T i x + c i 0, i = 1,..., m} X is random vector with EX = a, EXX T = S SDP formulation (variables P S n, q R n, r, τ 1,..., τ m R) maximize subject to 1 tr(sp) 2a T q r [ ] [ P q Ai q T τ r 1 i [ ] b T i P q q T 0 r b i c i ], τ i 0 i = 1,..., m optimal value is tight lower bound on prob(x S) Cone programming 68

Example C a a = EX; dashed line shows {x (x a) T (S aa T ) 1 (x a) = 1} lower bound on prob(x C) is achieved by distribution shown in red ellipse is defined by x T Px + 2q T x + r = 1 Cone programming 69

Detection example x = s + v x R n : received signal s: transmitted signal s {s 1, s 2,..., s N } (one of N possible symbols) v: noise with Ev = 0, E vv T = σ 2 I Detection problem: given observed value of x, estimate s Cone programming 70

Example (N = 7): bound on probability of correct detection of s 1 is 0.205 s 3 s 2 s 4 s 1 s 7 s 5 s 6 dots: distribution with probability of correct detection 0.205 Cone programming 71

Cone programming duality Primal and dual cone program P: minimize c T x subject to Ax K b D: maximize b T z subject to A T z + c = 0 z K 0 optimal values are equal (if primal or dual is strictly feasible) dual inequality is with respect to the dual cone K = {z x T z 0 for all x K} K = K for linear, second-order cone, semidefinite programming Applications: optimality conditions, sensitivity analysis, algorithms,... Cone programming 72

Interior-point methods Convex optimization MLSS 2009 Newton s method barrier method primal-dual interior-point methods problem structure 73

Equality-constrained convex optimization minimize f(x) subject to Ax = b f twice continuously differentiable and convex Optimality (Karush-Kuhn-Tucker or KKT) condition f(x) + A T y = 0, Ax = b Example: f(x) = (1/2)x T Px + q T x + r with P 0 [ P A T A 0 ] [ x y ] = [ q b ] a symmetric indefinite set of equations, known as a KKT system Interior-point methods 74

Newton step replace f with second-order approximation f q at feasible ˆx: minimize subject to Ax = b f q (x) = f(ˆx) + f(ˆx) T (x ˆx) + 1 2 (x ˆx)T 2 f(ˆx)(x ˆx) solution is x = ˆx + x nt with x nt defined by [ 2 f(ˆx) A T A 0 ] [ xnt w ] = [ f(ˆx) 0 ] x nt is called the Newton step at ˆx Interior-point methods 75

Interpretation (for unconstrained problem) ˆx + x nt minimizes 2nd-order approximation f q (ˆx, f(ˆx)) (ˆx + x nt, f(ˆx + x nt )) f q f ˆx + x nt condition solves linearized optimality f q f q (x) = f(ˆx) + 2 f(ˆx)(x ˆx) = 0 (ˆx, f (ˆx)) f (ˆx + x nt, f (ˆx + x nt )) Interior-point methods 76

Newton s algorithm given starting point x (0) dom f with Ax (0) = b, tolerance ǫ repeat for k = 0,1,... 1. compute Newton step x nt at x (k) by solving [ 2 f(x (k) ) A T A 0 ] [ xnt w ] = [ f(x (k) ) 0 ] 2. terminate if f(x (k) ) T x nt ǫ 3. x (k+1) = x (k) + t x nt, with t determined by line search Comments f(x (k) ) T x nt is directional derivative at x (k) in Newton direction line search needed to guarantee f(x (k+1) ) < f(x (k) ), global convergence Interior-point methods 77

f(x) = n log(1 x 2 i) i=1 Example m log(b i a T i x) (with n = 10 4, m = 10 5 ) i=1 10 5 f(x (k) ) inf f(x) 10 0 10 5 0 5 10 15 20 high accuracy after small number of iterations fast asymptotic convergence k Interior-point methods 78

Classical convergence analysis Assumptions (m, L are positive constants) f strongly convex: 2 f(x) mi 2 f Lipschitz continuous: 2 f(x) 2 f(y) 2 L x y 2 Summary: two regimes damped phase ( f(x) 2 large): for some constant γ > 0 f(x (k+1) ) f(x (k) ) γ quadratic convergence ( f(x) 2 small) f(x (k) ) 2 decreases quadratically Interior-point methods 79

Self-concordant functions Shortcomings of classical convergence analysis depends on unknown constants (m, L,... ) bound is not affinely invariant, although Newton s method is Analysis for self-concordant functions (Nesterov and Nemirovski, 1994) a convex function of one variable is self-concordant if f (x) 2f (x) 3/2 for all x dom f a function of several variables is s.c. if its restriction to lines is s.c. analysis is affine-invariant, does not depend on unknown constants developed for complexity theory of interior-point methods Interior-point methods 80

Interior-point methods minimize f 0 (x) subjec to f i (x) 0, Ax = b functions f i, i = 0, 1,..., m, are convex i = 1,..., m Basic idea: follow central path through interior feasible set to solution c Interior-point methods 81

General properties path-following mechanism relies on Newton s method every iteration requires solving a set of linear equations (KKT system) number of iterations small (10 50), fairly independent of problem size some versions known to have polynomial worst-case complexity History introduced in 1950s and 1960s used in polynomial-time methods for linear programming (1980s) polynomial-time algorithms for general convex optimization (ca. 1990) Interior-point methods 82

Reformulation via indicator function minimize f 0 (x) subject to f i (x) 0, Ax = b i = 1,..., m Reformulation where I is indicator function of R : minimize f 0 (x) + m i=1 I (f i (x)) subject to Ax = b I (u) = 0 if u 0, I (u) = otherwise reformulated problem has no inequality constraints however, objective function is not differentiable Interior-point methods 83

Approximation via logarithmic barrier minimize f 0 (x) 1 t subject to Ax = b m log( f i (x)) i=1 for t > 0, (1/t)log( u) is a smooth approximation of I approximation improves as t 10 log( u)/t 5 0 5 3 2 1 0 1 u Interior-point methods 84

Logarithmic barrier function φ(x) = m log( f i (x)) i=1 with dom φ = {x f 1 (x) < 0,..., f m (x) < 0} convex (follows from composition rules and convexity of f i ) twice continuously differentiable, with derivatives φ(x) = 2 φ(x) = m i=1 m i=1 1 f i (x) f i(x) 1 f i (x) 2 f i(x) f i (x) T + m i=1 1 f i (x) 2 f i (x) Interior-point methods 85

Central path central path is {x (t) t > 0}, where x (t) is the solution of minimize tf 0 (x) + φ(x) subject to Ax = b Example: central path for an LP minimize c T x subject to a T i x b i, i = 1,...,6 hyperplane c T x = c T x (t) is tangent to level curve of φ through x (t) c x x (10) Interior-point methods 86

Barrier method given strictly feasible x, t := t (0) > 0, µ > 1, tolerance ǫ > 0 repeat: 1. Centering step. Compute x (t) and set x := x (t) 2. Stopping criterion. Terminate if m/t < ǫ 3. Increase t. t := µt stopping criterion m/t ǫ guarantees f 0 (x) optimal value ǫ (follows from duality) typical value of µ is 10 20 several heuristics for choice of t (0) centering usually done using Newton s method, starting at current x Interior-point methods 87

Example: Inequality form LP m = 100 inequalities, n = 50 variables duality gap 10 2 10 0 10 2 10 4 10 6 µ = 50 µ = 150 µ = 2 0 20 40 60 80 Newton iterations Newton iterations 140 120 100 80 60 40 20 0 0 40 80 120 160 200 µ starts with x on central path (t (0) = 1, duality gap 100) terminates when t = 10 8 (gap m/t = 10 6 ) total number of Newton iterations not very sensitive for µ 10 Interior-point methods 88

Family of standard LPs minimize c T x subject to Ax = b, x 0 A R m 2m ; for each m, solve 100 randomly generated instances 35 Newton iterations 30 25 20 15 10 1 10 2 10 3 number of iterations grows very slowly as m ranges over a 100 : 1 ratio m Interior-point methods 89

Second-order cone programming minimize f T x subject to A i x + b i 2 c T i x + d i, i = 1,..., m Logarithmic barrier function φ(x) = m i=1 log ( (c T i x + d i ) 2 A i x + b i 2 ) 2 a convex function log(v 2 u T u) is logarithm for 2nd-order cone {(u, v) u 2 v} Barrier method: follows central path x (t) = argmin(tf T x + φ(x)) Interior-point methods 90

Example 50 variables, 50 second-order cone constraints in R 6 duality gap 10 2 10 0 10 2 10 4 10 6 µ = 50 µ = 200 µ = 2 0 20 40 60 80 Newton iterations Newton iterations 120 80 40 0 20 60 100 140 180 µ Interior-point methods 91

Semidefinite programming minimize c T x subject to x 1 A 1 + + x n A n B Logarithmic barrier function φ(x) = log det(b x 1 A 1 x n A n ) a convex function log det X is logarithm for p.s.d. cone Barrier method: follows central path x (t) = argmin(tf T x + φ(x)) Interior-point methods 92

Example 100 variables, one linear matrix inequality in S 100 10 2 140 duality gap 10 0 10 2 10 4 Newton iterations 100 60 10 6 µ = 150 µ = 50 µ = 2 20 0 20 40 60 80 100 Newton iterations 0 20 40 60 80 100 120 µ Interior-point methods 93

Complexity of barrier method Iteration complexity can be bounded by polynomial function of problem dimensions (with correct formulation, barrier function) examples: O( m) iteration bound for LP or SOCP with m inequalities, SDP with constraint of order m proofs rely on theory of Newton s method for self-concordant functions in practice: #iterations roughly constant as a function of problem size Linear algebra complexity dominated by solution of Newton system Interior-point methods 94

Primal-dual interior-point methods Similarities with barrier method follow the same central path linear algebra (KKT system) per iteration is similar Differences faster and more robust update primal and dual variables in each step no distinction between inner (centering) and outer iterations include heuristics for adaptive choice of barrier parameter t can start at infeasible points often exhibit superlinear asymptotic convergence Interior-point methods 95

Software implementations General-purpose software for nonlinear convex optimization several high-quality packages (MOSEK, Sedumi, SDPT3,... ) exploit sparsity to achieve scalability Customized implementations can exploit non-sparse types of problem structure often orders of magnitude faster than general-purpose solvers Interior-point methods 96

Example: l 1 -regularized least-squares A is m n (with m n) and dense Quadratic program formulation minimize Ax b 2 2 + x 1 minimize Ax b 2 2 + 1 T u subject to u x u coefficient of Newton system in interior-point method is [ A T A 0 0 0 ] [ ] D1 + D + 2 D 2 D 1 D 2 D 1 D 1 + D 2 (D 1, D 2 positive diagonal) very expensive (O(n 3 )) for large n Interior-point methods 97

Customized implementation can reduce Newton equation to solution of a system (AD 1 A T + I) u = r cost per iteration is O(m 2 n) Comparison (seconds on 3.2Ghz machine) general-purpose solver is MOSEK m n custom general-purpose 50 100 0.02 0.05 50 200 0.03 0.17 100 1000 0.32 10.6 100 2000 0.71 76.9 500 1000 2.5 11.2 500 2000 5.5 79.8 Interior-point methods 98

First-order methods Convex optimization MLSS 2009 gradient method Nesterov s gradient methods extensions 99

Gradient method to minimize a convex differentiable function f: choose x (0) and repeat x (k) = x (k 1) t k f(x (k 1) ), k = 1,2,... t k is step size (fixed or determined by backtracking line search) Classical convergence result assume f Lipschitz continuous ( f(x) f(y) 2 L x y 2 ) error decreases as 1/k, hence O ( ) 1 ǫ iterations needed to reach accuracy f(x (k) ) f ǫ First-order methods 100

Nesterov s gradient method choose x (0) ; take x (1) = x (0) t 1 f(x (0) ) and for k 2 y (k) = x (k 1) + k 2 k + 1 (x(k 1) x (k 2) ) x (k) = y (k) t k f(y (k) ) gradient method with extrapolation if f has Lipschitz continuous gradient, error decreases as 1/k 2 ; hence ( ) 1 O ǫ iterations needed to reach accuracy f(x (k) ) f ǫ many variations; first one published in 1983 First-order methods 101

Example minimize log m exp(a T i x + b i ) i=1 randomly generated data with m = 2000, n = 1000, fixed step size (f(x (k) ) f )/ f 10 0 gradient Nesterov 10-1 10-2 10-3 10-4 10-5 10-6 0 50 100 150 200 k First-order methods 102

Interpretation of gradient update x (k) = x (k 1) t k f(x (k 1) ) = argmin z ( f(x (k 1) ) T z + 1 ) z x (k 1) 2 2 t k Interpretation x (k) minimizes f(x (k 1) ) + f(x (k 1) ) T (z x (k 1) ) + 1 t k z x (k 1) 2 2 a simple quadratic model of f at x (k 1) First-order methods 103

Projected gradient method minimize f(x) subject to x C f convex, C a closed convex set x (k) = argmin z C ( f(x (k 1) ) T z + 1 ) z x (k 1) 2 2 t k ) = P C (x (k 1) t k f(x (k 1) ) useful if projection P C on C is inexpensive (e.g., box constraints) similar convergence result as for basic gradient algorithm can be used in fast Nesterov-type gradient methods First-order methods 104

Nonsmooth components minimize f(x) + g(x) f, g convex, with f differentiable, g nondifferentiable x (k) = argmin z = argmin z ( f(x (k 1) ) T z + g(x) + 1 ) z x (k 1) 2 2 t k ( 1 ) z x (k 1) + t k f(x (k 1) ) 2 + g(z) 2t k 2 = S tk (x (k 1) t k f(x (k 1) ) ) gradient step for f followed by thresholding operation S t useful if thresholding is inexpensive (e.g., because g is separable) similar convergence result as basic gradient method First-order methods 105

Example: l 1 -norm regularization f convex and differentiable Thresholding operator minimize f(x) + x 1 S t (y) = argmin z ( ) 1 2t z y 2 2 + z 1 S t (y) k S t (y) k = y k t y k t 0 t y k t y k + t y k t t t y k First-order methods 106

l 1 -Norm regularized least-squares minimize 1 2 Ax b 2 2 + x 1 10 0 gradient 10-1 Nesterov (f(x (k) ) f )/f 10-2 10-3 10-4 10-5 10-6 10-7 0 20 40 60 80 100 k randomly generated A R 2000 1000 ; fixed step First-order methods 107

Summary: Advances in convex optimization Theory new problem classes, robust optimization, convex relaxations,... Applications new applications in different fields; surprisingly many discovered recently Algorithms and software high-quality general-purpose implementations of interior-point methods software packages for convex modeling new first-order methods 108