CO 367. Nonlinear Optimization. Dr. Dmitriy Drusvyatskiy Winter 2014 (1141) University of Waterloo

Transcription

1 Contents CO 367 Nonlinear Optimization Dr. Dmitriy Drusvyatskiy Winter 2014 (1141) University of Waterloo 1 Lecture 1: Introduction to Nonlinear Optimization General formalism Unconstrained optimization Lecture 4: Introduction to iterative methods for unconstrained optimization Line search method Trust region Lecture 7: Trust region methods (a quick look) Trust region methods

2 Administrative Instructor: Dmitriy Drusvyatskiy (MC 4012); TA: Ahmad Abdi. Webpage: learn.uwaterloo.ca. Textbook: The mathematics of nonlinear programming by Peressini, Sullivan, Uhl. Scribing system. 1 Lecture 1: Introduction to Nonlinear Optimization 1.1 General formalism Notation: R n is the set of ordered n-tuples (x 1,..., x n ). The dot product will be denoted by n x, y = x T y = x i y i. i=1 The norm on R n is x = n x 2 i. i=1 The general problem of nonlinear optimization: given C 2 smooth f, g 1,..., g m : R n R, find a minimizer of min f(x) s.t. g i (x) 0 for i = 1,..., m. ( ) We ll call f the objective function, g i the constraint functions and the feasible region. D = {x : g i (x) 0 for all i = 1,..., m} If f, g 1,..., g m are linear then ( ) is called a linear program. This is a huge class of problems. 1.1 Example. Define g i (x) = x 2 i 1 for i = 1,..., n and ĝ i(x) = 1 x 2 i for i = 1,..., n. Now g i (x) 0 and ĝ i (x) 0 imply x 2 i = 1 hence x i = ±1. There is a vast number of applications, both in applied math and engineering. 1.2 Definition. Consider f : R n R and a subset D R n. Then a point x in D is A global minimizer for f on D if f( x) f(x) for all x D. A strict global minimizer for f on D if f( x) < f(x) for all x x D. A local minimizer for f on D if there exists ɛ > 0 such that for all x D B ɛ ( x). Analogous definition for strict local minimizer. f( x) f(x) Most algorithms in nonlinear programming are designed specifically to find local minimizers. Passing from local to global requires convexity. 1.3 Remark. One has to be careful with the constraints! This is illustrated by Whitney s theorem: for any closed set D R n, there exists a C -smooth function g : R n R such that D = {x : g(x) = 0}. 2

3 So without further restrictions on the constraints, we might as well be optimizing a function over any closed set. An easier problem is to design a procedure to check whether a point x is a local minimizer. Surprisingly, such a procedure can be used to design minimization algorithms! This brings us to the theme of the course: the interplay between optimality conditions and algorithm design. The constraints are really important; they need to be incorporated into everything. We also need certain conditions on how the constraints interact. Finding global minimizers is hopeless; we need to settle for local minimizers unless convexity is present, then they are the same. 1.2 Unconstrained optimization Simplest situation f : R R. The following is key. 1.4 Theorem. Let f : (a, b) R be C 2 -smooth. Then for any x and x in (a, b), there exists z strictly between x and x such that f(x) = f( x) + f ( x) (x x) + f (z) (x x) 2 2 Proof. Bonus Q on HW Corollary (Optimality Conditions I). Let f : (a, b) R be C 2 -smooth. Then the following are true: 1. If x is a local minimizer of f, then f ( x) = 0 and f ( x) 0 2. If x satisfies f ( x) = 0 and f ( x) > 0 then x is a strict local minimizer. 1.6 Remark. If f ( x) = f ( x) = 0 then one cannot deduce anything about optimality. In some sense this is rare. Proof. We have: 1. Suppose x is a local minimizer. Case 1: Suppose f ( x) > 0. Then for x i x we know for large i. Contradiction. f(x i ) f( x) x i x Case 2: Suppose f ( x) < 0; argue similarly but take x i x. From this we deduce f ( x) = 0. Suppose f ( x) < 0. There exists δ > 0 such that f (x) < 0 for all x ( x δ, x+δ). Thus, for x ( x δ, x+δ), there exists z between x and x such that (Taylor) contradiction, so f ( x) 0. > 0 f(x) = f( x) + f (z) (x x) 2 < f( x) 2 2. Exercise. Follows directly from Taylor and the hypotheses. Definition of gradient and Hessian for real-valued function of multiple variables. the Hessian of f at x [ ] 2 f( x) = 2 f x i x j ( x) i,j=1,...,n 1.7 Theorem (Taylor II). Consider a C 2 -smooth f : U R where U is an open subset of R n. If x and x are such that the segment [ x, x] = { x + t(x x) : t [0, 1]} is contained entirely in U, then there exists a point z ( x, x) such that f(x) = f( x) + f( x), x x f(z)(x x), x x 3

4 Proof. Choose ɛ > 0 satisfying x + t(x x) U t ( ɛ, 1 + ɛ). Define ψ(t) = f( x + t(x x)) A question on HW 1 will ask you to verify that ψ (t) = f( x + t(x x)), x x ψ (t) = 2 f( x + t(x x))(x x), x x. Now apply Taylor I. Get s such that the equation from Taylor holds. Define z = x + s(x x). Check that this works. 1.8 Definition (positive definite matrices). An n n symmetric matrix A is positive semi-definite (A 0) if we have Ax, x 0 x R n. positive definite (A 0) Ax, x > 0 0 x R n. 1.9 Corollary (Multivariate Optimality Conditions). Let f : U R be C 2 -smooth, U open subset of R n. Then the following are true: 1. If x is a local minimizer of f, then we have f( x) = 0 and 2 f( x) 0 2. If f( x) = 0 and 2 f( x) 0, then x is a strict local minimizer. Proof. Analogous to the 1-dimensional case Theorem (multivariate optimality conditions). Let f : U R be C 2 -smooth on an open set U in R n. Then the following are true: 1. If x is a local minimizer of f, then f( x) = 0 and 2 f( x) 0 2. If x satisfies f( x) = 0 and 2 f( x) 0, then x is a strict local minimizer of f Definition (critical points). A point x U is a critical point of f : U R if f( x) exists and satisfies f( x) = Remark. Here is a naive recipe for minimizing f : R n R. 1. Find all critical points x of f. 2. Check if 2 f( x) is positive definite. We need to be able to check if a matrix is positive definite. The definition merely states that A 0 iff Ax, x > 0 for all x 0. This is not a practical way to check if A 0. How to check if a matrix is positive definite? Principal minors, or eigenvalues. Consider a symmetric matrix a 11 a a 1n a 21 a a 2n A =.. a 1n a nn and let I be a subset of {1,..., n} (e.g. I = {2, 4}). Let A[I] be the restriction of A to the rows and columns indexed by I Example Definition. We have: 1. det A[I] is called a principal minor of A [ ] A = , I = {1, 3}, A[I] =

5 2. If I = {1,..., k} then det A[I] is called a leading principal minor Theorem. We have: 1. A 0 if and only if all of its principal minors are A 0 if and only if all of its principal minors are > A 0 if and only if all of its leading principal minors are > Remark. The analog of 3 for 0 is false! See Remark b in the book. Recall 0 v R n is an eigenvector of A if there exists λ R such that Av = λv. The number λ is called an eigenvalue Theorem. We have: 1. A 0 iff all its eigenvalues are A 0 iff all its eigenvalues are > Example. Find global and local minimizers of Solve Case 1 : If (x, y) = (2, 1), then f(x, y) = f(x, y) = x 3 12xy + 8y 3. [ ] 3x 2 12y 12x + 24y 2, 2 f(x, y) = [ 6x ] y f(x, y) = (0, 0) (x, y) = (0, 0) or (x, y) = (2, 1). 2 f(2, 1) = [ ] Observe 12 > 0; this is the first leading principal minor. Also, [ ] det > 0 = 2 f(2, 1) 0 = (2, 1) is a strict local minimizer Case 2 : If (x, y) = (0, 0), then 2 f(0, 0) = [ ] det 2 f(0, 0) < 0. Is it a local maximizer or local minimizer? No because f(x, 0) = x 3. Question: When do minimizers of f : R n R exist? 1.19 Example. f(x) = e x is lower-bounded but has no minimizers. [GRAPH OF EXPONENTIAL] 1.20 Theorem (*). If f : R n R is continuous, then it has a global minimizer on any closed and bounded subset D R n. Proof. See bonus question on HW. What about on all of R n? 1.21 Definition. A continuous function f : R n R is coercive if for any sequence x i with x i, it must be the case that f(x i ) Example. We have: f 1 (x) = x 2 is coercive. f 2 (x) = Ax, x for A 0 is coercive. See problem on HW2. g(x) = x is not coercive. h(x) = e x is not coercive Theorem. A coercive function f : R n R always has a minimizer. Proof. Choose r R such that r is greater than the infimum of f. Consider L := {x : f(x) r}. Then L is nonempty. L is bounded (because f is coercive) and closed (because f is continuous); note L = f 1 (, r]. By Theorem (*), there exists a minimizer of f on L, call it x. For any x L, we have f(x) f( x). For any x / L, we have f(x) > r. Thus x is a global minimizer of f. 5

6 2 Lecture 4: Introduction to iterative methods for unconstrained optimization Recall that we are interested in finding a minimizer of a C 2 -smooth function f : R n R. Iterative method. A procedure that produces a sequence {x k } in R n that we can expect to converge to a critical point of f. There are two fundamental strategies: line search methods and trust region methods. 2.1 Line search method At each iteration k, you choose a direction 0 v k R n and then choose α k 0 that approximately solves min f(x k + αv k ) α>0 Then declare x k+1 = x k + α k v k. Note: finding the exact minimizer of (*) is usually expensive and unnecessary. For these methods, the main points are how to choose a good direction v k and then a good α k. 2.2 Trust region In each iteration, we construct (or update) a model of f. That is m k : R n R is a simple function that approximates f well on a set Ω k containing x k. Then we compute the minimizer x of min x m k (x) such that x Ω k. If f( x) is close to m k ( x), then we declare x k+1 := x. If not, then we shrink Ω k and repeat. Usually Ω k is a ball or a box around x k. The line search and the trust region approaches differ in the order that they choose a direction and stepsize. Comparing algorithms Iteration count: # of iterations need to get within ɛ of optimal soln Cost of each iteration (e.g. # of matrix vector multiplications, # of eigenvalue decompositions, # function calls, # gradient evaluations) These two criteria are often opposing. Designing an algorithm requires you to know what information can be gathered about the function. We will assume f(x k ), f(x k ), 2 f(x k ) are available. Notation: o(t) stands for any function satisfying 2.1 Example. f(t) = t 2, g(t) = t is not o(t). For f : R R that is C 2 -smooth Multivariate version: for f : R n R that is C 2 -smooth o(t) lim = 0 t 0 t f( x + t) = f( x) + tf ( x) + o(t) f( x + t) = f( x) + tf ( x) t2 f ( x) + o(t 2 ) f( x + tv) = f( x) + t f( x), v + o(t) f( x + tv) = f( x) + t f( x), v t2 2 f( x)v, v + o(t 2 ) Line search methods in detail: two things to understand: (1) direction v k, (2) step size α k. We would like to ensure That means v k needs to define a direction of decrease. f(x k ) > f(x k+1 ) > f(x k+2 ) >... 6

7 2.2 Theorem. For any v satisfying v, f( x) < 0 there exists δ > 0 such that f( x + tv) < f(x) for all t (0, δ) Proof. Choose v R n such that v, f( x) < 0. Then f( x + tv) = f( x) + t f( x), v + o(t). So f( x + tv) f( x) t 2.3 Example. v = f( x). In fact, f( x) f( x) = f( x), v + o(t) t = f( x + tv) f( x) < 0 for all small t. Proof. For any v with v = 1, we have by Cauchy-Schwarz inequality < 0 for all small t. is the unique minimizer of min v, f( x) subject to v = 1. v, f( x) v f( x) = f( x) But This gives the result. f( x) f( x), f( x) = f( x) 3 Lecture 7: Trust region methods (a quick look) Last time: 3.1 Theorem (Convergence of Newton s Method). Suppose f : R n R is C 2 -smooth. 2 f(x ) is positive definite, f(x ) = 0 (+ minor technical condition). Consider the iterates x k+1 = x k + t k v N, where t k are chosen to satisfy Wolf s conditions (with c 1 < 1/2). Then if the starting point x 0 is sufficiently close to x, we have 1. t k = 1 satisfies Wolf s conditions 2. x k converges to x 3. If we choose t k = 1, then we have quadratic convergence for some r 0. x k+1 x r x k x 2 f(x k+1 ) r f(x k ) 2 In practice, to get global convergence of Newton s method, in each iteration k you consider 2 f(x k ). Remember v n = [ 2 f k ] 1 f k when 2 f k 0 If 2 f k is not positive definite, then replace 2 f k by a close positive definite matrix in the formula for v k. 2. Approaches 1. Set v k = ( 2 f k + λi) 1 f k for λ large. One choice of λ = δ + λ 1 ( 2 f k ) 2. Diagonalize λ 2 f k = U... U T λn Set all negative eigenvalues to δ > 0 and postmultiply back by U. Then run line search to get t k. If 2 f k > 0, then check if t k = 1 is acceptable. Otherwise run the line search. 7

8 3.1 Trust region methods Given f : R n R and an iterate x k approximate f by a local model function. Then set v k+1 to be the minimizer of m k (v) = f(x k ) + f(x k ), v B kv, v min m k (v) qquad v k where k is some number 0. Set x k+1 = x k + v k. Then adjust k to get k+1. Key quantity (actual decrease over predicted decrease). Algorithm (trust region) Given ˆ > 0, 0 (0, ˆ ) and η [0, 1/4) ρ k := f(x k) f(x k + v k ) m k (0) m k (v k ) for k=0,1,2,... Obtain v_k by APPROXIMATELY solving min m_k(v) s.t. v < \Delta_k Evaluate \rho_k if \rho_k < 1/4 \Delta_{k+1} = (1/4) * \Delta_k else if \rho_k > 3/4 \Delta_{k+1} = \min(2 \Delta_k, \hat{\delta}) else \Delta_{k+1} = \Delta_k if \rho_k > 2 x_{k+1} = x_k + v_k else x_{k+1} = x_k endfor The art here is how to solve approximately the trust region subproblem: min m k(v) v 1 [Here s the basic idea. Just like in line search method, to define approximate, what we needed to do was to have some kind of baseline; the baseline was the derivative. Here we need a baseline; we need to know somehow a good point that almost minimizes the above, and any method that does better than that is going to do well. This is a bit tough, but how can you find an OK solution? Not the true minimizer, but just a quick and dirty solution. What you can do is: steepest descent is the simplest thing I can think of. So let s look at the steepest descent direction for f, and minimize m k along that direction, subject to the constraint v 1. That s not hard. This gives you a baseline called the Cauchy point. As long as you get a fractional improvement over the Cauchy point, your method will do well. There is a 1000 page book just on trust region methods.] Now we re going to go back to the second chapter on convexity; read that too. However we will not follow the book; we will follow lecture notes of Stephen Boyd. This means for a week or two nobody has to scribe anything. [The professor posts all lecture notes online, so these notes have been discontinued.] 8