B553 Lecture 7: Constrained Optimization, Lagrange Multipliers, and KKT Conditions

Similar documents
Nonlinear Programming Methods.S2 Quadratic Programming

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Metric Spaces. Chapter Metrics

constraint. Let us penalize ourselves for making the constraint too big. We end up with a

3. INNER PRODUCT SPACES

Big Data - Lecture 1 Optimization reminders

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Solutions Of Some Non-Linear Programming Problems BIJAN KUMAR PATEL. Master of Science in Mathematics. Prof. ANIL KUMAR

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Section 1.1. Introduction to R n

Linear Threshold Units

Date: April 12, Contents

24. The Branch and Bound Method

Nonlinear Optimization: Algorithms 3: Interior-point methods

2.3 Convex Constrained Optimization Problems

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Mathematical finance and linear programming (optimization)

Solutions to Math 51 First Exam January 29, 2015

Special Situations in the Simplex Algorithm

Lecture 2: August 29. Linear Programming (part I)

Interior Point Methods and Linear Programming

1 Review of Least Squares Solutions to Overdetermined Systems

1 if 1 x 0 1 if 0 x 1

Economics 121b: Intermediate Microeconomics Problem Set 2 1/20/10

2013 MBA Jump Start Program

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

Solutions to Homework 10

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Numerical Analysis Lecture Notes

Linear Programming Notes V Problem Transformations

SOLUTIONS. f x = 6x 2 6xy 24x, f y = 3x 2 6y. To find the critical points, we solve

THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Linear Programming. March 14, 2014

Math 4310 Handout - Quotient Vector Spaces

(Quasi-)Newton methods

Mechanics 1: Conservation of Energy and Momentum

Computational Geometry. Lecture 1: Introduction and Convex Hulls

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

Chapter 4. Moment - the tendency of a force to rotate an object

OPRE 6201 : 2. Simplex Method

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Arrangements And Duality

Linear Programming I

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

10. Proximal point method

Solving Systems of Linear Equations

1.3. DOT PRODUCT If θ is the angle (between 0 and π) between two non-zero vectors u and v,

BX in ( u, v) basis in two ways. On the one hand, AN = u+

Largest Fixed-Aspect, Axis-Aligned Rectangle

Follow the Perturbed Leader

Elasticity Theory Basics

(a) We have x = 3 + 2t, y = 2 t, z = 6 so solving for t we get the symmetric equations. x 3 2. = 2 y, z = 6. t 2 2t + 1 = 0,

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Stochastic Inventory Control

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

Duality of linear conic problems

Thnkwell s Homeschool Precalculus Course Lesson Plan: 36 weeks

Biggar High School Mathematics Department. National 5 Learning Intentions & Success Criteria: Assessing My Progress

by the matrix A results in a vector which is a reflection of the given

α = u v. In other words, Orthogonal Projection

Vector Algebra CHAPTER 13. Ü13.1. Basic Concepts

Linear Algebra Notes

Linear Programming. Solving LP Models Using MS Excel, 18

Inner Product Spaces

Lecture 5 Principal Minors and the Hessian

The equivalence of logistic regression and maximum entropy models

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

Critical points of once continuously differentiable functions are important because they are the only points that can be local maxima or minima.

Name: ID: Discussion Section:

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

6. Define log(z) so that π < I log(z) π. Discuss the identities e log(z) = z and log(e w ) = w.

Math 215 HW #6 Solutions

Intersection of Convex Objects: The Method of Separating Axes

Lecture 2: Homogeneous Coordinates, Lines and Conics

Data Mining: Algorithms and Applications Matrix Math Review

MATH2210 Notebook 1 Fall Semester 2016/ MATH2210 Notebook Solving Systems of Linear Equations... 3

Methods for Finding Bases

New insights on the mean-variance portfolio selection from de Finetti s suggestions. Flavio Pressacco and Paolo Serafini, Università di Udine

Several Views of Support Vector Machines

15 Kuhn -Tucker conditions

North Carolina Math 2

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Høgskolen i Narvik Sivilingeniørutdanningen STE6237 ELEMENTMETODER. Oppgaver

CS3220 Lecture Notes: QR factorization and orthogonal transformations

Vector and Matrix Norms

Linear Programming. Widget Factory Example. Linear Programming: Standard Form. Widget Factory Example: Continued.

DRAFT. Algebra 1 EOC Item Specifications

Section Continued

Solving Linear Programs

discuss how to describe points, lines and planes in 3 space.

1 Solving LPs: The Simplex Algorithm of George Dantzig

Solutions for Review Problems

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Support Vector Machine (SVM)

Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the school year.

Lecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization

The Graphical Method: An Example

Transcription:

B553 Lecture 7: Constrained Optimization, Lagrange Multipliers, and KKT Conditions Kris Hauser February 2, 2012 Constraints on parameter values are an essential part of many optimization problems, and arise due to a variety of mathematical, physical, and resource limitations. In optimization, they can require significant work to handle depending on their complexity. In general, constrained optimization algorithms are much more complex than their unconstrained counterparts. A constrained optimization is specified in a problem of the form min f(x) x R n such that x S (1) where S R n denotes the subset of valid parameters, known as the feasible set (Figure 1). S must be a closed set to guarantee the existence of a minimum. Recall how in the univariate case of optimizing a function within some interval [a, b], we had to test the endpoints of the interval as well as the critical points in the interior (a, b) for optimality. In the multivariate constrained setting, the optimizer must not only consider the possibility that the optimum is a local minimum, but also that the optimum lies on the boundary of the feasible set (Figure 2). The challenge is now that there are an infinite number of points on S. This lecture will introduce analytical techniques Lagrange multipliers for equality constraints and the Karush-Kuhn-Tucker (KKT) conditions for inequalities for identifying those critical points. Besides being analytically useful, these conditions are the starting point for most constrained 1

optimization algorithms. Note that like other critical point tests these are only first-order conditions for optimality, and are therefore necessary but not sufficient for finding minima. 1 Common types of constraints Several forms of constraints arise in practice. common ones (Figure 3). Here are some of the most Bound constraints. Axis-aligned bound constraints take the form l i x i u i for some lower and upper values l i and u i, i = 1,..., n. These are some of the easiest constraints to incorporate. Linear inequalities. Linear inequality constraints take the form Ax b for some m n matrix A and a length m vector b. Note that bound constraints are a special case of a linear inequality with [ ] I A = (2) I and b = [ u l where u and l are the vectors fo upper and lower bounds, respectively. Linear equalities. Linear equality constraints take the form Ax = b, where A and b have m rows. Note that this is usually an underdetermined system (otherwise S would consist of either a single point or the empty set). In theory these constraints can easily be removed by finding a representation that incorporates the nullspace of A, say x = x 0 + Ny, and converting the optimization over x into a smaller optimization over y. However, note that most optimization routines do not operate this way because of numerical errors in computing N. Nonlinear constraints: general form. In general, constraints may be nonlinear. In this setting we can (usually) write the constraints in the following form: ] (3) g i (x) = 0 for i = 1,..., m h j (x) 0 for j = 1,..., p (4) 2

where the g i and h j are continuous, differentiable scalar field functions. This is the form that we will assume for the rest of this class, because all prior constraint types are special cases. Convex constraints. A convex set S satisfies the following property. For any two points x and y in S, the point (1 u)x + uy (5) for u [0, 1] lies in S as well. In other words, the line segment between any two points in S must also lie in S. Later we will discuss efficient algorithms for solving problems in which the constraints g i and h j produce a convex feasible set, and also the objective function f is a convex function. In particular, we will show that descent methods converge to a global minimum. (Note that to achieve convexity, any equality constraints must be linear) Black-box constraints. Another type of constraint that is a black box that can be queried to test whether a point x lies inside it. No other mathematical property, like the magnitude of feasibility violation, derivatives, or even smoothness, is necessarily provided. These constraints typically arise as a result of rather complex procedures (e.g., simulations, geometric algorithms, etc) that do not have a convenient mathematical representation. These constraints are rarely considered in the numerical optimization literature, but often come up in large practical systems. 2 First-Order Conditions of Local Optimality We say that a feasible point x is a local minimum of the optimization problem (1) if f(x) is lower than any other feasible point in some neighborhood of S. That is, x is a local minimum if x S and there exists a neighborhood of radius ɛ so that f(x) < f(y) for any y in {y S 0 < d(x, y) < ɛ}. Unfortunately not all local minima are critical points of f, because we have to take into account how the constraints affect the neighborhood! We will show that there are alternative criteria that we can use to generate candidates for local minima. 3

2.1 Lagrange Multipliers Let us suppose for the moment that there are no inequality constraints, and instead that we are addressing the general equality-constrained problem min f(x) x R n such that g i (x) = 0 for i = 1,..., m. (6) We will assume that both f and g are differentiable. With One Constraint. First let us consider the m = 1 case. The principle of Lagrange multipliers states that any local minima or maxima x of (6) must simultaneously satisfy the following equations: f(x)+λ g 1 (x) = 0 g 1 (x) = 0 (7) for some value of λ. The variable λ is known as the Lagrange multiplier. These equations are saying that at x, the gradient direction of f is a multiple of the gradient direction of g, which is to say that they are parallel (Figure 4). You might visualize this as follows. Imagine yourself standing at x, which satisfies g 1 (x) = 0. Any direction v that you can move in to instantaneously change the value of f will have a non-zero dot-product with f due to the properties of the directional derivative. The constraint g 1, however, will stop you from moving in any direction unless it maintains g 1 (x) = 0. This is equivalent to saying that the dot-product of v with g 1 (x) must be zero. If g 1 (x) is not a multiple of f(x), then you can slide along level set g 1 (x) = 0 along a direction v that has a nonzero dot-product with f(x) (Figure 5). In other words, x is not a minimum. On the other hand, if g 1 (x) is a multiple of f(x), then there is no such direction to move in, because any valid sliding direction will not change the value of f. In other words, the constraint g 1 cancels out any kind of change that you could make in the value of f. It is important to note that there may be multiple points x that satisfy (7), each of which has different Lagrange multipliers λ. With Many Constraints. The following condition generalizes Lagrange 4

multipliers to multiple constraints: f(x) + λ 1 g 1 (x) + + λ m g m (x) = 0 g 1 (x) = 0. g m (x) = 0 (8) where λ 1,..., λ m are the Lagrange multipliers. This equation is saying that at x, f(x) Span({ g 1 (x),..., g m (x)}). The reason why this makes sense is that each of the constraints resists motion in the direction of its gradient. If f lies in this span, then a motion in any direction that locally changes f will be completely nullified by the constraints. All local minima must satisfy (8). Conversely, if the two equations of (8) are satisfied then x must be a local minimum, maximum, or a sort of saddle point restricted to S. So this is a necessary, but not sufficient, condition for optimality. Example. Suppose we wanted to find the closest points (x 1, y 1 ) and (x 2, y 2 ) on two unit circles, one centered at the origin and the other centered at (c x, c y ). The optimization variable is x = (x 1, y 1, x 2, y 2 ) and the constrained minimization problem is min f(x) = (x 1 x 2 ) 2 + (y 1 y 2 ) 2 such that g 1 (x) = x 2 1 + y1 2 1 = 0 g 2 (x) = (x 2 c x ) 2 + (y 2 c y ) 2 1 = 0 (9) The method of Lagrange multipliers states that we need to find a variable x that satisfies the constraints, and multipliers λ 1 and λ 2 to satisfy: We can compute the following gradients f(x) + λ 1 g 1 (x) + λ 2 g 2 (x) = 0. (10) f(x) = 2(x 1 x 2 ) 2(y 1 y 2 ) 2(x 1 x 2 ) 2(y 1 y 2 ) 5, (11)

g 1 (x) = g 2 (x) = 2x 1 2y 1 0 0 0 0 2(x 2 x c ) 2(y 2 y c ), (12). (13) Putting these together, we have the two simultaneous sets of equations x 1 x 2 + λ 1 x 1 = 0 y 1 y 2 + λ 1 y 1 = 0 (14) and x 1 x 2 λ 2 (x 2 x c ) = 0 y 1 y 2 λ 2 (y 2 y c ) = 0. (15) In other words, the vectors (x 1 x 2, y 1 y 2 ), (x 1, y 1 ), and (x 2 x c, y 2 y c ) must all be parallel. With some rearrangement, it also means that (x 1, y 1 ) and (x 2, y 2 ) must be parallel to (x c, y c ). Verify geometrically that all points on the circles that intersect the line through (x c, y c ) are either local minima, local maxima, or saddle points of the squared distance function. Interpreting Lagrange Multipliers. In some applications like physics and economics, Lagrange multipliers have a meaningful interpretation. Consider the m = 1 case. Let s interpret the constraint as stating g 1 (x) = c with c = 0. The Lagrange multiplier λ at a (global) minimum x states how fast the minimum value f would change if I were to relax the constraint by raising c at a constant rate. This amount would be λ (Figure 6). For example, in constrained physical simulation the Lagrange multipliers produce the forces required to maintain each constraint. Using Lagrange Multipliers in numerical optimization. If we define the following Lagrangian function on n + m variables L(x, λ 1,..., λ m ) = f(x) + m λ i g i (x). (16) i=1 6

then the constraint optimization problem can be cast as one of finding the critical points of L in R n+m. More compactly, if we let λ = (λ 1,..., λ m ), note that we would like to find a point (x, λ) such that L(x, λ) = [ x L(x, λ) λ L(x, λ) ] = f(x) + m i=1 λ i g i (x) g 1 (x). g m (x) (17) equals zero. The importance of this is that we have converted a constrained optimization into an unconstrained root-finding problem! There do exist Newton-like techniques for solving multivariate root-finding problems. If f and the g i s are twice differentiable, we can use the iterative method: (x t+1, λ t+1 ) = (x t, λ t ) 2 L(x t, λ t ) 1 L(x, λ). (18) The Hessian of the Lagrangian is given by the following matrix 2 f(x) + m i=1 λ i 2 g i (x) g 1 (x) g m (x) 2 g 1 (x) T L(x, λ) =. 0. (19) g m (x) T 2.2 Karush-Kuhn-Tucker Conditions The KKT conditions extend the ideas of Lagrange multipliers to handle inequality constraints in addition to equality constraints. These conditions provide a first-order optimality condition for the problem: min f(x) x R n such that g i (x) = 0 for i = 1,..., m h j (x) 0 for j = 1,..., p (20) where f and all the g i s and h j s are differentiable. With one inequality. Let us start by assuming m = 0 and p = 1. The peculiar thing about inequalities is that they operate in essentially two regimes depending on whether they affect a critical point or not (Figure 7). If x is a 7

local minimum of f(x) such that h 1 (x) < 0, then the constraint is satisfied for a neighborhood around x, and x is a local minimum of the constrained problem. On the other hand, there could be local minima at the boundary of the feasible set S, which consists of those points that satisfy h 1 (x) = 0. To find these critical points, we can treat h 1 like an equality constraint and use the method of Lagrange multipliers. So, we must be aware of the following two cases: 1. f(x) = 0 and h 1 (x) < 0. 2. h 1 (x) = 0 and there exists a Lagrange multiplier µ such that f(x) + µ h 1 (x) = 0. A compact way of writing these two conditions, which will be very useful in a moment, is through the following set of equalities and inequalities: f(x) + µ h 1 (x) = 0 h 1 (x) 0 µh 1 (x) = 0 (21) in which the term µh 1 (x) = 0 is known as the complementarity condition that enforces either µ to be zero or h 1 (x) to be zero, If we are only interested in finding local minima, we can also include the constraint µ 0. With many inequalities. To generalize this argument to p > 1, consider that each of the two cases outlined above can hold for each of the inequalities. So, we may potentially need to enumerate all partitions of inequalities into those that are strictly satisfied and those that are met with equality, and find critical points for each subset. But there are 2 n possible subsets (Figure 8)! To express this condition, we can use the compact form as follows. f(x) + µ 1 h 1 (x) + + µ p h p (x) = 0 h j (x) 0 for j = 1,..., p µ j h j (x) = 0 for j = 1,..., p (22) where µ 1,..., µ p are the KKT multipliers. For those critical points with h j (x) = 0, we say the inequality is active at x. If h j (x) < 0, then we say the inequality is inactive. Some of the first 8

numerical methods that we present in this class perform combinatorial search through the possible subsets of active constraints. General form. Equalities can be incorporated in a straightforward manner into the above equation, giving us the full set of KKT conditions. f(x) + m p λ i g i (x) + µ j h j (x) = 0 i=1 j=1 g i (x) = 0 for i = 1,..., m h j (x) 0 for j = 1,..., p µ j h j (x) = 0 for j = 1,..., p (23) where λ 1,..., λ m are the Lagrange multipliers and µ 1,..., µ p are the KKT multipliers. Note that the complementarity condition only needs to be satisfied on the inequalities. Use of KKT conditions in analytical optimization. The KKT conditions can be used to analytically prove that a point is an optimum of a constrained problem. One drawback is that there are a combinatorial number of subsets of active inequalities, and in the absence of further information all of these subsets must be considered as candidates for generating the optimal critical point! Use of KKT conditions in numerical optimization. Unfortunately, we are not able to use the KKT conditions to formulate an unconstrained root-finding problem like we did in the case of Lagrange multipliers. The reason is because the inequality constraints h j (x) 0 must be preserved, and there is no natural way to handle them in the root finding methods we have observed so far. Instead, in most optimization software the KKT conditions are usually used as a first stage of verifing that a candidate point found by some algorithm is truly a critical point. 3 Exercises 1. The entropy of a discrete probability distribution (p 1,..., p n ) over n values is given by E(p 1,..., p n ) = n i=1 p i ln p i. Of course, probabilities must sum to 1. Find the probability distribution that maximizes entropy using Lagrange multipliers. 9

2. Find a simple way to compute the solution to the n-dimensional constrained optimization min x c 2 such that l x u, where l and u are bound constraints. 3. Write the KKT conditions for finding the closest point in a 2D triangle with vertices a, b, c (boundary inclusive) to the origin. Assume a, b, c are given in counterclockwise order. What is the significance of the KKT multipliers? What does it mean if none of them are nonzero? One? Two? More? 10