B553 Lecture 7: Constrained Optimization, Lagrange Multipliers, and KKT Conditions

B553 Lecture 7: Constrained Optimization, Lagrange Multipliers, and KKT Conditions Kris Hauser February 2, 2012 Constraints on parameter values are an essential part of many optimization problems, and arise due to a variety of mathematical, physical, and resource limitations. In optimization, they can require significant work to handle depending on their complexity. In general, constrained optimization algorithms are much more complex than their unconstrained counterparts. A constrained optimization is specified in a problem of the form min f(x) x R n such that x S (1) where S R n denotes the subset of valid parameters, known as the feasible set (Figure 1). S must be a closed set to guarantee the existence of a minimum. Recall how in the univariate case of optimizing a function within some interval [a, b], we had to test the endpoints of the interval as well as the critical points in the interior (a, b) for optimality. In the multivariate constrained setting, the optimizer must not only consider the possibility that the optimum is a local minimum, but also that the optimum lies on the boundary of the feasible set (Figure 2). The challenge is now that there are an infinite number of points on S. This lecture will introduce analytical techniques Lagrange multipliers for equality constraints and the Karush-Kuhn-Tucker (KKT) conditions for inequalities for identifying those critical points. Besides being analytically useful, these conditions are the starting point for most constrained 1

optimization algorithms. Note that like other critical point tests these are only first-order conditions for optimality, and are therefore necessary but not sufficient for finding minima. 1 Common types of constraints Several forms of constraints arise in practice. common ones (Figure 3). Here are some of the most Bound constraints. Axis-aligned bound constraints take the form l i x i u i for some lower and upper values l i and u i, i = 1,..., n. These are some of the easiest constraints to incorporate. Linear inequalities. Linear inequality constraints take the form Ax b for some m n matrix A and a length m vector b. Note that bound constraints are a special case of a linear inequality with [ ] I A = (2) I and b = [ u l where u and l are the vectors fo upper and lower bounds, respectively. Linear equalities. Linear equality constraints take the form Ax = b, where A and b have m rows. Note that this is usually an underdetermined system (otherwise S would consist of either a single point or the empty set). In theory these constraints can easily be removed by finding a representation that incorporates the nullspace of A, say x = x 0 + Ny, and converting the optimization over x into a smaller optimization over y. However, note that most optimization routines do not operate this way because of numerical errors in computing N. Nonlinear constraints: general form. In general, constraints may be nonlinear. In this setting we can (usually) write the constraints in the following form: ] (3) g i (x) = 0 for i = 1,..., m h j (x) 0 for j = 1,..., p (4) 2

where the g i and h j are continuous, differentiable scalar field functions. This is the form that we will assume for the rest of this class, because all prior constraint types are special cases. Convex constraints. A convex set S satisfies the following property. For any two points x and y in S, the point (1 u)x + uy (5) for u [0, 1] lies in S as well. In other words, the line segment between any two points in S must also lie in S. Later we will discuss efficient algorithms for solving problems in which the constraints g i and h j produce a convex feasible set, and also the objective function f is a convex function. In particular, we will show that descent methods converge to a global minimum. (Note that to achieve convexity, any equality constraints must be linear) Black-box constraints. Another type of constraint that is a black box that can be queried to test whether a point x lies inside it. No other mathematical property, like the magnitude of feasibility violation, derivatives, or even smoothness, is necessarily provided. These constraints typically arise as a result of rather complex procedures (e.g., simulations, geometric algorithms, etc) that do not have a convenient mathematical representation. These constraints are rarely considered in the numerical optimization literature, but often come up in large practical systems. 2 First-Order Conditions of Local Optimality We say that a feasible point x is a local minimum of the optimization problem (1) if f(x) is lower than any other feasible point in some neighborhood of S. That is, x is a local minimum if x S and there exists a neighborhood of radius ɛ so that f(x) < f(y) for any y in {y S 0 < d(x, y) < ɛ}. Unfortunately not all local minima are critical points of f, because we have to take into account how the constraints affect the neighborhood! We will show that there are alternative criteria that we can use to generate candidates for local minima. 3

2.1 Lagrange Multipliers Let us suppose for the moment that there are no inequality constraints, and instead that we are addressing the general equality-constrained problem min f(x) x R n such that g i (x) = 0 for i = 1,..., m. (6) We will assume that both f and g are differentiable. With One Constraint. First let us consider the m = 1 case. The principle of Lagrange multipliers states that any local minima or maxima x of (6) must simultaneously satisfy the following equations: f(x)+λ g 1 (x) = 0 g 1 (x) = 0 (7) for some value of λ. The variable λ is known as the Lagrange multiplier. These equations are saying that at x, the gradient direction of f is a multiple of the gradient direction of g, which is to say that they are parallel (Figure 4). You might visualize this as follows. Imagine yourself standing at x, which satisfies g 1 (x) = 0. Any direction v that you can move in to instantaneously change the value of f will have a non-zero dot-product with f due to the properties of the directional derivative. The constraint g 1, however, will stop you from moving in any direction unless it maintains g 1 (x) = 0. This is equivalent to saying that the dot-product of v with g 1 (x) must be zero. If g 1 (x) is not a multiple of f(x), then you can slide along level set g 1 (x) = 0 along a direction v that has a nonzero dot-product with f(x) (Figure 5). In other words, x is not a minimum. On the other hand, if g 1 (x) is a multiple of f(x), then there is no such direction to move in, because any valid sliding direction will not change the value of f. In other words, the constraint g 1 cancels out any kind of change that you could make in the value of f. It is important to note that there may be multiple points x that satisfy (7), each of which has different Lagrange multipliers λ. With Many Constraints. The following condition generalizes Lagrange 4

multipliers to multiple constraints: f(x) + λ 1 g 1 (x) + + λ m g m (x) = 0 g 1 (x) = 0. g m (x) = 0 (8) where λ 1,..., λ m are the Lagrange multipliers. This equation is saying that at x, f(x) Span({ g 1 (x),..., g m (x)}). The reason why this makes sense is that each of the constraints resists motion in the direction of its gradient. If f lies in this span, then a motion in any direction that locally changes f will be completely nullified by the constraints. All local minima must satisfy (8). Conversely, if the two equations of (8) are satisfied then x must be a local minimum, maximum, or a sort of saddle point restricted to S. So this is a necessary, but not sufficient, condition for optimality. Example. Suppose we wanted to find the closest points (x 1, y 1 ) and (x 2, y 2 ) on two unit circles, one centered at the origin and the other centered at (c x, c y ). The optimization variable is x = (x 1, y 1, x 2, y 2 ) and the constrained minimization problem is min f(x) = (x 1 x 2 ) 2 + (y 1 y 2 ) 2 such that g 1 (x) = x 2 1 + y1 2 1 = 0 g 2 (x) = (x 2 c x ) 2 + (y 2 c y ) 2 1 = 0 (9) The method of Lagrange multipliers states that we need to find a variable x that satisfies the constraints, and multipliers λ 1 and λ 2 to satisfy: We can compute the following gradients f(x) + λ 1 g 1 (x) + λ 2 g 2 (x) = 0. (10) f(x) = 2(x 1 x 2 ) 2(y 1 y 2 ) 2(x 1 x 2 ) 2(y 1 y 2 ) 5, (11)

g 1 (x) = g 2 (x) = 2x 1 2y 1 0 0 0 0 2(x 2 x c ) 2(y 2 y c ), (12). (13) Putting these together, we have the two simultaneous sets of equations x 1 x 2 + λ 1 x 1 = 0 y 1 y 2 + λ 1 y 1 = 0 (14) and x 1 x 2 λ 2 (x 2 x c ) = 0 y 1 y 2 λ 2 (y 2 y c ) = 0. (15) In other words, the vectors (x 1 x 2, y 1 y 2 ), (x 1, y 1 ), and (x 2 x c, y 2 y c ) must all be parallel. With some rearrangement, it also means that (x 1, y 1 ) and (x 2, y 2 ) must be parallel to (x c, y c ). Verify geometrically that all points on the circles that intersect the line through (x c, y c ) are either local minima, local maxima, or saddle points of the squared distance function. Interpreting Lagrange Multipliers. In some applications like physics and economics, Lagrange multipliers have a meaningful interpretation. Consider the m = 1 case. Let s interpret the constraint as stating g 1 (x) = c with c = 0. The Lagrange multiplier λ at a (global) minimum x states how fast the minimum value f would change if I were to relax the constraint by raising c at a constant rate. This amount would be λ (Figure 6). For example, in constrained physical simulation the Lagrange multipliers produce the forces required to maintain each constraint. Using Lagrange Multipliers in numerical optimization. If we define the following Lagrangian function on n + m variables L(x, λ 1,..., λ m ) = f(x) + m λ i g i (x). (16) i=1 6

then the constraint optimization problem can be cast as one of finding the critical points of L in R n+m. More compactly, if we let λ = (λ 1,..., λ m ), note that we would like to find a point (x, λ) such that L(x, λ) = [ x L(x, λ) λ L(x, λ) ] = f(x) + m i=1 λ i g i (x) g 1 (x). g m (x) (17) equals zero. The importance of this is that we have converted a constrained optimization into an unconstrained root-finding problem! There do exist Newton-like techniques for solving multivariate root-finding problems. If f and the g i s are twice differentiable, we can use the iterative method: (x t+1, λ t+1 ) = (x t, λ t ) 2 L(x t, λ t ) 1 L(x, λ). (18) The Hessian of the Lagrangian is given by the following matrix 2 f(x) + m i=1 λ i 2 g i (x) g 1 (x) g m (x) 2 g 1 (x) T L(x, λ) =. 0. (19) g m (x) T 2.2 Karush-Kuhn-Tucker Conditions The KKT conditions extend the ideas of Lagrange multipliers to handle inequality constraints in addition to equality constraints. These conditions provide a first-order optimality condition for the problem: min f(x) x R n such that g i (x) = 0 for i = 1,..., m h j (x) 0 for j = 1,..., p (20) where f and all the g i s and h j s are differentiable. With one inequality. Let us start by assuming m = 0 and p = 1. The peculiar thing about inequalities is that they operate in essentially two regimes depending on whether they affect a critical point or not (Figure 7). If x is a 7

local minimum of f(x) such that h 1 (x) < 0, then the constraint is satisfied for a neighborhood around x, and x is a local minimum of the constrained problem. On the other hand, there could be local minima at the boundary of the feasible set S, which consists of those points that satisfy h 1 (x) = 0. To find these critical points, we can treat h 1 like an equality constraint and use the method of Lagrange multipliers. So, we must be aware of the following two cases: 1. f(x) = 0 and h 1 (x) < 0. 2. h 1 (x) = 0 and there exists a Lagrange multiplier µ such that f(x) + µ h 1 (x) = 0. A compact way of writing these two conditions, which will be very useful in a moment, is through the following set of equalities and inequalities: f(x) + µ h 1 (x) = 0 h 1 (x) 0 µh 1 (x) = 0 (21) in which the term µh 1 (x) = 0 is known as the complementarity condition that enforces either µ to be zero or h 1 (x) to be zero, If we are only interested in finding local minima, we can also include the constraint µ 0. With many inequalities. To generalize this argument to p > 1, consider that each of the two cases outlined above can hold for each of the inequalities. So, we may potentially need to enumerate all partitions of inequalities into those that are strictly satisfied and those that are met with equality, and find critical points for each subset. But there are 2 n possible subsets (Figure 8)! To express this condition, we can use the compact form as follows. f(x) + µ 1 h 1 (x) + + µ p h p (x) = 0 h j (x) 0 for j = 1,..., p µ j h j (x) = 0 for j = 1,..., p (22) where µ 1,..., µ p are the KKT multipliers. For those critical points with h j (x) = 0, we say the inequality is active at x. If h j (x) < 0, then we say the inequality is inactive. Some of the first 8

numerical methods that we present in this class perform combinatorial search through the possible subsets of active constraints. General form. Equalities can be incorporated in a straightforward manner into the above equation, giving us the full set of KKT conditions. f(x) + m p λ i g i (x) + µ j h j (x) = 0 i=1 j=1 g i (x) = 0 for i = 1,..., m h j (x) 0 for j = 1,..., p µ j h j (x) = 0 for j = 1,..., p (23) where λ 1,..., λ m are the Lagrange multipliers and µ 1,..., µ p are the KKT multipliers. Note that the complementarity condition only needs to be satisfied on the inequalities. Use of KKT conditions in analytical optimization. The KKT conditions can be used to analytically prove that a point is an optimum of a constrained problem. One drawback is that there are a combinatorial number of subsets of active inequalities, and in the absence of further information all of these subsets must be considered as candidates for generating the optimal critical point! Use of KKT conditions in numerical optimization. Unfortunately, we are not able to use the KKT conditions to formulate an unconstrained root-finding problem like we did in the case of Lagrange multipliers. The reason is because the inequality constraints h j (x) 0 must be preserved, and there is no natural way to handle them in the root finding methods we have observed so far. Instead, in most optimization software the KKT conditions are usually used as a first stage of verifing that a candidate point found by some algorithm is truly a critical point. 3 Exercises 1. The entropy of a discrete probability distribution (p 1,..., p n ) over n values is given by E(p 1,..., p n ) = n i=1 p i ln p i. Of course, probabilities must sum to 1. Find the probability distribution that maximizes entropy using Lagrange multipliers. 9

2. Find a simple way to compute the solution to the n-dimensional constrained optimization min x c 2 such that l x u, where l and u are bound constraints. 3. Write the KKT conditions for finding the closest point in a 2D triangle with vertices a, b, c (boundary inclusive) to the origin. Assume a, b, c are given in counterclockwise order. What is the significance of the KKT multipliers? What does it mean if none of them are nonzero? One? Two? More? 10