Derivations for Linear Algebra and Optimization

Transcription

1 Derivations for Linear Algebra and Optimization John Duchi Contents 1 Householder Transformations 1 Convex Sets and Local/Global Minima.1 Local Minima as Global Minima Uniqueness of Global Minima Second Order Convexity Conditions Multivariate case Iterative Methods for Optimization Steepest Decent Digression on Choosing Directions Conjugate Gradient Perceptron Convergence 10 6 Duality of SVMs 11 7 Projection of a vector onto linear constraints 1 8 Minimization of a linear function subject to elliptical constraints 1 9 KL Divergence for Gaussian distributions 13 1 Householder Transformations Much of this section was copied and paraphrased from Heath s Scientific Computing. Anyways. Suppose we are looking for an orthogonal transformation that annihilates desired components of a given vector. Recall that a square real matrix Q is said to be orthogonal if its columns are orthonormal, that is, that Q T Q = I. Orthogonal transformations are nice because they preserve Euclidean norms of any vectors v, as follows: Qv = (Qv) T Qv = v T Q T Qv = v T v = v These sorts of transformations are nice, because when applied to any linear system, they maintain Euclidean norms and won t penalize the difficulty of solving the problem numerically. Now, we are really looking for an orthogonal transformation of a vector that annihilates its components while maintaining the two-norm (or Euclidean norm). We can accomplish this with a Householder transformation, which is a matrix of the form H = I vvt v T v. 1

2 Obviously, v is a non-zero vector. From our definition, it is easy to see that H = H T, and a little math gives us that HH = (I vvt v T )(I vvt v v T v ) = I 4vvT v T v + 4vvT v T v vvt v T v = I To derive the full Householder, we first consider the problem of taking a vector a and annihilating all of its components but the first, giving us a vector αe 1, in such a way that a = α. Basically, we would like to have α Ha = Using the original formula for H, we have so that. 0 = α. 0 = a e 1 αe 1 = Ha = a vvt v T v a v = (a αe 1 ) vt v v T a = (a ± a e 1 ) C(v, a) Because the constant C term will be factored out by division when we determine the actual transformation H, we can eliminate it, giving us that v = a αe 1 In general, to avoid cancellation (whatever that means), we choose α = sign(a 1 ) a. In the more general case, when actually working with a system that we would like to transform orthogonally into an upper-triangular system (for easier solving), we would like to take an m-vector a and annihilate its last m k components. That is, we would like to take a vector [ ] a1 a = a where a 1 is a k 1-vector and a is an m k + 1-vector, and transform it into a vector which contains a 1 as its first k 1 elements, as its k th element we would like a, and the remaining m k elements we would like to be zeros. Here is the derivation of the vector v to be used in the Householder transformation for this case: [ ] ) [ ] [ ] a1 Ha = + αe 0 k = (I vvt a1 a1 v T = vvt v a a v T v a Subtracting a from both sides and multiplying by negative one, we have [ ] 0 αe a k = v vt a v T = v C(v, a) v and again we can eliminate the constant term to give us that we should set v to be [ ] 0 v = a a e k From the above discussion, it is fairly easy to see that given an m n matrix A, if we apply n Householder transformations to eliminate the below diagonal elements for A, we will get a matrix R such that [ ] R H n H n 1 H 1 A =. 0 For further explication and use of the Householder, see Heath.

3 Convex Sets and Local/Global Minima The definition of a convex set is a set S R n which contains the line segment between any two of its points. More specifically, if S is convex, then for any x, y S, α [0, 1], αx + (1 α)y S. A function f : S R n R is convex on a convex set S if its value along any line segment in S lies on or below the chord connecting the function values at the endpoints of the segment. That is, f(αx + (1 αy)) αf(x) + (1 α)f(y).1 Local Minima as Global Minima One really nice property of convex sets is that any local minimum of a convex function f on a convex set S is a global minimum of f on S. To see this, consider the definition of a local minimum: f(x) < f(x + ε) for some small ε, which states that x S is the smallest value f takes on in some ball of radius ε = ε on S. Now, we consider some other point y S. Assume, for the sake of contradiction, that f(y) < f(x), which means that x is not the global minimum of f on S. By the definition of convexity, the chord connecting x and y lies completely in S. Thus, we can assert that there is a point on the chord αx + (1 α)y that lies inside the ball, that is, there is some α r such that α r x + (1 α r )y x < ε Consider some small α r (0, 1), and let α r x + (1 α r )y = x + ε r, which is a point inside the ball x + ε. Now, consider that f(y) < f(αx + (1 α)y) (1) αf(x) + (1 α)f(y) () αf(x) + (1 α)f(x) = f(x) where equation (1) follows because y is, by our assumption, the global minimum of f on S, and equation () follows because f is convex on S. Note that the above implies that for any α [0, 1), we have that αf(x) + (1 α)f(y) < f(x), specifically, for α = α r. We saw above, though, that for α r, x + ε r was inside the ball given by x + ε, so that f(x + ε r ) > f(x). This is a contradiction, however, because we saw from equation () that αf(x) + (1 α)f(y) f(x). Our assumption that f(y) < f(x) must thus be false, and we have that if x is a local minimum of f on S, then x is a global minimum of f on S.. Uniqueness of Global Minima If we have a function f which is strictly convex on a convex set S, that is, f(αx + (1 α)y) < αf(x) + (1 α)f(y) for α (0, 1), then a local minimum of f is the unique global minimum of f on S. To see this, suppose we have two distinct points, x and y, which are local minima of f on S. By the above result, if x and y existed, we know that they are both global minima, so that f(x) = f(y). This means, however, that because f is strictly convex, f(αx + (1 α)y) < αf(x) + (1 α)f(y) = αf(x) + (1 α)f(x) = f(x) This is a contradiction because x is a global minima of f on S, so our assumption that y also existed must be false. Thus, if a function f is strictly convex on a convex set S, any local minimum of f is the unique global minimum of f on S. 3

4 3 Second Order Convexity Conditions This is a problem from Boyd s book on convex optimization. Suppose we have a function f : R R, which is convex, and a, b dom f, with a < b. First, we consider x [a, b]. If θ [0, 1], we know that f(θx + (1 θx)) θf(x) + (1 θ)f(x). Because this is the case, and because x = θa + (1 θ)b for some θ for any x [a, b], it is easy to see that θ = x b a b = b x b a. Thus, f(x) = f(θa + (1 θ)b) θf(a) + (1 θ)f(b) = b x b a f(a) + x a b a f(b) Using this, we can prove some more interesting things. First, we have Similarly, f(x) + x b b a f(a) x a b a f(b) (b a)f(x) bf(a) (x a)f(b) xf(a) (b a)f(x) bf(a) + af(a) (x a)f(b) xf(a) + af(a) [f(x) f(a)] (b a) (x a)[f(b) f(a)] f(x) f(a) f(b) f(a) x a b a (x b)f(a) (x a)f(b) f(x)(b a) (b x)f(b) + (x b)f(a) (b a)f(b) (b a)f(x) [f(b) f(a)] (b x) (b a)[f(b) f(x)] f(b) f(a) f(b) f(x) b a b x If we suppose that f is differentiable, taking each side of the two inequalities in turn, we have (by L Hôpital s rule) that f(x) f(a) f (x) lim = lim = f f(b) f(a) f(b) f(x) (a) lim = f (b). x a x a x a 1 b a x b b x While we used the above inequalities to prove this, it also easily follows from the gradient of a convex function f being a global underestimator of f (that is f(x) f(y) + f (y)(x y)). Lastly, we assume that f is twice differentiable. Using the above result, that f (a) we see that f (b) f (a) 0, so that f (y) f (x) y x f(b) f(a) b a f (y) f (x) f (y) d dy lim = lim f (x) y x y x y x 1 d f (b) 0 as long as y > x, and we then have dy x = lim y x f (y) = f (x) 0. Thus we see that if f is convex and twice differentiable, f (x) 0. As for the opposite direction, we need simply use the fact that f is twice differentiable (so that f is integrable) and we can prove the converse. So suppose that f (u) 0 for all u in dom f. It easily follows that if y > z, we have that f (y) f (z), because f (u) 0 implies that y z f (u)du = f (y) f (z) 4 y z 0 du = 0

5 Now we simply take the integral from x to y of our result (assuming y x, though the negative signs just change if the opposite is the case), seeing y x f (y) f (z) 0 f (y) f (z)dz f (y)z y x f(z) y x 0 f (y)(y x) f(y) + f(x) 0 y x 0 dz f(x) f(y) + f (y)(x y) This shows that a tangent to f is a global underestimator of f, so f is convex if f (x) 0 for x dom f. In the end, we see that a function f is convex if and only if f (x) Multivariate case In the multivariate case, we would like to prove that if the hessian f of a function f : R n R is positive semi-definite, that is, f 0 or x T fx 0, that f is convex. First, suppose that f 0. Now, consider the single variable function g that restricts f to some arbitrary line in f s domain, that is, dom g = [0, 1], and g(t) = f(tx + (1 t)y) for x, y dom f. We now take the first and second derivatives of g with respect to t: d dt g(t) = f(tx + (1 t)y)t (x y) d dt g(t) = (x y)t f(tx + (1 t)y)(x y) g (t) 0. Now, suppose that f is convex, and again restrict it to a line using g. Because g is the intersection of a convex set with f, g is convex and hence g (t) 0. Thus, we have that for all x, y dom f, g (t) = (x y) T f(tx + (1 t)y)(x y) 0. As near as I can tell, we have proved that a twice differentiable function f : R n R with open, convex domain is convex if and only if f 0. 4 Iterative Methods for Optimization In this section, we will be considering the problem of minimizing some function f : R n R. 4.1 Steepest Decent In general, we know that f(x) is the direction of the fastest ascent for the function f(x). As such, to minimize the function f(x), we would like to move our solution in the f direction. For steepest descent, then, we define the update operator for our estimate x k+1 at step k + 1 to be x k+1 = x k α f(x k ). 5

6 To make this a smarter method, we would like to take the largest steps possible at each step of the search, so we would like to travel in the f(x) direction until f(x) is small (or zero). To this end, we can define a function x k (t) = x k f(x k )t, so that we would like to minimize f(x k (t)) = f(x k f(x k ) t) = g(t), which is simply a one-dimensional minimization problem we can solve using a line search. It is easy to see that if we minimize g(t), we have traveled as far along the path of f(x k ) as we can without making our function grow again. As a more specific example, which can be enlightening, consider the function f defined as follows (where A is symmetric and positive definite) whose derivative, it is easy to see, is f(x) = 1 xt Ax b T x + c f(x) = 1 Ax + 1 AT x b = Ax b Thus, to solve this minimization problem, we must solve Ax = b. For steepest descent, we know that our search direction is f(x) = b Ax, and we also know that the residual for any potential solution x is given by r = b Ax = Ax exact Ax = A(x exact x) = Ae error Unfortunately, we very rarely know anything about our error e, but we do know that in steepest descent we would like to go in the direction r until we are parallel to the level curves of our function, which is (as stated above) a one dimensional minimization problem. That is, we would like to have f(x k ) s k, where s k is the search direction at iteration k of steepest descent. This implies that f(x) T s k = 0 if and only if f(x k+1 ) T s k = 0. Recognizing that r k = b Ax k, we can solve for our updates as follows: 0 = r T k+1r k = (b Ax k+1 ) T r k = (b A(x k + αs k )) T r k = (b Ax k αar k ) T r k = (b Ax k ) T r k α(ar k ) T r k = r T k r k αr T k Ar k In the end, this gives that α = rt k r k r T k Ar k. In short, here is the steepest descent method used to find the solution to the equation Ax = b: The more general formulation is r k = b Ax k α = rt k r k r T k Ar k x k+1 = x k + αr k α = argmin f(x k f(x k ) t) t x k+1 = x k α f(x k ) 6

7 4.1.1 Digression on Choosing Directions As a thought experiment, suppose our initial guess had an error e = x x exact, where e is an eigenvector of A. This would give us that With the above, we see that r = b Ax = Ax exact Ax = A(x exact x) = Ae = λe x k+1 = x k + rt k r k r T k Ar r k k = x k + rt k r k r T ( λe) k A( λe) = x k + rt k r k λr T k ee = x k + rt k r k r T k r e k = x k e = x k (x k x exact ) = x exact What this tells us is that if we can pick an eigenvector of the matrix as our search direction, we can get our result immediately. That would be pretty neat. 4. Conjugate Gradient Conjugate gradient is a search method that tries to choose its search directions not completely orthogonally, but by recalling its previous search directions and scaling its newest directions with them, which generally leads to a nice speedup. This scaling makes the directions it searches conjugate; a more specific definition is that two vectors x and y are conjugate (with respect to a matrix A) if they are orthogonal in the inner product (x, y) = x T Ay = 0. To motivate this section, we are again going to be considering solving the system Ax = b, where A is symmetric and positive definite. Now, let us move on to trying to solve our minimization problem by picking orthogonal search directions. That is, we would like to pick directions s 0, s 1,..., s k, where s T i s j = 0 for i j. Recall that x k+1 = x k + αs k, which implies that x k+1 x exact = x k x exact + αs k e k+1 = e k + αs k (3) If we could go in in one direction of search until all error was orthogonal to the search direction, we could easily take large steps in the search space for the minimum of our function. The problem is that we do not know our error exactly, so we would like to eliminate all of our residual errors. To this end, we make what are known as A-orthogonal steps rather than strictly orthogonal directions (we will see why in a bit). To eliminate our residual errors, we have From this, we can see that 0 = r T k+1s k = (Ae k+1 ) T s k = A(e k + αs k ) T s k α = st k Ae k s T k As k = st k r k s T k As k This is A-orthogonal because we would like to constrain our search so that s T j As k = 0 for all j k. 7

8 Let us suppose that we would like to write our error as the scaled sum of a series of our search terms. That is, e 0 = j a js j (this sum takes into account all steps in the search). From this, we can derive some interesting properties of our error. s T k Ae 0 = s T k A j a j s j = j a j s T k As j = a k s T k As k + j k a j s k As j = a k s T k As k (4) where equation (4) follows because we are using A-orthogonal directions, so the summation goes to zero. This tells us that (again using A-orthogonal directions) a k = st k Ae 0 s T k As k = st k A(e 0 + α js j ) s T k As k Consider equation (3): e k+1 = e k + αs k. From this, we can see that Because of this, we have e k = e + α k s = e k + α k s k + α s = e 0 + α j s j a k = st k Ae k s T k As k = st k r k s T k As k = α k This allows us to consider the error at the k th step of our search. e k = e 0 + = j α j s j = j α j s j + α j s j a j s j + α j s j If we pre-multiply e k by one of our search directions transposed with A, in the case when i k 1, we have Similarly, for i k, we see that s T i Ae k = s T i r k = α j s T i As j + α j s T i As j j = α i s T i As i + α i s T i As i = 0 (5) s T i r k = α i s T i As i Note also from the second equation above that if we find a search direction for every dimension of the space we are trying to optimize, then our error at step k will go to zero. 8

9 We now have a lot of math that is really doing very little for us unless we can find our A- orthogonal directions to optimize. We can use simple Gram-Schmidt orthonormalization to remove the components of search directions already seen to choose a new search direction. That is, for some v not in the direction of s k, to get a our new vector s k which is A-orthogonal to all of the s j, we use the following: v T As j s k = v s T j As s j j For i k 1, we have s T k As i = v T v T As j As i s T j As s T j As i j = v T As i vt As i s T i As s T i As i = 0 i where the last line follows because our previous search vectors s j have all been A-orthogonalized. In the other case for i, that i k, we see that s T k r i = v T v T As j r i s T j As s T j r i j = v T r i (6) because, as we proved above in equation (5), s T j r i = 0 for j < i. We now have a relatively tedious way to calculate our next orthogonal search directions; armed with this, we can calculate the distance we should take steps in our search. Recalling that α k = st k r k s T k As k, we see that st k r k = v T r k implies that α k = vt r k s T k As k. Naturally, there is a simpler way to calculate the new search directions using the old ones than an expensive Gram-Schmidt orthogonalization step. Consider that r k+1 = b Ax k+1 = b A(x k + α k s k ) = b As k α k As k = r k α k As k so that α k As k = r k r k+1. This tells us that α k r T i As k = r T i r k r T i r k+1 and r T i r k+1 = r T i r k α k r T i As k Given that we are trying to find the k +1 search directions from the previous step, we set i = k +1. Now, we have that α k r T k+1 As k = r T k+1 r k r T k+1 r k+1. Because when k < i, v T k r i = s T k r i = 0 (see equations (6) and (5)), we know that if we simply set v k to be r k, then we have that r T k r i = 0. As such, since k < k + 1, we see that r T k+1as k = rt k+1 r k r T k+1 r k+1 α k = rt k+1 r k+1 α k Naturally, when i > k + 1, we have α k r T i As k = 0. Taking the above two paragraphs and combining them into the update rule for our search directions, (recalling that we set r k = v k ), we have the following update rule: s k+1 = r k+1 k r T k+1 As j s T j As j = r k+1 + rt k+1 r k+1 r T k r s k k s j 9

10 The sum goes to zero except for the j = k term, since α j r T k+1 As j = 0 except when j = k. Combining all of the above work for minimizing the function f(x) = 1 xt Ax b T x + c, we have the following set of update rules for the conjugate gradient method: s 0 = r 0 = b Ax 0 α k = rt k r k s T k AsT k x k+1 = x k + α k s k r k+1 = r k α k As k s k+1 = r k+1 + rt k+1 r k+1 r T k r s k k In the more general case, when we are trying to optimize an arbitrary function f : R n R, we modify the method above a little. First, recall that for the case of the quadratic function of with A, b, and c, that f(x) = A b T x = r. As such, for an arbitrary f, we can see that r k corresponds to f(x k ), giving us the more general conjugate gradient method. s 0 = f(x 0 ) α k = argmin f(x k s k t) t x k+1 = x k + α k s k 5 Perceptron Convergence s k+1 = f(x k+1 ) + f(x k+1) T f(x k+1 ) f(x k ) T s k f(x k ) Foregoing, temporarily, any introduction to the perceptron algorithm, let us denote a hyperplane by f(x) = β T 1 x + β 0 = 0 or simply by β T x ext = 0, where x ext = x, 1. If we have separability, that is, that for some β y (i) β T x (i) ext > 0 i we can scale β accordingly to give us normalization for the points closest to the hyperplane. I do not have a great justification for this, but it is relatively (though somewhat unjustifiably) easy to see that if the above condition holds then for some C > 0 and z (i) = x (i) ext/ x (i) ext, then y (i) βz (i) C. Given this, we can simply divide both sides of the equation by C, giving us a new β such that y (i) β T z (i) 1. In the perceptron algorithm, we have a current estimate for β, let us call that β k. The perceptron will identify a point z (i) that is misclassified and produce the update β k+1 = β k + z (i) y (i). With this, we can bound the difference of β k and β, which implies that the algorithm will converge to a separating hyperplane. Consider the following: β k+1 β = β k+1 + β β T β k+1 = β k + z (i) y (i) + β β T (β k + z (i) y (i) ) = y (i) β T k z (i) + β T k β k + z (i) + β β T β k y (i) β T z (i) y (i) β T k z (i) + β k β β T β k = (β k β ) T (β k β ) 1 + y (i) β T k z (i) < β k β 1 (7) 10

11 where the fourth line follows because y (i) β T z (i) 1, and the last line, equation (7), follows because β k misclassified z (i), making the y (i) β T k z (i) term negative. Noting this contraction, we see that the perceptron algorithm will converge to a separating hyperplane (in the case of separable data) in no more than β 0 β steps. Unfortunately, because the scaling can be very bad to give us the 1-margin separating hyperplane from above, this could be a huge number of steps. That, my friends, is why we use support vector machines. 6 Duality of SVMs In this section, we consider the Support Vector Machine and its dual. We will focus on the case of data that is not separable, so that we are trying to minimize a norm of the vector w plus penalty terms for un-separated subject to data points. That is, we are seeking argmin w subject to 1 wt w + c T ξ y (i) (w T x (i) + b) 1 ξ i, i = 1,..., m ξ 0 The Lagrangian for this is L(w, ξ, λ, β) = 1 wt w + c T ξ + Taking derivatives, we have that m λ i (1 ξ i y (i) (w T x (i) + b)) β T ξ w L = w + b L = m λ i ( y (i) x (i) ) m λ i y (i) ξ L = c λ β Setting each of the above to zero and solving, we have that λ i = c i β i, which, because β i > 0 for the optimal solution (because of duality, etc.), implies that 0 λ i c i in the dual. We also have that w = m λ iy (i) x (i) and the constraint that in the dual, m λ iy (i) = 0. Substituting, ( m ) T ( 1 m ) m m m λ i y (i) x (i) λ i y (i) x (i) + c T ξ + λ i λ T ξ β T ξ λ i y (i) λ j y (j) (x (j) ) T x (i) + b j=1 After some simplification and using the Lagrange constraint from the ξ, we have that our dual optimization problem becomes argmax λ s.t. 1 m m m λ i λ j y (i) y (j) (x (i) ) T x (j) + j=1 0 λ c m λ i y (i) = 0 λ i which can be optimized using pairwise coordinate ascent on the λ i s (as in the SMO algorithm). 11

12 7 Projection of a vector onto linear constraints Let us suppose that we would like to project a vector z onto the set {x Ax = b}. Thus, we would like to find the x that minimizes z x subject to Ax = b. First, we get our Lagrangian L(x, λ) = z T z x T z + x T x + λ T (Ax b) whose derivative we take with respect to x. This gives us x L(x, λ) = z + x + A T λ. Setting this to zero, we find that x = z A T λ/. Plugging this into the lagrangian above, we have ( L(λ) = z T z z 1 T ( ) T ( ) ( ( ) ) λ) AT z + z AT λ z AT λ + λ T A z AT λ b = z T z z T z + λ T Az + z T z + λt AA T λ 4 = λ T Az λt AA T λ λ T b 4 The derivative of the above with respect to λ is and setting this to 0, we have λ L(λ) = Az AA T λ/ b, λ = (AA T ) 1 (Az b). Finally, plugging this into x = z A T λ/, we have that λ T Az +λ T Az λt AA T λ x = z A T (AA T ) 1 Az + A T (AA T ) 1 b. λ T b Thus, the projection of z onto the null space of A is I A T (AA T ) 1 A, and the projection onto Ax = b given the projection into the nullspace is simply adding A T (AA T ) 1 b, which is a solution of Ax = b. 8 Minimization of a linear function subject to elliptical constraints In this section, we consider the following problem: where A S n ++. First, we form the Lagrangian: Taking the gradient with respect to x, we have minimize c T x subject to x T Ax 1 L(x, ν) = c T x + ν(x T Ax 1). x L(x, ν) = c + νax so that x = 1 ν A 1 c. We can solve assuming equality because unless the ellipse is unbounded or c = 0, the optimal value will be on the edge of the ellipsoid x T Ax 1. Plugging this in to the inequality constraint, we have that 1 4ν ct A 1 AA 1 c = 1 so ν = ± 1 A 1 c. 1

13 Plugging all this in, we find that the assignments to x and the minimum and maximum values are x 1 = ± A 1 c so c T x 1 = c T A 1 c = A 1 A 1 c A 1 c. c 9 KL Divergence for Gaussian distributions First, recall that the KL divergence between two distributions P and Q is defined as [ D KL (P Q) = E P log P ]. Q Also, the density function for a multivariate Gaussian (normal) distribution with mean µ and covariance matrix Σ is ( 1 p(x) = exp 1 ) (π) n/ det(σ) 1/ (x µ)t Σ 1 (x µ). Now, consider two multivariate Gaussians in R n, P 1 and P. We have D(P 1 P ) = E P1 [log P 1 log P ] = 1 E P 1 [ log det Σ1 (x µ 1 ) T Σ 1 1 (x µ 1) + log det Σ + (x µ ) T Σ 1 (x µ ) ] = 1 log det Σ det Σ E P 1 [ (x µ1 ) T Σ 1 1 (x µ 1) + (x µ ) T Σ 1 (x µ ) ] = 1 log det Σ det Σ E P 1 [ tr(σ 1 1 (x µ 1)(x µ 1 ) T ) + tr(σ 1 (x µ )(x µ ) T ) ] = 1 log det Σ det Σ E P 1 [ tr(σ 1 1 Σ 1) + tr(σ 1 (xxt xµ T + µ µ T )) ] = 1 log det Σ 1 det Σ 1 n + 1 tr(σ 1 (Σ 1 + µ 1 µ T 1 µ µ T 1 + µ µ T )) = 1 ( log det Σ ) n + tr(σ 1 det Σ Σ 1) + tr(µ T 1 Σ 1 µ 1 µ T 1 Σ 1 µ + µ T Σ 1 µ ) 1 = 1 ( log det Σ ) n + tr(σ 1 det Σ Σ 1) + (µ µ 1 ) T Σ 1 (µ µ 1 ) 1 13