# Derivative Free Optimization

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Department of Mathematics Derivative Free Optimization M.J.D. Powell LiTH-MAT-R--2014/02--SE

3 Three lectures 1 on Derivative Free Optimization M.J.D. Powell (University of Cambridge) Summary During the last ten years, the speaker has developed Fortran software for minimizing general objective functions when first derivatives are not available. He has released two packages, namely NEWUOA and UOBYQA, for the unconstrained case and for simple upper and lower bounds on the variables, respectively. His current research addresses general linear constraints on the variables, which is intended to provide the new LINCOA package. These algorithms require only of magnitude n squared operations on a typical iteration, where n is the number of variables, which has allowed them to be applied when n is in the hundreds. No attention is given to sparsity. Some excellent numerical results have been achieved, which certainly depend on the use of the symmetric Broyden updating formula. That formula when derivatives are not available is the subject of Lecture 1. It supplies the solution to the following variational problem. Some values of the objective function and a quadratic approximation to the objective function are given, where usually the approximation does not interpolate all the given function values. A new quadratic approximation is required that does interpolate the given function values. The freedom that remains in the new approximation is taken up by making as small as possible the Frobenius norm of the change to the second derivative matrix of the approximation. A trust region in the space of the variables is the set of all points that are within a prescribed distance of the best point so far. A typical change to the vector of variables is a move from the best point so far to a new point within the trust region, the new point being chosen to provide a relatively small value of the current quadratic approximation to the objective function. The calculation of the new point is the subject of Lecture 2. A version of truncated conjugate gradients is employed, in order that the amount of work is only of magnitude n 2. Several decisions had to be taken to specify an algorithm. They are particularly interesting for the LINCOA software, due to the linear constraints on the variables. Lectures 1 and 2 describe vital ingredients of the software that has been mentioned. A few more details are addressed in Lecture 3, to complete an outline of how the algorithms work. The numerical results from several test problems are considered too. The accuracy and efficiency are much better than expected if one takes the view that the second derivative matrix of the quadratic approximation should become close to the second derivative matrix of the objective function. We find clearly that this view is wrong. Also the numerical results show that good accuracy is achieved eventually, although the number of iterations is sometimes very sensitive to effects from computer rounding errors. 1 Presented at the University of Linköping, Sweden (January 2013).

4 1. The symmetric Broyden updating formula 1.1 Background and some notation The quadratic function Q(x) = c + g T x xt Hx, x R n, (1.1) is an approximation to the objective function F(x), x R n, of a general optimization calculation. Q is revised as the calculation proceeds. Steps in the space of the variables that reduce Q(x) are expected to reduce F(x) when the steps are sufficiently small. We address the situation when no derivatives of F are available. Therefore Q( ) has to be constructed from values of F( ). We assume that the general optimization calculation is running iteratively, all starting procedures being complete. The symmetric Broyden updating formula is a way of changing Q( ) that gives attention to each new value of F( ) that is computed. Let Q ( ) and Q + ( ) be the quadratics Q( ) before and after a change is made, and let the purpose of the change be to achieve the conditions Q + (y j ) = F(y j ), j =1, 2,...,m, (1.2) after the points y j, j = 1, 2,...,m, have been chosen automatically by the optimization algorithm. The positions of these points must be such that the conditions (1.2) can be satisfied by a quadratic Q + ( ) for any values of the right hand sides. Let the quadratic Q + (x) = c + + g T + x xt H + x, x R n, (1.3) be suitable. If m is less than 1 (n+1)(n+2), then the conditions (1.2) leave 2 some freedom in c +, g + and H +. The points must also have the property that, if H + = 2 Q + is fixed for any suitable Q +, then the conditions (1.2) do not allow any changes to c + and g +. This last requirement is equivalent to the zero function being the only polynomial of degree at most one that vanishes at all the points y j, j =1, 2,...,m, which implies m n+1. A typical value of m is 2n+1. We reserve the notation Q ( ) for the change in Q( ), which is the quadratic polynomial Q (x) = Q + (x) Q (x) = c + g T x xt H x, x R n, (1.4) say. The equations (1.2) imply that it satisfies the conditions Q (y j ) = F(y j ) Q (y j ), j =1, 2,...,m. (1.5) All but one of the right hand sides (1.5) are zero in the usual setting when Q ( ) interpolates m 1 of the values F(y j ), j =1, 2,...,m. 2

5 The symmetric Broyden updating method is highly successful at fixing the 1 (n+1)(n+2) m degrees of freedom that remain in Q 2 ( ) after satisfying the constraints (1.5). This freedom is taken up by minimizing H F, where the subscript F denotes the Frobenius matrix norm. In other words, the sum of squares of the elements of the second derivative matrix H = 2 Q + 2 Q is made as small as possible. It follows from the strict convexity of the Frobenius norm that the symmetric Broyden method defines H uniquely, so we can regard H as fixed. There is no remaining freedom in Q ( ), due to the second assumption above on the positions of the points y j, j =1, 2,...,m. Strong advantages of the symmetric Broyden formula are exposed in the theory of Section 1.2 and in the numerical results of Section 1.3 below. The Lagrange functions Λ l ( ), l = 1, 2,...,m, of the interpolation equations (1.5) are going to be useful in Section 1.4 and in Lecture 3. They are defined to be quadratic polynomials such that 2 Λ l F is least subject to the conditions Λ l (y j ) = δ jl, 1 j m, 1 l m, (1.6) where δ jl is the Kronecker delta. Thus Λ l ( ) is the Q ( ) that is given by the symmetric Broyden method when expression (1.5) has the right hand sides F(y j ) Q (y j )=δ jl, j =1, 2,...,m. 1.2 The projection property Let Q be the set of quadratic polynomials with n variables that satisfy the conditions (1.2) on Q + ( ), and, for any two polynomials S and T, let the Frobenius norm 2 S 2 T F be regarded as the distance between S and T. Then the choice of Q + ( ) by the symmetric Broyden method can be regarded as a projection of Q ( ) into Q. Least squares projections and inner product spaces are relevant, because the Frobenius norm of an n n matrix is exactly the Euclidean norm of the vector in R n2 whose components are the elements of the matrix. We are going to find that some elementary theory of projections establishes a powerful relation between the second derivative matrices 2 Q and 2 Q +. We introduce the notation A B = n i=1 nj=1 A ij B ij (1.7) for the scalar product between any two n n matrices A and B, which gives the equation 2 Q + 2 Q 2 F = 2 Q + 2 Q 2 Q + 2 Q. (1.8) Moreover, if Q is a general element of Q, then Q + +α(q Q + ) is in the set Q for every α in R. It follows that the symmetric Broyden formula has the property that the least value of the expression 2 Q + + α( 2 Q 2 Q + ) 2 Q 2 Q + + α( 2 Q 2 Q + ) 2 Q (1.9) 3

6 occurs at α=0, which is the condition 2 Q 2 Q + 2 Q + 2 Q = 0. (1.10) Thus we deduce the identity 2 Q 2 Q 2 F = 2 Q 2 Q 2 Q 2 Q = ( 2 Q 2 Q + ) + ( 2 Q + 2 Q ) ( 2 Q 2 Q + ) + ( 2 Q + 2 Q ) = 2 Q 2 Q + 2 Q 2 Q Q + 2 Q 2 Q + 2 Q = 2 Q 2 Q + 2 F + 2 Q + 2 Q 2 F. (1.11) It follows that the updating formula is highly suitable if the objective function F is quadratic. Then it is trivial that F satisfies the conditions (1.2) on Q + ( ), so F is an element of the set Q. Thus the identity (1.11) implies the relation 2 F 2 Q + 2 F = 2 F 2 Q 2 F 2 Q + 2 Q 2 F. (1.12) Therefore, when a sequence of quadratic models is generated iteratively, the symmetric Broyden formula being applied to update Q on each iteration, then the sequence of second derivative errors 2 F 2 Q F decreases monotonically. One should not become enthusiastic, however, about the last conclusion. If 2 F 2 Q F remains large, then monotonic decreases may not provide any gains over errors of the same magnitude that both increase and decrease occasionally. On the other hand, if these errors have to become small to achieve fast convergence, then one is faced with the task of obtaining good accuracy in 1 2 n(n+1) independent second derivatives, which is likely to require much effort when there are many variables with no derivatives available explicitly. Equation (1.12) shows also that, if F is quadratic, then 2 Q + 2 Q F converges to zero as the iterations proceed. This seems to be the main reason for the excellent success of the symmetric Broyden updating formula in practice. One expects the size of 2 Q + 2 Q F = 2 Q F to be related closely to the size of the right hand sides of the conditions (1.5). Thus the limit Q + 2 Q F 0 suggests that, as the iterations proceed, the quadratic models provide better and better predictions of the new values F(y j ) that are actually calculated. It is also likely that 2 Q + 2 Q F becomes small when F is a smooth function that is not quadratic. 1.3 An example of efficiency The objective function that has been employed most by the author in the development of his software is the sum of squares 2 2n n F(x) = b i [S ij sin(x j /σ j ) + C ij cos(x j /σ j ) ], x R n. (1.13) i=1 j=1 4

7 n Numbers of calculations of F (#F) Case 1 Case 2 Case 3 Case 4 Case Table 1: NEWUOA applied to the test problem (1.13) All the parameters are generated randomly, subject to the least value of F( ) being zero at a known point x. The software requires an initial vector of variables, x beg say, and values of the initial and final steplengths that are going to be employed in the optimization calculation. In the tests of Table 1, x beg is a random point from the hypercube {x : x j x j 0.1πσ j, j =1, 2,...,n}, (1.14) where the scaling factors σ j are selected randomly from [1, 10]. The prescribed final steplength is 10 6, and the number of conditions in expression (1.5) is set to m=2n+1. There are no constraints on the variables. Different random numbers provide five cases for each n. The NEWUOA software of the author, that updates quadratic models by the symmetric Broyden method, was applied to these test problems for n = 10, 20, 40, 80, 160 and 320. The total number of calculations of the objective function, namely #F, is shown for each case in Table 1. Good accuracy is achieved, the greatest value of x fin x throughout the experiments being less than , where x fin is the final calculated vector of variables. The table suggests that the growth of #F as n increases is about linear. On the other hand, a quadratic polynomial with n variables has 1 (n+1)(n+2) degrees 2 of freedom, this number being when there are 320 variables. The numbers in the last line of Table 1 are much smaller, however, which suggests that the use of the symmetric Broyden method is very efficient for minimizing the function (1.13). 1.4 The amount of work The NEWUOA and BOBYQA software packages, developed by the author, have the useful property that the amount of work of a typical iteration is only of magnitude n 2. Therefore they have been applied to optimization problems with hundreds of variables. The property is remarkable, because usually the symmetric Broyden method change all the elements of 2 Q, and the number of elements is 5

8 n 2. It requires the number m of the conditions (1.5) to be of magnitude n, frequent choices by the author being m=2n+1 and m=n+6, for example. Some details of the implementation of the symmetric Broyden method are given below. Occasional shifts of the origin of R n are necessary in practice to prevent much damage from computing rounding errors. Therefore we write the quadratic (1.4), which is the change to Q( ), in the form Q (x) = ĉ + ĝ T (x x b ) (x x b) T H (x x b ), x R n, (1.15) where the current origin x b has been chosen already, and where the number ĉ, the gradient ĝ R n and the n n matrix H have to be calculated. We have to minimize H 2 F subject to the constraints ĉ + ĝ T (y j x b ) (y j x b) T H (y j x b ) = r j, j 1, 2,...,m, (1.16) the numbers r j =F(y j ) Q (y j ), j =1, 2,...,m, being the right hand sides of the equations (1.5). First order conditions for the solution of this problem state the existence of multipliers µ j, j =1, 2,...,m, such that the function 1 H 4 2 F m j=1 µ j { ĉ + ĝ T (y j x b ) (y j x b) T H (y j x b ) r j } (1.17) is stationary when ĉ, ĝ and H are optimal. By differentiating this function with respect to ĉ, ĝ and the elements of H, we find the equations mj=1 µ j = 0, mj=1 µ j (y j x b ) = 0, (1.18) and (H ) pq m j=1 µ j (y j x b ) p (y j x b ) q = 0, 1 p n, 1 q n. (1.19) Multipliers are not required to force the symmetry of H, because expression (1.19) gives symmetry automatically. Specifically, this expression provides the matrix H = m i=1 µ i (y i x b )(y i x b ) T. (1.20) The form (1.20) of H is crucial, because it reduces the calculation of 1 2 n(n+1) elements of H to the calculation of only m parameters µ j, j =1, 2,...,m. After substituting the form (1.20) into the constraints (1.16), we find that the conditions (1.16) and (1.18) supply m+n+1 linear equations that have to be satisfied by ĉ R, ĝ R n and µ R m. These equations are the system A e Y T µ r e T 0 0 ĉ = 0, (1.21) Y 0 0 ĝ 0 6

9 where A is the m m matrix that has the elements A ij = 1 2 {(y i x b) T (y j x b ) } 2, 1 i m, 1 j m, (1.22) where all the components of e R m are one, and where the n m matrix Y has the columns y j x b, j =1, 2,...,m. The need for the origin shift x b can be deduced from the elements (1.22). Typical behaviour towards the end of an optimization calculation is that the distances y i y j, 1 i < j m, are of magnitude 10 6 but the lengths y j, j =1, 2,...,m, are of magnitude one. Then x b is chosen to be among the points y j, j = 1, 2,...,m, in order that floating point arithmetic can provide detail of good quality in the elements (1.22), which have magnitude This fine detail would be lost without the origin shift, because then the elements of A would be of magnitude one. Let W be the (m+n+1) (m+n+1) matrix of the system (1.21) and let Ω be the inverse of W. The inverse matrix Ω is employed by the NEWUOA and UOBYQA software. The availability of Ω allows the solution of the equations (1.21) to be generated in of magnitude n 2 operations as required, m being of magnitude n. On each iteration of NEWUOA and UOBYQA, only one of the points y j, j =1, 2,...,m, is changed. Let y t be replaced by y +, without altering the shift of origin x b. Then the form of the change to the matrix W of the system (1.21) is that W has a new t-th column, w + t say, and a new t-th row w +T t, but the other elements of W all remain the same. It follows that the inverse matrix of the new W can be calculated by updating the old inverse matrix Ω in only of order n 2 operations. We employ the vector w R m+n+1 that has the components w i = 1{(y x 2 i b) T (y + x b )} 2 }, i=1, 2,...,m, (1.23) w m+1 = 1, and w m+i+1 = (y + x b ) i, i=1, 2,...,n, which are independent of t. The new column w + t is the same as w, except that the new diagonal element W tt + takes the value 1 2 y+ x b 4. It is shown in the paper Least Frobenius norm updating of quadratic models that satisfy interpolation conditions (Math. Program. B, Vol. 100, pp , 2004), written by MJDP, that the new matrix Ω is the expression Ω + = H + σ 1 t [ αt (e t Ωw)(e t Ωw) T β t Ωe t e T t Ω + τ t { Ωet (e t Ωw) T + (e t Ωw)e T t Ω }], (1.24) where e t is the t-th coordinate vector in R m+n+1, and where the parameters of the formula are the numbers α t = e T t Ωe t, β t = 1 2 y+ x b 4 w T } Ωw, (1.25) τ t = w T Ωe t, and σ t = α t β t + τt 2. We wish to prevent the matrix of the system (1.21) from becoming nearly singular. The change to the matrix causes singularity if and only if some elements 7

10 of Ω + are unbounded, which is equivalent to σ t =0 in formula (1.24). Therefore, if y + has been chosen already, and if the software has to pick the point y t that is dropped from {y i : i=1, 2,...,m} to make room for y +, then σ t is calculated, using the equations (1.25), for all integers t in [1,m], which requires only of magnitude n 2 operations, partly because β t is independent of t. Then t is chosen to provide a relatively large value of σ t. Alternatively, the point y t that is going to be dropped may be selected before y + is chosen. Again we require σ t = α t β t +τ 2 t to be relatively large. We assume we can restrict attention to τ t, because both α t and β t are nonnegative in theory. Therefore we consider the dependence of τ t = w T Ωe t on y +, where w R m+n+1 has the components (1.23). We make use of the Lagrange function Λ t ( ), which is introduced at the end of Section 1.1. We recall that Λ t ( ) is the function (1.15) if the vector r of the system (1.21) has the components r j = δ jt, j = 1, 2,...,m. By multiplying this system by the inverse matrix Ω, we find that the parameters µ, ĉ and ĝ of Λ t ( ) are the elements of Ωe t, which makes Λ t ( ) available to the software. Equations (1.15) and (1.20) with these values of the parameters give the formula Λ t (y + ) = ĉ + ĝ T (y + x b ) mi=1 µ i {(y i x b ) T (y + x b )} 2. (1.26) We replace parts of this formula by the components of w in expression (1.23). The term ĉ is the product ĉw m+1, the term ĝ T (y + x b ) is the scalar product of ĝ with the n component vector at the end of w, and the remainder of formula (1.26) is the scalar product of µ with the m component vector at the beginning of w. Thus the right hand side of equation (1.26) is the scalar product w T Ωe t, which establishes that the dependence of τ t =w T Ωe t on y + is exactly the dependence of Λ t (y + ) on y +. Therefore, when t has been chosen and a vector y + is required, the software avoids tendencies towards singularity in the system (1.21) by seeking a relatively large value of Λ t (y + ). Another feature of the inverse matrix Ω is that in theory its leading m m submatrix is positive semi-definite and has rank m n 1. Therefore this submatrix can be expressed as the product ZZ T, where the matrix Z has m rows and only m n 1 columns. The software stores and updates Z instead of storing and updating the leading m m submatrix of Ω, because this technique avoids some substantial damage from computing rounding errors in the sequence of calculated matrices Ω. 8

11 2. Trust region calculations 2.1 Notation x R n is the vector of variables. is the trust region radius. Assuming without loss of generality that the centre of the trust region is at the origin, the trust region is the set {x : x }, where the vector norm is Euclidean. We try to minimize approximately the quadratic Q(x) = g T x xt Hx, x R n, (2.1) within the trust region, given g R n and the n n symmetric matrix H. If H is not known explicitly, then the product Hv can be calculated for any v R n in of magnitude n 2 operations. The variables x R n may have to satisfy given linear constraints a T j x b j, j =1, 2,...,m, (2.2) and then they hold at the trust region centre (i.e. all the components of b R m are nonnegative). We want the amount of work to be of a magnitude that is much less than n 3 when n is large. Therefore we may restrict x to a p-dimensional linear subspace of R n where p is a small positive integer, beginning with p=1, and p is increased in a way that gives an algorithm analogous to the truncated conjugate gradient procedure. Let the vectors v i, i = 1, 2,...,p, be an orthonormal basis of the linear subspace, which means that the conditions v T i v j = δ ij, 1 i p, 1 j p, (2.3) are satisfied, δ ij being the Kronecker delta. The variables of the reduced trust region subproblem are α 1,α 2,...,α p, and a relatively small value of the quadratic Q(V α) = (V T g) T α αt (V T HV )α, α R p, (2.4) is required subject to α and the constraints (2.2) on x = V α, where V is the n p matrix with the columns v i, i=1, 2,...,p. The notation ε is reserved for a tolerance parameter of the trust region calculation. We seek a feasible vector x within the trust region such that Q(x) is within about ε of the least value of Q that can be achieved, and sometimes x is restricted to the column space of V. 2.2 The problem without linear constraints The following conditions are necessary and sufficient for x to solve the given trust region problem exactly. There exists a nonnegative parameter λ such that x satisfies the equation (H + λi)x = g, (2.5) 9

12 and such that the matrix H+λI has no negative eigenvalues, where I is the n n unit matrix. Furthermore, the bound x has to hold, with equality if λ is positive. The proof of this assertion is straightforward when H is diagonal, which does not lose generality. One finds too that x(λ), λ > λ, decreases monotonically, where λ is the least nonnegative number such that H+λ I is positive definite or positive semi-definite. If λ=0 does not supply the solution, it is usual to adjust λ by trying to satisfy the equation x(λ) 2 = 2 or x(λ) 1 = 1. (2.6) Let Ω be any n n orthogonal matrix. The system (2.5) can be written in the form (ΩHΩ T + λi) (Ωx) = Ωg, (2.7) which allows Ωx to be calculated for any λ>λ, and then we employ the property x = Ωx. Before adjusting λ, one may choose Ω so that ΩHΩ T is a tridiagonal matrix, because then the solution of the system (2.7) requires only of magnitude n operations for each λ. Let x(λ) be calculated with λ>λ, and let η be any vector that provides the length x(λ)+η =. It follows from the identity (H + λi) {x(λ) + η} = g + (H + λi)η (2.8) that x(λ)+η is the exact solution of a perturbed trust region problem, the perturbation being the replacement of g by g (H+λI)η. This perturbation alters the optimal value of Q( ) by at most (H+λI)η. If H+λI is nearly singular, one can find η such that (H+λI)η is much less than η, which makes it less difficult to satisfy the inequality (H + λi)η ε, (2.9) where ε is the tolerance at the end of Section 2.1. Condition (2.9) with the length x(λ)+η = imply that x(λ)+η is a sufficiently accurate estimate of the solution of the trust region problem. 2.3 The p-dimensional problems The minimization of the quadratic (2.4) subject to α is a version of the calculation of the previous section, but p is going to be much less than n when n is large. In this section there are no linear constraints, and p runs through the positive integers until it seems that sufficient accuracy has been achieved. In the extreme case when g = Q(0) is zero, the vector x = 0 is returned immediately as the solution of the trust region problem, because our way of starting does not work. It is to set p=1 and v 1 = g/ g. Thus our first trust region problem is a 10

13 search along the steepest descent direction from the trust region centre as in the conjugate gradient method, which is solved exactly. Increases in p provide an iterative method. We let x p be the calculated solution of the trust region problem for each p, and we define x 0 = 0. The iterations are terminated if x p is not substantially better than x p 1 at providing a small value of Q( ), the actual test being the condition Q(x p 1 ) Q(x p ) 0.01 {Q(x 0 ) Q(x p )}. (2.10) Otherwise, we consider the gradient Q(x p )=g p, say. It provides the linear Taylor series approximation Q(x) Q(x p ) + (x x p ) T g p, x R n, (2.11) to Q( ), which has second order accuracy in a neighbourhood of x p. The iterations are terminated too if the inequality g p + g T p x p 0.01 {Q(x 0 ) Q(x p )} (2.12) holds, because then every move from x p within the trust region reduces the value of the linear Taylor series approximation by at most 0.01{Q(x 0 ) Q(x p )}. Otherwise p is increased by one for the next trust region problem. Its tolerance parameter ε is set to 0.01 times the reduction in Q that has been achieved so far. The orthonormal vectors v i, i=1, 2,...,p 1, are available already. We satisfy the new conditions (2.3) by defining v p by the formula v p = { g p 1 + p 1 i=1 (g T p 1 v i)v i }/ g p 1 + p 1 i=1 (g T p 1 v i)v i, (2.13) where g p 1 is the same as g p in the previous paragraph, due to the increase in p. Thus the vectors v i, i = 1, 2,...,p, span the same space as the gradients g i 1 = Q(x i 1 ), i = 1, 2,...,p. It follows that our iterations are the same as those of the conjugate gradient method, except that our steps can move round the trust region boundary to seek further reductions in Q( ), which may be very helpful when H = 2 Q has some negative eigenvalues. A huge advantage of formula (2.13) comes from the remark that v i, i = 1, 2,...,p, is an orthonormal basis of the Krylov subspace that is spanned by the vectors H i 1 g, i=1, 2,...,p. Indeed, it happens automatically that the second derivative matrix V T HV of the quadratic (2.4) is tridiagonal. Thus, in every p-dimensional trust region problem, the equations (2.5) can be solved easily for each choice of λ. Often the situation is less favourable, however, when there are linear inequality constraints on the variables. 11

14 2.4 Linear equality constraints It is not unusual, during the solution of an optimization problem with linear constraints, that some of the inequality constraints are satisfied as equations, with the boundaries of the other inequalities so far away that they can be ignored for a while. Therefore it is helpful to consider the case when the constraints become the linear equations a T j x = 0, j =1, 2,...,m, (2.14) the right hand sides being zero due to the assumptions that the trust region centre is the origin and that the constraints are satisfied there. We expect the value of m in expression (2.14) to be less than the value in Section 2.1. The constraint gradients a j, j = 1, 2,...,m, have to be linearly independent, which can be achieved if necessary by removing any redundancies from the conditions (2.14). In the special case when the gradients a j, j = 1, 2,...,m, are the first m coordinate vectors, the constraints (2.14) state that the first m components of x R n are fixed, and that the adjustment of the other n m components is unconstrained except for the trust region bounds. The procedure of the previous section can be employed to adjust the free variables. Instead of working with vectors of length n m, one can work with vectors of length n whose first m components are zero. They are generated automatically by the formulae of the previous section if the gradients g j 1 = Q(x j 1 ), j =1, 2,...,p, are replaced by Proj (g j 1 ), j =1, 2,...,p, where the operator Proj ( ) applied to any vector in R n replaces the first m components of the vector by zeros, leaving the other n m components unchanged. This method can also be used with a suitable Proj ( ) when the gradients a j, j = 1, 2,...,m, of the equality constraints (2.14) are general. Specifically, we let A be the n m matrix whose columns are a j, j = 1, 2,...,m, we construct the factorization A = QR, (2.15) where Q is an n m matrix with orthonormal columns and where R is m m and upper triangular, and we also construct an n (n m) matrix ˇQ such that the n n matrix ( Q ˇQ) is orthogonal. Then we let Proj ( ) be the operator Proj (v) = ˇQ ˇQ T v, v R n. (2.16) Because the columns of ˇQ are orthogonal to the columns of A, the vector (2.16) satisfies the constraints (2.14) for every v R n, and Proj ( ) is the orthogonal projection into the linear subspace of moves from the trust region centre that are allowed by the constraints. We now apply the procedure of the previous section with the replacement of the gradients g j 1 = Q(x j 1 ) by their projections Proj (g j 1 ) = ˇQ ˇQ T g j 1 = n i=m+1 (q T g )q, j =1, 2,...,p, (2.17) i j 1 i 12

15 where q i, i = m+1,m+2,...,n, are the columns of ˇQ. Thus, instead of the steepest descent direction at the trust region centre, the p = 1 move becomes a search along v 1 = ˇQ ˇQ T g / ˇQ ˇQ T g, and the later orthonormal vectors v p are given by formula (2.13) with g p 1 replaced by ˇQ ˇQ T g p 1. The matrix V T HV retains the tridiagonal property stated in the last paragraph of Section 2.3. It is explained later that active set methods pick subsets of the inequality constraints that are treated as equalities. The procedure of this section is useful for each choice of active set. 2.5 On choosing active sets The approximate minimization of the quadratic model (2.1), subject to the linear inequality constraints (2.2) and the bound x, begins by choosing an active set at the centre of the trust region. We take the view that the move x from the origin is substantial if its length is greater than 0.1, and we are content if the move along ˇQ ˇQ T g/ ˇQ ˇQ T g in the procedure of the previous section has this property. Therefore, when choosing the initial active set, we give attention to the j-th constraint if and only if its boundary can be reached by a move of length 0.1, which is the condition b j a T j x a j, (2.18) where x 0 is the trust region centre. We define the subset J of {1, 2,...,m} by including j in J if and only if inequality (2.18) holds. We require an active set with the property that the initial step along the projected steepest descent direction does not move closer to the boundaries of the constraints whose indices are in J, which is the property a T j ( ˇQ ˇQ T g) 0, j J. (2.19) A convenient calculation supplies this property automatically. It is the calculation of the vector s R n that minimizes the Euclidean norm s+g subject to a T j s 0, j J, which has a unique solution s, say. There is no acceptable trust region step if s is zero. Moreover, the j-th constraint influences s only if a T j s is zero, and s is the orthogonal projection of g into the linear space of vectors that are orthogonal to a j, j J, where J is the set J = J {j : a T j s = 0 }. (2.20) Indices can be removed from J if necessary until the vectors a j, j J, become linearly independent. We let the resultant set J be the initial active set, which means that the inequalities (2.2) are replaced by the equations a T j x = a T j x 0, j J, (2.21) 13

16 for a while, where x 0 is still the centre of the trust region. Let A be the matrix whose columns are a j, j J, let expression (2.15) be the QR factorization of A, and let ˇQ be a matrix such that ( Q ˇQ) is orthogonal as before. Then, because of the projection property that has been mentioned, we have s = ˇQ ˇQ T g, so the constraints a T j s 0, j J, on s = s are the conditions (2.19). It follows that a move from the trust region centre along the initial search direction v 1 = ˇQ ˇQ T g/ ˇQ ˇQ T g is not restricted by the boundaries of the inequality constraints until the step length along the search direction is greater than 0.1. There is a reason for writing x 0 explicitly as the trust region centre in expressions (2.18) and (2.21) although it is assumed to be at the origin. It is that a constraint violation may occur during the sequence of trust region calculations in Section 2.3 as the values of the integer p increase from p = 1, and the increases in p may not be complete. Then the active set may be revised by applying the procedure that has been described for a suitable x 0, but this x 0 cannot be the centre of the trust region at the origin. Instead it is the point x that provides the least known value of Q(x) so far subject to the constraints (2.2) and the bound x. This x is returned as the solution of the original trust region problem, if the method for choosing the next active set gives s =0. Otherwise, the method supplies a new active set J. The new linear equality constraints on x R n are the equations (2.21) with the new x 0 and the new J. Let S be the set of all vectors in R n that satisfy these constraints. It can happen that the original trust region centre at the origin is not in S. Then it is helpful to express a general vector x in S as the sum (x w)+w, where w is the point of S that is closest to the origin. It follows from the identity x 2 = x w 2 + w 2 that the trust region bound x can be written in the form x w 2 2 w 2, x S. (2.22) Therefore, after the changes in the active set, it is suitable to take the view that the trust region centre and the trust region radius are now w and ( 2 w 2 ) 1/2, respectively. Another consideration is that a search along the projected steepest descent direction ˇQ ˇQ T Q(w)/ ˇQ ˇQ T Q(w) may not provide a new value of Q( ) that is less than Q(x 0 ), where x 0 is still the point where the new active set has been chosen. Instead it is better to begin with p=2 instead of p=1 in the procedure of Section 2.3, letting v 1 and v 2 be an orthonormal basis of the two dimensional space spanned by w x 0 and ˇQ ˇQ T Q(x 0 ). Moreover, if the calculations before the change to the active set find that Q( ) has some directions of negative curvature, then linear combinations of these directions that are orthogonal to the new active constraint gradients a j, j J, may be very helpful for making further reductions in Q( ). An advantage of this approach is that some elements of the new matrix V T HV in expression (2.4) can be constructed from elements of the old V T HV, in order to avoid some multiplications of vectors in R n by H, these multiplications being expensive when n is huge. 14

17 2.6 On changing the active set New vectors of variables, given by trust region calculations, are never acceptable if they violate the constraints (2.2). Let x j, j =0, 1,...,p, be the sequence of vectors of variables that occurs in the active set version of the procedure of Section 2.3, where p 1, let the constraints be satisfied at x j, j < p, but let the move from x p 1 to x p introduce a constraint violation. The trust region calculation provides the strict reduction Q(x p ) < Q(x p 1 ). Usually x p would be shifted back on the line segment from x p 1 to x p to the point where the violation occurs. Then a new active set would be generated by the procedure in the previous section, its vector x 0 being the new x p. We expect a change to the active set, because x 0 is on the boundary of an inequality constraint that is not in the old active set. The following example in only two dimensions shows, however, that this technique may fail. We consider the problem Minimize Q(x) = 25x 1 4x 1 x 2 22x 2 2 subject to x }, (2.23) with the trust region centre at the origin as usual, and with = 1. The first active set is empty, because the distance from the origin to the boundary of the constraint is 0.2. Let the procedure in Section 2.3 begin with x 0 at the origin. Then the steepest descent direction v 1 = Q(x 0 )/ Q(x 0 ) is employed, this direction being the first coordinate direction. The move along this direction goes to the trust region boundary, where we find the values ( ) ( ) 1 25 x 1 =, Q(x 0 1 ) = and Q(x 4 1 ) = 25. (2.24) Next there is a search in a two dimensional space for the least value of Q(x) subject to x 1. This space has to be the full space R n because there are only two variables. The coefficients of Q are such that the search gives the values x 2 = ( ), Q(x 2 ) = ( ) and Q(x 1 ) = 31, (2.25) the fact that Q(x 2 ) is a negative multiple of x 2 being a sufficient condition for the solution of the trust region problem without any linear constraints. The constraint x is violated at x 2, however, so we consider the possibility of introducing the active set that supplies the equality constraint x 2 = 0.2. It is suggested at the beginning of this section that we pick the point x 0 for the automatic construction of the new active set by moving from x 1 in a straight line towards x 2 until we reach the constraint boundary, so we consider points of the form x 1 +α(x 2 x 1 ), 0 α 1. Then Q takes the values φ(α) = Q(x 1 + α(x 2 x 1 )) = α 12.8α 2, 0 α 1, (2.26) which include φ(0)= 25 and φ(1)= 31. The constraint boundary is at α=0.25, and the example has been constructed so that the corresponding value of Q, 15

18 namely φ(0.25)= 24.1, is greater than Q(x 1 )= 25. Therefore the introduction of an active set is questionable. Indeed, it would fail to supply the required vector of variables, because the least value of Q(x) subject to x 2 = 0.2 and x 1 is Q(x) = , the value of x 1 being It is better to pick x 1 = 0.6 and x 2 = 0.8, for example, because they satisfy the constraints, and they provide Q(x)= 27.16, which is smaller than before. A possible way out from this difficulty is to modify Q without changing the active set. We consider the situation of the example above where the trust region calculation gives Q(x 2 ) < Q(x 1 ), where x 1 satisfies the constraints, where x 2 is infeasible, and where Q(x 1 +α(x 2 x 1 )) is greater than Q(x 1 ) for the greatest value of α that retains feasibility. Then we may solve a new trust region problem using the old active set. Specifically, we add to Q a positive multiple of the quadratic function {(x x 1 ) T (x 2 x 1 )} 2, x R n, choosing the multiplier so that Q(x 2 ) becomes greater than Q(x 1 ). A new x 2 is generated by solving the new trust region problem. Let Q + ( ) and x + 2 be the new Q( ) and x 2. The trust region calculation and the change to Q supply Q + (x + 2 )<Q + (x 1 ) and Q + (x 2 )>Q + (x 1 ), respectively, so x + 2 is different from x 2. Furthermore, Q + (x + 2 ) < Q + (x 1 ) with the properties Q + (x 1 ) = Q(x 1 ) and Q + (x) Q(x), x R n, imply the reduction Q(x + 2 )<Q(x 1 ) in the original quadratic function Q. This technique can be applied recursively if necessary, but so far has not been tried in practice. The modification of Q is removed after the completion of the trust region calculation. 16

19 3. Software for minimization without derivatives 3.1 The problem and the required data The software seeks the least value of F(x), x R n, subject to x X. The objective function is specified by a subroutine that calculates F(x) for every x chosen by the software. The set X may be all of R n (NEWUOA), or there may be simple bounds l i x i u i, i=1, 2,...,n, on the variables (BOBYQA), or the user may supply general linear constraints a T j x b j, j =1, 2,...,m, (3.1) for the LINCOA software, which is still under development. The value of F(x) is required by BOBYQA only when x satisfies its bounds, but LINCOA needs F(x) sometimes when x is outside the feasible region. The user has to pick the number m of interpolation conditions Q(y i ) = F(y i ), i=1, 2,...,m, (3.2) that are satisfied by the quadratic models, subject to n+2 m 1 (n+1)(n+2). The 2 points y i are chosen automatically, except that the user has to supply an initial vector of variables x beg X. Also the user has to supply the initial and final values of the trust region radius, ρ beg and ρ end say, with ρ end ρ beg. Changes to the variables of length ρ beg should be suitable at the beginning of the calculation. Every trust region radius is bounded below by ρ end. The iterations are terminated when this bound seems to be preventing further progress. 3.2 Getting started The initial points y i, i = 1, 2,...,m, of NEWUOA include x beg and x beg +ρ beg e i, i=1, 2,...,n, where e i is the i-th coordinate direction in R n. The point x beg ρ beg e i or x beg +2ρ beg e i is included too for i=1, 2,...,min[m n 1,n]. When m>2n+1, the remaining points have the form x beg ± ρ beg e i ± ρ beg e j. The same points are picked by BOBYQA, except for slight adjustments if necessary to satisfy the bounds l i x i u i, i = 1, 2,...,n. The points of NEWUOA are also used by LINCOA, after adjustment if necessary, but now the purpose of the adjustment is not always to satisfy the constraints, as explained next. Throughout the calculation, we try to prevent every distance y i y j, i j, from becoming much less than the trust region radius, in order to restrict the contributions from errors in F(y i ) and F(y j ) to the derivatives of the quadratic models that interpolate these function values. The trust region centre, x k say, is one of the points {y i : i = 1, 2,...,m}, it satisfies the constraints, and it is the feasible point that has provided the least calculated value of F(x) so far. Every trust region step from x k to x k +d k, say, where d k is calculated by the method 17

20 of Lecture 2, is designed to be feasible and to provide a substantial reduction Q(x k +d k )<Q(x k ) in the current quadratic Q F. Let y i be feasible and different from x k. The properties Q(x k ) = F(x k ) F(y i ) = Q(y i ) with Q(x k +d k ) < Q(x k ) imply x k +d k y i, and the substantial reduction Q(x k +d k ) < Q(x k ) should keep x k +d k well away from y i. If y i is infeasible, however, then it is not a candidate for the trust region centre, so Q(y i )<Q(x k ) is possible, and the substantial reduction becomes irrelevant. We require another way of keeping x k +d k well away from y i. Our method is to ensure that, if y i is infeasible, then its distance to the boundary of the feasible region is at least one tenth of the trust region radius. Therefore, when LINCOA gives attention to each point y i in the previous paragraph, it shifts y i if necessary to a position that satisfies either all the constraints or at least one of the conditions a T j y i b j + 0.1ρ beg a j, j =1, 2,...,m. (3.3) LINCOA, unlike BOBYQA, does not require all the points y i, i=1, 2,...,m, to be feasible, because, if the feasible region is very thin and pointed where x is optimal, and if every y i were in such a region, then the equations (3.2) would not provide good information about changes to F( ) when moving away from constraint boundaries, which is important when considering possible changes to the active set. On the other hand, the feasible region of BOBYQA does not have sharply pointed corners, because it is a rectangular box. Also the box is not very thin, as the lower and upper bounds on the variables of BOBYQA are required to satisfy the conditions u i l i + 2ρ beg, i=1, 2,...,n, (3.4) in order that the initial points y i, i=1, 2,...,m, can be feasible. The right hand sides of the equations (3.2) are calculated at the chosen initial points y i, i=1, 2,...,m, by every version of the software. The initial trust region centre x 1 is the initial point y i where F(y i ) is least subject to feasibility. The initial trust region radius is =ρ beg. The first quadratic model Q(x), x R n, is defined by minimizing 2 Q F subject to the conditions (3.2). Also the inverse of the (m+n+1) (m+n+1) matrix of the system (1.21) is required for the initial points y i, i=1, 2,...,m. Both the initial Q and the initial inverse matrix Ω are given cheaply by analytic formulae, except that an extension is needed by LINCOA if it shifts any of the initial points y i from their usual positions. In the extension, we ignore the analytic Q, and we begin with the analytic form of Ω when the initial points y i are in their original positions. Then formula (1.24) of Lecture 1 is applied to update Ω if y t moves from its original position to y +, say, all moves being included in this way sequentially. Finally, the parameters of the initial Q are defined by the system (1.21) of Lecture 1, when r has the components r i = F(y i ), i = 1, 2,...,m, the parameters being determined by multiplying the right hand side by the inverse matrix Ω. 18

21 3.3 Adjustment of the trust region radius The number ρ is a lower bound on the trust region radius that is set to ρ beg initially, and that is reduced occasionally by about a factor of ten, the final value being ρ end. The idea behind this strategy is that we want to keep the points y i, i = 1, 2,...,m, well apart for as long as possible, and the least distance between them is hardly ever less than 0.1ρ for the current ρ. Therefore ρ remains constant until a reduction seems to be necessary for further progress. Termination occurs instead of a further reduction in ρ if ρ has reached its final value ρ end. Changes to the trust region radius ρ may be made after a new point y + R n has been generated by the method of Lecture 2, that minimizes Q(x) approximately subject to x x k, and to any linear constraints, where x k is now the centre of the trust region of the current iteration. The new function value F(y + ) is calculated, and then the ratio {F(x k ) F(y + )}/{Q(x k ) Q(y + )} (3.5) indicates whether or not the reduction in Q( ) has been successful in providing a satisfactory reduction in F( ). The rules for revising are typical of trust region methods, except for the bound ρ. Indeed, may be increased if the ratio exceeds 0.7 with y + x k > 1, or the new value of may be set to 2 max [ρ, 1 2 y+ x k ] if the ratio is less than 0.2. Further, because we reduce ρ only if =ρ holds, any new value of in the interval (ρ, 1.5ρ) is replaced by ρ. An unfavourable value of the ratio (3.5), including the possibility F(y + ) > F(x k ), indicates that the current approximation Q( ) F( ) is inadequate within the trust region {x : x x k }, but this can happen when the value of is suitable. Bad approximations may be due to the interpolation conditions Q(y i ) = F(y i ), i=1, 2,...,m, (3.6) if some of the distances y i x k, i = 1, 2,...,m, are much larger than. On some iterations, therefore, the point y t that has the property y t x k = max { y i x k : i=1, 2,...,m} (3.7) is replaced by a new point y +, which is the situation mentioned at the end of Lecture 1, where t is chosen before y +. Then y + is calculated to provide a relatively large value of the modulus of the Lagrange function Λ t ( ) subject to y + x k, the Lagrange functions being introduced in the first section of the handout of Lecture 1. In this case we call y + x k a model step. Alternatively, we call y + x k a trust region step if it is generated by the method of Lecture 2. The rules that determine whether y + x k is a trust region step or a model step are addressed briefly in the next section. Attention is given too to deciding whether the calculations with the current value of ρ are complete. 19

22 3.4 Summary of an iteration At the beginning of the k-th iteration, the centre of the trust region x k is the point that has supplied the least calculated value of F( ) so far subject to any linear constraints. A quadratic model Q( ) and the points y i, i=1, 2,...,m, are given, the interpolation equations (3.2) being satisfied. The current inverse matrix Ω, the trust region radius and the lower bound ρ are available too. Also the decision has been taken already whether the iteration is going to begin by seeking a trust region step or a model step. The trust region step alternative is preferred on the first iteration and immediately after a decrease in ρ. One new value of F( ) is computed on each iteration, except that a new value may not be required on the final iteration. The method of Section 2 is applied when a trust region step has priority, which provides the new vector y +. Then, if the condition y + x k 1 2 (3.8) is satisfied, the new function value F(y + ) is calculated, followed by the updating of as described in the previous section. Further, we select the point y t to remove from the set {y i : i=1, 2,...,m}\{x k } in order to make room for y +, by seeking a relatively large value of the denominator σ t in formula (1.24) of Lecture 1. Thus the new inverse matrix Ω + is obtained, and then the model Q( ) is updated by the symmetric Broyden method. The next trust region centre is x k+1 = x k or x k+1 =y + in the case F(y + ) F(x k ) or F(y + )<F(x k ), respectively. Finally, the next iteration is told to try a model step instead of another trust region step if and only if the ratio (3.5) is less than a prescribed constant, for example 0.5. Condition (3.8) helps to keep the interpolation points apart. If it fails with >ρ, then is reduced to max[ρ, 1 ], with a further reduction to ρ if the new 2 value is in the interval (ρ, 1.5ρ). Then the iteration is restarted with the instruction to try to use a model step instead of a trust region step. A similar restart is arranged if condition (3.8) fails with =ρ, with one exception. The exceptional case is when at least three trust region steps have been calculated since work began with the current ρ, and when the lengths of the last three of those steps are all less than 1 ρ. Those short steps are due to a combination of substantial 2 positive curvature in Q( ) with quite small values of Q(x k ). Therefore the decision is taken that the iterations with the current value of ρ are complete. When the current iteration gives priority to a model step, we let t be an integer in [1,m] that has the property (3.7), as mentioned already, but we assume there is no need for a model step if the inequality y t x k 10 (3.9) holds. Otherwise, we recall that we pick y + by seeking a large value of Λ t (y + ) within the trust region. Further, the LINCOA software shifts y + if necessary so that, if it is infeasible, then it satisfies at least one of the conditions a T j y + b j a j, j =1, 2,...,m, (3.10) 20

23 this device being similar to our use of expression (3.3). The shift is so small that it is allowed to take y + outside the trust region. Then the new function value F(y + ) is calculated by all versions of the software, followed by the updating of both the inverse matrix Ω and the quadratic model Q( ). As before, the next trust region centre is x k+1 = x k or x k+1 = y + in the case F(y + ) F(x k ) or F(y + ) < F(x k ), respectively, except that LINCOA has to set x k+1 = x k if y + is infeasible. After taking a model step, the next iteration is told always to try a trust region step. When the opportunity to take a model step is rejected because inequality (3.9) holds, we assume that the model is adequate for the current trust region radius. In the case > ρ, it is suitable to restart the current iteration with priority being given to a trust region step, because then is reduced if the attempt at a trust region step is unsuccessful. Alternatively, in the case =ρ, we ask whether the most recent earlier attempt at a trust region step achieved a reduction in F( ), which would have shifted the trust region centre. If the answer is affirmative, then again we restart the current iteration with priority being given to a trust region step, because this has not happened yet with the new position of the trust region centre. Otherwise, because the model seems to be good, and because trust region steps are not making progress, we have another situation where we finish the iterations with the current value of ρ. After a reduction in ρ, which always occurs during an iteration that has not calculated a new value of F( ) yet, it is adequate to change to half the old value of ρ, which is typically about five times the new value of ρ. Then the iteration is restarted with priority being given to a trust region step. This device causes every iteration to obtain one new value of F( ), except on the final iteration, which finds that the work with ρ = ρ end is complete. Then y + may have been generated by the trust region procedure of Lecture 2, but the new function value F(y + ) will not have been computed if the length y + x k is less than 1 2 ρ end, in order to keep the interpolation points apart. The spacing between the points no longer matters, however, and y + is the best prediction of the optimal vector of variables. Therefore the calculation of F(y + ) in this setting is recommended. The software returns the vector of variables that provides the least calculated value of F( ) subject to any given linear constraints on the variables. 3.5 Some numerical results Two remarkable features, shown below, occur when NEWUOA is applied to the quartic polynomial F(x) = { n 1 j=1 (x 2 j +x 2 n) 2 4x j + 3 }, x R n. (3.11) It is called the Arrowhead function because 2 F(x) has nonzero elements only on its diagonal and in its last row and column. We pick the standard starting point x beg = e (the vector of ones), and we note that x is the optimal vector of 21

### The NEWUOA software for unconstrained optimization without derivatives 1. M.J.D. Powell

DAMTP 2004/NA05 The NEWUOA software for unconstrained optimization without derivatives 1 M.J.D. Powell Abstract: The NEWUOA software seeks the least value of a function F(x), x R n, when F(x) can be calculated

### Inner Product Spaces

Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

### Lecture 3: Finding integer solutions to systems of linear equations

Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture

### Vector and Matrix Norms

Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

### Summary of week 8 (Lectures 22, 23 and 24)

WEEK 8 Summary of week 8 (Lectures 22, 23 and 24) This week we completed our discussion of Chapter 5 of [VST] Recall that if V and W are inner product spaces then a linear map T : V W is called an isometry

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

### Similarity and Diagonalization. Similar Matrices

MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

### Diagonalisation. Chapter 3. Introduction. Eigenvalues and eigenvectors. Reading. Definitions

Chapter 3 Diagonalisation Eigenvalues and eigenvectors, diagonalisation of a matrix, orthogonal diagonalisation fo symmetric matrices Reading As in the previous chapter, there is no specific essential

### The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method Robert M. Freund February, 004 004 Massachusetts Institute of Technology. 1 1 The Algorithm The problem

### Introduction to Matrix Algebra

Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

### We call this set an n-dimensional parallelogram (with one vertex 0). We also refer to the vectors x 1,..., x n as the edges of P.

Volumes of parallelograms 1 Chapter 8 Volumes of parallelograms In the present short chapter we are going to discuss the elementary geometrical objects which we call parallelograms. These are going to

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

### Some Notes on Taylor Polynomials and Taylor Series

Some Notes on Taylor Polynomials and Taylor Series Mark MacLean October 3, 27 UBC s courses MATH /8 and MATH introduce students to the ideas of Taylor polynomials and Taylor series in a fairly limited

### since by using a computer we are limited to the use of elementary arithmetic operations

> 4. Interpolation and Approximation Most functions cannot be evaluated exactly: x, e x, ln x, trigonometric functions since by using a computer we are limited to the use of elementary arithmetic operations

### Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Spaces and bases Week 3: Wednesday, Feb 8 I have two favorite vector spaces 1 : R n and the space P d of polynomials of degree at most d. For R n, we have a canonical basis: R n = span{e 1, e 2,..., e

### UNIT 2 MATRICES - I 2.0 INTRODUCTION. Structure

UNIT 2 MATRICES - I Matrices - I Structure 2.0 Introduction 2.1 Objectives 2.2 Matrices 2.3 Operation on Matrices 2.4 Invertible Matrices 2.5 Systems of Linear Equations 2.6 Answers to Check Your Progress

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### NOTES on LINEAR ALGEBRA 1

School of Economics, Management and Statistics University of Bologna Academic Year 205/6 NOTES on LINEAR ALGEBRA for the students of Stats and Maths This is a modified version of the notes by Prof Laura

### ME128 Computer-Aided Mechanical Design Course Notes Introduction to Design Optimization

ME128 Computer-ided Mechanical Design Course Notes Introduction to Design Optimization 2. OPTIMIZTION Design optimization is rooted as a basic problem for design engineers. It is, of course, a rare situation

### 1 Solving LPs: The Simplex Algorithm of George Dantzig

Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

### Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Chapter 3 Linear Least Squares Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction

### 7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

### Orthogonal Diagonalization of Symmetric Matrices

MATH10212 Linear Algebra Brief lecture notes 57 Gram Schmidt Process enables us to find an orthogonal basis of a subspace. Let u 1,..., u k be a basis of a subspace V of R n. We begin the process of finding

### Solving Sets of Equations. 150 B.C.E., 九章算術 Carl Friedrich Gauss,

Solving Sets of Equations 5 B.C.E., 九章算術 Carl Friedrich Gauss, 777-855 Gaussian-Jordan Elimination In Gauss-Jordan elimination, matrix is reduced to diagonal rather than triangular form Row combinations

### Inner Product Spaces and Orthogonality

Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,

### Section 4.4 Inner Product Spaces

Section 4.4 Inner Product Spaces In our discussion of vector spaces the specific nature of F as a field, other than the fact that it is a field, has played virtually no role. In this section we no longer

### Lecture 5 - Triangular Factorizations & Operation Counts

LU Factorization Lecture 5 - Triangular Factorizations & Operation Counts We have seen that the process of GE essentially factors a matrix A into LU Now we want to see how this factorization allows us

### Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Notes on Orthogonal and Symmetric Matrices MENU, Winter 201 These notes summarize the main properties and uses of orthogonal and symmetric matrices. We covered quite a bit of material regarding these topics,

### Section 6.1 - Inner Products and Norms

Section 6.1 - Inner Products and Norms Definition. Let V be a vector space over F {R, C}. An inner product on V is a function that assigns, to every ordered pair of vectors x and y in V, a scalar in F,

### Constrained Least Squares

Constrained Least Squares Authors: G.H. Golub and C.F. Van Loan Chapter 12 in Matrix Computations, 3rd Edition, 1996, pp.580-587 CICN may05/1 Background The least squares problem: min Ax b 2 x Sometimes,

### Eigenvalues and eigenvectors of a matrix

Eigenvalues and eigenvectors of a matrix Definition: If A is an n n matrix and there exists a real number λ and a non-zero column vector V such that AV = λv then λ is called an eigenvalue of A and V is

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### Linear Algebra Review. Vectors

Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka kosecka@cs.gmu.edu http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length

### MATH 240 Fall, Chapter 1: Linear Equations and Matrices

MATH 240 Fall, 2007 Chapter Summaries for Kolman / Hill, Elementary Linear Algebra, 9th Ed. written by Prof. J. Beachy Sections 1.1 1.5, 2.1 2.3, 4.2 4.9, 3.1 3.5, 5.3 5.5, 6.1 6.3, 6.5, 7.1 7.3 DEFINITIONS

### DETERMINANTS. b 2. x 2

DETERMINANTS 1 Systems of two equations in two unknowns A system of two equations in two unknowns has the form a 11 x 1 + a 12 x 2 = b 1 a 21 x 1 + a 22 x 2 = b 2 This can be written more concisely in

### Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

### 17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function

17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function, : V V R, which is symmetric, that is u, v = v, u. bilinear, that is linear (in both factors):

### What is Linear Programming?

Chapter 1 What is Linear Programming? An optimization problem usually has three essential ingredients: a variable vector x consisting of a set of unknowns to be determined, an objective function of x to

### Notes on Determinant

ENGG2012B Advanced Engineering Mathematics Notes on Determinant Lecturer: Kenneth Shum Lecture 9-18/02/2013 The determinant of a system of linear equations determines whether the solution is unique, without

### Nonlinear Programming Methods.S2 Quadratic Programming

Nonlinear Programming Methods.S2 Quadratic Programming Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard A linearly constrained optimization problem with a quadratic objective

### Lecture 5 Principal Minors and the Hessian

Lecture 5 Principal Minors and the Hessian Eivind Eriksen BI Norwegian School of Management Department of Economics October 01, 2010 Eivind Eriksen (BI Dept of Economics) Lecture 5 Principal Minors and

### BANACH AND HILBERT SPACE REVIEW

BANACH AND HILBET SPACE EVIEW CHISTOPHE HEIL These notes will briefly review some basic concepts related to the theory of Banach and Hilbert spaces. We are not trying to give a complete development, but

### Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

(für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained

### 4. MATRICES Matrices

4. MATRICES 170 4. Matrices 4.1. Definitions. Definition 4.1.1. A matrix is a rectangular array of numbers. A matrix with m rows and n columns is said to have dimension m n and may be represented as follows:

### ADVANCED LINEAR ALGEBRA FOR ENGINEERS WITH MATLAB. Sohail A. Dianat. Rochester Institute of Technology, New York, U.S.A. Eli S.

ADVANCED LINEAR ALGEBRA FOR ENGINEERS WITH MATLAB Sohail A. Dianat Rochester Institute of Technology, New York, U.S.A. Eli S. Saber Rochester Institute of Technology, New York, U.S.A. (g) CRC Press Taylor

### a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.

Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given

### Section Notes 3. The Simplex Algorithm. Applied Math 121. Week of February 14, 2011

Section Notes 3 The Simplex Algorithm Applied Math 121 Week of February 14, 2011 Goals for the week understand how to get from an LP to a simplex tableau. be familiar with reduced costs, optimal solutions,

### Linear Algebra Notes for Marsden and Tromba Vector Calculus

Linear Algebra Notes for Marsden and Tromba Vector Calculus n-dimensional Euclidean Space and Matrices Definition of n space As was learned in Math b, a point in Euclidean three space can be thought of

### Lecture 3: Linear Programming Relaxations and Rounding

Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### 1 if 1 x 0 1 if 0 x 1

Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or

### Linear Algebra I. Ronald van Luijk, 2012

Linear Algebra I Ronald van Luijk, 2012 With many parts from Linear Algebra I by Michael Stoll, 2007 Contents 1. Vector spaces 3 1.1. Examples 3 1.2. Fields 4 1.3. The field of complex numbers. 6 1.4.

The Hadamard Product Elizabeth Million April 12, 2007 1 Introduction and Basic Results As inexperienced mathematicians we may have once thought that the natural definition for matrix multiplication would

### Mathematics of Cryptography

CHAPTER 2 Mathematics of Cryptography Part I: Modular Arithmetic, Congruence, and Matrices Objectives This chapter is intended to prepare the reader for the next few chapters in cryptography. The chapter

### Lecture Topic: Low-Rank Approximations

Lecture Topic: Low-Rank Approximations Low-Rank Approximations We have seen principal component analysis. The extraction of the first principle eigenvalue could be seen as an approximation of the original

### Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

### Sec 4.1 Vector Spaces and Subspaces

Sec 4. Vector Spaces and Subspaces Motivation Let S be the set of all solutions to the differential equation y + y =. Let T be the set of all 2 3 matrices with real entries. These two sets share many common

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### October 3rd, 2012. Linear Algebra & Properties of the Covariance Matrix

Linear Algebra & Properties of the Covariance Matrix October 3rd, 2012 Estimation of r and C Let rn 1, rn, t..., rn T be the historical return rates on the n th asset. rn 1 rṇ 2 r n =. r T n n = 1, 2,...,

### LINEAR ALGEBRA W W L CHEN

LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,

### 13 MATH FACTS 101. 2 a = 1. 7. The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

3 MATH FACTS 0 3 MATH FACTS 3. Vectors 3.. Definition We use the overhead arrow to denote a column vector, i.e., a linear segment with a direction. For example, in three-space, we write a vector in terms

### DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

### Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

### CS3220 Lecture Notes: QR factorization and orthogonal transformations

CS3220 Lecture Notes: QR factorization and orthogonal transformations Steve Marschner Cornell University 11 March 2009 In this lecture I ll talk about orthogonal matrices and their properties, discuss

### Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Chapter 7 Eigenvalues and Eigenvectors In this last chapter of our exploration of Linear Algebra we will revisit eigenvalues and eigenvectors of matrices, concepts that were already introduced in Geometry

### Metric Spaces. Chapter 7. 7.1. Metrics

Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some

Quadratic Functions, Optimization, and Quadratic Forms Robert M. Freund February, 2004 2004 Massachusetts Institute of echnology. 1 2 1 Quadratic Optimization A quadratic optimization problem is an optimization

### By W.E. Diewert. July, Linear programming problems are important for a number of reasons:

APPLIED ECONOMICS By W.E. Diewert. July, 3. Chapter : Linear Programming. Introduction The theory of linear programming provides a good introduction to the study of constrained maximization (and minimization)

### Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Peter J. Olver 6. Eigenvalues and Singular Values In this section, we collect together the basic facts about eigenvalues and eigenvectors. From a geometrical viewpoint,

### Numerical Methods I Eigenvalue Problems

Numerical Methods I Eigenvalue Problems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course G63.2010.001 / G22.2420-001, Fall 2010 September 30th, 2010 A. Donev (Courant Institute)

### 4.3 Lagrange Approximation

206 CHAP. 4 INTERPOLATION AND POLYNOMIAL APPROXIMATION Lagrange Polynomial Approximation 4.3 Lagrange Approximation Interpolation means to estimate a missing function value by taking a weighted average

### Duality of linear conic problems

Duality of linear conic problems Alexander Shapiro and Arkadi Nemirovski Abstract It is well known that the optimal values of a linear programming problem and its dual are equal to each other if at least

### (a) The transpose of a lower triangular matrix is upper triangular, and the transpose of an upper triangular matrix is lower triangular.

Theorem.7.: (Properties of Triangular Matrices) (a) The transpose of a lower triangular matrix is upper triangular, and the transpose of an upper triangular matrix is lower triangular. (b) The product

### Iterative Methods for Computing Eigenvalues and Eigenvectors

The Waterloo Mathematics Review 9 Iterative Methods for Computing Eigenvalues and Eigenvectors Maysum Panju University of Waterloo mhpanju@math.uwaterloo.ca Abstract: We examine some numerical iterative

### MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

### 1 VECTOR SPACES AND SUBSPACES

1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such

### Numerical Methods for Solving Systems of Nonlinear Equations

Numerical Methods for Solving Systems of Nonlinear Equations by Courtney Remani A project submitted to the Department of Mathematical Sciences in conformity with the requirements for Math 4301 Honour s

### University of Lille I PC first year list of exercises n 7. Review

University of Lille I PC first year list of exercises n 7 Review Exercise Solve the following systems in 4 different ways (by substitution, by the Gauss method, by inverting the matrix of coefficients

### 1 Introduction. 2 Matrices: Definition. Matrix Algebra. Hervé Abdi Lynne J. Williams

In Neil Salkind (Ed.), Encyclopedia of Research Design. Thousand Oaks, CA: Sage. 00 Matrix Algebra Hervé Abdi Lynne J. Williams Introduction Sylvester developed the modern concept of matrices in the 9th

### Diagonal, Symmetric and Triangular Matrices

Contents 1 Diagonal, Symmetric Triangular Matrices 2 Diagonal Matrices 2.1 Products, Powers Inverses of Diagonal Matrices 2.1.1 Theorem (Powers of Matrices) 2.2 Multiplying Matrices on the Left Right by

CALIFORNIA INSTITUTE OF TECHNOLOGY Division of the Humanities and Social Sciences More than you wanted to know about quadratic forms KC Border Contents 1 Quadratic forms 1 1.1 Quadratic forms on the unit

### Inner product. Definition of inner product

Math 20F Linear Algebra Lecture 25 1 Inner product Review: Definition of inner product. Slide 1 Norm and distance. Orthogonal vectors. Orthogonal complement. Orthogonal basis. Definition of inner product

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### MATH 4330/5330, Fourier Analysis Section 11, The Discrete Fourier Transform

MATH 433/533, Fourier Analysis Section 11, The Discrete Fourier Transform Now, instead of considering functions defined on a continuous domain, like the interval [, 1) or the whole real line R, we wish

### Tensors on a vector space

APPENDIX B Tensors on a vector space In this Appendix, we gather mathematical definitions and results pertaining to tensors. The purpose is mostly to introduce the modern, geometrical view on tensors,

### NUMERICAL METHODS C. Carl Gustav Jacob Jacobi 10.1 GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING

0. Gaussian Elimination with Partial Pivoting 0.2 Iterative Methods for Solving Linear Systems 0.3 Power Method for Approximating Eigenvalues 0.4 Applications of Numerical Methods Carl Gustav Jacob Jacobi

### 1 Gaussian Elimination

Contents 1 Gaussian Elimination 1.1 Elementary Row Operations 1.2 Some matrices whose associated system of equations are easy to solve 1.3 Gaussian Elimination 1.4 Gauss-Jordan reduction and the Reduced

### Practical Guide to the Simplex Method of Linear Programming

Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear

### Matrix Algebra and Applications

Matrix Algebra and Applications Dudley Cooke Trinity College Dublin Dudley Cooke (Trinity College Dublin) Matrix Algebra and Applications 1 / 49 EC2040 Topic 2 - Matrices and Matrix Algebra Reading 1 Chapters

### Linear Programming in Matrix Form

Linear Programming in Matrix Form Appendix B We first introduce matrix concepts in linear programming by developing a variation of the simplex method called the revised simplex method. This algorithm,

### MATH 551 - APPLIED MATRIX THEORY

MATH 55 - APPLIED MATRIX THEORY FINAL TEST: SAMPLE with SOLUTIONS (25 points NAME: PROBLEM (3 points A web of 5 pages is described by a directed graph whose matrix is given by A Do the following ( points

### α = u v. In other words, Orthogonal Projection

Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v

### 3 Orthogonal Vectors and Matrices

3 Orthogonal Vectors and Matrices The linear algebra portion of this course focuses on three matrix factorizations: QR factorization, singular valued decomposition (SVD), and LU factorization The first

### MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets. Norm The notion of norm generalizes the notion of length of a vector in R n. Definition. Let V be a vector space. A function α

### We shall turn our attention to solving linear systems of equations. Ax = b

59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system

### 1 The Brownian bridge construction

The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof

### Linear Programming Problems

Linear Programming Problems Linear programming problems come up in many applications. In a linear programming problem, we have a function, called the objective function, which depends linearly on a number

### Factorization Theorems

Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization