A Perfect Example for The BFGS Method


 Catherine Webster
 1 years ago
 Views:
Transcription
1 A Perfect Example for The BFGS Method YuHong Dai Abstract Consider the BFGS quasinewton method applied to a general nonconvex function that has continuous second derivatives. This paper aims to construct a fourdimensional example such that the BFGS method need not converge. The example is perfect in the following sense: (a All the stepsizes are exactly equal to one; the unit stepsize can also be accepted by various line searches including the Wolfe line search and the Arjimo line search; (b The objective function is strongly convex along each search direction although it is not in itself. The unit stepsize is the unique minimizer of each line search function. Hence the example also applies to the global line search and the line search that always picks the first local minimizer; (c The objective function is polynomial and hence is infinitely continuously differentiable. If relaxing the convexity requirement of the line search function; namely, (b, we are able to construct a relatively simple polynomial example. Keywords. unconstrained optimization, quasinewton method, nonconvex function, global convergence. 1 Introduction Consider the unconstrained optimization problem min f(x, x R n, (1.1 where f is a general nonconvex function that has continuous second derivatives. The quasinewton method is a class of wellknown and efficient methods for solving (1.1. It is of the iterative scheme x k+1 = x k α k H k g k, (1. This work was partially supported by the Chinese NSF grants and and the CAS grant kjcxyws703. State Key Laboratory of Scientific and Engineering Computing, Institute of Computational Mathematics and Scientific/Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, P.O. Box 719, Beijing , P.R. China. 1
2 where x 1 is a starting point, H k is some approximation to the inverse Hessian [ f(x k ] 1, g k = f(x k, and α k is a stepsize obtained in some way. Defining the vectors δ k = x k+1 x k, γ k = g k+1 g k, the quasinewton method asks the next approximation matrix H k+1 to satisfy the secant equation H k+1 γ k = δ k. (1.3 This similarity to the identical equation [ f(x k+1 ] 1 γ k = δ k in the quadratic case enables the quasinewton method to be superlinearly convergent (see Dennis and Moré [5], for example and makes it very attractive in practical optimization. The first quasinewton method was dated back to Davidon [4] and Fletcher and Powell [9]. The DFP method updates the approximation matrix H k to H k+1 by the formula H k+1 = H k H kγ k γ T k H k γ T k H kγ k + δ kδ T k δ T k γ k. (1.4 Nowadays, the most efficient quasinewton method is perhaps the BFGS method, which was proposed by Broyden [], Fletcher [7], Goldfarb [10], and Shanno [4], independently. The matrix H k+1 in the BFGS method can be updated by the way H k+1 = H k δ kγ T k H k + H k γ k δ T ( k δ T γt k H kγ k δk δ T k k γ k δ T k γ k δ T. (1.5 k γ k To pass the positive definiteness of the matrix H k to H k+1, practical quasi Newton algorithms make use of the Wolfe line search to calculate the stepsize α k, which ensures a positive curvature to be found at each iteration; namely, δ T k γ k > 0. More exactly, defining the onedimensional line search function Ψ k (α = f(x k αh k g k, α 0, (1.6 the Wolfe line search consists in finding a stepsize α k such that and Ψ k (α k Ψ k (0 + µ Ψ k(0 α k (1.7 Ψ k(α k η Ψ k(0, (1.8 where µ and η are constants with 0 < µ η < 1. There have been a large quantity of research works devoting to the global convergence of the quasinewton method (see Yuan [8] and Sun and Yuan [6], for example. Specifically, Powell [18] showed that the DFP method with exact line searches is globally convergent for uniformly convex functions. In another paper [0], Powell established the global convergence of the BFGS method with the Wolfe line search for uniformly convex functions. This result was extended
3 by Byrd, Nocedal and Yuan [1] to the whole Broyden s convex family of methods except the DFP method. Therefore it is natural to ask the following question: Does the DFP method with the Wolfe line search converge globally for uniformly convex functions? On the other hand, since the BFGS method with the Wolfe line search works well for both convex and nonconvex functions, we might ask another question: Does the BFGS method with the Wolfe line search converge globally for general functions? The difficulty and importance of the two convergence problems has been addressed in many situations, including Nocedal [15], Fletcher [8], and Yuan [8]. Recent studies provide a negative answer to the convergence problem of the BFGS method for nonconvex functions. As a matter of fact, in an early paper [1], which analyzes the convergence properties of the conjugate gradient method, Powell mentioned that the BFGS method need not converge if the line search can pick any local minimizer of the line search function Ψ k (α. After further studies on the twodimensional example in [1], Dai [3] presented an example with six cycling points and showed by the example that the BFGS method with the Wolfe line search may fail for nonconvex functions. Later, Mascarenhas [14] constructed a threedimensional counterexample such that the BFGS method does not converge if the line search picks the global minimizer of the function Ψ k (α. It should be noted that the stepsize in the counterexample of [14] also satisfies the Wolfe line search conditions. However, neither examples are such that the stepsize is the first local minimizer of the line search function Ψ k (α. Surprisingly enough, if there are only two variables, and if the stepsize is chosen to be the first local minimizer of Ψ k (α; namely, α k = arg min {α > 0 : α is a local minimizer of Ψ k (α}, (1.9 Powell [] established the global convergence of the BFGS method for general twice continuously differentiable function. Powell s proof makes use of the principle of contradiction and is quite sophisticated. Assuming the nonconvergence of the method and the relation lim inf g k 0, Powell showed that the limit points of the BFGS path k P = {x : x lies on the line segment connecting x k and x k+1 for some k 1} (that is exactly the same as the PolakRibière conjugate gradient path if n = and gk+1 T δ k = 0 for all k forms some line segment L. The use of the specific line search (1.9 is such that the objective function is monotonically decreasing on the line segment L k that connects x k and x k+1. It follows that f(x lim f(x k for all x L. Consequently, the line segment L k cannot cross L k and all the points x k with large indices lies in the same side of L. Furthermore, Powell showed that starting around from one end of the line segment L, the BFGS path can not get any close to the other end of L and finally obtained a contradiction. Intuitively, it is difficult to extend Powell s result to the case when n 3. This is because, even if it has been shown that the BFGS path 3
4 tends to some limit line segment L, each line segment L k could turn around L providing that n 3, making the similar analysis of the BFGS path complicated and nearly impossible. The purpose of this paper is to construct a fourdimensional counterexample for the BFGS method with the following features: (a All the stepsizes in the example are exactly equal to one; namely, α k 1; (1.10 For the search direction defined by the BFGS update in the example, a unit stepsize satisfies various line search conditions, including the Wolfe conditions and the Armijo conditions; (b The objective function is strongly convex along each search direction; namely, the line search function Ψ k (α is strongly convex for all k, although the objective function is not in itself. The unit stepsize is the unique minimizer of Ψ k (α. Hence the example also applies to the global line search and the specific line search (1.9; (c The objective function is polynomial and hence is infinitely continuously differentiable. On the other hand, the objective function in our example is linear in the third and fourth variables and thus has no local minimizer. However, the iterations generated by the BFGS method tend to a nonstationary point. As seen from 3.4, the construction of the example is quite complicated. To be such that the line search function Ψ k (α is strongly convex, a polynomial of degree 38 is introduced. Nevertheless, if we relax the convexity requirement of Ψ k (α; namely, (b in the above, it is possible to construct a relatively simple polynomial example of low degree (see 3.3. The construction of our examples can be divided into four procedures: (1 Prefix the special forms of the steps {δ k } and gradients {g k }, leaving several parameters to be determined later. In this case, once x 1 is given, the whole sequence {x k } is fixed. As seen in.1, our steps {δ k } and gradients {g k } are asked to possess some symmetrical and cyclic properties and to push the iterations {x k } tend to the eight vertices of a regular octagon. This is very helpful in simplifying the construction of the examples and we can focus our attention on the choice of the parameter t, that answers for the decay of the last two components of δ k and the first two components of g k. ( To enable the BFGS method to generate those prefixed steps, investigate the consistency conditions on the steps {δ k } and gradients {g k }. With the prefixed forms of {δ k } and {g k }, we show in. that the unit stepsize is indispensable, whereas this stepsize is usually used as the first trial stepsize in the implementation of quasinewton methods. Four more consistency conditions on {δ k } and {g k } are obtained in.3 by expressing the vectors H k+1 γ k, H k+1 g k+1, H k+1 g k+ and H k+1 g k+3 by some δ k s and g k s instead of considering the quasinewton 4
5 matrix H k+1 itself. (3 Choose the parameters in some way to satisfy the consistency conditions and other necessary conditions on the objective function. As seen in.4, the first three consistency conditions are actually corresponding to some underdetermined linear system. Substituting its general solution by the Cramer s rule into the fourth consistency condition, we are led to a nonlinear equation from which the exact value can be obtained for the decay parameter t. By suitably choosing the other parameters, we can express all the quasinewton updating matrices including the initial choice for H 1 in.5. (4 Construct a suitable objective function f whose gradients are the preassigned values; namely, f(x k = g k for all k 1 and the line search has the desired properties. This will be done in the whole third section. To make full use of symmetrical and cyclic properties of the steps {δ k } and {g k }, we carefully choose a special form of the objective function. In addition, the introduction of element function φ in 3.4 helps us greatly to convexify the line search function and finally complete the perfect example that meets all the requirements (a, (b and (c. Some concluding remarks are given in the last section. Looking for Consistent Steps and Gradients.1 The Forms of Steps and Gradients Consider the case of four dimension. Inspired by [1], [3] and [14], we assume that the steps {δ k } have the following form: δ 1 = (η 1, ξ 1, γ 1, τ 1 T ; δ k+1 = M δ k (k 1, (.1 where M is the 4 4 matrix defined by ( cos θ1 sin θ 1 M = sin θ 1 cos θ 1 0 t 0 ( cos θ sin θ sin θ cos θ. (. In the above, t is a parameter satisfying 0 < t < 1. This parameter answers for the decay of the last two components of the steps {δ k }. The angles θ 1 and θ are chosen so that θ 1 = π m 1 and θ = π m for some positive integers m 1 and m. Specifically, we choose in this paper θ 1 = 1 4 π, θ = 3 π. (.3 4 Consequently, the first two components of the steps {δ k } turn to the same after every eight iterations, and the last two components will shrink at a factor of t 8 after every eight iterations. Further, we can see that the iterations {x k } will tend to turn around the eight vertices of some regular octagon (see 3.1 for more details. Accordingly, the gradients {g k } are assumed to be of the form g 1 = (l 1, h 1, c 1, d 1 T ; g k+1 = P g k (k 1, (.4 5
6 where P = ( cos θ1 sin θ t 1 sin θ 1 cos θ 1 0 ( cos θ 0 sin θ. (.5 sin θ cos θ Since 0 < t < 1, we see that the first two components of {g k } vanish, whereas the last two components of the gradient {g k }, turn to the same after eight iterations. It follows that none of the cluster points generated by the BFGS method are stationary points.. Unit Stepsizes It is well known that the unit stepsize is usually used as the first trial stepsize in practical implementations of the BFGS method. Under suitable assumptions on f, the unit stepsize will be accepted by the line search as the iterates tend to the solution and will enable a superlinear convergence step (see [5], for example. In the following, we are going to show that if the line search satisfies g T k+1δ k = 0, for all k 1, (.6 and if the steps generated by the BFGS method have the form (.1(. and the gradients have the form (.4(.5, then the use of unit stepsizes is also indispensable. To begin with, we see that the line search condition (.6, the secant equation (1.3 and the definition of the search direction indicate that H k+1 g k+1 = α 1 k+1 δ k+1 (.7 δ T k+1γ k = α k+1 g T k+1h k+1 γ k = α k+1 g T k+1δ k = 0. (.8 The above relation (.8 is sometimes called as the conjugacy condition in the context of nonlinear conjugate gradient methods. By multiplying the BFGS updating formula (1.5 with g k+1 and using (.6, which with (.7 gives H k+1 g k+1 = H k g k+1 γt k H kg k+1 δ T k γ k δ k, H k g k+1 = α 1 k+1 δ k+1 + γt k H kg k+1 δ T k γ k δ k. (.9 Multiplying (.9 by g T k and noticing gt k H kg k+1 = 0, we get that 0 = α 1 k+1 gt k δ k+1 + γt k H kg k+1 δ T gk T δ k = α 1 k+1 k γ gt k δ k+1 γ T k H k g k+1. k 6
7 Thus γ T k H kg k+1 = α 1 k+1 gt k δ k+1 = α 1 k+1 gt k+1 δ k+1. Hence, by (.9, H k g k+1 = α 1 k+1 δ k+1 α 1 gk+1 T δ k+1 k+1 δ T δ k. (.10 k γ k It follows from (.10 and (.7 with k replaced by k 1 that [ H k γ k = α 1 k+1 δ k+1 + α 1 k α 1 gk+1 T δ ] k+1 k+1 δ T δ k. (.11 k γ k Further, substituting this into the BFGS updating formula (1.5 yields [ H k+1 = H k + α 1 δ k δ T k+1 + δ k+1 δ T k k+1 δ T + 1 α 1 k + α 1 gk+1 T δ ] k+1 k+1 k γ k δ T k γ k δ k δ T k δ T k γ k. (.1 The above new updating formula requires the quantity δ k+1, that depends on H k+1 itself, and hence has theoretical meanings only. Lemma.1. Assume that (.6 holds. Then for all k 1 and i 0, the vector H k g k+i + α 1 k+i δ k+i belongs to the subspace spanned by δ k, δ k+1,..., δ k+i 1 ; namely, H k g k+i + α 1 k+i δ k+i Span{δ k, δ k+1,..., δ k+i 1 }. (.13 Proof. For convenience, we write (.1 as H k+1 = H k + V (s k, s k+1, (.14 where V (s k, s k+1 means the ranktwo matrix in the right hand of (.1. Therefore we have for all i 1, H k+i = H k + i V (s k+j 1, s k+j. (.15 j=1 The statement follows by multiplying the above relation with g k+i and using (.7 with k replaced by k + i 1 and g T k+i δ k+i 1 = 0. Q.E.D. Lemma.. To construct a desired example with Det (S 1 0, we must have that α k = 1, for all k 4. Proof. Define the following matrices G k = [ γ k 1 g k g k+1 g k+ ], S k = [ δ k 1 δ k δ k+1 δ k+ ]. 7
8 By H k γ k 1 = δ k 1 and Lemma.1, we can get that Det(H k Det(G k = Det [ ] H k γ k 1 H k g k H k g k+1 H k g k+ = Det [ δ k 1 α 1 k δ k α 1 k+1 δ k+1 α k+ δ ] k+ = α 1 k α 1 k+1 α 1 k+ Det(S k. (.16 Replacing k with k + 1 in the above yields Det(H k+1 Det(G k+1 = α 1 k+1 α 1 k+ α 1 k+3 Det(S k+1. (.17 On the other hand, due to the special forms of {g k } and {δ k }, we know that G k+1 = P G k and S k+1 = MS k. Hence Det(G k+1 = Det(P Det(G k = t Det(G k, Det(S k+1 = Det(M Det(S k = t Det(S k. (.18 Due to the basic determinant relation of the BFGS update, (.6 and (.7, it is not difficult to see that Det(H k+1 = δt k H 1 k δ k δ T Det(H k = α k Det(H k. (.19 k γ k If Det(S 1 0, (.16 with k = 1 implies that Det(G 1 0. Then by (.18, Det(G k 0, Det(S k 0, for all k 1. (.0 Dividing (.17 by (.16 and using the above relations, we then obtain α k+3 = 1. So the statement holds due to the arbitrariness of k 1. Q.E.D. The deletion of the first finite iterations does not influence the whole example. Thus we will ask our counterexample to satisfy and Det(S 1 0 (.1 α k 1. (. In this case, the updating formula (.1 of H k can be simplified as H k+1 = H k + δ kδ T k+1 + δ k+1 δ T k δ T gt k+1 δ k+1 k γ k gk T δ k δ k δ T k δ T k γ k. (.3 8
9 .3 Consistency Conditions Now we ask what else conditions, besides the requirement of unit stepsizes, have to be satisfied by the parameters in the definitions of {g k } and {δ k } so that the steps {s k ; k 1} can be generated by the BFGS update. Our early idea is based on the observation on the updating formula (1.5 that H k+1 is linear with H k and the linear system H 8j+9 = diag(t 4 E, t 4 E H 8j+1 diag(t 4 E, t 4 E, (.4 where E is the twodimensional identity matrix. However, this way seems to be quite complicated. Notice that the dimension is n = 4 and by Lemma., the assumption (.1 implies (.0. Then the matrix H k+1 can be uniquely defined by the equations given by H k+1 γ k, H k+1 g k+1, H k+1 g k+ and H k+1 g k+3. As a matter of fact, we have that H k+1 γ k = δ k (the secant equation, (.5 H k+1 g k+1 = δ k+1 ( by (.7 and α k 1, (.6 H k+1 g k+ = δ k+ + gt k+ δ k+ gk+1 T δ δ k+1 (by (.10 and α k 1, (.7 k+1 ( g T H k+1 g k+3 = δ k+3 + k+3 δ k+3 gk+ T δ + gt k+3 δ k+1 k+ gk+1 T δ δ k+ k+1 ( ( g T k+ δ k+ g T k+3 δ k+1 gk+1 T δ k+1 gk+1 T δ δ k+1. (.8 k+1 The last equality is obtained by multiplying (.3 by g k+, using (.7 and finally replacing k with k + 1. Relations (.5(.8 provide a system of 16 equations, while the symmetric matrix H k+1 only has 10 independent entries. How to ensure that this linear system has a symmetric solution H k+1? We have the following general lemma. Lemma.3. Assume that {u 1, u,..., u n } and {v 1, v,..., v n } are two sets of ndimensional linearly independent vectors. Then there exists a symmetric matrix H R n n satisfying Hu i = v i, i = 1,,..., n (.9 if and only if u T i v j = u T j v i, i, j = 1,,..., n. (.30 Proof. The only if part. If H = H T satisfies (.9, we have for all i, j = 1,,..., n, u T i v j = u T i Hu j = u T i HT u j = (Hu i T u j = v T i u j. The if part. Assume that (.30 holds. Defining the matrices U = (u 1 u... u n, V = (v 1 v... v n, (.31 9
10 direct calculations show that U T HU = U T V = u T 1 v 1 u T 1 v u T 1 v n u T v 1 u T v u T v n u T n v 1 u T n v u T n v n := A. (.3 By (.30, A is symmetric. So H = U T AU 1 satisfies H = H T and (.9. This completes our proof. Q.E.D. By the above lemma, the following six conditions are sufficient for the linear system (.5(.8 to allow a symmetric solution matrix H k+1. g T k+1 (H k+1y k = (H k+1 g k+1 T y k, g T k+ (H k+1y k = (H k+1 g k+ T y k, g T k+3 (H k+1y k = (H k+1 g k+3 T y k, g T k+ (H k+1g k+1 = (H k+1 g k+ T g k+1, g T k+3 (H k+1g k+1 = (H k+1 g k+3 T g k+1, g T k+3 (H k+1g k+ = (H k+1 g k+3 T g k+. Further, considering the whole sequence of {H k+1 ; k 1} and combining the line search condition g T k+1 δ k = 0, we can deduce the following four consistency conditions, which should be satisfied for all k 1, g T k+1δ k = 0, (.33 δ T k+1y k = 0, (.34 gk+δ T k = δ T k+y k, (.35 ( g gk+3δ T k = δ T T k+3y k + k+3 δ k+3 gk+ T δ + gt k+3 δ k+1 k+ gk+1 T δ δ T k+y k. (.36 k+1 The positive definiteness of the matrix H k+1 will be further considered in.5..4 Choosing The Parameters In this subsection we focus on how to choose the decay parameter t and suitable vectors δ 1 and g 1 such that the consistency conditions (.33, (.34, (.35 and (.36 hold with k = 1. Due to the special structure of this example, we then know that the consistency conditions hold for all k. Define v = ( T, where 1 = l 1 η 1 + h 1 ξ 1, = h 1 η 1 l 1 ξ 1, 3 = c 1 γ 1 + d 1 τ 1, 4 = d 1 γ 1 c 1 τ 1. (.37 As will be seen, the first three consistent conditions provide an underdetermined linear system with v, from which we can get a general solution by the Cramer s rule. The substitution of v into the fourth condition (.36, which is nonlinear, yields a desired value for the decay parameter t. 10
11 At first, the condition δ T 1 g = δ T 1 P g 1 = 0, that is (.33 with k = 1, asks [t cos θ 1 t sin θ 1 cos θ sin θ ] v = 0. (.38 The condition δ T γ 1 = δ T 1 M T (P Ig 1 = δ T 1 (ti M T g 1 = 0, that is (.34 with k = 1, requires the vector v to satisfy [t cos θ 1 sin θ 1 t(1 cos θ t sin θ ] v = 0. (.39 The requirement (.35 with k = 1, that is, 0 = δ T 1 g 3 + δ T 3 γ 1 = δ T 1 P g 1 + δ T 1 (M T (P Ig 1 = δ T 1 [P + tm T (M T ]g 1, yields the equation [ t cos θ1 + (t 1 cos θ 1 t sin θ 1 (1 + t sin θ 1 t (cos θ cos θ + cos θ t (sin θ sin θ sin θ ] v = 0. (.40 By (.38, (.39, (.40 and the choices of θ 1 and θ, we see that { i } must satisfy t t t (1 + t t t t (1 + t t ( + 1t By the Cramer s rule, to meet (.41, we may choose { i } as follows: 1 = ± [ ( + t 3 + (1 + t + 1 ], = ± [ ( + t 3 + (1 + t 1 ], 3 = ± [ t 3 + t + (1 t + 1 ], 4 = ± [ t 3 t + (1 + t 1 ]. = 0. (.41 (.4 Since g1 T δ 1 = = ±(t + 1(t + 1 and 0 < t < 1, we choose all the signs in (.4 as so that g1 T δ 1 < 0. We now want to substitute the general solution to the fourth consistent condition (.36. Noting that δ T 3 γ 1 = g3 T δ 1 and g4 T δ 4 = t g3 T δ 3, we know that (.36 with k = 1 is equivalent to [ ] g4 T δ 1 + g T δ 4 g1 T δ 4 + g3 T δ 1 t + gt 4 δ g T δ = 0. (.43 Further, using g T δ = t g T 1 δ 1, g T 4 δ = t g T 3 δ 1 and g T δ 4 = t g T 1 δ 3, (.43 can be simplified as g T 1 δ 1 [ g T 4 δ 1 g T 1 δ 4 + t(g T 3 δ 1 + g T 1 δ 3 ] + (g T 3 δ 1 = 0. (.44 11
12 Consequently, we obtain the following equation for the parameter t: where (t + 1 (t + 1 [ p (t + tp(t q(t ] = 0, (.45 p(t = ( + t + ( + t 1, q(t = ( + 3 t 3 (4 + 3 t + (3 + t. Further calculations provide ( [ p (t + tp(t q(t ] = [ t + (1 t + (1 ] [t + (1 3 t + ( ]. Considering the requirement that t ( 1, 1, one can deduce the following quadratic equation from (.45, ( t ( t = 0. (.46 The above equation has a unique root of t in the interval ( 1, 1, t = (.47 The numerical value of t is equal to approximately. Therefore if we choose the above t and the vectors δ 1 and g 1 such that (.4 holds with minus sign, then the prefixed steps and gradients satisfy the consistency conditions (.33, (.34, (.35 and (.36 for all k 1. There are many ways to choose δ 1 and g 1 satisfying (.4. Specifically, we choose ( η1 = ξ 1 ( 0, ( c1 = d 1 ( 0. (.48 In this case, by (.4 with minus signs and the definitions of i s, we can get that ( ( γ1 (17 8 t + ( = τ 1 ( t + (17 9, ( ( (.49 l1 ( 4 13 t + ( = h 1 ( 4 13 t + ( The initial step δ 1 and the initial gradient g 1 are then determined..5 QuasiNewton Updating Matrices In this subsection we discuss the form of the BFGS quasinewton updating matrices implied by (.5(.8 under the consistent conditions (.33(.36. A byproduct is that, we will know how to choose the initial matrix H 1 for the example. For this aim, we provide a followup lemma of Lemma.3. 1
13 Lemma.4. Assume that (.30 holds and the matrix H satisfies (.9. If further, the matrix A in (.3 is positive definite, there must exist nonsingular triangular matrices T 1 and T such that A = V T U = T T 1 T. (.50 Further, denoting ˆV = V T 1 = (ˆv 1, ˆv,..., ˆv n and the diagonal matrix T 1 1 T 1 = diag(t 1, t,..., t n, we have that H = n t i ˆv iˆv i T. (.51 i=1 Proof. The truth of (.50 is obvious by the Cholesky factorization of positive definite matrices. Further, by (.50 and the definition of ˆV, we get that Since HU = V, we obtain H = V U 1 = ( ( ˆV T 1 1 ˆV T U = T. T 1 ˆV T = ˆV ( T1 1 T 1 ˆV T, which with the definitions of ˆv i s and t i s implies the truth of (.51. Q.E.D. By the above lemma, we can express the form of H k+1 determined by (.5 (.6 under the consistent conditions (.33(.36. At first, we make use of the steps {δ k+i ; i = 0, 1,, 3} to introduce the following two vectors that are orthogonal to γ k and g k+1 : z k = δt k+γ k δ T k γ k δ k δt k+g k+1 δ T k+1g k+1 δ k+1 + δ k+, w k = δt k+3γ k δ T k γ k δ k δt k+3g k+1 δ T k+1g k+1 δ k+1 + δ k+3. (.5 The vectors z k and w k are well defined because δ T k γ k = δ T k g k > 0 is positive due to the descent property of δ k. Direct calculations show that z T k γ k = z T k g k+1 = 0, w T k γ k = w T k g k+1 = 0. (.53 Further, we use z k and w k to define the vector that is orthogonal to g k+ : By the choice of v k, it is easy to see that v k = wt k g k+ z T k g k+ z k + w k. (.54 v T k γ k = v T k g k+1 = v T k g k+ = 0. (.55 13
14 Now we could express the matrix H k+1 by H k+1 = δ kδ T k δ T k g k δ k+1δ T k+1 δ T k+1g k+1 z kz T k z T k g k+ v kv T k v T k g k+3. (.56 Since δ T k g k < 0 for all k 1, we know that the matrix H k+1 is positive definite if and only if z T k g k+ < 0 and v T k g k+3 < 0. Direct calculations show that z T 1 g 3 = ( t ( < 0, v T 1 g 4 = ( t + ( < 0. Therefore we know that the matrix H is positive definite. In addition, it is difficult to build the relations δ T k+1g k+1 = t δ T k g k, z T k+1 g k+3 = t z T k g k+, v T k+1 g k+4 = t v T k g k+3, z k+1 = Mz k and v k+1 = M v k for all k 1. Thus we have by (.56 and δ k+1 = M δ k that H k+1 = 1 t MH km T, (.57 holds for all k and hence {H k : k } are all positive definite. With the above procedure, we can calculate the values of z 1, v 1 and H. Further, by (.3, we can obtain the initial matrix H 1 required by the example of this paper. H 1 = H δ 1δ T + δ δ T 1 δ T 1 γ 1 + gt δ g T 1 δ 1 δ 1 δ T 1 δ T 1 γ 1 := H , (.58 where H 1 is a symmetric matrix with entries H 1 (1, 1 = ( t + ( , H 1 (1, = ( t ( , H 1 (1, 3 = ( t ( , H 1 (1, 4 = ( t + ( H 1 (, = ( t + ( , H 1 (, 3 = 1068 t ( , H 1 (, 4 = ( t ( , H 1 (3, 3 = ( t + ( , H 1 (3, 4 = ( t + ( , H 1 (4, 4 = ( t + ( (.59 Direct calculations show that (.57 also holds with k = 1 and hence H 1 is a positive definite matrix. 14
15 3 Constructing A Suitable Objective Function 3.1 Recovering The Iterations and The Function Values Denote the rotation matrices ( cos θ1 sin θ R 1 = 1, R sin θ 1 cos θ = 1 ( cos θ sin θ sin θ cos θ. (3.1 If we write the steps {δ k } and gradients {g k } into the forms η i t 8j l i δ 8j+i = ξ i t 8j γ i, g 8j+i = t 8j h i c i ; i = 1,..., 8, (3. t 8j τ i d i we have from (.1, (., (.4 and (.5 that ( ( ηi+1 ηi = R ξ 1, i+1 ξ i ( ( li+1 li = t R h 1, i+1 h i ( γi+1 τ i+1 ( γi = t R, τ i ( ( ci+1 ci = R d. i+1 d i (3.3 For simplicity, we want the iterations {x k } asymptotically to turn around the eight vertices of some regular octagon Ω of the subspace spanned by the first and second coordinates. More exactly, such a regular octagon Ω has the origin as its center and its eight vertices V i are given by V i = (a i, b i, 0, 0 T with ( = b 1 1, ( a1 ( ai+1 b i+1 ( ai = R 1. (3.4 b i Here we should note that if there is no confusion, we also regard that Ω and V i s are defined in the subspace spanned by the first and second coordinates. To this aim, we ask the iterations {x k } to be of the form a i x 8j+i = b i t 8j p i, (3.5 t 8j q i where ( pi+1 q i+1 To decide the values of p 1 and q 1, noting that x 1 V 1 = x 1 lim j x 8j+1 = ( pi = t R. (3.6 q i 15 δ i, i=0
16 we have that ( p1 q 1 = i=0 ( γi τ i ( = (E tr 1 γ1, (3.7 τ 1 where again E is the twodimensional identity matrix. Direct calculations show that ( p1 = 1 ( ( t + ( q ( t + ( (3.8 Now, let us assume that the limit of f(x k is f. Since for any j, the first two coordinates of {x 8j+i } always keep the same as those of its limit V i, we can get that ( T ( f(x 8j+i f = t 8j ci pi d i q i where = t 8j [ R i 1 ( T ( = t 8j+i 1 c1 p1 d 1 q 1 = t 8j+i 1 (f(x 1 f, ( c1 d 1 ] T [ (tr i 1 ( p1 q 1 ] f(x 1 f = c 1 p 1 + d 1 q 1 = ( t + ( Now we see the value of g T 8j+i δ 8j+i. Direct calculations show that t 8j T T l i η i l i η i g8j+i T δ 8j+i = t 8j h i ξ i c i t 8j γ i = t8j h i ξ i c i γ i d i t 8j τ i d i τ i = t 8j+i 1 (l 1 η 1 + h 1 ξ 1 + c 1 γ 1 + d 1 τ 1 = t 8j+i 1 g1 T δ 1, where g T 1 δ 1 = l 1 η 1 + h 1 ξ 1 + c 1 γ 1 + d 1 τ 1 = ( t + ( Therefore f(x 8j+i+1 f(x 8j+i α 8j+i g T 8j+iδ 8j+i = (f(x 8j+i+1 f (f(x 8j+i f g T 8j+iδ 8j+i = (t8j+i t 8j+i 1 (f(x 1 f t 8j+i 1 g T 1 δ 1 = (t 1(f(x 1 f g T 1 δ E 0. (3.9 16
17 The above relation, together with g T 8j+i+1 δ 8j+i = 0, implies that the stepsize α 8j+i = 1 can be accepted by the Wolfe line search, the Armijo line search and the Goldstein line search with suitable line search parameters. 3. Seeking A Suitable Form for The Objective Function We now consider how to construct a smooth function f such that its gradient at any point x 8j+i given in (3.5 are the one given in (3.; namely, f(x 8j+i = g 8j+i, for all j 0 and i = 1,..., 8. (3.10 To this aim, we assume that f is of the form f(x 1, x, x 3, x 4 = λ(x 1, x x 3 + µ(x 1, x x 4, (3.11 where λ and µ are twodimensional functions to be determined. Since the function (3.11 is linear with x 3 and x 4, we know that f has no lower bound in R n. Consequently, for the sequence {x k } generated by some optimization method for the minimization of this function, it is expected that f(x k tends to, but this will not happen in our examples. With the prefixed form (3.11, we have that f(x 1, x, x 3, x 4 = λ x 1 x 3 + µ x 1 x 4 λ x x 3 + µ x x 4 λ(x 1, x µ(x 1, x. (3.1 Comparing the last two components of the right hand vector in (3.1 with the gradients in (3., we must have that λ(v i = c i, µ(v i = d i. (3.13 Consequently, by (.48 and (3.3, we can obtain the concrete values of λ and µ at the eight vertices of Ω, which are listed in the second and third rows in Table 3.1. Further, for each i, let J i be the Jacobian of (λ, µ at vertex V i and denote J i = ( λ x 1 λ x 1 µ µ x x Vi, J 1 = ( ω1 ω. (3.14 ω 3 ω 4 The comparison of the first two components of the right hand vector in (3.1 with the gradients in (3. leads to the relation ( ( pi li J i =. (3.15 q i To be such that (3.15 always holds, we ask J i+1 and J i to meet the condition h i J i+1 = R 1 J i R T (
18 for all i 1. In this case, if (3.15 holds for some i, we have by this, (3.6 and (3.3 that ( ( ( ( ( pi+1 J i+1 = (R q 1 J i R T pi pi li li+1 (tr = tr i+1 q 1 J i = tr i q 1 =, i h i h i+1 which means that (3.15 holds with i + 1. Therefore by the induction principle, to be such that (3.15 holds for all i 1, it remains to choose J 1 such that (3.15 holds with i = 1. Noticing that there are still two degrees of freedom, we ask the special relations ω = ω 3, ω 4 = ω 1. (3.17 Then the values of ω 1 and ω 3 can be solved from (3.15 with i = 1 and (3.17, ω 1 = ( t ( , ω 3 = ( t + ( (3.18 Therefore if we choose J 1 to be the one in (3.14 with the values in (3.17 and (3.18 and ask the relation (3.16 for all i 1, then we will have (3.15 for all i 1. Now, using (3.13, (3.16 and (3.17, we can list the function values and gradients of λ and µ at the vertices V i s of the octagon Ω into Table 3.1. V 1 V V 3 V 4 V 5 V 6 V 7 V 8 λ µ λ x 1 ω 1 ω 1 ω 1 ω 1 ω 1 ω 1 ω 1 ω 1 λ x ω 3 ω 3 ω 3 ω 3 ω 3 ω 3 ω 3 ω 3 µ x 1 ω 3 ω 3 ω 3 ω 3 ω 3 ω 3 ω 3 ω 3 µ x ω 1 ω 1 ω 1 ω 1 ω 1 ω 1 ω 1 ω 1 Table 3.1. Function values and gradients of λ and µ at V i s By using the special choice of Ω and observing the values in Table 3.1, we can ask the functions λ and µ to have the properties and µ(x 1, x = λ( x, x 1 (3.19 λ(x 1, x = λ( x 1, x (3.0 for all (x 1, x R. Further, by (3.0, we could think of constructing the function λ by a polynomial function with only odd orders. In the next subsection, we will construct a simple function that only meets the requirements (a and (c that are described in the first section. In 3.4, we construct a complicated function that can meet all the requirements (a, (b and (c. 18
19 3.3 A Simple Function that Meets (a and (c If we ignore the convexity requirement (b of the line search function, we can construct a relatively simple function λ(x 1, x to meet the interpolation conditions listed in Table 3.1 and the function µ(x 1, x is then given by (3.19. In fact, we can check that the following function λ f (x 1, x =(15 1 x (6 9 (x 4 1 x + x 1x 3 + ( x 3 1 +( 1 + (6x 1x + x 3 + ( x1 9 8 x. has values of 0, 1,, 1 at vertices V 1, V, V 3, V 4, respectively, and zero derivatives at all the vertices V i s. Further, we can check the following function λ g (x 1, x = 3 (x 1 1[x 1 (3 + ]x 4 has zero function values at all the vertices V i s, but could be used to interpolate the derivatives of λ(x 1, x at V i s. Consequently, we know that λ(x 1, x = λ f (x 1, x + ω 1 λ g (x 1, x + ω 3 λ g ( x, x 1 (3.1 meets all the interpolation conditions required by λ(x 1, x in Table 3.1. Consequently, we could say that for the function defined by (3.11, (3.1 and (3.19, the BFGS method with the Wolfe line search using the unit initial step does not converge. Observing that Ψ i (α = tψ i (α, Ψ i(α = tψ i(α at α = 0, 1, we ideally wish there is the following relation between Ψ i (α and Ψ i+1 (α: Ψ i+1 (α = t Ψ i (α, for all α 0 (3. so that the neighboring line search functions only differ up to a constant multiplier of t. However, the choice (3.1 of λ(x 1, x does not lead to (3.. Nevertheless, we can propose the following compensation function and define λ c (x 1, x = x 1 [ x 1 + x ( + ] ˆλ(x 1, x = λ(x 1, x + c 1 λ c (x 1, x + c λ c ( x, x 1, (3.3 where c 1 = ( t ( , c = ( t + (
20 In this case, the relation (3. will always hold and the corresponding line search function at the first iteration is 6 ˆΨ 1 (α = ρ i α i, (3.4 where i=0 ρ 6 = ( t + ( , ρ 5 = ( t ( , ρ 4 = ( t + ( , ρ 3 = ( t ( , ρ = ( t + ( , ρ 1 = g T 1 δ 1 = ( t + (40 18, ρ 0 = f(x 1 = ( t + ( To sum up at this stage, for the function (3.11 with λ(x 1, x given in (3.1 or (3.3 and µ(x 1, x = λ( x, x 1, if the initial point is x 1 = (a 1, b 1, p 1, q 1 T (see (3.4, (3.8, (.47 for their values and if the initial matrix is H 1 given by (.58 and (.59, then the BFGS method (1. and (1.5 with α k 1 will generate the iterations in (3.5 whose gradients are given by (.4(.5. Therefore the method will asymptotically cycle around the eight vertices of a regular octagon without approaching a stationary point or pushing f(x k. 3.4 A Complicated Function that Meets (a, (b and (c To meet the requirement (b, this subsection gives a further compensation to the function ˆλ(x 1, x in (3.3 such that each line search function Ψ k (α is strongly convex. To this aim, we consider the straight line connecting x 8j+i and x 8j+i+1 : a i + α η i l i (α = x 8j+i + α δ 8j+i = b i + α ξ i t 8j (p i + α γ i. (3.5 t 8j (q i + α τ i By (3.1, the line search function at the (8j + ith iteration is Ψ 8j+i (α = t 8j[ λ(a i +α η i, b i +α ξ i ( p i +α γ i +µ(ai +α η i, b i +α ξ i ( q i +α τ i ]. (3.6 0
Orthogonal Bases and the QR Algorithm
Orthogonal Bases and the QR Algorithm Orthogonal Bases by Peter J Olver University of Minnesota Throughout, we work in the Euclidean vector space V = R n, the space of column vectors with n real entries
More informationOptimization by Direct Search: New Perspectives on Some Classical and Modern Methods
SIAM REVIEW Vol. 45,No. 3,pp. 385 482 c 2003 Society for Industrial and Applied Mathematics Optimization by Direct Search: New Perspectives on Some Classical and Modern Methods Tamara G. Kolda Robert Michael
More informationSubspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity
Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at UrbanaChampaign
More informationA Modern Course on Curves and Surfaces. Richard S. Palais
A Modern Course on Curves and Surfaces Richard S. Palais Contents Lecture 1. Introduction 1 Lecture 2. What is Geometry 4 Lecture 3. Geometry of InnerProduct Spaces 7 Lecture 4. Linear Maps and the Euclidean
More informationRECENTLY, there has been a great deal of interest in
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 1, JANUARY 1999 187 An Affine Scaling Methodology for Best Basis Selection Bhaskar D. Rao, Senior Member, IEEE, Kenneth KreutzDelgado, Senior Member,
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationSemiSimple Lie Algebras and Their Representations
i SemiSimple Lie Algebras and Their Representations Robert N. Cahn Lawrence Berkeley Laboratory University of California Berkeley, California 1984 THE BENJAMIN/CUMMINGS PUBLISHING COMPANY Advanced Book
More informationWHICH SCORING RULE MAXIMIZES CONDORCET EFFICIENCY? 1. Introduction
WHICH SCORING RULE MAXIMIZES CONDORCET EFFICIENCY? DAVIDE P. CERVONE, WILLIAM V. GEHRLEIN, AND WILLIAM S. ZWICKER Abstract. Consider an election in which each of the n voters casts a vote consisting of
More informationA Tutorial on Spectral Clustering
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Spemannstr. 38, 7276 Tübingen, Germany ulrike.luxburg@tuebingen.mpg.de This article appears in Statistics
More informationON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION. A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT
ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT A numerical study of the distribution of spacings between zeros
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationControllability and Observability of Partial Differential Equations: Some results and open problems
Controllability and Observability of Partial Differential Equations: Some results and open problems Enrique ZUAZUA Departamento de Matemáticas Universidad Autónoma 2849 Madrid. Spain. enrique.zuazua@uam.es
More informationDecoding by Linear Programming
Decoding by Linear Programming Emmanuel Candes and Terence Tao Applied and Computational Mathematics, Caltech, Pasadena, CA 91125 Department of Mathematics, University of California, Los Angeles, CA 90095
More informationFrom Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images
SIAM REVIEW Vol. 51,No. 1,pp. 34 81 c 2009 Society for Industrial and Applied Mathematics From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images Alfred M. Bruckstein David
More informationHow many numbers there are?
How many numbers there are? RADEK HONZIK Radek Honzik: Charles University, Department of Logic, Celetná 20, Praha 1, 116 42, Czech Republic radek.honzik@ff.cuni.cz Contents 1 What are numbers 2 1.1 Natural
More informationThe Backpropagation Algorithm
7 The Backpropagation Algorithm 7. Learning as gradient descent We saw in the last chapter that multilayered networks are capable of computing a wider range of Boolean functions than networks with a single
More informationHighRate Codes That Are Linear in Space and Time
1804 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 48, NO 7, JULY 2002 HighRate Codes That Are Linear in Space and Time Babak Hassibi and Bertrand M Hochwald Abstract Multipleantenna systems that operate
More informationAn Elementary Introduction to Modern Convex Geometry
Flavors of Geometry MSRI Publications Volume 3, 997 An Elementary Introduction to Modern Convex Geometry KEITH BALL Contents Preface Lecture. Basic Notions 2 Lecture 2. Spherical Sections of the Cube 8
More informationNew Results on Hermitian Matrix RankOne Decomposition
New Results on Hermitian Matrix RankOne Decomposition Wenbao Ai Yongwei Huang Shuzhong Zhang June 22, 2009 Abstract In this paper, we present several new rankone decomposition theorems for Hermitian
More informationCOSAMP: ITERATIVE SIGNAL RECOVERY FROM INCOMPLETE AND INACCURATE SAMPLES
COSAMP: ITERATIVE SIGNAL RECOVERY FROM INCOMPLETE AND INACCURATE SAMPLES D NEEDELL AND J A TROPP Abstract Compressive sampling offers a new paradigm for acquiring signals that are compressible with respect
More informationRegular Languages are Testable with a Constant Number of Queries
Regular Languages are Testable with a Constant Number of Queries Noga Alon Michael Krivelevich Ilan Newman Mario Szegedy Abstract We continue the study of combinatorial property testing, initiated by Goldreich,
More informationFoundations of Data Science 1
Foundations of Data Science John Hopcroft Ravindran Kannan Version /4/204 These notes are a first draft of a book being written by Hopcroft and Kannan and in many places are incomplete. However, the notes
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289. Compressed Sensing. David L. Donoho, Member, IEEE
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289 Compressed Sensing David L. Donoho, Member, IEEE Abstract Suppose is an unknown vector in (a digital image or signal); we plan to
More informationHow Bad is Forming Your Own Opinion?
How Bad is Forming Your Own Opinion? David Bindel Jon Kleinberg Sigal Oren August, 0 Abstract A longstanding line of work in economic theory has studied models by which a group of people in a social network,
More informationSome Applications of Laplace Eigenvalues of Graphs
Some Applications of Laplace Eigenvalues of Graphs Bojan MOHAR Department of Mathematics University of Ljubljana Jadranska 19 1111 Ljubljana, Slovenia Notes taken by Martin Juvan Abstract In the last decade
More informationA Googlelike Model of Road Network Dynamics and its Application to Regulation and Control
A Googlelike Model of Road Network Dynamics and its Application to Regulation and Control Emanuele Crisostomi, Steve Kirkland, Robert Shorten August, 2010 Abstract Inspired by the ability of Markov chains
More informationFast Solution of l 1 norm Minimization Problems When the Solution May be Sparse
Fast Solution of l 1 norm Minimization Problems When the Solution May be Sparse David L. Donoho and Yaakov Tsaig October 6 Abstract The minimum l 1 norm solution to an underdetermined system of linear
More informationSUMMING CURIOUS, SLOWLY CONVERGENT, HARMONIC SUBSERIES
SUMMING CURIOUS, SLOWLY CONVERGENT, HARMONIC SUBSERIES THOMAS SCHMELZER AND ROBERT BAILLIE 1 Abstract The harmonic series diverges But if we delete from it all terms whose denominators contain any string
More informationErgodicity and Energy Distributions for Some Boundary Driven Integrable Hamiltonian Chains
Ergodicity and Energy Distributions for Some Boundary Driven Integrable Hamiltonian Chains Peter Balint 1, Kevin K. Lin 2, and LaiSang Young 3 Abstract. We consider systems of moving particles in 1dimensional
More informationSteering User Behavior with Badges
Steering User Behavior with Badges Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell University Cornell University Stanford University ashton@cs.stanford.edu {dph,
More information