A Perfect Example for The BFGS Method



Similar documents
(Quasi-)Newton methods

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

The Characteristic Polynomial

Continued Fractions and the Euclidean Algorithm

IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL. 1. Introduction

Inner Product Spaces and Orthogonality

Applications to Data Smoothing and Image Processing I

Mathematics Course 111: Algebra I Part IV: Vector Spaces

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

1 VECTOR SPACES AND SUBSPACES

Notes on Symmetric Matrices

DATA ANALYSIS II. Matrix Algorithms

Classification of Cartan matrices

x1 x 2 x 3 y 1 y 2 y 3 x 1 y 2 x 2 y 1 0.

Adaptive Online Gradient Descent

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

LINEAR ALGEBRA W W L CHEN

MATH 590: Meshfree Methods

On the representability of the bi-uniform matroid

Recall the basic property of the transpose (for any A): v A t Aw = v w, v, w R n.

ISOMETRIES OF R n KEITH CONRAD

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Similarity and Diagonalization. Similar Matrices

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Ideal Class Group and Units

The NEWUOA software for unconstrained optimization without derivatives 1. M.J.D. Powell

LINEAR ALGEBRA. September 23, 2010

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.

2.3 Convex Constrained Optimization Problems

Numerical Analysis Lecture Notes

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Roots of Polynomials

Integer Factorization using the Quadratic Sieve

Vector and Matrix Norms

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

3. INNER PRODUCT SPACES

Duality of linear conic problems

By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

ON CERTAIN DOUBLY INFINITE SYSTEMS OF CURVES ON A SURFACE

Solving polynomial least squares problems via semidefinite programming relaxations

SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI. (B.Sc.(Hons.), BUAA)

On the Interaction and Competition among Internet Service Providers

Introduction to Matrix Algebra

CONTROLLABILITY. Chapter Reachable Set and Controllability. Suppose we have a linear system described by the state equation

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include

1 Introduction to Matrices

Factorization Theorems

1 Short Introduction to Time Series

24. The Branch and Bound Method

Section 1.1. Introduction to R n

Nonlinear Iterative Partial Least Squares Method

CS3220 Lecture Notes: QR factorization and orthogonal transformations

Numerical Analysis Lecture Notes

1 Sets and Set Notation.

Notes on Determinant

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Numerical Methods for Solving Systems of Nonlinear Equations

CHAPTER 1 Splines and B-splines an Introduction

CURVES WHOSE SECANT DEGREE IS ONE IN POSITIVE CHARACTERISTIC. 1. Introduction

The Open University s repository of research publications and other research outputs

Math 215 HW #6 Solutions

SECRET sharing schemes were introduced by Blakley [5]

SOLVING LINEAR SYSTEMS

Solution of Linear Systems

Approximation Algorithms

THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS

Linear Programming I

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Rotation Rate of a Trajectory of an Algebraic Vector Field Around an Algebraic Curve

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, Notes on Algebra

A note on companion matrices

Data Mining: Algorithms and Applications Matrix Math Review

ASEN Structures. MDOF Dynamic Systems. ASEN 3112 Lecture 1 Slide 1

Structure of the Root Spaces for Simple Lie Algebras

Linear Algebra I. Ronald van Luijk, 2012

Linear Algebra Notes for Marsden and Tromba Vector Calculus

BookTOC.txt. 1. Functions, Graphs, and Models. Algebra Toolbox. Sets. The Real Numbers. Inequalities and Intervals on the Real Number Line

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

MATHEMATICAL METHODS OF STATISTICS

Applied Algorithm Design Lecture 5

A Distributed Line Search for Network Optimization

MATH APPLIED MATRIX THEORY

Stationary random graphs on Z with prescribed iid degrees and finite mean connections

Monogenic Fields and Power Bases Michael Decker 12/07/07

CHAPTER SIX IRREDUCIBILITY AND FACTORIZATION 1. BASIC DIVISIBILITY THEORY

7 Gaussian Elimination and LU Factorization

Thnkwell s Homeschool Precalculus Course Lesson Plan: 36 weeks

PÓLYA URN MODELS UNDER GENERAL REPLACEMENT SCHEMES

Continuity of the Perron Root

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

A PRIORI ESTIMATES FOR SEMISTABLE SOLUTIONS OF SEMILINEAR ELLIPTIC EQUATIONS. In memory of Rou-Huai Wang

How To Prove The Dirichlet Unit Theorem

South Carolina College- and Career-Ready (SCCCR) Pre-Calculus

Transcription:

A Perfect Example for The BFGS Method Yu-Hong Dai Abstract Consider the BFGS quasi-newton method applied to a general nonconvex function that has continuous second derivatives. This paper aims to construct a four-dimensional example such that the BFGS method need not converge. The example is perfect in the following sense: (a All the stepsizes are exactly equal to one; the unit stepsize can also be accepted by various line searches including the Wolfe line search and the Arjimo line search; (b The objective function is strongly convex along each search direction although it is not in itself. The unit stepsize is the unique minimizer of each line search function. Hence the example also applies to the global line search and the line search that always picks the first local minimizer; (c The objective function is polynomial and hence is infinitely continuously differentiable. If relaxing the convexity requirement of the line search function; namely, (b, we are able to construct a relatively simple polynomial example. Keywords. unconstrained optimization, quasi-newton method, nonconvex function, global convergence. 1 Introduction Consider the unconstrained optimization problem min f(x, x R n, (1.1 where f is a general non-convex function that has continuous second derivatives. The quasi-newton method is a class of well-known and efficient methods for solving (1.1. It is of the iterative scheme x k+1 = x k α k H k g k, (1. This work was partially supported by the Chinese NSF grants 10571171 and 10831106 and the CAS grant kjcx-yw-s7-03. State Key Laboratory of Scientific and Engineering Computing, Institute of Computational Mathematics and Scientific/Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, P.O. Box 719, Beijing 100190, P.R. China. E-mail: dyh@lsec.cc.ac.cn 1

where x 1 is a starting point, H k is some approximation to the inverse Hessian [ f(x k ] 1, g k = f(x k, and α k is a stepsize obtained in some way. Defining the vectors δ k = x k+1 x k, γ k = g k+1 g k, the quasi-newton method asks the next approximation matrix H k+1 to satisfy the secant equation H k+1 γ k = δ k. (1.3 This similarity to the identical equation [ f(x k+1 ] 1 γ k = δ k in the quadratic case enables the quasi-newton method to be superlinearly convergent (see Dennis and Moré [5], for example and makes it very attractive in practical optimization. The first quasi-newton method was dated back to Davidon [4] and Fletcher and Powell [9]. The DFP method updates the approximation matrix H k to H k+1 by the formula H k+1 = H k H kγ k γ T k H k γ T k H kγ k + δ kδ T k δ T k γ k. (1.4 Nowadays, the most efficient quasi-newton method is perhaps the BFGS method, which was proposed by Broyden [], Fletcher [7], Goldfarb [10], and Shanno [4], independently. The matrix H k+1 in the BFGS method can be updated by the way H k+1 = H k δ kγ T k H k + H k γ k δ T ( k δ T + 1 + γt k H kγ k δk δ T k k γ k δ T k γ k δ T. (1.5 k γ k To pass the positive definiteness of the matrix H k to H k+1, practical quasi- Newton algorithms make use of the Wolfe line search to calculate the stepsize α k, which ensures a positive curvature to be found at each iteration; namely, δ T k γ k > 0. More exactly, defining the one-dimensional line search function Ψ k (α = f(x k αh k g k, α 0, (1.6 the Wolfe line search consists in finding a stepsize α k such that and Ψ k (α k Ψ k (0 + µ Ψ k(0 α k (1.7 Ψ k(α k η Ψ k(0, (1.8 where µ and η are constants with 0 < µ η < 1. There have been a large quantity of research works devoting to the global convergence of the quasi-newton method (see Yuan [8] and Sun and Yuan [6], for example. Specifically, Powell [18] showed that the DFP method with exact line searches is globally convergent for uniformly convex functions. In another paper [0], Powell established the global convergence of the BFGS method with the Wolfe line search for uniformly convex functions. This result was extended

by Byrd, Nocedal and Yuan [1] to the whole Broyden s convex family of methods except the DFP method. Therefore it is natural to ask the following question: Does the DFP method with the Wolfe line search converge globally for uniformly convex functions? On the other hand, since the BFGS method with the Wolfe line search works well for both convex and non-convex functions, we might ask another question: Does the BFGS method with the Wolfe line search converge globally for general functions? The difficulty and importance of the two convergence problems has been addressed in many situations, including Nocedal [15], Fletcher [8], and Yuan [8]. Recent studies provide a negative answer to the convergence problem of the BFGS method for nonconvex functions. As a matter of fact, in an early paper [1], which analyzes the convergence properties of the conjugate gradient method, Powell mentioned that the BFGS method need not converge if the line search can pick any local minimizer of the line search function Ψ k (α. After further studies on the two-dimensional example in [1], Dai [3] presented an example with six cycling points and showed by the example that the BFGS method with the Wolfe line search may fail for nonconvex functions. Later, Mascarenhas [14] constructed a three-dimensional counter-example such that the BFGS method does not converge if the line search picks the global minimizer of the function Ψ k (α. It should be noted that the stepsize in the counterexample of [14] also satisfies the Wolfe line search conditions. However, neither examples are such that the stepsize is the first local minimizer of the line search function Ψ k (α. Surprisingly enough, if there are only two variables, and if the stepsize is chosen to be the first local minimizer of Ψ k (α; namely, α k = arg min {α > 0 : α is a local minimizer of Ψ k (α}, (1.9 Powell [] established the global convergence of the BFGS method for general twice continuously differentiable function. Powell s proof makes use of the principle of contradiction and is quite sophisticated. Assuming the nonconvergence of the method and the relation lim inf g k 0, Powell showed that the limit points of the BFGS path k P = {x : x lies on the line segment connecting x k and x k+1 for some k 1} (that is exactly the same as the Polak-Ribière conjugate gradient path if n = and gk+1 T δ k = 0 for all k forms some line segment L. The use of the specific line search (1.9 is such that the objective function is monotonically decreasing on the line segment L k that connects x k and x k+1. It follows that f(x lim f(x k for all x L. Consequently, the line segment L k cannot cross L k and all the points x k with large indices lies in the same side of L. Furthermore, Powell showed that starting around from one end of the line segment L, the BFGS path can not get any close to the other end of L and finally obtained a contradiction. Intuitively, it is difficult to extend Powell s result to the case when n 3. This is because, even if it has been shown that the BFGS path 3

tends to some limit line segment L, each line segment L k could turn around L providing that n 3, making the similar analysis of the BFGS path complicated and nearly impossible. The purpose of this paper is to construct a four-dimensional counter-example for the BFGS method with the following features: (a All the stepsizes in the example are exactly equal to one; namely, α k 1; (1.10 For the search direction defined by the BFGS update in the example, a unit stepsize satisfies various line search conditions, including the Wolfe conditions and the Armijo conditions; (b The objective function is strongly convex along each search direction; namely, the line search function Ψ k (α is strongly convex for all k, although the objective function is not in itself. The unit stepsize is the unique minimizer of Ψ k (α. Hence the example also applies to the global line search and the specific line search (1.9; (c The objective function is polynomial and hence is infinitely continuously differentiable. On the other hand, the objective function in our example is linear in the third and fourth variables and thus has no local minimizer. However, the iterations generated by the BFGS method tend to a non-stationary point. As seen from 3.4, the construction of the example is quite complicated. To be such that the line search function Ψ k (α is strongly convex, a polynomial of degree 38 is introduced. Nevertheless, if we relax the convexity requirement of Ψ k (α; namely, (b in the above, it is possible to construct a relatively simple polynomial example of low degree (see 3.3. The construction of our examples can be divided into four procedures: (1 Prefix the special forms of the steps {δ k } and gradients {g k }, leaving several parameters to be determined later. In this case, once x 1 is given, the whole sequence {x k } is fixed. As seen in.1, our steps {δ k } and gradients {g k } are asked to possess some symmetrical and cyclic properties and to push the iterations {x k } tend to the eight vertices of a regular octagon. This is very helpful in simplifying the construction of the examples and we can focus our attention on the choice of the parameter t, that answers for the decay of the last two components of δ k and the first two components of g k. ( To enable the BFGS method to generate those prefixed steps, investigate the consistency conditions on the steps {δ k } and gradients {g k }. With the prefixed forms of {δ k } and {g k }, we show in. that the unit stepsize is indispensable, whereas this stepsize is usually used as the first trial stepsize in the implementation of quasi-newton methods. Four more consistency conditions on {δ k } and {g k } are obtained in.3 by expressing the vectors H k+1 γ k, H k+1 g k+1, H k+1 g k+ and H k+1 g k+3 by some δ k s and g k s instead of considering the quasi-newton 4

matrix H k+1 itself. (3 Choose the parameters in some way to satisfy the consistency conditions and other necessary conditions on the objective function. As seen in.4, the first three consistency conditions are actually corresponding to some under-determined linear system. Substituting its general solution by the Cramer s rule into the fourth consistency condition, we are led to a nonlinear equation from which the exact value can be obtained for the decay parameter t. By suitably choosing the other parameters, we can express all the quasi-newton updating matrices including the initial choice for H 1 in.5. (4 Construct a suitable objective function f whose gradients are the preassigned values; namely, f(x k = g k for all k 1 and the line search has the desired properties. This will be done in the whole third section. To make full use of symmetrical and cyclic properties of the steps {δ k } and {g k }, we carefully choose a special form of the objective function. In addition, the introduction of element function φ in 3.4 helps us greatly to convexify the line search function and finally complete the perfect example that meets all the requirements (a, (b and (c. Some concluding remarks are given in the last section. Looking for Consistent Steps and Gradients.1 The Forms of Steps and Gradients Consider the case of four dimension. Inspired by [1], [3] and [14], we assume that the steps {δ k } have the following form: δ 1 = (η 1, ξ 1, γ 1, τ 1 T ; δ k+1 = M δ k (k 1, (.1 where M is the 4 4 matrix defined by ( cos θ1 sin θ 1 M = sin θ 1 cos θ 1 0 t 0 ( cos θ sin θ sin θ cos θ. (. In the above, t is a parameter satisfying 0 < t < 1. This parameter answers for the decay of the last two components of the steps {δ k }. The angles θ 1 and θ are chosen so that θ 1 = π m 1 and θ = π m for some positive integers m 1 and m. Specifically, we choose in this paper θ 1 = 1 4 π, θ = 3 π. (.3 4 Consequently, the first two components of the steps {δ k } turn to the same after every eight iterations, and the last two components will shrink at a factor of t 8 after every eight iterations. Further, we can see that the iterations {x k } will tend to turn around the eight vertices of some regular octagon (see 3.1 for more details. Accordingly, the gradients {g k } are assumed to be of the form g 1 = (l 1, h 1, c 1, d 1 T ; g k+1 = P g k (k 1, (.4 5

where P = ( cos θ1 sin θ t 1 sin θ 1 cos θ 1 0 ( cos θ 0 sin θ. (.5 sin θ cos θ Since 0 < t < 1, we see that the first two components of {g k } vanish, whereas the last two components of the gradient {g k }, turn to the same after eight iterations. It follows that none of the cluster points generated by the BFGS method are stationary points.. Unit Stepsizes It is well known that the unit stepsize is usually used as the first trial stepsize in practical implementations of the BFGS method. Under suitable assumptions on f, the unit stepsize will be accepted by the line search as the iterates tend to the solution and will enable a superlinear convergence step (see [5], for example. In the following, we are going to show that if the line search satisfies g T k+1δ k = 0, for all k 1, (.6 and if the steps generated by the BFGS method have the form (.1-(. and the gradients have the form (.4-(.5, then the use of unit stepsizes is also indispensable. To begin with, we see that the line search condition (.6, the secant equation (1.3 and the definition of the search direction indicate that H k+1 g k+1 = α 1 k+1 δ k+1 (.7 δ T k+1γ k = α k+1 g T k+1h k+1 γ k = α k+1 g T k+1δ k = 0. (.8 The above relation (.8 is sometimes called as the conjugacy condition in the context of nonlinear conjugate gradient methods. By multiplying the BFGS updating formula (1.5 with g k+1 and using (.6, which with (.7 gives H k+1 g k+1 = H k g k+1 γt k H kg k+1 δ T k γ k δ k, H k g k+1 = α 1 k+1 δ k+1 + γt k H kg k+1 δ T k γ k δ k. (.9 Multiplying (.9 by g T k and noticing gt k H kg k+1 = 0, we get that 0 = α 1 k+1 gt k δ k+1 + γt k H kg k+1 δ T gk T δ k = α 1 k+1 k γ gt k δ k+1 γ T k H k g k+1. k 6

Thus γ T k H kg k+1 = α 1 k+1 gt k δ k+1 = α 1 k+1 gt k+1 δ k+1. Hence, by (.9, H k g k+1 = α 1 k+1 δ k+1 α 1 gk+1 T δ k+1 k+1 δ T δ k. (.10 k γ k It follows from (.10 and (.7 with k replaced by k 1 that [ H k γ k = α 1 k+1 δ k+1 + α 1 k α 1 gk+1 T δ ] k+1 k+1 δ T δ k. (.11 k γ k Further, substituting this into the BFGS updating formula (1.5 yields [ H k+1 = H k + α 1 δ k δ T k+1 + δ k+1 δ T k k+1 δ T + 1 α 1 k + α 1 gk+1 T δ ] k+1 k+1 k γ k δ T k γ k δ k δ T k δ T k γ k. (.1 The above new updating formula requires the quantity δ k+1, that depends on H k+1 itself, and hence has theoretical meanings only. Lemma.1. Assume that (.6 holds. Then for all k 1 and i 0, the vector H k g k+i + α 1 k+i δ k+i belongs to the subspace spanned by δ k, δ k+1,..., δ k+i 1 ; namely, H k g k+i + α 1 k+i δ k+i Span{δ k, δ k+1,..., δ k+i 1 }. (.13 Proof. For convenience, we write (.1 as H k+1 = H k + V (s k, s k+1, (.14 where V (s k, s k+1 means the rank-two matrix in the right hand of (.1. Therefore we have for all i 1, H k+i = H k + i V (s k+j 1, s k+j. (.15 j=1 The statement follows by multiplying the above relation with g k+i and using (.7 with k replaced by k + i 1 and g T k+i δ k+i 1 = 0. Q.E.D. Lemma.. To construct a desired example with Det (S 1 0, we must have that α k = 1, for all k 4. Proof. Define the following matrices G k = [ γ k 1 g k g k+1 g k+ ], S k = [ δ k 1 δ k δ k+1 δ k+ ]. 7

By H k γ k 1 = δ k 1 and Lemma.1, we can get that Det(H k Det(G k = Det [ ] H k γ k 1 H k g k H k g k+1 H k g k+ = Det [ δ k 1 α 1 k δ k α 1 k+1 δ k+1 α k+ δ ] k+ = α 1 k α 1 k+1 α 1 k+ Det(S k. (.16 Replacing k with k + 1 in the above yields Det(H k+1 Det(G k+1 = α 1 k+1 α 1 k+ α 1 k+3 Det(S k+1. (.17 On the other hand, due to the special forms of {g k } and {δ k }, we know that G k+1 = P G k and S k+1 = MS k. Hence Det(G k+1 = Det(P Det(G k = t Det(G k, Det(S k+1 = Det(M Det(S k = t Det(S k. (.18 Due to the basic determinant relation of the BFGS update, (.6 and (.7, it is not difficult to see that Det(H k+1 = δt k H 1 k δ k δ T Det(H k = α k Det(H k. (.19 k γ k If Det(S 1 0, (.16 with k = 1 implies that Det(G 1 0. Then by (.18, Det(G k 0, Det(S k 0, for all k 1. (.0 Dividing (.17 by (.16 and using the above relations, we then obtain α k+3 = 1. So the statement holds due to the arbitrariness of k 1. Q.E.D. The deletion of the first finite iterations does not influence the whole example. Thus we will ask our counter-example to satisfy and Det(S 1 0 (.1 α k 1. (. In this case, the updating formula (.1 of H k can be simplified as H k+1 = H k + δ kδ T k+1 + δ k+1 δ T k δ T gt k+1 δ k+1 k γ k gk T δ k δ k δ T k δ T k γ k. (.3 8

.3 Consistency Conditions Now we ask what else conditions, besides the requirement of unit stepsizes, have to be satisfied by the parameters in the definitions of {g k } and {δ k } so that the steps {s k ; k 1} can be generated by the BFGS update. Our early idea is based on the observation on the updating formula (1.5 that H k+1 is linear with H k and the linear system H 8j+9 = diag(t 4 E, t 4 E H 8j+1 diag(t 4 E, t 4 E, (.4 where E is the two-dimensional identity matrix. However, this way seems to be quite complicated. Notice that the dimension is n = 4 and by Lemma., the assumption (.1 implies (.0. Then the matrix H k+1 can be uniquely defined by the equations given by H k+1 γ k, H k+1 g k+1, H k+1 g k+ and H k+1 g k+3. As a matter of fact, we have that H k+1 γ k = δ k (the secant equation, (.5 H k+1 g k+1 = δ k+1 ( by (.7 and α k 1, (.6 H k+1 g k+ = δ k+ + gt k+ δ k+ gk+1 T δ δ k+1 (by (.10 and α k 1, (.7 k+1 ( g T H k+1 g k+3 = δ k+3 + k+3 δ k+3 gk+ T δ + gt k+3 δ k+1 k+ gk+1 T δ δ k+ k+1 ( ( g T k+ δ k+ g T k+3 δ k+1 gk+1 T δ k+1 gk+1 T δ δ k+1. (.8 k+1 The last equality is obtained by multiplying (.3 by g k+, using (.7 and finally replacing k with k + 1. Relations (.5-(.8 provide a system of 16 equations, while the symmetric matrix H k+1 only has 10 independent entries. How to ensure that this linear system has a symmetric solution H k+1? We have the following general lemma. Lemma.3. Assume that {u 1, u,..., u n } and {v 1, v,..., v n } are two sets of n-dimensional linearly independent vectors. Then there exists a symmetric matrix H R n n satisfying Hu i = v i, i = 1,,..., n (.9 if and only if u T i v j = u T j v i, i, j = 1,,..., n. (.30 Proof. The only if part. If H = H T satisfies (.9, we have for all i, j = 1,,..., n, u T i v j = u T i Hu j = u T i HT u j = (Hu i T u j = v T i u j. The if part. Assume that (.30 holds. Defining the matrices U = (u 1 u... u n, V = (v 1 v... v n, (.31 9

direct calculations show that U T HU = U T V = u T 1 v 1 u T 1 v u T 1 v n u T v 1 u T v u T v n u T n v 1 u T n v u T n v n := A. (.3 By (.30, A is symmetric. So H = U T AU 1 satisfies H = H T and (.9. This completes our proof. Q.E.D. By the above lemma, the following six conditions are sufficient for the linear system (.5-(.8 to allow a symmetric solution matrix H k+1. g T k+1 (H k+1y k = (H k+1 g k+1 T y k, g T k+ (H k+1y k = (H k+1 g k+ T y k, g T k+3 (H k+1y k = (H k+1 g k+3 T y k, g T k+ (H k+1g k+1 = (H k+1 g k+ T g k+1, g T k+3 (H k+1g k+1 = (H k+1 g k+3 T g k+1, g T k+3 (H k+1g k+ = (H k+1 g k+3 T g k+. Further, considering the whole sequence of {H k+1 ; k 1} and combining the line search condition g T k+1 δ k = 0, we can deduce the following four consistency conditions, which should be satisfied for all k 1, g T k+1δ k = 0, (.33 δ T k+1y k = 0, (.34 gk+δ T k = δ T k+y k, (.35 ( g gk+3δ T k = δ T T k+3y k + k+3 δ k+3 gk+ T δ + gt k+3 δ k+1 k+ gk+1 T δ δ T k+y k. (.36 k+1 The positive definiteness of the matrix H k+1 will be further considered in.5..4 Choosing The Parameters In this subsection we focus on how to choose the decay parameter t and suitable vectors δ 1 and g 1 such that the consistency conditions (.33, (.34, (.35 and (.36 hold with k = 1. Due to the special structure of this example, we then know that the consistency conditions hold for all k. Define v = ( 1 3 4 T, where 1 = l 1 η 1 + h 1 ξ 1, = h 1 η 1 l 1 ξ 1, 3 = c 1 γ 1 + d 1 τ 1, 4 = d 1 γ 1 c 1 τ 1. (.37 As will be seen, the first three consistent conditions provide an under-determined linear system with v, from which we can get a general solution by the Cramer s rule. The substitution of v into the fourth condition (.36, which is nonlinear, yields a desired value for the decay parameter t. 10

At first, the condition δ T 1 g = δ T 1 P g 1 = 0, that is (.33 with k = 1, asks [t cos θ 1 t sin θ 1 cos θ sin θ ] v = 0. (.38 The condition δ T γ 1 = δ T 1 M T (P Ig 1 = δ T 1 (ti M T g 1 = 0, that is (.34 with k = 1, requires the vector v to satisfy [t cos θ 1 sin θ 1 t(1 cos θ t sin θ ] v = 0. (.39 The requirement (.35 with k = 1, that is, 0 = δ T 1 g 3 + δ T 3 γ 1 = δ T 1 P g 1 + δ T 1 (M T (P Ig 1 = δ T 1 [P + tm T (M T ]g 1, yields the equation [ t cos θ1 + (t 1 cos θ 1 t sin θ 1 (1 + t sin θ 1 t (cos θ cos θ + cos θ t (sin θ sin θ sin θ ] v = 0. (.40 By (.38, (.39, (.40 and the choices of θ 1 and θ, we see that { i } must satisfy t t t (1 + t t t t (1 + t t ( + 1t + 1 1 3 4 By the Cramer s rule, to meet (.41, we may choose { i } as follows: 1 = ± [ ( + t 3 + (1 + t + 1 ], = ± [ ( + t 3 + (1 + t 1 ], 3 = ± [ t 3 + t + (1 t + 1 ], 4 = ± [ t 3 t + (1 + t 1 ]. = 0. (.41 (.4 Since g1 T δ 1 = 1 + 3 = ±(t + 1(t + 1 and 0 < t < 1, we choose all the signs in (.4 as so that g1 T δ 1 < 0. We now want to substitute the general solution to the fourth consistent condition (.36. Noting that δ T 3 γ 1 = g3 T δ 1 and g4 T δ 4 = t g3 T δ 3, we know that (.36 with k = 1 is equivalent to [ ] g4 T δ 1 + g T δ 4 g1 T δ 4 + g3 T δ 1 t + gt 4 δ g T δ = 0. (.43 Further, using g T δ = t g T 1 δ 1, g T 4 δ = t g T 3 δ 1 and g T δ 4 = t g T 1 δ 3, (.43 can be simplified as g T 1 δ 1 [ g T 4 δ 1 g T 1 δ 4 + t(g T 3 δ 1 + g T 1 δ 3 ] + (g T 3 δ 1 = 0. (.44 11

Consequently, we obtain the following equation for the parameter t: where (t + 1 (t + 1 [ p (t + tp(t q(t ] = 0, (.45 p(t = ( + t + ( + t 1, q(t = ( + 3 t 3 (4 + 3 t + (3 + t. Further calculations provide ( 6 + 4 1 [ p (t + tp(t q(t ] = [ t + (1 t + (1 ] [t + (1 3 t + ( 3 + 7 ]. Considering the requirement that t ( 1, 1, one can deduce the following quadratic equation from (.45, ( t + 1 3 ( t + 3 + 7 = 0. (.46 The above equation has a unique root of t in the interval ( 1, 1, t = 3 1 31 0. (.47 The numerical value of t is equal to 0.7973 approximately. Therefore if we choose the above t and the vectors δ 1 and g 1 such that (.4 holds with minus sign, then the prefixed steps and gradients satisfy the consistency conditions (.33, (.34, (.35 and (.36 for all k 1. There are many ways to choose δ 1 and g 1 satisfying (.4. Specifically, we choose ( η1 = ξ 1 ( 0, ( c1 = d 1 ( 0. (.48 In this case, by (.4 with minus signs and the definitions of i s, we can get that ( ( γ1 (17 8 t + ( 17 + 9 = τ 1 ( 17 + 9 t + (17 9, ( ( (.49 l1 ( 4 13 t + ( 1 + 11 = h 1 ( 4 13 t + ( 1 + 1. The initial step δ 1 and the initial gradient g 1 are then determined..5 Quasi-Newton Updating Matrices In this subsection we discuss the form of the BFGS quasi-newton updating matrices implied by (.5-(.8 under the consistent conditions (.33-(.36. A byproduct is that, we will know how to choose the initial matrix H 1 for the example. For this aim, we provide a follow-up lemma of Lemma.3. 1

Lemma.4. Assume that (.30 holds and the matrix H satisfies (.9. If further, the matrix A in (.3 is positive definite, there must exist nonsingular triangular matrices T 1 and T such that A = V T U = T T 1 T. (.50 Further, denoting ˆV = V T 1 = (ˆv 1, ˆv,..., ˆv n and the diagonal matrix T 1 1 T 1 = diag(t 1, t,..., t n, we have that H = n t i ˆv iˆv i T. (.51 i=1 Proof. The truth of (.50 is obvious by the Cholesky factorization of positive definite matrices. Further, by (.50 and the definition of ˆV, we get that Since HU = V, we obtain H = V U 1 = ( ( ˆV T 1 1 ˆV T U = T. T 1 ˆV T = ˆV ( T1 1 T 1 ˆV T, which with the definitions of ˆv i s and t i s implies the truth of (.51. Q.E.D. By the above lemma, we can express the form of H k+1 determined by (.5- (.6 under the consistent conditions (.33-(.36. At first, we make use of the steps {δ k+i ; i = 0, 1,, 3} to introduce the following two vectors that are orthogonal to γ k and g k+1 : z k = δt k+γ k δ T k γ k δ k δt k+g k+1 δ T k+1g k+1 δ k+1 + δ k+, w k = δt k+3γ k δ T k γ k δ k δt k+3g k+1 δ T k+1g k+1 δ k+1 + δ k+3. (.5 The vectors z k and w k are well defined because δ T k γ k = δ T k g k > 0 is positive due to the descent property of δ k. Direct calculations show that z T k γ k = z T k g k+1 = 0, w T k γ k = w T k g k+1 = 0. (.53 Further, we use z k and w k to define the vector that is orthogonal to g k+ : By the choice of v k, it is easy to see that v k = wt k g k+ z T k g k+ z k + w k. (.54 v T k γ k = v T k g k+1 = v T k g k+ = 0. (.55 13

Now we could express the matrix H k+1 by H k+1 = δ kδ T k δ T k g k δ k+1δ T k+1 δ T k+1g k+1 z kz T k z T k g k+ v kv T k v T k g k+3. (.56 Since δ T k g k < 0 for all k 1, we know that the matrix H k+1 is positive definite if and only if z T k g k+ < 0 and v T k g k+3 < 0. Direct calculations show that z T 1 g 3 = ( 5808 3348 t ( 691 480 < 0, v T 1 g 4 = ( 88803 + 63514 t + ( 109307 77868 < 0. Therefore we know that the matrix H is positive definite. In addition, it is difficult to build the relations δ T k+1g k+1 = t δ T k g k, z T k+1 g k+3 = t z T k g k+, v T k+1 g k+4 = t v T k g k+3, z k+1 = Mz k and v k+1 = M v k for all k 1. Thus we have by (.56 and δ k+1 = M δ k that H k+1 = 1 t MH km T, (.57 holds for all k and hence {H k : k } are all positive definite. With the above procedure, we can calculate the values of z 1, v 1 and H. Further, by (.3, we can obtain the initial matrix H 1 required by the example of this paper. H 1 = H δ 1δ T + δ δ T 1 δ T 1 γ 1 + gt δ g T 1 δ 1 δ 1 δ T 1 δ T 1 γ 1 := H 1 8778, (.58 where H 1 is a symmetric matrix with entries H 1 (1, 1 = ( 3690 1380 t + ( 7998 13694, H 1 (1, = ( 4474 + 1590 t ( 1990 + 11308, H 1 (1, 3 = ( 148 + 18496 t ( 33966 11118, H 1 (1, 4 = ( 356 + 95 t + ( 509 108 H 1 (, = ( 1954 1098 t + ( 6566 15580, H 1 (, 3 = 1068 t ( 5134 1068, H 1 (, 4 = ( 13600 + 5508 t ( 151 + 9996, H 1 (3, 3 = ( 7834 415769 t + ( 35654 + 163183, H 1 (3, 4 = ( 87543 + 576963 t + ( 943194 66093, H 1 (4, 4 = ( 83606 164543 t + ( 10410 + 4951. (.59 Direct calculations show that (.57 also holds with k = 1 and hence H 1 is a positive definite matrix. 14

3 Constructing A Suitable Objective Function 3.1 Recovering The Iterations and The Function Values Denote the rotation matrices ( cos θ1 sin θ R 1 = 1, R sin θ 1 cos θ = 1 ( cos θ sin θ sin θ cos θ. (3.1 If we write the steps {δ k } and gradients {g k } into the forms η i t 8j l i δ 8j+i = ξ i t 8j γ i, g 8j+i = t 8j h i c i ; i = 1,..., 8, (3. t 8j τ i d i we have from (.1, (., (.4 and (.5 that ( ( ηi+1 ηi = R ξ 1, i+1 ξ i ( ( li+1 li = t R h 1, i+1 h i ( γi+1 τ i+1 ( γi = t R, τ i ( ( ci+1 ci = R d. i+1 d i (3.3 For simplicity, we want the iterations {x k } asymptotically to turn around the eight vertices of some regular octagon Ω of the subspace spanned by the first and second coordinates. More exactly, such a regular octagon Ω has the origin as its center and its eight vertices V i are given by V i = (a i, b i, 0, 0 T with ( = b 1 1, ( a1 ( ai+1 b i+1 ( ai = R 1. (3.4 b i Here we should note that if there is no confusion, we also regard that Ω and V i s are defined in the subspace spanned by the first and second coordinates. To this aim, we ask the iterations {x k } to be of the form a i x 8j+i = b i t 8j p i, (3.5 t 8j q i where ( pi+1 q i+1 To decide the values of p 1 and q 1, noting that x 1 V 1 = x 1 lim j x 8j+1 = ( pi = t R. (3.6 q i 15 δ i, i=0

we have that ( p1 q 1 = i=0 ( γi τ i ( = (E tr 1 γ1, (3.7 τ 1 where again E is the two-dimensional identity matrix. Direct calculations show that ( p1 = 1 ( ( 1570 + 4019 t + (3931 17534 q 1 4633 (4376 1977 t + (9563 9433. (3.8 Now, let us assume that the limit of f(x k is f. Since for any j, the first two coordinates of {x 8j+i } always keep the same as those of its limit V i, we can get that ( T ( f(x 8j+i f = t 8j ci pi d i q i where = t 8j [ R i 1 ( T ( = t 8j+i 1 c1 p1 d 1 q 1 = t 8j+i 1 (f(x 1 f, ( c1 d 1 ] T [ (tr i 1 ( p1 q 1 ] f(x 1 f = c 1 p 1 + d 1 q 1 = ( 3954 4376 t + ( 18866 9563. 4633 Now we see the value of g T 8j+i δ 8j+i. Direct calculations show that t 8j T T l i η i l i η i g8j+i T δ 8j+i = t 8j h i ξ i c i t 8j γ i = t8j h i ξ i c i γ i d i t 8j τ i d i τ i = t 8j+i 1 (l 1 η 1 + h 1 ξ 1 + c 1 γ 1 + d 1 τ 1 = t 8j+i 1 g1 T δ 1, where g T 1 δ 1 = l 1 η 1 + h 1 ξ 1 + c 1 γ 1 + d 1 τ 1 = ( 44 + 13 t + (40 18. Therefore f(x 8j+i+1 f(x 8j+i α 8j+i g T 8j+iδ 8j+i = (f(x 8j+i+1 f (f(x 8j+i f g T 8j+iδ 8j+i = (t8j+i t 8j+i 1 (f(x 1 f t 8j+i 1 g T 1 δ 1 = (t 1(f(x 1 f g T 1 δ 1.6483E 0. (3.9 16

The above relation, together with g T 8j+i+1 δ 8j+i = 0, implies that the stepsize α 8j+i = 1 can be accepted by the Wolfe line search, the Armijo line search and the Goldstein line search with suitable line search parameters. 3. Seeking A Suitable Form for The Objective Function We now consider how to construct a smooth function f such that its gradient at any point x 8j+i given in (3.5 are the one given in (3.; namely, f(x 8j+i = g 8j+i, for all j 0 and i = 1,..., 8. (3.10 To this aim, we assume that f is of the form f(x 1, x, x 3, x 4 = λ(x 1, x x 3 + µ(x 1, x x 4, (3.11 where λ and µ are two-dimensional functions to be determined. Since the function (3.11 is linear with x 3 and x 4, we know that f has no lower bound in R n. Consequently, for the sequence {x k } generated by some optimization method for the minimization of this function, it is expected that f(x k tends to, but this will not happen in our examples. With the prefixed form (3.11, we have that f(x 1, x, x 3, x 4 = λ x 1 x 3 + µ x 1 x 4 λ x x 3 + µ x x 4 λ(x 1, x µ(x 1, x. (3.1 Comparing the last two components of the right hand vector in (3.1 with the gradients in (3., we must have that λ(v i = c i, µ(v i = d i. (3.13 Consequently, by (.48 and (3.3, we can obtain the concrete values of λ and µ at the eight vertices of Ω, which are listed in the second and third rows in Table 3.1. Further, for each i, let J i be the Jacobian of (λ, µ at vertex V i and denote J i = ( λ x 1 λ x 1 µ µ x x Vi, J 1 = ( ω1 ω. (3.14 ω 3 ω 4 The comparison of the first two components of the right hand vector in (3.1 with the gradients in (3. leads to the relation ( ( pi li J i =. (3.15 q i To be such that (3.15 always holds, we ask J i+1 and J i to meet the condition h i J i+1 = R 1 J i R T (3.16 17

for all i 1. In this case, if (3.15 holds for some i, we have by this, (3.6 and (3.3 that ( ( ( ( ( pi+1 J i+1 = (R q 1 J i R T pi pi li li+1 (tr = tr i+1 q 1 J i = tr i q 1 =, i h i h i+1 which means that (3.15 holds with i + 1. Therefore by the induction principle, to be such that (3.15 holds for all i 1, it remains to choose J 1 such that (3.15 holds with i = 1. Noticing that there are still two degrees of freedom, we ask the special relations ω = ω 3, ω 4 = ω 1. (3.17 Then the values of ω 1 and ω 3 can be solved from (3.15 with i = 1 and (3.17, ω 1 = (163 + 106 t (195 + 19, ω 3 = (57 + 33 t + (53 + 45. 34 34 (3.18 Therefore if we choose J 1 to be the one in (3.14 with the values in (3.17 and (3.18 and ask the relation (3.16 for all i 1, then we will have (3.15 for all i 1. Now, using (3.13, (3.16 and (3.17, we can list the function values and gradients of λ and µ at the vertices V i s of the octagon Ω into Table 3.1. V 1 V V 3 V 4 V 5 V 6 V 7 V 8 λ 0 1 1 0 1 1 µ 1 0 1 1 0 1 λ x 1 ω 1 ω 1 ω 1 -ω 1 ω 1 ω 1 ω 1 ω 1 λ x ω 3 ω 3 ω 3 ω 3 ω 3 ω 3 ω 3 ω 3 µ x 1 ω 3 ω 3 ω 3 ω 3 ω 3 ω 3 ω 3 ω 3 µ x ω 1 ω 1 -ω 1 ω 1 ω 1 ω 1 ω 1 ω 1 Table 3.1. Function values and gradients of λ and µ at V i s By using the special choice of Ω and observing the values in Table 3.1, we can ask the functions λ and µ to have the properties and µ(x 1, x = λ( x, x 1 (3.19 λ(x 1, x = λ( x 1, x (3.0 for all (x 1, x R. Further, by (3.0, we could think of constructing the function λ by a polynomial function with only odd orders. In the next subsection, we will construct a simple function that only meets the requirements (a and (c that are described in the first section. In 3.4, we construct a complicated function that can meet all the requirements (a, (b and (c. 18

3.3 A Simple Function that Meets (a and (c If we ignore the convexity requirement (b of the line search function, we can construct a relatively simple function λ(x 1, x to meet the interpolation conditions listed in Table 3.1 and the function µ(x 1, x is then given by (3.19. In fact, we can check that the following function λ f (x 1, x =(15 1 x 5 1 + (6 9 (x 4 1 x + x 1x 3 + ( 15 + 10 x 3 1 +( 1 + (6x 1x + x 3 + ( 15 4 15 8 x1 9 8 x. has values of 0, 1,, 1 at vertices V 1, V, V 3, V 4, respectively, and zero derivatives at all the vertices V i s. Further, we can check the following function λ g (x 1, x = 3 (x 1 1[x 1 (3 + ]x 4 has zero function values at all the vertices V i s, but could be used to interpolate the derivatives of λ(x 1, x at V i s. Consequently, we know that λ(x 1, x = λ f (x 1, x + ω 1 λ g (x 1, x + ω 3 λ g ( x, x 1 (3.1 meets all the interpolation conditions required by λ(x 1, x in Table 3.1. Consequently, we could say that for the function defined by (3.11, (3.1 and (3.19, the BFGS method with the Wolfe line search using the unit initial step does not converge. Observing that Ψ i (α = tψ i (α, Ψ i(α = tψ i(α at α = 0, 1, we ideally wish there is the following relation between Ψ i (α and Ψ i+1 (α: Ψ i+1 (α = t Ψ i (α, for all α 0 (3. so that the neighboring line search functions only differ up to a constant multiplier of t. However, the choice (3.1 of λ(x 1, x does not lead to (3.. Nevertheless, we can propose the following compensation function and define λ c (x 1, x = x 1 [ x 1 + x ( + ] ˆλ(x 1, x = λ(x 1, x + c 1 λ c (x 1, x + c λ c ( x, x 1, (3.3 where c 1 = ( 39 + 15 t (59 1756 7, c = ( 65 + 8 t + (681 46 7. 19

In this case, the relation (3. will always hold and the corresponding line search function at the first iteration is 6 ˆΨ 1 (α = ρ i α i, (3.4 where i=0 ρ 6 = ( 3186 + 33 t + (3414 398, ρ 5 = (4034740 805171 t (44367385 31014036 966, ρ 4 = ( 18674096 + 1804379 t + (0961677 1455090 4633, ρ 3 = (1350618 8366705 t (13736673 9508816 966, ρ = ( 1198 + 36673 t + (4058 176758 966, ρ 1 = g T 1 δ 1 = ( 44 + 13 t + (40 18, ρ 0 = f(x 1 = (3954 4376 t + (18866 9563 4633. To sum up at this stage, for the function (3.11 with λ(x 1, x given in (3.1 or (3.3 and µ(x 1, x = λ( x, x 1, if the initial point is x 1 = (a 1, b 1, p 1, q 1 T (see (3.4, (3.8, (.47 for their values and if the initial matrix is H 1 given by (.58 and (.59, then the BFGS method (1. and (1.5 with α k 1 will generate the iterations in (3.5 whose gradients are given by (.4-(.5. Therefore the method will asymptotically cycle around the eight vertices of a regular octagon without approaching a stationary point or pushing f(x k. 3.4 A Complicated Function that Meets (a, (b and (c To meet the requirement (b, this subsection gives a further compensation to the function ˆλ(x 1, x in (3.3 such that each line search function Ψ k (α is strongly convex. To this aim, we consider the straight line connecting x 8j+i and x 8j+i+1 : a i + α η i l i (α = x 8j+i + α δ 8j+i = b i + α ξ i t 8j (p i + α γ i. (3.5 t 8j (q i + α τ i By (3.1, the line search function at the (8j + i-th iteration is Ψ 8j+i (α = t 8j[ λ(a i +α η i, b i +α ξ i ( p i +α γ i +µ(ai +α η i, b i +α ξ i ( q i +α τ i ]. (3.6 0

To be such that each line search function is a strongly convex function that has the unique minimizer α = 1, we firstly construct Ψ 1 (α as follows where Ψ 1 (α = ζ 1 (α 1 38 + ζ (α 1 + ζ 3, (3.7 ζ 1 = (17356 3817 t + ( 138064 + 48586 166788 ζ = (37747 359709 t + ( 71544 + 577958 166788 ζ 3 = ( 11344 + 6675 t + (4494 6967 4633 1.5468E 1, 1.0405E 3, 6.170E 1. Due to our special construction, we know from (3.3 and t < 1 that the third and fourth components of {x k } tend to zero. This with the general function form (3.11 implies that the limit of f(x k is f = 0. So the above Ψ 1 is constructed such that Ψ 1 (0 = f(x 1, Ψ 1 (1 = f(x, Ψ 1(0 = g T 1 δ 1, Ψ 1(1 = g T δ 1 = 0, (3.8 where the values of f(x 1, f(x and g T 1 δ 1 are given in 3.1 and by f = 0. In addition, the positiveness of ζ i s implies that Ψ 1 is strongly convex. Thus we see that Ψ 1 is a desired strongly convex function that takes α = 1 as the unique minimizer. A reason why we choose a polynomial (3.7 of degree 38 will be explained in the last section. To utilize the function ˆλ(x 1, x in (3.3, we firstly develop the following relation between (3.7 and (3.4, Ψ 1 (α = ˆΨ 1 (α + where σ i = 0 (i = 39,..., 35 and 4 [ ] 4i+ 7 α (α 1 σ 8i+j α j, (3.9 i=0 σ 34 = ζ 1, σ 33 = 0ζ 1, σ 3 = 190ζ 1, σ 31 = 1140ζ 1, σ 30 = 9405ζ 1, σ 9 = 4174ζ 1, σ 8 = 134406ζ 1, σ 7 = 346104ζ 1, σ 6 = 735471ζ 1, σ 5 = 1307504ζ 1, σ 4 = 196156ζ 1, σ 3 = 496144ζ 1, σ = 168873ζ 1, σ 1 = 88963ζ 1, σ 0 = 38155344ζ 1, σ 19 = 3744160ζ 1, σ 18 = 3041755ζ 1, σ 17 = 1474180ζ 1, σ 16 = 1313110ζ 1 ; σ 15 = 6906900ζ 1, σ 14 = 30735705ζ 1, σ 13 = 55057860ζ 1, σ 1 = 51389130ζ 1, σ 11 = 8048800ζ 1, σ 10 = 10518300ζ 1, σ 9 = 3365856ζ 1, σ 8 = 90619ζ 1, σ 7 = 01376ζ 1, σ 6 = 841464ζ 1, σ 5 = 1357056ζ 1, σ 4 = 1041600ζ 1, σ 3 = 37699ζ 1, σ = 58905ζ 1 ρ 6, σ 1 = 7140ζ 1 ρ 5 + 4ρ 6, σ 0 = 630ζ 1 ρ 4 + 3ρ 5 6ρ 6. j=0 1

Secondly, noting that the following relations hold for all i 1, ( ( ( ( ai+1 + α η i+1 ai + α η = R i pi+1 + α γ b i+1 + α ξ 1, i+1 pi + α γ = t R i i+1 b i + α ξ i q i+1 + α τ, i+1 q i + α τ i we look for some element functions φ(x 1, x that satisfy for all (x 1, x R, ( ( ( φ( x1, x φ(x1, x = R ( x1 x1 φ( x, x 1, where = R φ( x, x 1 x 1. (3.30 x There are a lot of possibilities for the choice of φ(x 1, x. The following are some of them required in this paper. φ 1 (x 1, x = x 3 1 3x 1 x, φ 3 (x 1, x = x 3 1x x 1 x 4, φ 5 (x 1, x = x 5 1 5x 3 1x, φ 7 (x 1, x = x 5 1x x 1 x 6. (3.31 We take φ 1 as an illustrative example. In fact, for any (x 1, x R, ( x 1, x T = R 1 (x 1, x T means that x 1 = (x 1 x, x = (x 1 + x. Therefore we have that ( φ( x1, x ( x1 ( x = 1 3 x φ( x, x 1 ( x (3 x 1 x = 4 (x 1 x ((x 1 x 3(x 1 + x ( 4 (x 1 + x (3(x 1 x (x 1 + x = (x 1 x (x 1 + 4x 1 x + x ( (x 1 + x (x 1 4x 1 x + x = [(x3 1 3x 1 x + (3x 1x x 3 ] ( [(x3 1 3x 1 x (3x 1x x 3 ] = [φ(x 1, x + φ( x, x 1 ] ( [φ(x 1, x φ( x, x 1 ] φ(x1, x = R. φ( x, x 1 So φ 1 (x 1, x is a desired element function. Similarly, we can check that the other φ i s in (3.31 have the same property. In addition, it is easy to see that if φ(x 1, x has the property (3.30, then φ( x, x 1 keeps the same property. So we also consider the following element function φ i (x 1, x = φ i 1 ( x, x 1, for i = 1,, 3. Thirdly, we consider the linear combination of φ i (i = 1,..., 8 and define λ φ (x 1, x = 8 ν i φ i (x 1, x. (3.3 i=1

Meanwhile, we define µ φ (x 1, x = λ φ ( x, x 1. For any vector κ = (κ 0,... κ 7 T in R 8, we are going to claim that there exists a unique solution of ν := (ν 1,..., ν 8 T such that the related functions λ φ and µ φ satisfy λ φ (a 1 +α η 1, b 1 +α ξ 1 ( p 1 +α γ 1 +µφ (a 1 +α η 1, b 1 +α ξ 1 ( q 1 +α τ 1 = 7 κ i α i. (3.33 In fact, it is easy to see that the above relation leads to a linear system of ν: Direct calculations show that i=0 W ν = κ. (3.34 W = ( W 1 + W t + (W3 + W 4, (3.35 where W i (i = 1,..., 4 are given in the Appendix. By further calculations, we know that where Det(W = ( c 3 + c 4 t + ( c5 + c 6 0, c 3 = 7578451158319133441879149391071037800448, c 4 = 193544036546015387013404881491369775784755, c 5 = 17095099669563783757699156790438736813955, c 6 = 154494660836167330116146186519704016816. Hence W is nonsingular and (3.34 is a nonsingular linear system. Therefore for any vector κ R 8, there exists a unique ν or λ φ such that the relation (3.33 holds. For convenience, we denote such λ φ by λ φ,κ. Denoting the following vectors related to the coefficients {σ i ; i = 0,..., 39} in (3.9, κ i = (σ 8i,..., σ 8i+7 T, i = 0,..., 4, and noticing that for any point x = (x 1, x, x 3, x 4 T in the line l i (α (i any, x 1 + x ( + = α (α 1, (3.36 we can finally present the desired function of λ: λ(x 1, x = ˆλ(x 1, x + [ 4 x 1 + x ( + ] 4i+ λ φ,κ i (x 1, x. (3.37 i=0 Thus by the choice of ˆλ(x 1, x in (3.3, the relation (3.36, the definitions of κ i and λ φ,κ i and the relation (3.9, we know that the function (3.11 with 3

λ(x 1, x given in (3.37 and µ(x 1, x = λ( x, x 1 not only satisfies those necessary interpolation conditions but gives the line search function Ψ i+1 (α = t i Ψ 1 (α, for any i 0, (3.38 where Ψ 1 (α is given by (3.7. Since the polynomial function λ(x 1, x in (3.37 is of order 43, we see that the final objective function in (3.11 is a polynomial of order 44 and hence is infinitely times differentiable. The line search functions, which are given by (3.38 and (3.7, is strongly convex and has the unit stepsize as its unique minimizer. The requirement (b is then satisfied. In summary, for the function (3.11 with λ(x 1, x given in (3.37 and µ(x 1, x = λ( x, x 1, if the initial point is x 1 = (a 1, b 1, p 1, q 1 T (see (3.4, (3.8, (.47 for their values and if the initial matrix is H 1 given by (.58 and (.59, then the BFGS method (1. and (1.5 with α k 1 will generate the iterations in (3.5 whose gradients are given by (.4-(.5. Therefore the method will asymptotically cycle around the eight vertices of a regular octagon without approaching a stationary point or pushing f(x k. This counter-example is perfect in the sense that all the requirements (a, (b and (c are satisfied. 4 Concluding Remarks No matter whether the simple example(s in 3.3 or the complicated example in 3.4, we can see that the BFGS method with unit stepsizes produces the same iterations {x k } and simultaneously, their objective functions provide the same gradients {g k }. As analyzed in 3.1, the unit stepsize is acceptable by the Wolfe line search, the Armijo line search or the Goldstein line search. If we do not impose the convexity on the line search function Ψ k (α for all k, it is possible for us to construct a relatively simple polynomial example shown in 3.3. The examples have the dimension 4, but can be used to show that the BFGS method may also fail for general functions when n 5 since four-dimensional functions can be regarded as special cases of higher-dimensional functions. In the case that n 5, we can just take the same objective function and ask the starting point x 1 to have zero components except the first four. Then for all k, the first four components of x k remain the same as in the example and its components from the fifth will be always zero. The counter-example in 3.4 is perfect in stepsize choices and function properties. However, it is still not perfect in the sense that the objective function is very complicated with large numbers and some coefficients are expressed by nonsingular linear systems. Here we provide the reason why a polynomial (3.7 of degree 38 is used to guarantee the strong convexity of Ψ 1 (α. Equivalently, the problem of finding a strictly convex polynomial that satisfies (3.8 (in this case Ψ 1(α is nonnegative for all α R can be transferred to the feasibility problem of some semi-definite program (SDP. This is because, by Shor [5], if n = 1, a polynomial is nonnegative if and only if it can be written into a sum of squares (s.o.s.; further, by Lasserre [1], for an any-dimensional polynomial, 4

it has the form of s.o.s. if and only if the coefficient matrix, which is formed when lifting the outer product of the vector of monomials and its transpose into the matrix variable, is semi-definite. The other interpolation conditions can be treated as linear constraints. However, even when we chose some relatively high numbers for the order of the desired polynomial, our numerical calculations showed that the corresponding SDP feasibility problems have no solution. This is not expected since there are many freedoms in these polynomials, but there are only four interpolation conditions in (3.8. Instead, we considered the following simple form for Ψ 1 (α with variable but even p. Ψ 1 (α = c 1 (α 1 p + c (α 1 + ζ 3, (4.1 where c 1 and c are parameters and ζ 3 is the same constant in (3.7. With this form, it can be deduced from (3.8, f = 0 and the calculations in 3.1 that 1 p Ψ 1(1 Ψ 1 (0 Ψ 1 (0 = f(x f(x 1 g1 T δ.6483e 0. (4. 1 Since the reciprocal of.6483e 0 is about 37.76, we choose p to be the least even integer, that is 38, to meet the condition (4.. In spite of the existence of large numbers, we have observed the cycle exactly predicted by the simple example in 3.3 (with the function ˆλ(x 1, x given by (3.3, thanks to the powerful symbolic computation software MAPLE. With such software, it is also possible to observe how numerical errors affect the example. When the machine error is set to 10 64 (this is possible in MAPLE, we found that the cycle can go on for ten rounds and during the rounds, the iterations really tend to the eight vertices of the octagon according to the predicted way. Influenced by the numerical errors, however, the iterations jump out from the cycle during the eleventh round. For quasi-newton methods, there is global convergence if the norms of the matrix H k and its inverse are uniformly bounded for all k, in which case the angle between the quasi-newton direction and the negative gradient must be uniformly less than some angle strictly less than π/. For our examples, with the help of (. and (.57, we can see that the matrices {H 8j+1 ; j = 1,,...} satisfy the relation (.4. Consequently, we have for all j 1, H 8j+1 = diag(t 4j E, t 4j E H 1 diag(t 4j E, t 4j E. (4.3 By comparing H 8j+1 and the right matrices in (4.3 but with the middle matrix H 1 replaced by λ min (H 1 E 4 and λ max (H 1 E 4, respectively, we can show that λ max (H 8j+1 t 8j λ min (H 1, λ min (H 8j+1 t 8j λ max (H 1, (4.4 where λ max ( and λ min ( mean the largest and smallest eigenvalues of the matrix. Since t is the decay parameter that lies in the interval (0, 1, the relation (4.4 implies that both H 8j+1 and H 1 8j+1 tend to infinity at an exponential rate. This analysis is also valid for {H 8j+i ; j = 1,,...} with other i s. 5

The number of the cyclic points in the above example(s is eight due to the special choices of θ 1 and θ in (.3. This number of eight could be decreased to seven since if θ 1 = 7 π, θ = 4 7 π, the consistent conditions (.33-(.36 also allow a nonzero solution of t, whose numerical value is approximately 0.864. However, it is difficult to obtain its analytic expression. In addition, we found that the system (.33-(.36 has no solution of t in ( 1, 1 for θ 1 = i 6 π, θ = j π, where i, j is any integer in [1, 6], 6 which implies that the number of cyclic points cannot be decreased to six. As mentioned in 1, if the stepsize is chosen to be the first local minimizer along the line; namely, by (1.9, Powell [] established the global convergence of the BFGS method for general differentiable functions when n =. We argued there that it is difficult to extend Powell s result to the case that n = 3. Considerable attentions have also been drawn to the construction of a threedimensional example (see also Section 5 of Powell []. This construction is almost successful, but remains one condition always not satisfied. It is not known yet whether the BFGS method with the specific line search (1.9 converges for three-dimensional general differentiable functions. Another direction of research is how to present a suitable modification of the BFGS algorithm, with which global convergence can be established for general nonconvex functions. A typical work of this kind is Li and Fukushima [13]. Acknowledgements. This work is initiated while the author was visiting Cambridge from July to October in 1997. He is very grateful to Professor Michael Powell and Professor Ya-xiang Yuan for their financial supports of his visit and consistent encouragements in many years. He also thanks the two anonymous referees very much for their useful suggestions and comments on this paper. References [1] R. H. Byrd, J. Nocedal, and Y. X. Yuan, Global convergence of a class of quasi-newton methods on convex problems, SIAM J. Numer. Anal., 4 (1987, pp. 1171 1190. [] C. G. Broyden, The convergence of a class of double rank minimization algorithms:. The new algorithm, J. Inst. Math. Appl., 6 (1970, pp. 31. [3] Y. H. Dai, Convergence properties of the BFGS algorithm, SIAM Journal on Optimization, 13:3 (00, pp. 693-701. 6

[4] W. C. Davidon, Variable metric methods for minimization, SIAM J. Optim., 1 (1991, pp. 1 17. [5] J. E. Dennis and J. J. Moré, Quasi-Newton method, motivation and theory, SIAM Review 19 (1977, pp. 46-89. [6] L. C. W. Dixon, Variable metric algorithms: Necessary and sufficient conditions for identical behavior of nonquadratic functions, J. Optim. Theory Appl., 10 (197, pp. 34 40. [7] R. Fletcher, A new approach to variable metric algorithms, Computer J., 13 (1970, pp. 317 3. [8] R. Fletcher, An Overview of Unconstrained Optimization, in Algorithms for Continuous Optimization: The state of Art, E. Spedicato ed., Kluwer Academic Publishers, Dordrecht/Boston/London, 1994, pp. 109-143. [9] R. Fletcher and M. J. D. Powell, A rapidly convergent descent method for minimization, Comput. J., 6 (1963, pp. 163 168. [10] D. Goldfarb, A family of variable metric methods derived by variational means, Math. Comp., 4 (1970, pp. 3 6. [11] M. R. Hestenes and E. Stiefel, Method of conjugate gradient for solving linear system, J. Res. Nat. Bur. Standards, 49 (195, pp. 409 436. [1] J.B.Lasserre, Global optimization with polynomials and the problem of moments, SIAM Journal on Optimization, 11:3 (001, pp. 796 817. [13] D. H. Li and M. Fukushima, On the global convergence of the BFGS method for nonconvex unconstrained optimization problems, SIAM Journal on Optimization, 11 (001, pp. 1054 1064. [14] W. F.Mascarenhas, The BFGS algorithm with exact line searches fails for nonconvex functions, Mathematical Programming, 99:1 (004, pp. 49 61. [15] J. Nocedal, Theory of algorithms for unconstrained optimization, Acta Numer., 1 (199, pp. 199 4. [16] E. Polak and G. Ribière, Note sur la convergence de methodes de directions conjuguées, Rev. Franccaise Informat. Recherche Opérationelle, 16 (1969, pp. 35 43. [17] B. T. Polyak, The conjugate gradient method in extremal problems, USSR Comp. Math. and Math. Phys., 9 (1969, pp. 94 11. [18] M. J. D. Powell, On the convergence of the variable metric algorithm, J. Inst. Math. Appl., 7 (1971, pp. 1 36. 7

[19] M. J. D. Powell, Quadratic termination properties of minimization algorithm, Part I and Part II, J. Inst. Math. Appl., 10 (197, pp. 333 357. [0] M. J. D. Powell, Some global convergence properties of a variable metric algorithm for minimization without exact line searches, in Nonlinear Programming, SIAM-AMS Proceedings Vol. IX, R. W. Cottle and C. E. Lemke, eds., SIAM, Philadelphia, PA, 1976, pp. 53 7. [1] M. J. D. Powell, Nonconvex minimization calculations and the conjugate gradient method, in Numerical Analysis, D. F. Griffiths, ed., Lecture Notes in Math. 1066, Springer-Verlag, Berlin, 1984, pp. 1 141. [] M. J. D. Powell, On the convergence of the DFP algorithm for unconstrained optimization when there are only two variables, Math. Program. Ser. B, 87 (000, pp. 81 301. [3] D. Pu and W. Yu, On the convergence property of DFP algorithm, Ann. Oper. Res., 4 (1990, pp. 175 184. [4] D. F. Shanno, Conditioning of quasi-newton methods for function minimization, Math. Comp., 4 (1970, pp. 647 650. [5] N. Z. Shor, Quadratic optimization problems, Soviet J. Comput. Systems Sci., 5 (1987, pp. 1 11. [6] W. Y. Sun and Y. X. Yuan, Optimization Theory and Methods: Nonlinear Programming, Springer, New York, 006. [7] P. Wolfe, Convergence conditions for ascent methods, SIAM Rev., 11 (1969, pp. 6 35. [8] Y. X. Yuan, Numerical Methods for Nonlinear Programming, Shanghai Scientific and Technical Publishers, Shanghai, 1993 (in Chinese. Appendix. The matricies W i (i = 1,..., 4 in the relation (3.35 are given in the following sequentially. 6168 15364 50467 19383 50467 19383 170784 69850 3573 7844 164094 0386 41766 356414 617846 774894 993360 196008 4160 01438 561400 115390 14901 1366496 91688 0 3698860 61345 114596 463300 86949 653360 9651 333576 407704 890484 56990 316500 168300 3018560 0 0 9651 333576 1546864 1667880 114176 371088 0 0 0 0 59304 66715 140843 44810 0 0 0 0 0 0 59304 66715 8

38766 340 3495 1554 3495 1554 10317 50467 6464 10164 9436 165678 4946 38058 4668 56486 9430 101688 14415 60041 1444145 98375 5696 1083960 56344 500364 488404 11705 10100 64860 1191110 388 315044 315044 13118 0678 1009350 88070 183770 1018070 0 0 1853 1853 1700980 15750 1143596 1018308 0 0 0 0 630088 630088 463300 45068 0 0 0 0 0 0 37064 37064 66580 806 48866.5 955.5 48866.5 955.5 16176 419 305 8838 149974 138071 193910 377473 543381 565187 107146 156966 300535 8467 135353 905635 75885 1149501 1085404 7418 43774 654056 379690 74180 70070 78668 333576 333576 444768 84594 3017860 409160 49750 916010 0 0 9651 9651 194844 1704944 1357948 355590 0 0 0 0 66715 66715 148560 8487 0 0 0 0 0 0 59304 59304 31153 16895 31.5 746 31.5 746 113309.5 536.5 49886 14014 7850 147478 4906 43454 360117 438519 80980 18 169365 3048 980815 76350 1010 1044365 63088 444768 554550 465018 1070 509630 1166840 8060 315044 315044 194586 93378 1538890 88070 113170 44550 0 0 1853 1853 1838668 15750 1059760 467608 0 0 0 0 630088 630088 4636 14969 0 0 0 0 0 0 37064 37064 9