Newton s Method for Unconstrained Optimization

Similar documents
The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

(Quasi-)Newton methods

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

Nonlinear Algebraic Equations. Lectures INF2320 p. 1/88

2.3 Convex Constrained Optimization Problems

Nonlinear Algebraic Equations Example

Principles of Scientific Computing Nonlinear Equations and Optimization

Limits and Continuity

Date: April 12, Contents

Adaptive Online Gradient Descent

Solving Linear Systems, Continued and The Inverse of a Matrix

Big Data - Lecture 1 Optimization reminders

Recall that two vectors in are perpendicular or orthogonal provided that their dot

1 Lecture: Integration of rational functions by decomposition

Lecture 5 Principal Minors and the Hessian

Scalar Valued Functions of Several Variables; the Gradient Vector

CSCI567 Machine Learning (Fall 2014)

H/wk 13, Solutions to selected problems

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Linear and quadratic Taylor polynomials for functions of several variables.

Lecture 13 Linear quadratic Lyapunov theory

Solutions of Equations in One Variable. Fixed-Point Iteration II

Introduction to Algebraic Geometry. Bézout s Theorem and Inflection Points

Multi-variable Calculus and Optimization

MATH 425, PRACTICE FINAL EXAM SOLUTIONS.

Numerical Methods for Solving Systems of Nonlinear Equations

Numerical Solution of Differential

Taylor and Maclaurin Series

constraint. Let us penalize ourselves for making the constraint too big. We end up with a

Practice with Proofs

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Cost Minimization and the Cost Function

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

1 Norms and Vector Spaces

Summer course on Convex Optimization. Fifth Lecture Interior-Point Methods (1) Michel Baes, K.U.Leuven Bharath Rangarajan, U.

Numerical methods for American options

CS 295 (UVM) The LMS Algorithm Fall / 23

Cross product and determinants (Sect. 12.4) Two main ways to introduce the cross product

Nonlinear Optimization: Algorithms 3: Interior-point methods

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

Lecture Notes 10

Differentiation and Integration

SIXTY STUDY QUESTIONS TO THE COURSE NUMERISK BEHANDLING AV DIFFERENTIALEKVATIONER I

Nonlinear Programming Methods.S2 Quadratic Programming

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Class Meeting # 1: Introduction to PDEs

6.4 Logarithmic Equations and Inequalities

TOPIC 4: DERIVATIVES

Elasticity Theory Basics

Microeconomic Theory: Basic Math Concepts

DERIVATIVES AS MATRICES; CHAIN RULE

CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

1 if 1 x 0 1 if 0 x 1

Follow the Perturbed Leader

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

19 LINEAR QUADRATIC REGULATOR

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

LS.6 Solution Matrices

Duality of linear conic problems

Interior Point Methods and Linear Programming

Solutions to Homework 5

Factoring Cubic Polynomials

SOLUTIONS. f x = 6x 2 6xy 24x, f y = 3x 2 6y. To find the critical points, we solve

10. Proximal point method

The Method of Partial Fractions Math 121 Calculus II Spring 2015

Iterative Methods for Optimization

Section 3.7. Rolle s Theorem and the Mean Value Theorem. Difference Equations to Differential Equations

Polynomials. Dr. philippe B. laval Kennesaw State University. April 3, 2005

Math 120 Final Exam Practice Problems, Form: A

PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include

Lecture 8 February 4

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 10

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

Inner Product Spaces

Online Convex Optimization

BANACH AND HILBERT SPACE REVIEW

Lecture 14: Section 3.3

correct-choice plot f(x) and draw an approximate tangent line at x = a and use geometry to estimate its slope comment The choices were:

Lecture 7: Finding Lyapunov Functions 1

Fixed Point Theorems

Section 12.6: Directional Derivatives and the Gradient Vector

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Chapter 6. Cuboids. and. vol(conv(p ))

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

NOTES ON LINEAR TRANSFORMATIONS

Vector and Matrix Norms

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z

Mean value theorem, Taylors Theorem, Maxima and Minima.

1 Inner Products and Norms on Real Vector Spaces

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand

Ideal Class Group and Units

Notes on Symmetric Matrices

Høgskolen i Narvik Sivilingeniørutdanningen STE6237 ELEMENTMETODER. Oppgaver

Example 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum. asin. k, a, and b. We study stability of the origin x

Thesis Title. A. U. Thor. A.A.S., University of Southern Swampland, 1988 M.S., Art Therapy, University of New Mexico, 1991 THESIS

Separable First Order Differential Equations

Transcription:

Newton s Method for Unconstrained Optimization Robert M. Freund February, 24 24 Massachusetts Institute of Technology.

Newton s Method Suppose we want to solve: (P:) min f (x) At x = x, f (x) can be approximated by: n x R. f (x) h(x) := f ( x)+ f ( x) T (x x)+ (x x) t H ( x)(x x), 2 which is the quadratic Taylor expansion of f (x) at x = x. Here f (x) is the gradient of f (x) and H (x) is the Hessian of f (x). Notice that h(x) is a quadratic function, which is minimized by solving h(x) =. Since the gradient of h(x) is: we therefore are motivated to solve: which yields h(x) = f ( x)+ H( x)(x x), f ( x)+ H ( x)(x x) =, x x = H( x) f ( x). The direction H ( x) f ( x) is called the Newton direction, or the Newton step at x = x. This leads to the following algorithm for solving (P): Step Given x,set k Newton s Method: Step d k = H(x k ) f (x k ). If d k =, then stop. Step 2 Choose step-size α k =. 2

Step 3 Set x k+ x k + α k d k, k k +. Go to Step. Note the following: The method assumes H(x k ) is nonsingular at each iteration. There is no guarantee that f(x k+ ) f(x k ). Step 2 could be augmented by a line-search of f(x k + αd k )tofind an optimal value of the step-size parameter α. Recall that we call a matrix SPD if it is symmetric and positive definite. Proposition. If H(x) is SPD and d := H(x) f(x), then d is a descent direction, i.e., f(x + αd) < f(x) for all sufficiently small values of α. Proof: It is sufficient to show that f(x) t d = f(x) t H(x) f(x) <. This will clearly be the case if H(x) is SPD. Since H(x) is SPD, if v =, < (H(x) v) t H(x)(H(x) v) = v t H(x) v, thereby showing that H(x) is SPD.. Rates of convergence A sequence of numbers {s i } exhibits linear convergence if lim i s i = s and s i+ s lim = δ <. i s i s If δ = in the above expression, the sequence exhibits superlinear convergence. A sequence of numbers {s i } exhibits quadratic convergence if lim i s i = s and s i+ s lim = δ <. i s i s 2 3

.. Examples of Rates of Convergence ( ) i Linear convergence: s i = :.,.,., etc. s =. s i+ s =.. s i s Superlinear convergence: s i = i! :, 2, 6, 24, 25,etc. s =. s i+ s i! = = as i. s i s (i +)! i + ( ) (2 i ) Quadratic convergence: s i = :.,.,.,., etc. s =. s i+ s ( 2i ) 2 = =. s i s 2 2i+.2 Quadratic Convergence of Newton s Method We have the following quadratic convergence theorem. In the theorem, we use the operator norm of a matrix M: M := max { Mx x =}. x Theorem. (Quadratic Convergence Theorem) Suppose f(x) is twice continuously differentiable and x is a point for which f(x )=. Suppose that H(x) satisfies the following conditions: there exists a scalar h > for which [H(x )] h there exists scalars β> and L> for which H(x) H(x ) L x x for all x satisfying x x β { } Let x satisfy x x <γ:= min β, 2h 3L,and let x N := x H(x) f(x). Then: ( ) L (i) x N x x x 2 2(h L x x ) (ii) x N x < x x <γ 4

( ) (iii) x 2 3L N x x x 2h Example : Let f (x) =7x ln(x). Then f (x) = f (x) =7 x and H(x) = f (x) = 2. It is not hard to check that x = 7 =.4285743 is x the unique global minimum. The Newton direction at x is ( ) 2 2 d = H (x) f ( x) f (x) = = x 7 = x 7x. f ( x) x k Newton s method will generate the sequence of iterates {x } satisfying: k+ k ) 2 x = x k +(x k 7(x k ) 2 )=2x k 7(x. Below are some examples of the sequences generated by this method for different starting points. k x k x k x k x k... 5..3.93 2 85..47.3599257 3 239, 945..4284777.6296884 4 4.32.4285742.982428 5.37 24.4285743.28849782 6 9.486 48.4285743.44837 7 5.734 98.4285743.42843938 8.4285743.4285742 9.4285743.4285743.4285743.4285743 By the way, the range of quadratic convergence for Newton s method for this function happens to be x (.,.285743). 5

Example 2: f(x) = ln( x x 2 ) ln x ln x 2. f(x) = x x 2 x x x 2 x 2 ) 2 ( x, ( ) 2 ( ) 2 x x 2 + x x 2 H(x) =. ( ) 2 ( ) 2 ( ) 2 x x 2 x x 2 + x 2 ( ) 3, f(x )=3.295836866. x = 3, k x k x k 2 x k x.85.5.5892556598879.776827288.965986394557823.45836926 2.52975993329.7647976723556.23848324957462 3.352478577567272.273248784584.636294297446 4.33844966352.32623875996.87476926379655 5.3333377223482.333259335655 7.432848283795e 5 6.3333333436762.3333333272428.95322855443e 8 7.333333333333333.333333333333333.579245868378e 6 Comments: The convergence rate is quadratic: x N x x x 2 3L 2h We typically never know β, h, or L. However, there are some amazing n exceptions, for example f(x) = ln(x j ), as we will soon see. j= The constants β, h, and L depend on the choice of norm used, but the method does not. This is a drawback in the concept. But we do not know β, h, or L anyway. x x x 2 In the limit we obtain x N 6 L 2h

We did not assume convexity, only that H(x ) is nonsingular and not badly behaved near x. One can view Newton s method as trying successively to solve f(x) = by successive linear approximations. Note from the statement of the convergence theorem that the iterates of Newton s method are equally attracted to local minima and local maxima. Indeed, the method is just trying to solve f(x) =. What if H(x k ) becomes increasingly singular (or not positive definite)? In this case, one way to fix this is to use H(x k )+ ɛi. () Comparison with steepest-descent. One can think of steepest-descent as ɛ + in () above. The work per iteration of Newton s method is O(n 3 ) So-called quasi-newton methods use approximations of H(x k )at each iteration in an attempt to do less work per iteration. 2 Proof of Theorem. The proof relies on the following two elementary facts. For the first fact, let v denote the usual Euclidian norm of a vector, namely v := v T v. The operator norm of a matrix M is defined as follows: M := max { Mx x =}. x Proposition 2. Suppose that M is a symmetric matrix. Then the following are equivalent:. h> satisfies M h 2. h> satisfies Mv h v for any vector v 7

You are asked to prove this as part of your homework for the class. Proposition 2.2 Suppose that f (x) is twice differentiable. Then f (z) f (x) = [H(x + t(z x))] (z x)dt. Proof: Let φ(t) := f (x+t(z x)). Then φ() = f (x) and φ() = f (z), and φ (t) =[H(x + t(z x))] (z x). From the fundamental theorem of calculus, we have: f (z) f (x) = φ() φ() = φ (t)dt ProofofTheorem.: We have: = [H(x + t(z x))] (z x)dt. x N x = x H(x) f (x) x = x x + H (x) ( f (x ) f (x)) = x x + H (x) [H (x + t(x x))] (x x)dt (from Proposition 2.2) = H(x) [H(x + t(x x)) H(x)] (x x)dt 8

Therefore x N x H(x) [H(x + t(x x)) H(x)] (x x) dt x x H (x) L t (x x) dt = x x 2 H(x) L tdt = x x 2 H(x) L 2 We now bound H(x).Let v be any vector. Then H(x)v = H (x )v +(H (x) H(x ))v H (x )v (H(x) H(x ))v h v H(x) H (x ) v (from Proposition 2.) h v L x x v = (h L x x ) v. Invoking Proposition 2. again, we see that this implies that Combining this with the above yields H(x). h L x x L x N x x x 2 2(h L x x ), which is (i) of the theorem. Because L x x < 3 we have: x N x x x 2h 3 2h L x x < ( 2(h L x x ) 2 h 2h 3 9 ) x x = x x,

which establishes (ii) of the theorem. Finally, we have L L 3L x N x x x 2 2(h L x x ) x x 2 ( ) 2 h = x 2h x 2 2h, which establishes (iii) of the theorem. 3 3 Newton s Method Exercises. (Newton s Method) Suppose we want to minimize the following function: f(x) =9x 4ln(x 7) over the domain X = {x x>7} using Newton s method. (a) Give an exact formula for the Newton iterate for a given value of x. (b) Using a calculator (or a computer, if you wish), compute five iterations of Newton s method starting at each of the following points, and record your answers: x =7.4 x =7.2 x =7. x =7.8 x =7.88 (c) Verify empirically that Newton s method will converge to the optimal solution for all starting values of x in the range (7, 7.8888). What behavior does Newton s method exhibit outside of this range? 2. (Newton s Method) Suppose we want to minimize the following function: f(x) =6x 4ln(x 2) 3 ln(25 x) over the domain X = {x 2 < x<25} using Newton s method. (a) Using a calculator (or a computer, if you wish), compute five iterations of Newton s method starting at each of the following points, and record your answers:

x = 2.6 x = 2.7 x = 2.4 x = 2.8 x = 3. (b) Verify empirically that Newton s method will converge to the optimal solution for all starting values of x in the range (2, 3.5). What behavior does Newton s method exhibit outside of this range? 3. (Newton s Method) Suppose that we seek to minimize the following function: f(x,x 2 )= 9x x 2 +θ( ln( x x 2 ) ln(x ) ln(x 2 ) ln(5 x +x 2 )), where θ is a given parameter, on the domain X = {(x,x 2 ) x >,x 2 >,x + x 2 <,x x 2 < 5}. This exercise asks you to implement Newton s method on this problem, first without a line- search, and then with a line-search. Run your algorithm for θ = and for θ =, using the following starting points. x =(8, 9) T. x =(, 4) T. x =(5, 68.69) T. x =(, 2) T. (a) When you run Newton s method without a line-search for this problem and with these starting points, what behavior do you observe? (b) When you run Newton s method with a line-search for this problem, what behavior do you observe? 4. (Projected Newton s Method) Prove Proposition 6. of the notes on Projected Steepest Descent. 5. (Newton s Method) In class we described Newton s method as a method for finding a point x for which f(x ) =. Now consider the following setting, where we have n nonlinear equations in n unknowns x = (x,...,x n ):

which we conveniently write as g (x) =... g n (x) =, g(x) =. Let J (x) denote the Jacobian matrix (of partial derivatives) of g(x). Then at x = x we have g( x + d) g( x)+ J ( x)d, this being the linear approximation of g(x) at x = x. Construct a version of Newton s method for solving the equation system g(x) =. 6. (Newton s Method) Suppose that f (x) is a strictly convex twice-continuously differentiable function, and consider Newton s method with a linex, we compute the Newton direction d = [H( x)] f ( x) search. Given and the next iterate x is chosen to satisfy: x := arg min f ( x + αd). α Prove that the iterates of this method converge to the unique global minimum of f (x). 7. Prove Proposition 2. of the notes on Newton s method. 8. Bertsekas, Exercise.4., page 99. 2