(Quasi-)Newton methods



Similar documents
A Perfect Example for The BFGS Method

Solutions of Equations in One Variable. Fixed-Point Iteration II

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Adaptive Online Gradient Descent

2.3 Convex Constrained Optimization Problems

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Lecture 8 February 4

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Linear Threshold Units

Metric Spaces. Chapter Metrics

CSCI567 Machine Learning (Fall 2014)

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

GLM, insurance pricing & big data: paying attention to convergence issues.

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

Lecture 13 Linear quadratic Lyapunov theory

Principles of Scientific Computing Nonlinear Equations and Optimization

Big Data - Lecture 1 Optimization reminders

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Nonlinear Programming Methods.S2 Quadratic Programming

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

Reference: Introduction to Partial Differential Equations by G. Folland, 1995, Chap. 3.

Date: April 12, Contents

Nonlinear Algebraic Equations. Lectures INF2320 p. 1/88

DERIVATIVES AS MATRICES; CHAIN RULE

10. Proximal point method

Factorization Theorems

TRANSACTIONS JUNE, 1969 AN UPPER BOUND ON THE STOP-LOSS NET PREMIUM--ACTUARIAL NOTE NEWTON L. BOWERS, JR. ABSTRACT

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Data Mining: Algorithms and Applications Matrix Math Review

Pricing and calibration in local volatility models via fast quantization

THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Similar matrices and Jordan form

Analysis of a Production/Inventory System with Multiple Retailers

CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

Notes on Symmetric Matrices

Inner Product Spaces

CONTROLLABILITY. Chapter Reachable Set and Controllability. Suppose we have a linear system described by the state equation

The Characteristic Polynomial

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Numerical Methods for Solving Systems of Nonlinear Equations

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

24. The Branch and Bound Method

G.A. Pavliotis. Department of Mathematics. Imperial College London

Inner Product Spaces and Orthogonality

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL. 1. Introduction

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, Notes on Algebra

Online Learning, Stability, and Stochastic Gradient Descent

constraint. Let us penalize ourselves for making the constraint too big. We end up with a

The equivalence of logistic regression and maximum entropy models

Introduction to Algebraic Geometry. Bézout s Theorem and Inflection Points

BANACH AND HILBERT SPACE REVIEW

Network Traffic Modelling

Lecture 3: Linear methods for classification

Nonlinear Algebraic Equations Example

Least-Squares Intersection of Lines

1 VECTOR SPACES AND SUBSPACES

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

7 Gaussian Elimination and LU Factorization

DATA ANALYSIS II. Matrix Algorithms

Numerical Analysis Lecture Notes

GI01/M055 Supervised Learning Proximal Methods

Numerical methods for American options

Solving polynomial least squares problems via semidefinite programming relaxations

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

6. Cholesky factorization

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Similarity and Diagonalization. Similar Matrices

10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method

The NEWUOA software for unconstrained optimization without derivatives 1. M.J.D. Powell

Elasticity Theory Basics

INTERIOR-POINT METHODS FOR NONCONVEX NONLINEAR PROGRAMMING: FILTER METHODS AND MERIT FUNCTIONS

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include

Perron vector Optimization applied to search engines

Derivative Free Optimization

Machine learning and optimization for massive data

Beyond stochastic gradient descent for large-scale machine learning

Optimization. J(f) := λω(f) + R emp (f) (5.1) m l(f(x i ) y i ). (5.2) i=1

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.

x a x 2 (1 + x 2 ) n.

Stochastic Inventory Control

Linear Codes. Chapter Basics

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

Taylor and Maclaurin Series

Orthogonal Projections

Proximal mapping via network optimization

STA 4273H: Statistical Machine Learning

Statistical Machine Learning

Lecture 7: Finding Lyapunov Functions 1

Orthogonal Diagonalization of Symmetric Matrices

Transcription:

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting point x 0, Newton method consists in iterating: x k+1 = x k g (x k ) 1 g(x k ) where g (x) is the derivative of g at point x. Applying this method to the optimization problem: min f(x) x R n consists in setting g(x) = f(x), i.e. looking for stationary points. The iterations read: x k+1 = x k 2 f(x k ) 1 f(x k ). Newton method is particularly interesting as its convergence is quadratic locally around x, i.e.: x k+1 x γ x k x 2, γ > 0. However the convergence is guaranteed only if x 0 is sufficiently close to x. The method may diverge if the initial point is too far from x or if the Hessian is not positive definite. In order to address this issue of local convergence, Newton method can be combined with a line search method in the direction d k = 2 f(x k ) 1 f(x k ). proof. We now prove the quadratic local convergence of Newton method. TODO 1.2 Variable metric methods The idea behind variable metric methods consists in using iterations of the form { d k = B k g k, where g k = f(x k ) and B k is a positive definite matrix. If B k = I n, it corresponds to gradient descent. Fixing B k = B leads to the following remark. Remark. When minimizing min f(x) x R n one can set x = Cy with C invertible (change of variable). Let us denote f(y) = f(cy). This leads to: Gradient descent applied to f(y) reads: which is equivalent to f(y) = C T f(cy). y k+1 = y k ρ k C T f(cy k ) x k+1 = x k ρ k CC T f(x k ). This amounts to using B = CC T. In the case where f is a quadratic form, it is straightforward to observe that this will improve convergence. page 1

Theorem 1. Let f(x) a positive definite quadratic form and B a positive definite matrix. Le preconditioned gradient algorithm: { x0 = fixed, has a linear convergence: where: x k+1 = x k ρ k Bg k, ρ k optimal x k+1 x γ x k x γ = χ(ba) 1 χ(ba) + 1 < 1. Remark. The lower the conditioning of BA, the faster is the algorithm. One cannot set B = A 1 as it would implied having already solved the problem, but this however suggests to use B so that is approximate A 1. This is the idea behind quasi-newton methods. 2 Quasi-Newton methods 2.1 Quasi-Newton relation A quasi-newton method reads { d k = B k g k, or { d k = H 1 k g k, where B k (resp. H k ) is a matrix which aims to approximate the inverse of the Hessian (resp. the Hessian) of f at x k. The question is how to achieve this? One can start with B 0 = I n, but then how to update B k at every iteration? The idea is the following: by applying a Taylor expansion on the gradient, we know that at point x k, the gradient and the Hessian are such that: g k+1 = g k + 2 f(x k )(x k+1 x k ) + ɛ(x k+1 x k ). If one assumes that the approximation is good enough one has: which leads to the quasi-newton relation. g k+1 g k 2 f(x k )(x k+1 x k ), Definition 1. Two matrices B k+1 and H k+1 verify the quasi-newton relation if: or H k+1 (x k+1 x k ) = f(x k+1 ) f(x k ) x k+1 x k = B k+1 ( f(x k+1 ) f(x k )) The follow up question is how to update B k while making sure that it says positive definite. 2.2 Update formula of Hessian The update strategy at iteration { is to correct B k with a symmetric matrix k : such that quasi-newton relation holds: d k = B k g k, B k+1 = B k + k x k+1 x k = B k+1 (g k+1 g k ) with B k+1 positive definite, assuming B k is positive definite. For simplicity we will write B k+1 > 0. We will now see different formulas to define k, under some assumptions that its rank is 1 or 2. We will speak about rank 1 or rank 2 corrections. page 2

2.3 Broyden formula Broyden formula is a rank 1 correction. Let us write B k+1 = B k + vv T. The vector v R n should verify the quasi-newton relation: B k+1 y k = s k, where y k = g k+1 g k and s k = x k+1 x k. It follows that: B k y k + vv T y k = s k, which leads by application of the dot product with y k to: Using the equality (y T k v) 2 = (s k B k y k ) T y k, vv T = vvt y k (vv T y k ) T (v T y k ) 2, one can write after replacing vv T y k by s k B k y k, and (v T y k ) 2 by y T k (s k B k y k ), the correction formula also known as Broyden formula. B k+1 = B k + (s k B k y k )(s k B k y k ) T (s k B k y k ) T y k, Theorem 2. Let f a quadratic form positive definite. Let us consider the method that, starting for x 0, iterates: x k+1 = x k + s k, where the vectors s k are linearly independent. Then the sequence of matrices starting by B 0 and defined as: B k+1 = B k + (s k B k y k )(s k B k y k ) T (s k B k y k ) T y k, where y k = f(x k+1 ) f(x k ), converges in less than n iterations towards A 1, the inverse of the Hessian of f. proof. to: Since we consider here a quadratic function, the Hessian is constant and equal to A. This leads We have seen that B k+1 is constructed such that: Let us show that: By recurrence, let us assume that it is true for B k : Let i k 2. One has y i = f(x i+1 ) f(x i ), i. B k+1 y k = s k. B k+1 y i = s i, i = 0,..., k 1. B k y i = s i, i = 0,..., k 2. B k+1 y i = B k y i + (s k B k y k )(s T k y i B k y T k y i) (s k B k y k ) T y k. (1) By hypothesis, one has B k y i = s i which implies that y T k B ky i = y T k s i, but since As j = y j for all j, it leads to: y T k s i = s T k As i = s T k y i, page 3

The numerator in (1) is therefore 0, which leads to: B k+1 y i = B k y i = s i. One therefore has: After n iterations, one has B k+1 y i = s i, i = 0,..., k. B n y i = s i, i = 0,..., n 1. but since y i = As i, this last equation is equivalent to: B n As i = s i, i = 0,..., n 1. As the s i form a basis of R n this implies that B n A = I n or B n = A 1. The issue with Broyden s formula is that there is no guarantee that the matrices B k are positive definite, even if the function f is quadratic and B 0 = I n. It is nevertheless interesting to have B n = A 1. 2.4 Davidon, Fletcher and Powell formula The formula from Davidon, Fletcher and Powell is a rank 2 correction. It reads: B k+1 = B k + s ks T k s T k y B ky k yk T B k k yk T B. (2) ky k The following theorem states that under certain conditions, the formula guarantees to have B k positive definite. Theorem 3. Let us consider the update { dk = B k g k, x k+1 = x k + ρ k Bg k, ρ k optimal where B 0 > 0 is provided as well as x 0. Then the matrices B k defined as in (2) are positive definite for all k > 0. proof. Let x R n. One has: x T B k+1 x = x T B k x + (st k x)2 (yt k B kx) 2 y T k B ky k. = yt k B ky k x T B k x (y T k B kx) 2 y T k B ky k + (st k x)2. (3) If one defines the dot product x, y as x T B k y the equation above reads: x T B k+1 x = y k, y k x, x y k, x 2 y k, y k + (st k x)2. The first term is positive by Cauchy-Schwartz inequality. Regarding the second term, as the step size is optimal one has: g T k+1d k = 0, which implies = ρ k (g k+1 g k ) T d k = ρ k g T k B k g k > 0, and so x T B k+1 x 0. Both terms being positive, the sum is zero only if both terms are 0. This implies that x = λy k with λ 0. In this case the second term cannot be zero as s T k x = λst k y k. This implies that B k+1 > 0. Remark. In the proof we used the fact that an optimal step size is used. The result still holds if one uses an approximate line search strategy for example using Wolf and Powell s rule. In this case the point x k+1 is such that: φ (ρ k ) = f(x k+1 ) T d k m 2 f(x k ) T d k, 0 < m 2 < 1, which guarantees: and therefore (g k+1 g k ) T (x k+1 x k ) > 0. gk+1 T x k+1 x k ρ k > g T k x k+1 x k ρ k, page 4

Algorithm 1 Davidon-Fletcher-Powell algorithm Require: ε > 0 (tolerance), K (maximum number of iterations) 1: x 0 R n, B 0 > 0 (for example I n ) 2: for k = 0 to K do 3: if g k < ε then 4: break 5: end if 6: d k = B k f(x k ) 7: Compute optimal step size ρ k 8: x k+1 = x k + ρ k d k 9: s k = ρ k d k 10: y k = g k+1 g k 11: B k+1 = B k + s ks T k 12: end for 13: return x k+1 B ky k y T k B k y T k B ky k 2.5 Davidon-Fletcher-Powell algorithm One can know use the later formula in algorithm 1. This algorithm has a remarkable property when the function f is quadratic. Theorem 4. When f is a quadratic form, the algorithm 1 generates a sequence of directions s 0,..., s k which verify: s i A T s j = 0, 0 j < j k + 1, (4) B k+1 As i = s i, 0 i k. proof. TODO This theorem says that in the quadratic case, the algorithm 1 is like a conjugate gradient method, which therefore converges in at most n iterations. One can also notice that for k = n 1 the equalities B n As i = s i, i = 0,..., n 1, and the fact that all s i are linearly independent imply that B n = A 1. Remark. One can show that in the general case (non-quadratic), if the direction d k is reinitialized to g k periodically, this algorithm converges to a local minimum ˆx of f and that: lim B k = 2 f(ˆx) 1. k The implies that close to the optimum, if the line search is exact, the method behaves like a Newton method. This justifies the use of ρ k = 1 when using approximate line search. 2.6 Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm The BFGS formula is a correction formula of rank 2 which is derived from the formula of DFP by swapping the roles of s k and y k. BFGS is named for the four people who independently discovered it in 1970: Broyden, Fletcher, Goldfarb and Shanno. The formula obtained allows to maintain an approximation H k of the Hessian which satisfies the same properties: H k+1 > 0 if H k > 0 and satisfying the quasi-newton relation: y k = H k s k. The formula therefore reads: H k+1 = H k + y kyk T yk T s H ks k s T k H k k s T k H. ks k The algorithm is detailed in algorithm 2. Note that the direction d k is obtained by solving a linear system. However in practice the update of H k is done on Cholesky factorization of H k = C k Ck T which implies that the complexity of BFGS is the same as DFP. The use of a Cholesky factorization is useful to check that H k states numerically positive definite, as this property can be lost due to numerical errors. page 5

Algorithm 2 Broyden-Davidon-Goldfarb-Shanno (BFGS) algorithm Require: ε > 0 (tolerance), K (maximum number of iterations) 1: x 0 R n, H 0 > 0 (for example I n ) 2: for k = 0 to K do 3: if g k < ε then 4: break 5: end if 6: d k = H 1 k f(x k) 7: Compute optimal step size ρ k 8: x k+1 = x k + ρ k d k 9: s k = ρ k d k 10: y k = g k+1 g k 11: H k+1 = H k + y ky T k yk T s k 12: end for 13: return x k+1 H ks k s T k H k s T k H ks k Remark. The BFGS algorithm has the same property as the DFP method: in the quadratic case it produces conjugate directions, converges in less than n iterations and H n = A. Compared to DFP, BFGS convergence speed is much less sensitive to the use of approximate step size. It is therefore a particularly good candidate when combined with Wolfe and Powell s or Goldstein s rule. Remark. The BFGS method is available in scipy as scipy.optimize.fmin_bfgs. 2.7 Limited-memory BFGS (L-BFGS) algorithm The limited-memory BFGS (L-BFGS) algorithm is a variant of the BFGS algorithm that limits memory usage. While BFGS requires to store in memory a matrix of the size of the Hessian, n n, which can be prohibitive in applications such as computer vision or machine learning, the L-BFGS algorithm only stores a few vectors that are used to approximate the matrix H 1 k. As a consequence the memory usage is linear in the dimension of the problem. The L-BFGS is an algorithm of the quasi-newton family with d k = B k f(x k ). The difference is in the computation of the product between B k and f(x k ). The idea is to keep in memory the last few low rank corrections, more specifically the last m updates s k = x k+1 x k and y k = g k+1 g k. Often in practice m < 10, yet it might be necessary to modify this parameter on specific problems to speed up convergence. We denote by µ k = 1 yk T s. The algorithm to compute the descent direction is given in k algorithm 3. Algorithm 3 Direction finding in L-BFGS algorithm Require: m (memory size) 1: q = g k 2: for i = k 1 to k m do 3: α i = µ i s T i q 4: q = q α i y i 5: end for 6: z = B 0 k q 7: for i = k m to k 1 do 8: β = µ i y T i z 9: z = z + s i (α i β) 10: end for 11: d k = z Commonly, the inverse Hessian Bk 0 is represented as a diagonal matrix, so that initially setting z requires only an element-by-element multiplication. Bk 0 can change at each iteration but has however to be positive definite. Like BFGS, L-BFGS does not need exact line search to converge. In machine learning, it is almost always the best approach to solve l 2 regularized Logistic regression and conditional random fields (CRF). page 6

Remark. L-BFGS is for smooth unconstrained problem. Yet, it can be extended to handle simple box constraints (a.k.a. bound constraints) on variables; that is, constraints of the form l i x i u i where l i and u i are per-variable constant lower and upper bounds, respectively (for each x i, either or both bounds may be omitted). This algorithm is called L BF GS B. Remark. The L-BFGS-B algorithm is available in scipy as scipy.optimize.fmin_l_bfgs_b. 3 Methods specific to least squares 3.1 Gauss-Newton method When considering least square problems the function to minimize reads: f(x) = 1 2 f i (x) 2. Newton method can be applied to the minimization of f. The gradient and the Hessian matrix read in this particular case: f(x) = f i (x) f i (x), and 2 f(x) = f i (x) f i (x) T + f i (x) 2 f i (x). If we are close to the optimum, where the f i (x) are assumed to be small the second term can be ignored. The matrix obtained reads: H(x) = f i (x) f i (x) T. This matrix is always positive. Furthermore when m is much larger than n, this matrix is often positive definite. The Gauss-Newton method consists in using H(x) in place of the Hessian. The method reads: x 0 = fixed, H k = f i (x k ) f i (x k ) T, 3.2 Levenberg-Marquardt method x k+1 = x k H 1 k f(x k). In order to guarantee the convergence of the Gauss-Newton method, it can be combined with a line search procedure: x k+1 = x k ρ k H 1 k f(x k). However in order to guarantee that the H k stay positive definite a modified method, known as Levenberg- Marquardt, is often used. The idea is simply to replace H k by H k + λi n. One can notice that if λ is large, this method is equivalent to a simple gradient method. The Levenberg-Marquardt method reads: x 0 = fixed, H k = f i (x k ) f i (x k ) T, d k = (H k + λi n ) 1 f(x k ) x k+1 = x k + ρ k d k. Remark. The Levenberg-Marquardt method is available in scipy as scipy.optimize.leastsq. page 7