Heavy ball method on convex quadratic problem 1 min

Transcription

1 Heavy ball method on convex quadratic problem 1 min x 2 x Ax b x A case study Andersen Ang Mathématique et recherche opérationnelle UMONS, Belgium manshun.ang@umons.ac.be First draft: June 26, 2018 Last update : July 28, 2019 Homepage: angms.science

2 Overview 1 1 Convex Quadratic Problem min x 2 x Ax b x 2 Gradient Descent and convergence rate 3 Polyak s Heavy Ball Method 4 Convergence of Heavy Ball Method 5 Summary 2 / 22

3 An inverse problem / unconstrained optimization problem Given A R n n, b R n 1, find x R n 1 by 1 (P 0 ) : min x 2 Ax b 2 2 (P 0 ) is equivalent to the quadratic problem min f(x) = 1 x 2 x A x b x. 1 2 Ax b 2 2 = 1 2 (Ax b) (Ax b) (expand) = 1 ( ) x A Ax x A b b Ax + b b 2 (a b = b a) = 1 ( ) x A Ax 2b Ax + b (A = A A, b = A b) = 1 2 x A x b x b 2 2 Ignoring constant 1 2 b 2 2, denote A as A and b as b, we now focus on the equivalent problem (P) : min x f(x) = 1 2 x Ax b x. 3 / 22

4 The convex quadratic problem (P) : min x f(x) = 1 2 x Ax b x. Properties of f : f is convex with respect to (w.r.t) x f is differentiable w.r.t. x First order derivative (gradient) : xf(x) = Ax b Second order derivative (Hessian) : 2 xf(x) = A Assumption 1 : A is positive definite and symmetric Consequence of the assumption : all eigenvalues of A are positive A is nonsingular = optimal solution x exists, which is x = A 1 b We can further assume li A LI 4 / 22

5 Gradient descent GD solves (P) : minf(x) = 1 x 2 x Ax b x by generating the sequence {x k } k IN : x k+1 = x k t k x f(x k ) where k is iteration (k = 1, 2,...) and t k is step size The sequence x k converges to x at a linear rate (in optimization). The convergence is illustrated by showing the distance function x k x is monotonically decreasing as k increases, under a suitable step size t k Theorem (GD converge at linear rate) Consider the problem min 2 x Ax b x with A being pd and li A LI, we have ( κ 1 ) k x0 x k x 2 x 2 κ + 1 x 1 where κ is the conditional number of A : i.e. κ = L l 5 / 22

6 Useful material before the proof As x = A 1 b, we have b = Ax (1) As x f(x) = Ax b, we have Put (1) into (2) we have x k t k x f(x k ) = x k t k (Ax k b) (2) x k t k x f(x k ) = x k t k (Ax k Ax ) = (I t k A)x k + t k Ax (3) With these we can now prove the convergence, starting with the distance function x k+1 x 2 6 / 22

7 Convergence rate of Gradient Descent in 1 slide x k+1 x 2 = x k t k x f(x k ) x 2 by (3) = (I t k A)x k + t k Ax x 2 = (I t k A)(x k x ) 2 I t k A 2 x k x 2 (1 t k l) x k x 2 (1 tl) k x 0 x 2 ( L l ) k x0 = x 2 L + l ( κ 1 ) k x0 = x 2 κ + 1 1st line : by definition of GD x k+1 = x k t k x f(x k ) 4th line : by operator norm inequality Ax 2 A 2 x 2 5th line : by li A LI = (1 t k L)I I t k A (1 t k l)i 6th line : if constant step size is used t k = t 7th line : pick t = 2 L l L+l, 1 tl = L+l 8th line : κ = L l 1 is the conditional number of A. 7 / 22

8 Polyak s Heavy Ball Method (HBM) HBM adds a momentum term in GD x k+1 = x k α k x f(x k ) + β k (x k x k 1 ) }{{} HBM momentum i.e. gradient descent with momentum β k (x k x k 1 ) β k 0 is the momentum parameter / extrapolation parameter α k acts as the step size t k in GD When β k = 0, HBM reduces to GD As update direction is perturbed by the momentum, HBM is not monotone : objective function value may increase However, overall speaking HBM converges faster than GD (Will prove it soon) 8 / 22

9 Comparing GD, HBM and Nesterov s acceleration Compared to HBM, Nesterov s accelerated gradient compute the gradient after applying the momentum x k+1 = x k α k x f ( x k + β k (x k x k 1 ) ) + β k (x k x k 1 ) Consider the following notations : Then x + = x t f(x) a k = β k (x k x k 1 ) Open question : may be? Cauchy Gradient Descent x k+1 = x + k Polyak HBM x k+1 = x + k + a k Nesterov acceleration x k+1 = (x k + a k ) + x k+1 = (x k + a k ) + + b k, x k+1 = ( (x k + a k ) + + b k ) +,... 9 / 22

10 Convergence of Heavy Ball Method Consider x k+1 x. By definition of HBM update : x k+1 x = x k α k x f(x k ) + β k (x k x k 1 ) x As x f(x k ) = Ax k b and b = Ax, we have x f(x k ) = Ax k Ax x k+1 x = x k α k (Ax k Ax ) + β k (x k x k 1 ) x = x k x α k A(x k x ) + β k (x k x k 1 ) = (I α k A)(x k x ) + β k (x k x k 1 x + x ) = (I α k A)(x k x ) β k (x k 1 x ) + β(x k x ) ( ) = (1 + β k )I α k A (x k x ) β k (x k 1 x ) In this sense, we have to consider x k x and x k 1 x at the same time [ xk+1 x ] [ ] [ (1 + βk )I α x k x = k A β k I xk x ] I 0 x k 1 x, }{{} T k (α,β) T k (α, β) is the transition matrix 10 / 22

11 Convergence of Heavy Ball Method - Transition matrix T Compact expression [ xk+1 x ] x k x [ xk x = T k (α, β) ] x k 1 x Take constant constant α k and β k in T k, so T k = T and [ xk+1 x ] [ x k x = T k x1 x ] x 0 x Take norm [ ] xk+1 x x k x 2 [ = x1 x ] Tk x 0 x 2 T k 2 [ x1 x x 0 x ] So if T k 2 is bounded, the series x k produced by HBM converges / 22

12 Tools for bounding T k 2 Recall Spectrum (all eigenvalues) of a block diagonal matrix are the eigenvalues of the block submatrices. Spectrum of a matrix are the roots of characteristic equation. For 2-by-2 matrix, the characteristic equation is in the form ax 2 + bx + c = 0 with roots x = 1 2 ( b ± b 2 4ac). The roots are complex conjugate if = b 2 4ac 0 Complex roots in the form a + ib share same magnitude as a 2 + b 2 We need two lemmas Lemma 1. For a n n matrix T, there exists a sequences ε k 0 that T k (ρ(t) + ε k ) k Lemma 2. For β > (1 αl) 2, ρ(t) < β. where ρ(t) = max{ λ 1, λ 2,..., λ n } is the spectral radius of matrix T, and λ i are the eigenvalues of T 12 / 22

13 The logic flow of bounding T k 2 Ultimate goal : show x k produced by HBM converges to x x k produced by HBM converges to x if T k 2 is bounded We use lemma 1 to bound T k. To use lemma 1, we need ρ(t), which we use lemma 2 We will not prove lemma 1 but lemma 2. Lemma 1. For a n n matrix T, there exists a sequences ε k 0 that where lim k ε k = 0. Proof. Skipped (too long). T k (ρ(t) + ε k ) k 13 / 22

14 The logic of bounding T k 2 Lemma 2. For β > (1 αl) 2, ρ(t) < β. Flow of proving lemma 2 : First show T can be decomposed into blocks T i Then spectrum of T are the eigenvalues of T i As ρ(t) is considering on the magnitude of eigenvalues, so we consider the magnitude of the eigenvalues of T i T i is 2-by-2 matrix, so the eigenvalues are the root of characteristic equation in the form ax 2 + bx + c = 0 Roots of ax 2 + bx + c = 0 are complex conjugate that share the same magnitude if b 2 4ac 14 / 22

15 Proving lemma 2 - eigendecomposition Lemma 2. For β > (1 αl) 2, ρ(t) < β. Proof. First assume β (1 αl) 2. As A is pd, A has eigendecomposition as VΛV. Then [ ] (1 + βk )I α T = k A β k I I 0 [ (1 + βk )I α = k VΛV ] β k I I 0 As T is diagonal and V forms a basis, so [ ] T = (1 + βk )I α k Λ β k I I 0 15 / 22

16 Proving lemma 2 - block decomposition of T Note T is block diagonal, it can be decomposed into blocks. e.g. when n = 2, let x i = 1 + β αλ i x 1 β T = x 2 β 1 As norm is invariant under permutation, we can swap : row 3 row 2, column 3 column 2 x 1 β x 1 x 2 β 1 β 1 x β We can see T can be decomposed into a block diagonal matrix consists of 2 2 component T i as 1 [ ] T = T T β αλi β, where T i = / 22

17 Proving lemma 2 - block decomposition of T Structure of T in general After transformation Hence, spectrum of T = all the eigenvalues of all T i 17 / 22

18 Proving lemma 2 - spectrum of T = eigenvalues of T i T i is 2-by-2 = we use the roots of the characteristic equation to find its eigenvalues, which is to solve det(t i ui) = 0 : ( ) 1 + β αλi u β det = 0 u 2 (1 + β αλ 1 u i )u + β = 0 u = 1 ( 1 + β αλ i ± ) (1 + β αλ i ) 2 2 4β Magnitude of roots are the same iff the roots are complex : Let = (1 + β αλ i ) 2 4β. Then is imaginary 0. (1 + β αλ i ) 2 4β 1 + β αλ i 2 β 1 2 β + β αλ i (1 β) 2 αλ i 1 β αλ i 1 αλ i β β (1 αλ i ) 2 18 / 22

19 Proving lemma 2 - Complex roots β (1 αλ i ) 2 Note : β (1 αλ i ) 2 is automatically satisfied due to assumptions β (1 αl) 2 and A LI Then we have 1 + β αλ i 1 + (1 αλ i ) 2 αλ i = 1 + (1 2 αλ i + αλ i ) αλ i = 2(1 αλ i ) Hence u = 1 ( 1 + β αλ i ± ) (1 + β αλ i ) 2 2 4β 1 ( 2(1 αλ i ) ± (2(1 ) αλ i )) 2 2 4β ( = 1 αλ i ± (1 ) αλ i ) 2 β As β (1 αλ i ) 2, so (1 αλ i ) 2 β 0 and (1 αλ i ) 2 β is imaginary Thus the roots u will be complex number in the form a ± ib, where a = 1 αλ i and b = (1 αλ i ) 2 β 19 / 22

20 Proving lemma 2 - magnitude of u β The magnitude of u in the form of a + ib is a 2 + b 2 u = (1 αλ i ) 2 + (1 αλ i ) 2 β β + β β = β So the magnitude of eigenvalues of T i ( i) are less than β By assumption β > (1 αl) 2, we have β > 1 αl and β β Therefore, the largest eigenvalue (spectral radius) of T β. And we finish the proof of Lemma / 22

21 Convergence of HBM Assume β (1 αl), by lemma 2 we have ρ(t) = max λ i (T) β. By lemma 1, we have T k (ρ(t) + ε k ) k with lim k ε k = 0 Put lemma 2 into lemma 1 we have T k (β + ε k ) k 4 L l Lastly, let α = ( L + l) and β = in T we have 2 L + l [ ] ( xk+1 x ) L l k [ x1 x ] x k x + ε L + l x 0 x or x k x ( κ 1 κ ε) k x 0 x where κ = L l 21 / 22

22 Last page - summary Gradient descent x k+1 = x k t k x f(x k ) has convergence x k x ( κ 1 κ ε ) k x0 x Heavy Ball Method x k+1 = x k α k x f(x k ) + β k (x k x k 1 ) has convergence ( x k x κ 1 k x0 + ε) x κ + 1 ( κ 1 ) k ( Improvement from κ ε = 1 2 ) k κ ε to ( κ 1 ) k ( 2 ) k + ε = 1 + ε κ + 1 κ + 1 End of document 22 / 22