SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI. (B.Sc.(Hons.), BUAA)

Transcription

1 SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI (B.Sc.(Hons.), BUAA) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF MATHEMATICS NATIONAL UNIVERSITY OF SINGAPORE 2005

2 Acknowledgements I am very grateful to my supervisor, Dr Sun Defeng, for his abundant knowledge and kindness. In the past two years, Dr Sun has always helped me when I was in trouble, encouraged me when I lost confidence. This thesis would not have been finished without the invaluable suggestions and patient guidance from him. I would also like to thank Miss. Shi Shengyuan, who has helped me so much in my study. Thanks all the staff and friends at National University of Singapore for their help and support. Yu Qi August 2005 ii

3 Summary In order to solve a convex non-differentiable optimization problem, one may introduce a smoothing convex function to approximate the non-differentiable objective function and solve the smoothing convex optimization problem to get an approximate solution. Nesterov showed that if the gradient of the smoothing function is Lipschitz continuous, one may get an ɛ-approximation solution with the number of iterations bounded by O( 1 ) [9]. In [11], Nesterov discussed the problem ɛ of minimizing the maximal eigenvalue and the problem of minimizing the spectral radius. Recently, Shi [12] presented a smoothing function for the sum of the κ largest components of a vector. This smoothing function is highly advantageous because the composition of this function and eigenvalue functions allows one to compute smoothing functions to approximate all eigenvalue functions of a real symmetric matrix. In this thesis, we further study the properties of the smoothing functions to approximate the eigenvalue functions. In particular, we obtain an estimation of the Lipschitz constant of the gradient of these smoothing functions. iii

4 Summary iv We then consider the problem of minimizing the sum of the κ largest eigenvalues by applying Nesterov s smoothing method [9]. Finally, we extend this algorithm to solve the problem of minimizing the sum of the κ largest absolute values of eigenvalues. These two problems are general forms of the problem of minimizing the maximal eigenvalue and the problem of minimizing the spectral radius, considered respectively in [11]. We also report some numerical results for both problems. The organization of this thesis is as follows. In Chapter 1 we first describe the problems discussed in [11] by Nesterov and then extend them to general cases. In Chapters 2 and 3 we discuss some important properties of smoothing functions for approximating the sum of the κ largest eigenvalues and the sum of the κth largest absolute values of eigenvalues of a parametric affine operator, respectively. The smoothing algorithm and computational results are given in Chapter 4.

5 List of Notation A, B,... denote matrices; M n,m denotes the n-by-m matrix. S m is the set of all m m real symmetric matrices; O m is the set of all m m orthogonal matrices. A superscript T represents the transpose of matrices and vectors. For a matrix M, M i and M j represent the ith row and j th column of M, respectively. M ij denotes the (i, j)th entry of M. A diagonal matrix is written as Diag(β 1,..., β n ). We use to denote the Hadamard product between matrices, i.e. X Y = [X ij Y ij ] m i,j=1. Let A 1,..., A n S m be given, we define the linear operator A : R n S m by A(x) := x i A i, x R n. (1) v

6 List of Notation vi Let A : S m R n be the adjoint of the linear operator A : R n S m defined by (1): d, A D = D, Ad, (d, D) R n S m. Hence, for all D S m, A D = ( A 1, D,..., A n, D ) T. Denote G S m as (G) ij = A i, A j. The eigenvalues of X S m are designated by λ i (X), i = 1,..., m, and λ 1 (X) λ 2 (X) λ m (X). We write X = O(α) (respectively, o(α)) if X / α is uniformly bounded (respectively, tends to zero) as α 0. For B M n,m, the operator Ξ : M n,m S m+n is defined as Ξ(B) = 0 B. B T 0 We define the linear operator Γ : R n S 2m as Γ(x) = Ξ(A(x)). (2) Let Γ : S 2m R n be the adjoint operator of the linear operator Γ, for any Y S 2m, Γ (Y ) = (2 A 1, Y 2,, 2 A n, Y 2 ) T, where Y = Y 1 Y 2 Y2 T Y 3, and Y 1, Y 2, Y 3 S m.

7 Contents Acknowledgements ii Summary iii List of Notation v 1 Introduction 3 2 Properties of the Smoothing Function for the Sum of the κ Largest Eigenvalues The Smoothing Function for the Sum of κ Largest Components Spectral Functions Smoothing Functions for φ(x) Smoothing Functions for the Sum of the κ Largest Absolute Values of Eigenvalues Preliminaries

8 Contents Smoothing Functions for ψ(x) Numerical Experiments Computational Results Conclusions Bibliography 32

9 Chapter 1 Introduction Let S m be the set of real m-by-m symmetric matrices. For X S m, let {λ i (X)} m be the eigenvalues of X which are sorted in nonincreasing order, i.e. λ 1 (X) λ 2 (X) λ κ (X) λ m (X). Let A 1,..., A n S m be given. Define the operator A : R n S m by A(x) = x i A i, x R n. (1.1) Let Q be a bounded closed convex set in R n and C S m. In [11], Nesterov considered the following nonsmooth problem: min λ 1(C + A(x)). (1.2) x Q Nesterov s approach is to replace the above nonsmooth function λ 1 (C + A(x)) by the following smooth function: S µ (C + A(x)), (1.3) where the tolerance parameter µ > 0, and S µ (X) is the product of the entropy function and µ, i.e. S µ (X) = µ ln[ e λi(x)/µ ]. (1.4) 3

10 4 The function S µ (X) approximates λ 1 (X) as µ 0. following smoothed optimization problem: Now, let us consider the min {S µ(c + A(x))}. (1.5) x Q Note that for any µ > 0, the gradient mapping S µ ( ) is globally Lipschitz continuous. Nesterov suggested to use a gradient based numerical method [9] to solve problem (1.5). For a given matrix X S m, we define its spectral radius by: ρ(x) := max 1 i m λ i(x) = max{λ 1 (X), λ m (X)}. (1.6) Let ϕ(x) := ρ(c + A(x)). Another problem discussed in Nesterov s paper [11] is min ϕ(x) (1.7) x Q with C 0. Nesterov constructed the smoothing function as ϕ p (x) = F p (A(x)), x Q, where F p ( ) is: F p (X) = 1 2 X2p, I n 1 p. (1.8) Nesterov considered the smoothing problem min {ϕ p(x)} (1.9) x Q and used method [8] to solve the smoothing problem (1.9). In this thesis, we shall extend Nesterov s approach to the following two problems. Let φ(x) be the sum of κth largest eigenvalues of C + A(x), i.e. φ(x) = κ λ i (C + A(x)). (1.10)

11 5 We shall first consider in this thesis the following nonsmooth problem: Clearly, if κ = 1, (1.11) turns to be problem (1.2). min φ(x). (1.11) x Q Let λ [κ] (X) be the κth largest absolute value of eigenvalues of X, sorted in the nonincreasing order, i.e. λ [1] (X) λ [2] (X) λ [κ] (X) λ [n] (X). We define the following function: ψ(x) = The second problem that we shall consider in this thesis is which is a general case of (1.7). κ λ [i] (C + A(x)). (1.12) min ψ(x), (1.13) x Q We shall construct smoothing functions for problem (1.11) and (1.13) respectively. The gradients of the smoothing functions must satisfy the global Lipschitz condition, which makes it possible for us to apply Nesterov s algorithm [9] to solve the smoothing problems. Nesterov s method is to solve the optimization problem: where θ( ) is a nonsmooth convex function. x Q, i.e. min θ(x), (1.14) x Q Our goal is to find an ɛ solution θ( x) θ ɛ, (1.15) where θ = min x Q θ(x). We denote θ µ( ) as a smoothing function for θ( ). The smoothing function θ µ ( ) satisfies the following inequality: θ µ (x) θ(x) θ µ (x) + µr, (1.16)

12 6 where R is a constant for a specified smoothing function θ µ (x) (We will give its definition in Chapter 2). Nesterov proved the following inequality [9, Theorem 2] θ( x) θ θ µ ( x) θ µ + µr, where θ µ = min x Q θ µ(x). Let µ = µ(ɛ) = ε 2R and θ µ ( x) θµ 1 ɛ. (1.17) 2 We have θ( x) θ ɛ. Nesterov s algorithm can improve the bound on the number of iterations to O( 1), ɛ while the traditional algorithms need O( 1 ) iterations. ɛ 2 The remaing part of this thesis is as follows. We discuss the properties of smoothing functions for approximating the sum of the κ largest eigenvalues of a parametric affine operator in Chapter 2 and the sum of the κ largest absolute values in Chapter 3. Computational results are given in Chapter 4.

13 Chapter 2 Properties of the Smoothing Function for the Sum of the κ Largest Eigenvalues 2.1 The Smoothing Function for the Sum of κ Largest Components In her thesis [12], Shi discussed the smoothing function for the κ largest components of a vector. For x R n we denote by x [κ] the κth largest component of x, i.e., x [1] x [2] x [κ] x [n] sorted in the nonincreasing order. Define f κ (x) = κ x [i]. (2.1) Denote by Q κ the convex set in R n : Q κ = {v R n : v i = κ, 0 v i 1, i = 1, 2,..., n}, (2.2) and 7

14 2.1 The Smoothing Function for the Sum of κ Largest Components 8 Let where r(v) = z ln z, z (0, 1], p(z) = 0, z = 0. p(v i ) + (2.3) p(1 v i ) + R, v Q κ, (2.4) R := n ln n κ ln κ (n κ) ln(n κ), (2.5) which is the maximal value of r(v). The smoothing function for f κ ( ) is f µ κ ( ) : R n R, which is defined by: f κ µ (x) := max x T v µr(v) s.t. v i = κ 0 v i 1, i = 1,..., n. (2.6) Shi has also provided the optimal solution to f µ κ ( ) in (2.6): where α satisfies v i (µ, x) = e α(µ,x) x i µ e α(µ,x) x i µ (2.7) = κ. (2.8) In order to introduce Lemma 2.1 in [12], we need the definition of the γ function: γ i (µ, x) := e α(µ,x) x i µ µ(1 + e α(µ,x) x i µ ) 2, i = 1,, n. (2.9) Without causing any confusion, let γ i := γ i (µ, x), for i = 1,..., n, and γ = (γ 1,..., γ n ) T.

15 2.1 The Smoothing Function for the Sum of κ Largest Components 9 Lemma 2.1. The optimal solution to problem (2.6), v(µ, x), is continuously differentiable on R ++ R n, with γ 1 x v(µ, x) =... 1 γ(γ) T. (2.10) (γ k ) Proof. From (2.7), for each i = 1,, n, γ n k=1 v i (µ, x)(1 + e α(µ,x) x i µ ) = 1. (2.11) Taking derivatives of x on both side of (2.11), we have ( x v(µ, x)) i = 1 e α(µ,x) xi µ(1 + e α(µ,x) x µ (e i i x α(µ, x)) µ ) 2 = γ i (e i x α(µ, x)). (2.12) From (2.8), we have x α(µ, x) = = 1 (γ k ) k=1 (γ i e i ) 1 (γ 1,, γ n ) T, (γ k ) k=1 (2.13) where e i R n, its ith entry is 1 and others are all zeros. From (2.12) and (2.13), we obtain: ( x v(µ, x)) i = e i γ i γ i (γ 1,, γ n ) T. (2.14) (γ k ) k=1 Lemma 2.2. For γ i, i = 1,..., n given by (2.9), has bounds: 0 γ i 1 4µ. (2.15)

16 2.1 The Smoothing Function for the Sum of κ Largest Components 10 Proof. Denote p i (µ, x) = e α(µ,x) x i µ, i = 1,..., n. Thus we have γ i = 1 µ p i (1 + p i ) 2 = 1 µ ( p i 1 (1 + p i ) 2 ) = 1 µ ( ( p i 1 2 ) ) 1 4µ. (2.16) The following theorem from [12] describes some properties of f µ κ (x). Theorem 2.3. For µ > 0, x R n, the function f µ κ (x) has the following properties: 1. f µ κ (x) is convex; 2. f µ κ (x) is continuously differentiable; 3. f µ κ (x) f κ (x) f µ κ (x) + µr. From the above theorem, for each µ > 0, f κ µ (x) is continuously differentiable. According to [12], the gradient of f κ µ (x) is the optimal solution to problem (2.6), i.e., v(µ, x). Therefore the gradient and Hessian of f κ µ (x) for any µ > 0 are given by f κ µ (x) = v(µ, x); (2.17) 2 f κ µ (x) = x v(µ, x). (2.18) Up to now we have reviewed some of Shi s results. provide an estimate to 2 f µ κ (x) in the following theorem. Based on these results, we

17 2.1 The Smoothing Function for the Sum of κ Largest Components 11 Theorem 2.4. For µ > 0, we have the following conclusions on 2 f µ κ (x): 1. For h R n, 0 h, 2 (f µ κ (x))h 1 4µ h 2 2. (2.19) 2. (( 2 (f µ κ (x))) ij is the (i, j)th entry of ( 2 (f µ κ (x))), for i j 0 ( 2 (f κ µ (x))) ii 1 4µ ; (2.20) 1 4µ ( 2 (f κ µ (x))) ij 0. (2.21) Proof. First we prove part 1. Since f µ κ (x) is convex, its Hessian 2 f µ κ (x) is positive semidefinite, i.e., By Lemma 2.1, h, x f µ κ (x)h 0. h, 2 f κ µ (x)h = h, x v(µ, x)h = γ i h 2 i 1 (γ T h) 2 γ k γ i h 2 i 1 4µ h2 i = 1 4µ h 2 2. Now let us prove part 2. For any 1 i n, 1 j n, i j, k=1 ( 2 (f µ κ (x))) ii = ( x v(µ, x)) ii = γ i (γ i) 2 γ k k=1 = γ i 1 γ i γ k k=1. (2.22) (2.23)

18 2.1 The Smoothing Function for the Sum of κ Largest Components 12 According to Lemma 2.2, we have 0 γ i 1 4µ, and Therefore, On the other hand, 0 1 γ i γ k k= ( 2 (f µ κ (x))) ii 1 4µ. (2.24) ( 2 (f µ κ (x))) ij = ( x v(µ, x)) ij Since we have = γ iγ j γ k k=1 γ j = γ i. γ j j=1 (2.25) 0 γ j 1, (2.26) γ j j=1 1 4µ ( 2 (f µ κ (x))) ij 0, i j. (2.27) Remark. Nesterov proved in [9, Theorem 1] that for problem (1.14), if the function θ µ ( ) (µ > 0) is continuously differentiable, then its gradient θ µ (x) = A v µ (x) is Lipschitz continuous with its Lipschitz constant L µ = 1 µσ 2 A 2 1,2.

19 2.2 Spectral Functions 13 As to the smoothing function f µ κ ( ), σ 2 = 4 and A is an identity operator. So Theorem 2.4 may also be derived from [9, Theorem 1]. Here we provide a direct proof. 2.2 Spectral Functions A function F on the space of m-by-m real symmetric matrices is called spectral if it depends only on the eigenvalues of its argument. Spectral functions are just symmetric functions of the eigenvalues. A symmetric function is a function that is unchanged by any permutation of its variables. In this thesis, we are interested in functions F of a symmetric matrix argument that are invariant under orthogonal similarity transformations [6]: F (U T AU) = F (A), U O, A S m, (2.28) where O is the set of orthogonal matrices. Every such function can be decomposed as F (A) = (f λ)(a), where λ is the map that gives the eigenvalues of the matrix A and f is a symmetric function. We call such functions F spectral functions because they depend only on the spectrum of the operator A. Therefore, we can regard a spectral function as a composition of a symmetric function and the eigenvalue function. In order to show some preliminary results, we give the following definition. For each X S m, define the set of orthonormal eigenvectors of X by O X := {P O : P T XP = Diag[λ(X)]}. Now we refer to the formula for the gradient of a differential spectral function [6]. Proposition 2.5. Let f be a symmetric function from R n to R and X S n. Then the following holds:

20 2.3 Smoothing Functions for φ(x) 14 (a) (f λ) is differentiable at point X if and only if f is differentiable at point λ(x). In the case the gradient of (f λ) at X is given by (f λ)(x) = UDiag[ f(λ(x))]u T, U O X. (2.29) (b) (f λ) is continuously differentiable at point X if and only if f is continuously differentiable at point λ(x). Lewis and Sendov [7, Theorems 3.3 and 4.2] proved the following proposition, which gives the formula for calculating the Hessian of the spectral function. Proposition 2.6. Let f : R n R be symmetric. Then for any X S n, it holds that (f λ) is twice (continuously) differentiable at X if and only if f is twice (continuously) differentiable at λ(x). Moreover, in this case the Hessian of the spectral function at X is 2 (f λ)(x)[h] = U(Diag[ 2 f(λ(x))diag[ H]] + C(λ(X)) H)U T, H S n, (2.30) where U is any orthogonal matrix in O X and H = U T HU. The matrix C in Proposition 2.6 is defined as follows: C(ω) R n n : 0, if i = j (C(ω)) ij := ( 2 f(ω)) ii ( 2 f(ω)) ij, if i j and ω i = ω j ( f(ω)) i ( f(ω)) j, else. ω i ω j (2.31) 2.3 Smoothing Functions for φ(x) Consider the φ(x) in (1.10), clearly, it is a composition function: φ(x) = (f κ λ)(c + A(x)), (2.32)

21 2.3 Smoothing Functions for φ(x) 15 where f κ (x) is given in (2.1). In [12], Shi investigated the smoothing function f κ µ (x) as a approximation for f κ (x). So it is natural for us to think of the below composition function: φ µ (x) = (f κ µ λ)(c + A(x)), (2.33) According to the properties of f κ µ ( ), for µ > 0, φ µ (x) φ(x) φ µ (x) + µr, x Q. (2.34) Since the function f κ µ ( ) is a symmetric function, (f κ µ λ) is a composition function of a symmetric function f κ µ ( ) : R m R and the eigenvalue function λ( ) : S m R m. Hence the function (f κ µ λ) is a spectral function. Consequently, (f κ µ λ) is twice continuously differentiable. We will prove that the composition φ µ (x) is twice continuously differentiable and provide an estimation of the Lipschitz constant of the gradient of φ µ (x) as follows. First, we will derive the gradient of φ µ (x). For µ > 0, for any h R n and h 0, φ µ (x + h) φ µ (x) = f κ µ (λ(c + A(x + h))) f κ µ (λ(c + A(x))) = f µ κ (λ(c + A(x) + A(h))) f µ κ (λ(c + A(x))) = (f µ κ λ)(c + A(x)), A(h) + O( h 2 ) (2.35) = A ( (f κ µ λ)(c + A(x))), h + O( h 2 ), where A D = ( A 1, D,..., A m, D ) T. Thus we have the following proposition. Proposition 2.7. For µ > 0, φ µ ( ) is continuously differentiable with its gradient given by: φ µ (x) = A ( (f κ µ λ)(c + A(x))). (2.36)

22 2.3 Smoothing Functions for φ(x) 16 Next, we shall consider the Hessian of φ µ (x). First we define C µ (ω) R n n as 0, if i = j (C µ (ω)) ij := ( 2 f κ µ (ω)) ii ( 2 f κ µ (ω)) ij, if i j and ω i = ω j (2.37) ( f κ µ (ω)) i ( f κ µ (ω)) j, else. ω i ω j Proposition 2.8. For µ > 0, for any h R n and h 0, φ(x) is twice continuously differentiable with its Hessian given by: where H := A(h). 2 φ µ (x)[h] = A ( 2 (f µ κ λ)(c + A(x))[H]), (2.38) Proof. For µ > 0, for any h R n and h 0,, φ µ (x + h) φ µ (x), h = A ( (f κ µ λ)(c + A(x + h))) A ( (f κ µ λ)(c + A(x))), h = A ( (f κ µ λ)(c + A(x + h)) (f κ µ λ)(c + A(x))), h = (f µ κ λ)(c + A(x + h)) (f µ κ λ)(c + A(x)), A(h) (2.39) = (f κ µ λ)(c + A(x) + A(h)) (f κ µ λ)(c + A(x)), A(h) = 2 (f κ µ λ)(c + A(x))A(h), A(h) + O( h 2 ) = A ( 2 (f κ µ λ)(c + A(x)))H, h + O( h 2 ), which shows that (2.50) holds. In order to estimate 2 φ µ (x), we need the estimation of C µ (ω) given in (2.31). The following lemma is motivated by Lewis and Sendov [7]. Lemma 2.9. For ω R n with ω i ω j, i j, i, j = 1,, n, there exists ξ, η R n, such that ( f µ κ (ω)) i ( f µ κ (ω)) j ω i ω j = ( 2 f µ κ (ξ)) ii ( 2 f µ κ (η)) ij. (2.40)

23 2.3 Smoothing Functions for φ(x) 17 Proof. For each ω R n, we define the two vectors ω and ω R n coordinatewise as follows: ω p, p i, ω = ω j, p = i. ω = ω p, p i, j, ω j, p = i. ω i, p = j. (2.41) By the mean value theorem, ( f µ κ (ω)) i ( f µ κ (ω)) j ω i ω j = ( f µ κ (ω)) i ( f µ κ ( ω)) i + ( f µ κ ( ω)) i ( f µ κ (ω)) j ω i ω j = (ω i ω j )( 2 f µ κ (ξ)) ii + ( f µ κ ( ω)) i ( f µ κ (ω)) j ω i ω j =( 2 f µ κ (ξ)) ii + ( f µ κ ( ω)) i ( f µ κ ( ω)) i + ( f µ κ ( ω)) i ( f µ κ (ω)) j ω i ω j (2.42) =( 2 f µ κ (ξ)) ii + (ω j ω i )( 2 f µ κ (η)) ij + ( f µ κ ( ω)) i ( f µ κ (ω)) j ω i ω j =( 2 f µ κ (ξ)) ii ( 2 f µ κ (η)) ij + ( f µ κ ( ω)) i ( f µ κ (ω)) j ω i ω j, where ξ is a vector between ω and ω, and η is a vector between ω and ω. We next consider the term ( f κ µ ( ω)) i ( f κ µ (ω)) j ω i ω j. By the definitions, we know and ( f µ κ ( ω)) i = ( f µ κ (ω)) j = Since f µ κ ( ) is a symmetric function, Consequently, ( f µ κ ( ω)) i = ω i f µ κ ( ω) = ω j f µ κ (ω) (2.43) ω j f µ κ (ω) = ( f µ κ (ω)) j. ω j f µ κ ( ω). (2.44) Therefore (2.40) holds. ( f µ κ ( ω)) i ( f µ κ (ω)) j ω i ω j = 0.

24 2.3 Smoothing Functions for φ(x) 18 Lemma For any 1 i, j m, each entry of (C µ (ω)) ij bound: has the following 0 (C µ (ω)) ij 1 2µ. (2.45) Proof. ( 2 f (C µ κ µ (ω)) ii ( 2 f κ µ (ω)) ij, i = j (ω)) ij = ( 2 f κ µ (ξ)) ii ( 2 f κ µ (η)) ij, i j. For any x, y R n, by Theorem 2.4, (2.46) ( 2 f µ κ (x)) ii ( 2 f µ κ (y)) ij max ( 2 f µ κ (x)) ii min ( 2 f µ κ (y)) ij = 1 2µ ; (2.47) ( 2 f µ κ (x)) ii ( 2 f µ κ (y)) ij min ( 2 f µ κ (x)) ii max ( 2 f µ κ (y)) ij = 0. (2.48) Let i = j, x = ω, y = ω, then Let i j, x = ξ, y = η, then 0 (C µ (ω)) ij 1 2µ ; 0 (C µ (ω)) ij 1 2µ. Next, we estimate the Lipschitz constant of φ µ (x). For any h R n, we have h, 2 φ µ (x)[h] = h, A ( 2 (f µ κ λ)(c + A(x))[H]) (2.49) = H, 2 (f µ κ λ)(c + A(x))[H], with H = A(h), and 2 (f µ κ λ)(c + A(x))[H] =U(Diag[ 2 f µ κ (λ(c + A(x)))diag[ H]] + C µ (λ(c + A(x))) H)U T, (2.50)

25 2.3 Smoothing Functions for φ(x) 19 where U O C+A(x), and H = U T HU. Therefore h, 2 φ µ (x)[h] = ( 2 (f κ µ λ)(c + A(x))) ii Hii + 1 4µ H ii + 1 2µ 1 H, H 2µ = 1 A(h), A(h) 2µ = 1 2µ i,j=1 m h i h j A i, A j i,j=1 = 1 2µ ht Gh 1 2µ G h 2, i j H ij i,j=1 i j C µ ij (λ(c + A(x))) H ij (2.51) where G S m, (G) ij = A i, A j, and G := max 1 i m λ i(g) = max{λ 1 (G), λ m (G)}. Thus the Lipschitz constant for the gradient of the smoothing function φ µ (x) is: L = 1 G. (2.52) 2µ In particular, if we take µ := µ(ε) = ɛ, where R = 2m ln(2m) κ ln κ (2m 2R κ) ln(2m κ), then we have L = R G. (2.53) ɛ

26 Chapter 3 Smoothing Functions for the Sum of the κ Largest Absolute Values of Eigenvalues In order to solve problem (1.13), we need the concept of singular values. According to some properties of singular values, we can obtain a computable smoothing function for problem ψ(x). 3.1 Preliminaries Similar to the eigenvalue decomposition of a symmetric matrix, a non-symmetric matrix has singular value decomposition. Let A M n,m, and without generality we assume n m. Then there exist orthogonal matrices U M n,n and V M m,m such that A has the following singular value decomposition (SVD): U T AV = [Σ(A) 0], (3.1) where Σ(A) = diag(σ 1 (A),..., σ n (A)) and σ 1 (A) σ 2 (A)... σ n (A) 0 are the singular values of A [5]. 20

27 3.1 Preliminaries 21 The singular value has the following property: AAT = UΣ 2 (A)U T = Udiag[σ 1 (A),..., σ n (A)]U T. (3.2) In particular, if A is symmetric, we have AA T = A 2 = P diag[λ 1 (A) 2,..., λ n (A) 2 ]P T, (3.3) where λ 1 (A),..., λ n (A) are the eigenvalues of A. Comparing (3.2) and (3.3), the singular values are the square root of respective eigenvalues of AA T, which means σ i (A) = λ i (A), i = 1,..., n for all symmetric matrix A. For any W S n+m, we define: Λ(W ) = diag(λ (1) (W ),..., λ (n) (W ), λ (n+m) (W ),..., λ (n+1) (W )), (3.4) where {λ i (W ) : i = 1,, n + m} are the eigenvalues of W arrange in decreasing order. Noted that the first n diagonal entries of Λ(W ) are just the n largest eigenvalues of W, arranged in deceasing order, while the last m diagonal entries of Λ(W ) are the m smallest eigenvalues of W, arranged in increasing order. This arrangement will show convenience shortly afterwards. Define the linear operator: Ξ : M n,m S m+n by Ξ(B) = 0 B, B M n,m. (3.5) B T 0 For A S m, let it have the following eigenvalue decomposition: V T AV = Σ(A), V O A. (3.6) The following result is derived from Golub and Van Loan [5, Section 8.6]. Proposition 3.1. Suppose that A S m has the eigenvalue decomposition (3.6). Then the matrix Ξ(A) has the following spectral decomposition: Ξ(A) = Q(Λ(Ξ(A)))Q T = Q Σ(A) 0 Q T, (3.7) 0 Σ(A)

28 3.2 Smoothing Functions for ψ(x) 22 where Q O 2m,2m is defined by: Q = 1 V 2 i.e. the eigenvalues of Ξ(A) are ±σ i (A). V V V, (3.8) 3.2 Smoothing Functions for ψ(x) From Proposition 3.1, we know that κ ψ(x) = λ i (Ξ(C + A(x))). (3.9) Similar to Chapter 2, we define the smoothing function for ψ(x) by ψ µ (x) = (f κ µ λ)(ξ(c + A(x))), (3.10) where the f κ µ ( ) is the smoothing function defined in Chapter 2 and κ m. For x R n, we define a linear operator Γ : R n S 2m as follows: Γ(x) = Ξ(A(x)). (3.11) Let us consider the adjoint of Γ(x). For Y S 2m, Γ(x), Y = Ξ(A(x)), Y = 0 A(x) (A(x)) T 0, = A(x), Y 2 + (A(x)) T, Y T 2 = 2 A(x), Y 2 = 2x i A i, Y 2. Y 1 Y 2 Y2 T Y 3 (3.12) Thus, Γ (Y ) = (2 A 1, Y 2,, 2 A n, Y 2 ) T.

29 3.2 Smoothing Functions for ψ(x) 23 The smoothing function ψ µ (x) takes the following form: ψ µ (x) = (f µ κ λ)(ξ(c) + Γ(x))). (3.13) Since we have already known that, for any µ > 0, (f κ µ λ) is twice continuously differentiable, ψ µ ( ) is also twice continuously differentiable. Now we discuss its gradient and Hessian. Proposition 3.2. For µ > 0, ψ µ ( ) is continuously differentiable with its gradient given by ψ µ (x) = Γ ( (f κ µ λ)(ξ(c) + Γ(x))). (3.14) Proof. For µ > 0, for any h R n and h 0, ψ µ (x + h) ψ µ (x) = f κ µ (λ(ξ(c) + Γ(x + h))) f κ µ (λ(ξ(c) + Γ(x))) = f µ κ (λ(ξ(c) + Γ(x) + Γ(h))) f µ κ (λ(ξ(c) + Γ(x))) = (f µ κ λ)(ξ(c) + Γ(x)), Γ(h) + O( h 2 ) (3.15) = Γ ( (f κ µ λ)(ξ(c) + Γ(x))), h + O( h 2 ), which agrees with (3.14). Proposition 3.3. For µ > 0, ψ µ ( ) is twice continuously differentiable with its Hessian given by 2 ψ µ (x)[h] = Γ ( 2 (f κ µ λ)(ξ(c) + Γ(x))[H]), h R n, (3.16) where H := Γ(h).

30 3.2 Smoothing Functions for ψ(x) 24 Proof. For µ > 0, for any h R n and h 0, ψ µ (x + h) ψ µ (x), h = Γ ( (f κ µ λ)(ξ(c) + Γ(x + h))) Γ ( (f κ µ λ)(ξ(c) + Γ(x))), h = Γ ( (f κ µ λ)(ξ(c) + Γ(x + h)) (f κ µ λ)(ξ(c) + Γ(x))), h = (f µ κ λ)(ξ(c) + Γ(x + h)) (f µ κ λ)(ξ(c) + Γ(x)), Γ(h) (3.17) = (f κ µ λ)(ξ(c) + Γ(x) + Γ(h)) (f κ µ λ)(ξ(c) + Γ(x)), Γ(h) = 2 (f κ µ λ)(ξ(c) + Γ(x))Γ(h), Γ(h) + O( h 2 ) = Γ ( 2 (f κ µ λ)(ξ(c) + Γ(x)))H, h + O( h 2 ), which shows that (3.16) holds. Next, we estimate the Lipschitz constant of ψ µ (x). For any h R n, we have h, 2 ψ µ (x)[h] = h, Γ ( 2 (f µ κ λ)(ξ(c) + Γ(x))[H]) = H, ( 2 f µ κ λ)(ξ(c) + Γ(x))H, (3.18) with 2 (f µ κ λ)(ξ(c) + Γ(x))[H] =U(Diag[ 2 f µ κ (λ(ξ(c) + Γ(x)))diag[ H]] + C µ (λ(ξ(c) + Γ(x))) H)U T, (3.19)

31 3.2 Smoothing Functions for ψ(x) 25 where U O Ξ(C)+Γ(x) and H = U T HU. Therefore h, 2 ψ µ (x)[h] = ( 2 (f κ µ λ)(ξ(c) + Γ(x))) ii H2 ii + 1 4µ H 2 ii + 1 2µ 1 H, H 2µ 1 Γ(h), Γ(h) 2µ 1 A(h), A(h) µ i,j=1 i j H 2 ij i,j=1 i j C µ ij (λ(ξ(c) + Γ(x))) H 2 ij (3.20) 1 µ G h 2. Thus, the Lipschitz constant for the gradient of the smoothing function ψ µ (cot) is L = 1 G. (3.21) µ In particular, if we take µ := µ(ε) = ɛ, wherer := m ln m κ ln κ (m κ) ln(m 2R κ) then we have L = 2R G. (3.22) ɛ

32 Chapter 4 Numerical Experiments Nesterov s method is to solve the following optimization problem: min{f(x)}, (4.1) x Q where f is a convex function with its gradient of f(x) satisfied the Lipschitz condition: where L > 0 is a constant. Consider a prox-function d(x) in Q. f(x) f(y) L x y, x, y R n. (4.2) We assume that d(x) is continuous and strongly convex on Q with convexity parameter σ > 0. Denote x 0 by x 0 = min d(x). (4.3) x Q Without loss of generality we assume d(x 0 ) = 0. Thus, for any x Q, we have Define d(x) 1 2 σ x x 0 2. (4.4) T Q (x) = min y { f(x), y x L y x 2 : y Q}. (4.5) Now we are ready to give Nesterov s smoothing algorithm [9]: For k 0 do 26

33 27 1. Compute f(x k ) and f(x k ). 2. Find y k = T Q (x k ). 3. Find z k = arg min{ Ld(x) + k i+1 [f(x x σ 2 i) + f(x i ), x x i ] : x Q}. i=0 4. Set x k+1 = 2 k+3 z k + k+1 k+3 y k. Nesterov prove the following Theorem [9, Theorem 3]: Theorem 4.1. Let the sequences {x k } k=0 and {y k} k=0 algorithm. Then for any k 0, we have { (k + 1)(k + 2) L k f(y k ) min 4 x σ d(x) + Therefore, i=0 f(y k ) f(x ) where x is an optimal solution to the problem. (4.1). be generated by the above } i [f(x i) + f(x i ), x x i ] : x Q. (4.6) 4Ld(x ) σ(k + 1)(k + 2). (4.7) By applying Bregman s distance, Nesterov provided a modified algorithm, which gives a way to compute T Q (x k ). In the new algorithm, we compute the V Q instead of T Q. Bregman s distance was introduced in [3], as an extension to the usual metric discrepancy measure (x, y) x y 2. If f( ) is a real convex function, then the Bregman distance between two parameters z and x is defined as The Bregman distance satisfies ξ(z, x) = f(x) f(z) f(z), x z, x, z Q. (4.8) Define the Bregman projection of h as follows: ξ(z, x) 1 2 σ x z 2. (4.9) V Q (z, h) = argmin{h T (x z) + ξ(z, x) : x Q}. (4.10)

34 4.1 Computational Results 28 The following algorithm is Nesterov s algorithm via the Bregman distance [9]: 1. Choose y 0 = z 0 = arg min{ Ld(x) + 1[f(x x σ 2 0) + f(x 0 ), x x 0 ] : x Q}. 2. For k 0 iterate: a. Find z k = arg min{ Ld(x) + k i+1 [f(x x σ 2 i) + f(x i ), x x i ] : x Q}. i=0 b. Set τ k = 2 and x k+3 k+1 = τ k z k + (1 τ k )y k. c. Find ˆx k+1 = V Q (z k, σ L τ k f(x k+1 )). d. Set y k+1 = τ kˆx k+1 + (1 τ k )y k. Nesterov pointed out for the above method, Theorem 4.1 holds. The computational results shown in the next section are achieved by applying the above algorithm. 4.1 Computational Results First, we solve the smoothing problem: min x Q φµ (x), where the closed convex set Q is given by Q = {x R n : x i = 1, 0 x i 1, i = 1, 2,..., n}. Let P i be a m-by-m random matrix with its entries in [-1,1], define C, A 1, A 2,, A n as C = 1 2 (P 0 + P T 0 ) and A i = i 1 2 (P i + P T i ), i = 1,, n. Let ɛ be the desired accuracy, i.e., φ( x) φ ɛ. R is defined by 2.5 and d(x) = lnn + n x i. According to Theorem 4.1, we have the iteration bound N := [ 2 2R ln n ɛ G ]. We composed the matlab codes for this problem. Tables 4.1, 4.2 and 4.3 are the numerical results for this problem.

35 4.1 Computational Results 29 Table 4.1: for m = 10, n = 4 and different κ κ ɛ N time starting value optimal value Table 4.2: for m = 30, n = 4 and different κ κ ɛ N time starting value optimal value Table 4.3: for n = 4, κ = 4, ɛ = 0.01 and different m m N time starting value optimal value

36 4.2 Conclusions 30 Next, we solve the smoothing problem : min x Q ψµ (x). The parameters Q, C, A i, i = 1,, n are defined as for φ µ (x). The maximal number of iterations is N = [ 4 R ln n ɛ G ]. We composed the matlab codes for this problem. Tables 4.4 contains the computational results. Table 4.4: for n = 4, κ = 3 and different m m ɛ N time starting value optimal value Conclusions Note that the two problems we have discussed can both be converted into semidefinite programming problems. One may then consider second order approaches like Newton s method to solve these problems. However, for high dimensional problems, the efficiency of such approaches are not satisfactory. In this chapter, we have done some experiments, but our final goal is to solve high dimension problems. In our experiments, the number of iterations is very high. In order to achieve the required accuracy with efficiency, we make the following observation on the improvement of the smoothing algorithm.

37 4.2 Conclusions 31 Firstly, we can reduce the time of eigenvalue decomposition in each iteration. In our algorithm, we only need the first κ eigenvalues. In many cases, κ m, we can try to do the partial eigenvalue decomposition. In each decomposition steps, we only decompose the first l largest eigenvalues, where l is a heuristic parameter, and κ < l m. For instance, let l be κ + 3, κ + 5, or 2κ, 3κ. Then we compare the lth largest eigenvalue and the κth eigenvalue. If the lth eigenvalue is far less than the κth eigenvalue, the eigenvalues less than the lth have little contribution to the optimal value v T λ in each iteration, which means that the correspondence v i s are very small for i > l. When we apply (2.29) to compute the gradient of φ µ (x) or ψ µ (x), v i s also have little contribution to the gradient matrix for i > l. From the above discussion, the partial eigenvalues decomposition will not cause great loss in the process. How to find a proper l in each decomposition step, and how to estimate the loss need to be taken into consideration in further research. Secondly, Nesterov has provided an excessive gap algorithm [10], which is based on the upper bound and lower bound of the optimal value. In further research, we may try to apply this algorithm to our problems and reduce the number of total iterations. Finally, in our analysis, we prove that the gradients of the two classes of problems are Lipschitz continuous, and derive the Lipschitz constant. The maximal iteration number depends on the Lipschitz constant, which depends on the norm of the matrix G. If G becomes larger, the Lipschitz constant becomes larger, thereby the iteration number becomes larger. In Nesterov s analysis, A is the infinity norm, while in our result, G is the normal matrix norm, its property still needs investigation.

38 Bibliography [1] H. H. Bauschke, J. M. Borwein, and P. L. Combettes, Bregman monotone optimization algorithms, SIAM Journal on Control and Optimization, vol. 42, pp , [2] A. Ben-Tal and M. Teboulle, A smoothing technique for nondifferentiable optimization problems, in Optimization. Fifth French German Conference, Lecture Notes in Math. 1405, Springer-Verlag, New York, 1989, [3] L. M. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming, U.S.S.R Computational Mathematics and Mathematical Physics, vol. 7, pp , [4] X. Chen, H.D. Qi, L. Qi and K. Teo, Smooth convex approximation to the maximum eigenvalue function to appear in J. of Global Optimization. [5] G.H. Golub and C.F. Van Loan, Matrix Computations, the Johns Hopkins University Press, Baltimore, USA, Third Edition,(1996). 32

39 Bibliography 33 [6] A.S. Lewis, Derivatives of spectral functions, Mathematics of Operations Research, 21 (1996), [7] A.S. Lewis and H.S. Sendov, Twice differentiable spectral functions, SIAM J. Matrix Anal. Appl., 22 (2001), [8] Yu. Nesterov, Unconstrained convex minimization in relative scale, CORE DP 2003/96. [9] Yu. Nesterov, Smooth minimization of non-smooth functions, CORE DP 2003/12, February Accepted by Mathematcal Programming. [10] Yu. Nesterov, Excessive gap technique in non-smooth convex minimizarion, CORE DP 2003/35, May Accepted by SIAM J. Optimization. [11] Yu. Nesterov, Smoothing technique and its applications in semidefinite Optimizarion, CORE DP 2004/73, October Accepted by SIAM J. Optimization. [12] S. Shi. Smooth convex approximation and its applications, Master s thesis, National University of Singapore, [13] D. Sun and J. Sun, Strong semismoothness of eigenvalues of symmetric matrices and its application to inverse eigenvalue problems, SIAM J. Numer. Anal., 40 (2003),

40 Name: Degree: Department: Thesis Title: Yu Qi Master of Science Mathematics Smoothing Approximations for Two Classes of Convex Eigenvalue Optimization Problems Abstract In this thesis, we consider two problems: minimizing the sum of the κ largest eigenvalues and the sum of the κth largest absolute values of eigenvalues of a parametric linear operator. In order to apply Nesterov s smoothing algorithm to solve these two problems, we construct two computable smoothing functions whose gradients are Lipschitz continuous. This construction is based on Shi s thesis [12] and new techniques introduced in this thesis. Numerical results on the performance of Nesterov s smooth algorithm are also reported. Keywords: Smoothing functions, Lipschitz constants, Smoothing algorithm, Eigenvalue problems.

41 SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI NATIONAL UNIVERSITY OF SINGAPORE 2005