SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI. (B.Sc.(Hons.), BUAA)
|
|
- Esmond Holmes
- 8 years ago
- Views:
Transcription
1 SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI (B.Sc.(Hons.), BUAA) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF MATHEMATICS NATIONAL UNIVERSITY OF SINGAPORE 2005
2 Acknowledgements I am very grateful to my supervisor, Dr Sun Defeng, for his abundant knowledge and kindness. In the past two years, Dr Sun has always helped me when I was in trouble, encouraged me when I lost confidence. This thesis would not have been finished without the invaluable suggestions and patient guidance from him. I would also like to thank Miss. Shi Shengyuan, who has helped me so much in my study. Thanks all the staff and friends at National University of Singapore for their help and support. Yu Qi August 2005 ii
3 Summary In order to solve a convex non-differentiable optimization problem, one may introduce a smoothing convex function to approximate the non-differentiable objective function and solve the smoothing convex optimization problem to get an approximate solution. Nesterov showed that if the gradient of the smoothing function is Lipschitz continuous, one may get an ɛ-approximation solution with the number of iterations bounded by O( 1 ) [9]. In [11], Nesterov discussed the problem ɛ of minimizing the maximal eigenvalue and the problem of minimizing the spectral radius. Recently, Shi [12] presented a smoothing function for the sum of the κ largest components of a vector. This smoothing function is highly advantageous because the composition of this function and eigenvalue functions allows one to compute smoothing functions to approximate all eigenvalue functions of a real symmetric matrix. In this thesis, we further study the properties of the smoothing functions to approximate the eigenvalue functions. In particular, we obtain an estimation of the Lipschitz constant of the gradient of these smoothing functions. iii
4 Summary iv We then consider the problem of minimizing the sum of the κ largest eigenvalues by applying Nesterov s smoothing method [9]. Finally, we extend this algorithm to solve the problem of minimizing the sum of the κ largest absolute values of eigenvalues. These two problems are general forms of the problem of minimizing the maximal eigenvalue and the problem of minimizing the spectral radius, considered respectively in [11]. We also report some numerical results for both problems. The organization of this thesis is as follows. In Chapter 1 we first describe the problems discussed in [11] by Nesterov and then extend them to general cases. In Chapters 2 and 3 we discuss some important properties of smoothing functions for approximating the sum of the κ largest eigenvalues and the sum of the κth largest absolute values of eigenvalues of a parametric affine operator, respectively. The smoothing algorithm and computational results are given in Chapter 4.
5 List of Notation A, B,... denote matrices; M n,m denotes the n-by-m matrix. S m is the set of all m m real symmetric matrices; O m is the set of all m m orthogonal matrices. A superscript T represents the transpose of matrices and vectors. For a matrix M, M i and M j represent the ith row and j th column of M, respectively. M ij denotes the (i, j)th entry of M. A diagonal matrix is written as Diag(β 1,..., β n ). We use to denote the Hadamard product between matrices, i.e. X Y = [X ij Y ij ] m i,j=1. Let A 1,..., A n S m be given, we define the linear operator A : R n S m by A(x) := x i A i, x R n. (1) v
6 List of Notation vi Let A : S m R n be the adjoint of the linear operator A : R n S m defined by (1): d, A D = D, Ad, (d, D) R n S m. Hence, for all D S m, A D = ( A 1, D,..., A n, D ) T. Denote G S m as (G) ij = A i, A j. The eigenvalues of X S m are designated by λ i (X), i = 1,..., m, and λ 1 (X) λ 2 (X) λ m (X). We write X = O(α) (respectively, o(α)) if X / α is uniformly bounded (respectively, tends to zero) as α 0. For B M n,m, the operator Ξ : M n,m S m+n is defined as Ξ(B) = 0 B. B T 0 We define the linear operator Γ : R n S 2m as Γ(x) = Ξ(A(x)). (2) Let Γ : S 2m R n be the adjoint operator of the linear operator Γ, for any Y S 2m, Γ (Y ) = (2 A 1, Y 2,, 2 A n, Y 2 ) T, where Y = Y 1 Y 2 Y2 T Y 3, and Y 1, Y 2, Y 3 S m.
7 Contents Acknowledgements ii Summary iii List of Notation v 1 Introduction 3 2 Properties of the Smoothing Function for the Sum of the κ Largest Eigenvalues The Smoothing Function for the Sum of κ Largest Components Spectral Functions Smoothing Functions for φ(x) Smoothing Functions for the Sum of the κ Largest Absolute Values of Eigenvalues Preliminaries
8 Contents Smoothing Functions for ψ(x) Numerical Experiments Computational Results Conclusions Bibliography 32
9 Chapter 1 Introduction Let S m be the set of real m-by-m symmetric matrices. For X S m, let {λ i (X)} m be the eigenvalues of X which are sorted in nonincreasing order, i.e. λ 1 (X) λ 2 (X) λ κ (X) λ m (X). Let A 1,..., A n S m be given. Define the operator A : R n S m by A(x) = x i A i, x R n. (1.1) Let Q be a bounded closed convex set in R n and C S m. In [11], Nesterov considered the following nonsmooth problem: min λ 1(C + A(x)). (1.2) x Q Nesterov s approach is to replace the above nonsmooth function λ 1 (C + A(x)) by the following smooth function: S µ (C + A(x)), (1.3) where the tolerance parameter µ > 0, and S µ (X) is the product of the entropy function and µ, i.e. S µ (X) = µ ln[ e λi(x)/µ ]. (1.4) 3
10 4 The function S µ (X) approximates λ 1 (X) as µ 0. following smoothed optimization problem: Now, let us consider the min {S µ(c + A(x))}. (1.5) x Q Note that for any µ > 0, the gradient mapping S µ ( ) is globally Lipschitz continuous. Nesterov suggested to use a gradient based numerical method [9] to solve problem (1.5). For a given matrix X S m, we define its spectral radius by: ρ(x) := max 1 i m λ i(x) = max{λ 1 (X), λ m (X)}. (1.6) Let ϕ(x) := ρ(c + A(x)). Another problem discussed in Nesterov s paper [11] is min ϕ(x) (1.7) x Q with C 0. Nesterov constructed the smoothing function as ϕ p (x) = F p (A(x)), x Q, where F p ( ) is: F p (X) = 1 2 X2p, I n 1 p. (1.8) Nesterov considered the smoothing problem min {ϕ p(x)} (1.9) x Q and used method [8] to solve the smoothing problem (1.9). In this thesis, we shall extend Nesterov s approach to the following two problems. Let φ(x) be the sum of κth largest eigenvalues of C + A(x), i.e. φ(x) = κ λ i (C + A(x)). (1.10)
11 5 We shall first consider in this thesis the following nonsmooth problem: Clearly, if κ = 1, (1.11) turns to be problem (1.2). min φ(x). (1.11) x Q Let λ [κ] (X) be the κth largest absolute value of eigenvalues of X, sorted in the nonincreasing order, i.e. λ [1] (X) λ [2] (X) λ [κ] (X) λ [n] (X). We define the following function: ψ(x) = The second problem that we shall consider in this thesis is which is a general case of (1.7). κ λ [i] (C + A(x)). (1.12) min ψ(x), (1.13) x Q We shall construct smoothing functions for problem (1.11) and (1.13) respectively. The gradients of the smoothing functions must satisfy the global Lipschitz condition, which makes it possible for us to apply Nesterov s algorithm [9] to solve the smoothing problems. Nesterov s method is to solve the optimization problem: where θ( ) is a nonsmooth convex function. x Q, i.e. min θ(x), (1.14) x Q Our goal is to find an ɛ solution θ( x) θ ɛ, (1.15) where θ = min x Q θ(x). We denote θ µ( ) as a smoothing function for θ( ). The smoothing function θ µ ( ) satisfies the following inequality: θ µ (x) θ(x) θ µ (x) + µr, (1.16)
12 6 where R is a constant for a specified smoothing function θ µ (x) (We will give its definition in Chapter 2). Nesterov proved the following inequality [9, Theorem 2] θ( x) θ θ µ ( x) θ µ + µr, where θ µ = min x Q θ µ(x). Let µ = µ(ɛ) = ε 2R and θ µ ( x) θµ 1 ɛ. (1.17) 2 We have θ( x) θ ɛ. Nesterov s algorithm can improve the bound on the number of iterations to O( 1), ɛ while the traditional algorithms need O( 1 ) iterations. ɛ 2 The remaing part of this thesis is as follows. We discuss the properties of smoothing functions for approximating the sum of the κ largest eigenvalues of a parametric affine operator in Chapter 2 and the sum of the κ largest absolute values in Chapter 3. Computational results are given in Chapter 4.
13 Chapter 2 Properties of the Smoothing Function for the Sum of the κ Largest Eigenvalues 2.1 The Smoothing Function for the Sum of κ Largest Components In her thesis [12], Shi discussed the smoothing function for the κ largest components of a vector. For x R n we denote by x [κ] the κth largest component of x, i.e., x [1] x [2] x [κ] x [n] sorted in the nonincreasing order. Define f κ (x) = κ x [i]. (2.1) Denote by Q κ the convex set in R n : Q κ = {v R n : v i = κ, 0 v i 1, i = 1, 2,..., n}, (2.2) and 7
14 2.1 The Smoothing Function for the Sum of κ Largest Components 8 Let where r(v) = z ln z, z (0, 1], p(z) = 0, z = 0. p(v i ) + (2.3) p(1 v i ) + R, v Q κ, (2.4) R := n ln n κ ln κ (n κ) ln(n κ), (2.5) which is the maximal value of r(v). The smoothing function for f κ ( ) is f µ κ ( ) : R n R, which is defined by: f κ µ (x) := max x T v µr(v) s.t. v i = κ 0 v i 1, i = 1,..., n. (2.6) Shi has also provided the optimal solution to f µ κ ( ) in (2.6): where α satisfies v i (µ, x) = e α(µ,x) x i µ e α(µ,x) x i µ (2.7) = κ. (2.8) In order to introduce Lemma 2.1 in [12], we need the definition of the γ function: γ i (µ, x) := e α(µ,x) x i µ µ(1 + e α(µ,x) x i µ ) 2, i = 1,, n. (2.9) Without causing any confusion, let γ i := γ i (µ, x), for i = 1,..., n, and γ = (γ 1,..., γ n ) T.
15 2.1 The Smoothing Function for the Sum of κ Largest Components 9 Lemma 2.1. The optimal solution to problem (2.6), v(µ, x), is continuously differentiable on R ++ R n, with γ 1 x v(µ, x) =... 1 γ(γ) T. (2.10) (γ k ) Proof. From (2.7), for each i = 1,, n, γ n k=1 v i (µ, x)(1 + e α(µ,x) x i µ ) = 1. (2.11) Taking derivatives of x on both side of (2.11), we have ( x v(µ, x)) i = 1 e α(µ,x) xi µ(1 + e α(µ,x) x µ (e i i x α(µ, x)) µ ) 2 = γ i (e i x α(µ, x)). (2.12) From (2.8), we have x α(µ, x) = = 1 (γ k ) k=1 (γ i e i ) 1 (γ 1,, γ n ) T, (γ k ) k=1 (2.13) where e i R n, its ith entry is 1 and others are all zeros. From (2.12) and (2.13), we obtain: ( x v(µ, x)) i = e i γ i γ i (γ 1,, γ n ) T. (2.14) (γ k ) k=1 Lemma 2.2. For γ i, i = 1,..., n given by (2.9), has bounds: 0 γ i 1 4µ. (2.15)
16 2.1 The Smoothing Function for the Sum of κ Largest Components 10 Proof. Denote p i (µ, x) = e α(µ,x) x i µ, i = 1,..., n. Thus we have γ i = 1 µ p i (1 + p i ) 2 = 1 µ ( p i 1 (1 + p i ) 2 ) = 1 µ ( ( p i 1 2 ) ) 1 4µ. (2.16) The following theorem from [12] describes some properties of f µ κ (x). Theorem 2.3. For µ > 0, x R n, the function f µ κ (x) has the following properties: 1. f µ κ (x) is convex; 2. f µ κ (x) is continuously differentiable; 3. f µ κ (x) f κ (x) f µ κ (x) + µr. From the above theorem, for each µ > 0, f κ µ (x) is continuously differentiable. According to [12], the gradient of f κ µ (x) is the optimal solution to problem (2.6), i.e., v(µ, x). Therefore the gradient and Hessian of f κ µ (x) for any µ > 0 are given by f κ µ (x) = v(µ, x); (2.17) 2 f κ µ (x) = x v(µ, x). (2.18) Up to now we have reviewed some of Shi s results. provide an estimate to 2 f µ κ (x) in the following theorem. Based on these results, we
17 2.1 The Smoothing Function for the Sum of κ Largest Components 11 Theorem 2.4. For µ > 0, we have the following conclusions on 2 f µ κ (x): 1. For h R n, 0 h, 2 (f µ κ (x))h 1 4µ h 2 2. (2.19) 2. (( 2 (f µ κ (x))) ij is the (i, j)th entry of ( 2 (f µ κ (x))), for i j 0 ( 2 (f κ µ (x))) ii 1 4µ ; (2.20) 1 4µ ( 2 (f κ µ (x))) ij 0. (2.21) Proof. First we prove part 1. Since f µ κ (x) is convex, its Hessian 2 f µ κ (x) is positive semidefinite, i.e., By Lemma 2.1, h, x f µ κ (x)h 0. h, 2 f κ µ (x)h = h, x v(µ, x)h = γ i h 2 i 1 (γ T h) 2 γ k γ i h 2 i 1 4µ h2 i = 1 4µ h 2 2. Now let us prove part 2. For any 1 i n, 1 j n, i j, k=1 ( 2 (f µ κ (x))) ii = ( x v(µ, x)) ii = γ i (γ i) 2 γ k k=1 = γ i 1 γ i γ k k=1. (2.22) (2.23)
18 2.1 The Smoothing Function for the Sum of κ Largest Components 12 According to Lemma 2.2, we have 0 γ i 1 4µ, and Therefore, On the other hand, 0 1 γ i γ k k= ( 2 (f µ κ (x))) ii 1 4µ. (2.24) ( 2 (f µ κ (x))) ij = ( x v(µ, x)) ij Since we have = γ iγ j γ k k=1 γ j = γ i. γ j j=1 (2.25) 0 γ j 1, (2.26) γ j j=1 1 4µ ( 2 (f µ κ (x))) ij 0, i j. (2.27) Remark. Nesterov proved in [9, Theorem 1] that for problem (1.14), if the function θ µ ( ) (µ > 0) is continuously differentiable, then its gradient θ µ (x) = A v µ (x) is Lipschitz continuous with its Lipschitz constant L µ = 1 µσ 2 A 2 1,2.
19 2.2 Spectral Functions 13 As to the smoothing function f µ κ ( ), σ 2 = 4 and A is an identity operator. So Theorem 2.4 may also be derived from [9, Theorem 1]. Here we provide a direct proof. 2.2 Spectral Functions A function F on the space of m-by-m real symmetric matrices is called spectral if it depends only on the eigenvalues of its argument. Spectral functions are just symmetric functions of the eigenvalues. A symmetric function is a function that is unchanged by any permutation of its variables. In this thesis, we are interested in functions F of a symmetric matrix argument that are invariant under orthogonal similarity transformations [6]: F (U T AU) = F (A), U O, A S m, (2.28) where O is the set of orthogonal matrices. Every such function can be decomposed as F (A) = (f λ)(a), where λ is the map that gives the eigenvalues of the matrix A and f is a symmetric function. We call such functions F spectral functions because they depend only on the spectrum of the operator A. Therefore, we can regard a spectral function as a composition of a symmetric function and the eigenvalue function. In order to show some preliminary results, we give the following definition. For each X S m, define the set of orthonormal eigenvectors of X by O X := {P O : P T XP = Diag[λ(X)]}. Now we refer to the formula for the gradient of a differential spectral function [6]. Proposition 2.5. Let f be a symmetric function from R n to R and X S n. Then the following holds:
20 2.3 Smoothing Functions for φ(x) 14 (a) (f λ) is differentiable at point X if and only if f is differentiable at point λ(x). In the case the gradient of (f λ) at X is given by (f λ)(x) = UDiag[ f(λ(x))]u T, U O X. (2.29) (b) (f λ) is continuously differentiable at point X if and only if f is continuously differentiable at point λ(x). Lewis and Sendov [7, Theorems 3.3 and 4.2] proved the following proposition, which gives the formula for calculating the Hessian of the spectral function. Proposition 2.6. Let f : R n R be symmetric. Then for any X S n, it holds that (f λ) is twice (continuously) differentiable at X if and only if f is twice (continuously) differentiable at λ(x). Moreover, in this case the Hessian of the spectral function at X is 2 (f λ)(x)[h] = U(Diag[ 2 f(λ(x))diag[ H]] + C(λ(X)) H)U T, H S n, (2.30) where U is any orthogonal matrix in O X and H = U T HU. The matrix C in Proposition 2.6 is defined as follows: C(ω) R n n : 0, if i = j (C(ω)) ij := ( 2 f(ω)) ii ( 2 f(ω)) ij, if i j and ω i = ω j ( f(ω)) i ( f(ω)) j, else. ω i ω j (2.31) 2.3 Smoothing Functions for φ(x) Consider the φ(x) in (1.10), clearly, it is a composition function: φ(x) = (f κ λ)(c + A(x)), (2.32)
21 2.3 Smoothing Functions for φ(x) 15 where f κ (x) is given in (2.1). In [12], Shi investigated the smoothing function f κ µ (x) as a approximation for f κ (x). So it is natural for us to think of the below composition function: φ µ (x) = (f κ µ λ)(c + A(x)), (2.33) According to the properties of f κ µ ( ), for µ > 0, φ µ (x) φ(x) φ µ (x) + µr, x Q. (2.34) Since the function f κ µ ( ) is a symmetric function, (f κ µ λ) is a composition function of a symmetric function f κ µ ( ) : R m R and the eigenvalue function λ( ) : S m R m. Hence the function (f κ µ λ) is a spectral function. Consequently, (f κ µ λ) is twice continuously differentiable. We will prove that the composition φ µ (x) is twice continuously differentiable and provide an estimation of the Lipschitz constant of the gradient of φ µ (x) as follows. First, we will derive the gradient of φ µ (x). For µ > 0, for any h R n and h 0, φ µ (x + h) φ µ (x) = f κ µ (λ(c + A(x + h))) f κ µ (λ(c + A(x))) = f µ κ (λ(c + A(x) + A(h))) f µ κ (λ(c + A(x))) = (f µ κ λ)(c + A(x)), A(h) + O( h 2 ) (2.35) = A ( (f κ µ λ)(c + A(x))), h + O( h 2 ), where A D = ( A 1, D,..., A m, D ) T. Thus we have the following proposition. Proposition 2.7. For µ > 0, φ µ ( ) is continuously differentiable with its gradient given by: φ µ (x) = A ( (f κ µ λ)(c + A(x))). (2.36)
22 2.3 Smoothing Functions for φ(x) 16 Next, we shall consider the Hessian of φ µ (x). First we define C µ (ω) R n n as 0, if i = j (C µ (ω)) ij := ( 2 f κ µ (ω)) ii ( 2 f κ µ (ω)) ij, if i j and ω i = ω j (2.37) ( f κ µ (ω)) i ( f κ µ (ω)) j, else. ω i ω j Proposition 2.8. For µ > 0, for any h R n and h 0, φ(x) is twice continuously differentiable with its Hessian given by: where H := A(h). 2 φ µ (x)[h] = A ( 2 (f µ κ λ)(c + A(x))[H]), (2.38) Proof. For µ > 0, for any h R n and h 0,, φ µ (x + h) φ µ (x), h = A ( (f κ µ λ)(c + A(x + h))) A ( (f κ µ λ)(c + A(x))), h = A ( (f κ µ λ)(c + A(x + h)) (f κ µ λ)(c + A(x))), h = (f µ κ λ)(c + A(x + h)) (f µ κ λ)(c + A(x)), A(h) (2.39) = (f κ µ λ)(c + A(x) + A(h)) (f κ µ λ)(c + A(x)), A(h) = 2 (f κ µ λ)(c + A(x))A(h), A(h) + O( h 2 ) = A ( 2 (f κ µ λ)(c + A(x)))H, h + O( h 2 ), which shows that (2.50) holds. In order to estimate 2 φ µ (x), we need the estimation of C µ (ω) given in (2.31). The following lemma is motivated by Lewis and Sendov [7]. Lemma 2.9. For ω R n with ω i ω j, i j, i, j = 1,, n, there exists ξ, η R n, such that ( f µ κ (ω)) i ( f µ κ (ω)) j ω i ω j = ( 2 f µ κ (ξ)) ii ( 2 f µ κ (η)) ij. (2.40)
23 2.3 Smoothing Functions for φ(x) 17 Proof. For each ω R n, we define the two vectors ω and ω R n coordinatewise as follows: ω p, p i, ω = ω j, p = i. ω = ω p, p i, j, ω j, p = i. ω i, p = j. (2.41) By the mean value theorem, ( f µ κ (ω)) i ( f µ κ (ω)) j ω i ω j = ( f µ κ (ω)) i ( f µ κ ( ω)) i + ( f µ κ ( ω)) i ( f µ κ (ω)) j ω i ω j = (ω i ω j )( 2 f µ κ (ξ)) ii + ( f µ κ ( ω)) i ( f µ κ (ω)) j ω i ω j =( 2 f µ κ (ξ)) ii + ( f µ κ ( ω)) i ( f µ κ ( ω)) i + ( f µ κ ( ω)) i ( f µ κ (ω)) j ω i ω j (2.42) =( 2 f µ κ (ξ)) ii + (ω j ω i )( 2 f µ κ (η)) ij + ( f µ κ ( ω)) i ( f µ κ (ω)) j ω i ω j =( 2 f µ κ (ξ)) ii ( 2 f µ κ (η)) ij + ( f µ κ ( ω)) i ( f µ κ (ω)) j ω i ω j, where ξ is a vector between ω and ω, and η is a vector between ω and ω. We next consider the term ( f κ µ ( ω)) i ( f κ µ (ω)) j ω i ω j. By the definitions, we know and ( f µ κ ( ω)) i = ( f µ κ (ω)) j = Since f µ κ ( ) is a symmetric function, Consequently, ( f µ κ ( ω)) i = ω i f µ κ ( ω) = ω j f µ κ (ω) (2.43) ω j f µ κ (ω) = ( f µ κ (ω)) j. ω j f µ κ ( ω). (2.44) Therefore (2.40) holds. ( f µ κ ( ω)) i ( f µ κ (ω)) j ω i ω j = 0.
24 2.3 Smoothing Functions for φ(x) 18 Lemma For any 1 i, j m, each entry of (C µ (ω)) ij bound: has the following 0 (C µ (ω)) ij 1 2µ. (2.45) Proof. ( 2 f (C µ κ µ (ω)) ii ( 2 f κ µ (ω)) ij, i = j (ω)) ij = ( 2 f κ µ (ξ)) ii ( 2 f κ µ (η)) ij, i j. For any x, y R n, by Theorem 2.4, (2.46) ( 2 f µ κ (x)) ii ( 2 f µ κ (y)) ij max ( 2 f µ κ (x)) ii min ( 2 f µ κ (y)) ij = 1 2µ ; (2.47) ( 2 f µ κ (x)) ii ( 2 f µ κ (y)) ij min ( 2 f µ κ (x)) ii max ( 2 f µ κ (y)) ij = 0. (2.48) Let i = j, x = ω, y = ω, then Let i j, x = ξ, y = η, then 0 (C µ (ω)) ij 1 2µ ; 0 (C µ (ω)) ij 1 2µ. Next, we estimate the Lipschitz constant of φ µ (x). For any h R n, we have h, 2 φ µ (x)[h] = h, A ( 2 (f µ κ λ)(c + A(x))[H]) (2.49) = H, 2 (f µ κ λ)(c + A(x))[H], with H = A(h), and 2 (f µ κ λ)(c + A(x))[H] =U(Diag[ 2 f µ κ (λ(c + A(x)))diag[ H]] + C µ (λ(c + A(x))) H)U T, (2.50)
25 2.3 Smoothing Functions for φ(x) 19 where U O C+A(x), and H = U T HU. Therefore h, 2 φ µ (x)[h] = ( 2 (f κ µ λ)(c + A(x))) ii Hii + 1 4µ H ii + 1 2µ 1 H, H 2µ = 1 A(h), A(h) 2µ = 1 2µ i,j=1 m h i h j A i, A j i,j=1 = 1 2µ ht Gh 1 2µ G h 2, i j H ij i,j=1 i j C µ ij (λ(c + A(x))) H ij (2.51) where G S m, (G) ij = A i, A j, and G := max 1 i m λ i(g) = max{λ 1 (G), λ m (G)}. Thus the Lipschitz constant for the gradient of the smoothing function φ µ (x) is: L = 1 G. (2.52) 2µ In particular, if we take µ := µ(ε) = ɛ, where R = 2m ln(2m) κ ln κ (2m 2R κ) ln(2m κ), then we have L = R G. (2.53) ɛ
26 Chapter 3 Smoothing Functions for the Sum of the κ Largest Absolute Values of Eigenvalues In order to solve problem (1.13), we need the concept of singular values. According to some properties of singular values, we can obtain a computable smoothing function for problem ψ(x). 3.1 Preliminaries Similar to the eigenvalue decomposition of a symmetric matrix, a non-symmetric matrix has singular value decomposition. Let A M n,m, and without generality we assume n m. Then there exist orthogonal matrices U M n,n and V M m,m such that A has the following singular value decomposition (SVD): U T AV = [Σ(A) 0], (3.1) where Σ(A) = diag(σ 1 (A),..., σ n (A)) and σ 1 (A) σ 2 (A)... σ n (A) 0 are the singular values of A [5]. 20
27 3.1 Preliminaries 21 The singular value has the following property: AAT = UΣ 2 (A)U T = Udiag[σ 1 (A),..., σ n (A)]U T. (3.2) In particular, if A is symmetric, we have AA T = A 2 = P diag[λ 1 (A) 2,..., λ n (A) 2 ]P T, (3.3) where λ 1 (A),..., λ n (A) are the eigenvalues of A. Comparing (3.2) and (3.3), the singular values are the square root of respective eigenvalues of AA T, which means σ i (A) = λ i (A), i = 1,..., n for all symmetric matrix A. For any W S n+m, we define: Λ(W ) = diag(λ (1) (W ),..., λ (n) (W ), λ (n+m) (W ),..., λ (n+1) (W )), (3.4) where {λ i (W ) : i = 1,, n + m} are the eigenvalues of W arrange in decreasing order. Noted that the first n diagonal entries of Λ(W ) are just the n largest eigenvalues of W, arranged in deceasing order, while the last m diagonal entries of Λ(W ) are the m smallest eigenvalues of W, arranged in increasing order. This arrangement will show convenience shortly afterwards. Define the linear operator: Ξ : M n,m S m+n by Ξ(B) = 0 B, B M n,m. (3.5) B T 0 For A S m, let it have the following eigenvalue decomposition: V T AV = Σ(A), V O A. (3.6) The following result is derived from Golub and Van Loan [5, Section 8.6]. Proposition 3.1. Suppose that A S m has the eigenvalue decomposition (3.6). Then the matrix Ξ(A) has the following spectral decomposition: Ξ(A) = Q(Λ(Ξ(A)))Q T = Q Σ(A) 0 Q T, (3.7) 0 Σ(A)
28 3.2 Smoothing Functions for ψ(x) 22 where Q O 2m,2m is defined by: Q = 1 V 2 i.e. the eigenvalues of Ξ(A) are ±σ i (A). V V V, (3.8) 3.2 Smoothing Functions for ψ(x) From Proposition 3.1, we know that κ ψ(x) = λ i (Ξ(C + A(x))). (3.9) Similar to Chapter 2, we define the smoothing function for ψ(x) by ψ µ (x) = (f κ µ λ)(ξ(c + A(x))), (3.10) where the f κ µ ( ) is the smoothing function defined in Chapter 2 and κ m. For x R n, we define a linear operator Γ : R n S 2m as follows: Γ(x) = Ξ(A(x)). (3.11) Let us consider the adjoint of Γ(x). For Y S 2m, Γ(x), Y = Ξ(A(x)), Y = 0 A(x) (A(x)) T 0, = A(x), Y 2 + (A(x)) T, Y T 2 = 2 A(x), Y 2 = 2x i A i, Y 2. Y 1 Y 2 Y2 T Y 3 (3.12) Thus, Γ (Y ) = (2 A 1, Y 2,, 2 A n, Y 2 ) T.
29 3.2 Smoothing Functions for ψ(x) 23 The smoothing function ψ µ (x) takes the following form: ψ µ (x) = (f µ κ λ)(ξ(c) + Γ(x))). (3.13) Since we have already known that, for any µ > 0, (f κ µ λ) is twice continuously differentiable, ψ µ ( ) is also twice continuously differentiable. Now we discuss its gradient and Hessian. Proposition 3.2. For µ > 0, ψ µ ( ) is continuously differentiable with its gradient given by ψ µ (x) = Γ ( (f κ µ λ)(ξ(c) + Γ(x))). (3.14) Proof. For µ > 0, for any h R n and h 0, ψ µ (x + h) ψ µ (x) = f κ µ (λ(ξ(c) + Γ(x + h))) f κ µ (λ(ξ(c) + Γ(x))) = f µ κ (λ(ξ(c) + Γ(x) + Γ(h))) f µ κ (λ(ξ(c) + Γ(x))) = (f µ κ λ)(ξ(c) + Γ(x)), Γ(h) + O( h 2 ) (3.15) = Γ ( (f κ µ λ)(ξ(c) + Γ(x))), h + O( h 2 ), which agrees with (3.14). Proposition 3.3. For µ > 0, ψ µ ( ) is twice continuously differentiable with its Hessian given by 2 ψ µ (x)[h] = Γ ( 2 (f κ µ λ)(ξ(c) + Γ(x))[H]), h R n, (3.16) where H := Γ(h).
30 3.2 Smoothing Functions for ψ(x) 24 Proof. For µ > 0, for any h R n and h 0, ψ µ (x + h) ψ µ (x), h = Γ ( (f κ µ λ)(ξ(c) + Γ(x + h))) Γ ( (f κ µ λ)(ξ(c) + Γ(x))), h = Γ ( (f κ µ λ)(ξ(c) + Γ(x + h)) (f κ µ λ)(ξ(c) + Γ(x))), h = (f µ κ λ)(ξ(c) + Γ(x + h)) (f µ κ λ)(ξ(c) + Γ(x)), Γ(h) (3.17) = (f κ µ λ)(ξ(c) + Γ(x) + Γ(h)) (f κ µ λ)(ξ(c) + Γ(x)), Γ(h) = 2 (f κ µ λ)(ξ(c) + Γ(x))Γ(h), Γ(h) + O( h 2 ) = Γ ( 2 (f κ µ λ)(ξ(c) + Γ(x)))H, h + O( h 2 ), which shows that (3.16) holds. Next, we estimate the Lipschitz constant of ψ µ (x). For any h R n, we have h, 2 ψ µ (x)[h] = h, Γ ( 2 (f µ κ λ)(ξ(c) + Γ(x))[H]) = H, ( 2 f µ κ λ)(ξ(c) + Γ(x))H, (3.18) with 2 (f µ κ λ)(ξ(c) + Γ(x))[H] =U(Diag[ 2 f µ κ (λ(ξ(c) + Γ(x)))diag[ H]] + C µ (λ(ξ(c) + Γ(x))) H)U T, (3.19)
31 3.2 Smoothing Functions for ψ(x) 25 where U O Ξ(C)+Γ(x) and H = U T HU. Therefore h, 2 ψ µ (x)[h] = ( 2 (f κ µ λ)(ξ(c) + Γ(x))) ii H2 ii + 1 4µ H 2 ii + 1 2µ 1 H, H 2µ 1 Γ(h), Γ(h) 2µ 1 A(h), A(h) µ i,j=1 i j H 2 ij i,j=1 i j C µ ij (λ(ξ(c) + Γ(x))) H 2 ij (3.20) 1 µ G h 2. Thus, the Lipschitz constant for the gradient of the smoothing function ψ µ (cot) is L = 1 G. (3.21) µ In particular, if we take µ := µ(ε) = ɛ, wherer := m ln m κ ln κ (m κ) ln(m 2R κ) then we have L = 2R G. (3.22) ɛ
32 Chapter 4 Numerical Experiments Nesterov s method is to solve the following optimization problem: min{f(x)}, (4.1) x Q where f is a convex function with its gradient of f(x) satisfied the Lipschitz condition: where L > 0 is a constant. Consider a prox-function d(x) in Q. f(x) f(y) L x y, x, y R n. (4.2) We assume that d(x) is continuous and strongly convex on Q with convexity parameter σ > 0. Denote x 0 by x 0 = min d(x). (4.3) x Q Without loss of generality we assume d(x 0 ) = 0. Thus, for any x Q, we have Define d(x) 1 2 σ x x 0 2. (4.4) T Q (x) = min y { f(x), y x L y x 2 : y Q}. (4.5) Now we are ready to give Nesterov s smoothing algorithm [9]: For k 0 do 26
33 27 1. Compute f(x k ) and f(x k ). 2. Find y k = T Q (x k ). 3. Find z k = arg min{ Ld(x) + k i+1 [f(x x σ 2 i) + f(x i ), x x i ] : x Q}. i=0 4. Set x k+1 = 2 k+3 z k + k+1 k+3 y k. Nesterov prove the following Theorem [9, Theorem 3]: Theorem 4.1. Let the sequences {x k } k=0 and {y k} k=0 algorithm. Then for any k 0, we have { (k + 1)(k + 2) L k f(y k ) min 4 x σ d(x) + Therefore, i=0 f(y k ) f(x ) where x is an optimal solution to the problem. (4.1). be generated by the above } i [f(x i) + f(x i ), x x i ] : x Q. (4.6) 4Ld(x ) σ(k + 1)(k + 2). (4.7) By applying Bregman s distance, Nesterov provided a modified algorithm, which gives a way to compute T Q (x k ). In the new algorithm, we compute the V Q instead of T Q. Bregman s distance was introduced in [3], as an extension to the usual metric discrepancy measure (x, y) x y 2. If f( ) is a real convex function, then the Bregman distance between two parameters z and x is defined as The Bregman distance satisfies ξ(z, x) = f(x) f(z) f(z), x z, x, z Q. (4.8) Define the Bregman projection of h as follows: ξ(z, x) 1 2 σ x z 2. (4.9) V Q (z, h) = argmin{h T (x z) + ξ(z, x) : x Q}. (4.10)
34 4.1 Computational Results 28 The following algorithm is Nesterov s algorithm via the Bregman distance [9]: 1. Choose y 0 = z 0 = arg min{ Ld(x) + 1[f(x x σ 2 0) + f(x 0 ), x x 0 ] : x Q}. 2. For k 0 iterate: a. Find z k = arg min{ Ld(x) + k i+1 [f(x x σ 2 i) + f(x i ), x x i ] : x Q}. i=0 b. Set τ k = 2 and x k+3 k+1 = τ k z k + (1 τ k )y k. c. Find ˆx k+1 = V Q (z k, σ L τ k f(x k+1 )). d. Set y k+1 = τ kˆx k+1 + (1 τ k )y k. Nesterov pointed out for the above method, Theorem 4.1 holds. The computational results shown in the next section are achieved by applying the above algorithm. 4.1 Computational Results First, we solve the smoothing problem: min x Q φµ (x), where the closed convex set Q is given by Q = {x R n : x i = 1, 0 x i 1, i = 1, 2,..., n}. Let P i be a m-by-m random matrix with its entries in [-1,1], define C, A 1, A 2,, A n as C = 1 2 (P 0 + P T 0 ) and A i = i 1 2 (P i + P T i ), i = 1,, n. Let ɛ be the desired accuracy, i.e., φ( x) φ ɛ. R is defined by 2.5 and d(x) = lnn + n x i. According to Theorem 4.1, we have the iteration bound N := [ 2 2R ln n ɛ G ]. We composed the matlab codes for this problem. Tables 4.1, 4.2 and 4.3 are the numerical results for this problem.
35 4.1 Computational Results 29 Table 4.1: for m = 10, n = 4 and different κ κ ɛ N time starting value optimal value Table 4.2: for m = 30, n = 4 and different κ κ ɛ N time starting value optimal value Table 4.3: for n = 4, κ = 4, ɛ = 0.01 and different m m N time starting value optimal value
36 4.2 Conclusions 30 Next, we solve the smoothing problem : min x Q ψµ (x). The parameters Q, C, A i, i = 1,, n are defined as for φ µ (x). The maximal number of iterations is N = [ 4 R ln n ɛ G ]. We composed the matlab codes for this problem. Tables 4.4 contains the computational results. Table 4.4: for n = 4, κ = 3 and different m m ɛ N time starting value optimal value Conclusions Note that the two problems we have discussed can both be converted into semidefinite programming problems. One may then consider second order approaches like Newton s method to solve these problems. However, for high dimensional problems, the efficiency of such approaches are not satisfactory. In this chapter, we have done some experiments, but our final goal is to solve high dimension problems. In our experiments, the number of iterations is very high. In order to achieve the required accuracy with efficiency, we make the following observation on the improvement of the smoothing algorithm.
37 4.2 Conclusions 31 Firstly, we can reduce the time of eigenvalue decomposition in each iteration. In our algorithm, we only need the first κ eigenvalues. In many cases, κ m, we can try to do the partial eigenvalue decomposition. In each decomposition steps, we only decompose the first l largest eigenvalues, where l is a heuristic parameter, and κ < l m. For instance, let l be κ + 3, κ + 5, or 2κ, 3κ. Then we compare the lth largest eigenvalue and the κth eigenvalue. If the lth eigenvalue is far less than the κth eigenvalue, the eigenvalues less than the lth have little contribution to the optimal value v T λ in each iteration, which means that the correspondence v i s are very small for i > l. When we apply (2.29) to compute the gradient of φ µ (x) or ψ µ (x), v i s also have little contribution to the gradient matrix for i > l. From the above discussion, the partial eigenvalues decomposition will not cause great loss in the process. How to find a proper l in each decomposition step, and how to estimate the loss need to be taken into consideration in further research. Secondly, Nesterov has provided an excessive gap algorithm [10], which is based on the upper bound and lower bound of the optimal value. In further research, we may try to apply this algorithm to our problems and reduce the number of total iterations. Finally, in our analysis, we prove that the gradients of the two classes of problems are Lipschitz continuous, and derive the Lipschitz constant. The maximal iteration number depends on the Lipschitz constant, which depends on the norm of the matrix G. If G becomes larger, the Lipschitz constant becomes larger, thereby the iteration number becomes larger. In Nesterov s analysis, A is the infinity norm, while in our result, G is the normal matrix norm, its property still needs investigation.
38 Bibliography [1] H. H. Bauschke, J. M. Borwein, and P. L. Combettes, Bregman monotone optimization algorithms, SIAM Journal on Control and Optimization, vol. 42, pp , [2] A. Ben-Tal and M. Teboulle, A smoothing technique for nondifferentiable optimization problems, in Optimization. Fifth French German Conference, Lecture Notes in Math. 1405, Springer-Verlag, New York, 1989, [3] L. M. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming, U.S.S.R Computational Mathematics and Mathematical Physics, vol. 7, pp , [4] X. Chen, H.D. Qi, L. Qi and K. Teo, Smooth convex approximation to the maximum eigenvalue function to appear in J. of Global Optimization. [5] G.H. Golub and C.F. Van Loan, Matrix Computations, the Johns Hopkins University Press, Baltimore, USA, Third Edition,(1996). 32
39 Bibliography 33 [6] A.S. Lewis, Derivatives of spectral functions, Mathematics of Operations Research, 21 (1996), [7] A.S. Lewis and H.S. Sendov, Twice differentiable spectral functions, SIAM J. Matrix Anal. Appl., 22 (2001), [8] Yu. Nesterov, Unconstrained convex minimization in relative scale, CORE DP 2003/96. [9] Yu. Nesterov, Smooth minimization of non-smooth functions, CORE DP 2003/12, February Accepted by Mathematcal Programming. [10] Yu. Nesterov, Excessive gap technique in non-smooth convex minimizarion, CORE DP 2003/35, May Accepted by SIAM J. Optimization. [11] Yu. Nesterov, Smoothing technique and its applications in semidefinite Optimizarion, CORE DP 2004/73, October Accepted by SIAM J. Optimization. [12] S. Shi. Smooth convex approximation and its applications, Master s thesis, National University of Singapore, [13] D. Sun and J. Sun, Strong semismoothness of eigenvalues of symmetric matrices and its application to inverse eigenvalue problems, SIAM J. Numer. Anal., 40 (2003),
40 Name: Degree: Department: Thesis Title: Yu Qi Master of Science Mathematics Smoothing Approximations for Two Classes of Convex Eigenvalue Optimization Problems Abstract In this thesis, we consider two problems: minimizing the sum of the κ largest eigenvalues and the sum of the κth largest absolute values of eigenvalues of a parametric linear operator. In order to apply Nesterov s smoothing algorithm to solve these two problems, we construct two computable smoothing functions whose gradients are Lipschitz continuous. This construction is based on Shi s thesis [12] and new techniques introduced in this thesis. Numerical results on the performance of Nesterov s smooth algorithm are also reported. Keywords: Smoothing functions, Lipschitz constants, Smoothing algorithm, Eigenvalue problems.
41 SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI NATIONAL UNIVERSITY OF SINGAPORE 2005
The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression
The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every
More informationModern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
More informationNotes on Symmetric Matrices
CPSC 536N: Randomized Algorithms 2011-12 Term 2 Notes on Symmetric Matrices Prof. Nick Harvey University of British Columbia 1 Symmetric Matrices We review some basic results concerning symmetric matrices.
More informationSimilarity and Diagonalization. Similar Matrices
MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that
More informationComputing a Nearest Correlation Matrix with Factor Structure
Computing a Nearest Correlation Matrix with Factor Structure Nick Higham School of Mathematics The University of Manchester higham@ma.man.ac.uk http://www.ma.man.ac.uk/~higham/ Joint work with Rüdiger
More informationInner Product Spaces and Orthogonality
Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,
More informationMath 550 Notes. Chapter 7. Jesse Crawford. Department of Mathematics Tarleton State University. Fall 2010
Math 550 Notes Chapter 7 Jesse Crawford Department of Mathematics Tarleton State University Fall 2010 (Tarleton State University) Math 550 Chapter 7 Fall 2010 1 / 34 Outline 1 Self-Adjoint and Normal Operators
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationSimilar matrices and Jordan form
Similar matrices and Jordan form We ve nearly covered the entire heart of linear algebra once we ve finished singular value decompositions we ll have seen all the most central topics. A T A is positive
More information2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
More informationLinear Algebra Review. Vectors
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka kosecka@cs.gmu.edu http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length
More informationMATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +
More information[1] Diagonal factorization
8.03 LA.6: Diagonalization and Orthogonal Matrices [ Diagonal factorization [2 Solving systems of first order differential equations [3 Symmetric and Orthonormal Matrices [ Diagonal factorization Recall:
More informationInner products on R n, and more
Inner products on R n, and more Peyam Ryan Tabrizian Friday, April 12th, 2013 1 Introduction You might be wondering: Are there inner products on R n that are not the usual dot product x y = x 1 y 1 + +
More informationChapter 6. Orthogonality
6.3 Orthogonal Matrices 1 Chapter 6. Orthogonality 6.3 Orthogonal Matrices Definition 6.4. An n n matrix A is orthogonal if A T A = I. Note. We will see that the columns of an orthogonal matrix must be
More informationSection 6.1 - Inner Products and Norms
Section 6.1 - Inner Products and Norms Definition. Let V be a vector space over F {R, C}. An inner product on V is a function that assigns, to every ordered pair of vectors x and y in V, a scalar in F,
More informationApplied Linear Algebra I Review page 1
Applied Linear Algebra Review 1 I. Determinants A. Definition of a determinant 1. Using sum a. Permutations i. Sign of a permutation ii. Cycle 2. Uniqueness of the determinant function in terms of properties
More informationLINEAR ALGEBRA W W L CHEN
LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationProperties of BMO functions whose reciprocals are also BMO
Properties of BMO functions whose reciprocals are also BMO R. L. Johnson and C. J. Neugebauer The main result says that a non-negative BMO-function w, whose reciprocal is also in BMO, belongs to p> A p,and
More informationInner Product Spaces
Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and
More informationEigenvalues, Eigenvectors, Matrix Factoring, and Principal Components
Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components The eigenvalues and eigenvectors of a square matrix play a key role in some important operations in statistics. In particular, they
More informationOrthogonal Diagonalization of Symmetric Matrices
MATH10212 Linear Algebra Brief lecture notes 57 Gram Schmidt Process enables us to find an orthogonal basis of a subspace. Let u 1,..., u k be a basis of a subspace V of R n. We begin the process of finding
More informationLecture 5: Singular Value Decomposition SVD (1)
EEM3L1: Numerical and Analytical Techniques Lecture 5: Singular Value Decomposition SVD (1) EE3L1, slide 1, Version 4: 25-Sep-02 Motivation for SVD (1) SVD = Singular Value Decomposition Consider the system
More informationInner product. Definition of inner product
Math 20F Linear Algebra Lecture 25 1 Inner product Review: Definition of inner product. Slide 1 Norm and distance. Orthogonal vectors. Orthogonal complement. Orthogonal basis. Definition of inner product
More informationLINEAR ALGEBRA. September 23, 2010
LINEAR ALGEBRA September 3, 00 Contents 0. LU-decomposition.................................... 0. Inverses and Transposes................................. 0.3 Column Spaces and NullSpaces.............................
More informationLinear Algebra Notes for Marsden and Tromba Vector Calculus
Linear Algebra Notes for Marsden and Tromba Vector Calculus n-dimensional Euclidean Space and Matrices Definition of n space As was learned in Math b, a point in Euclidean three space can be thought of
More information(Quasi-)Newton methods
(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting
More informationThe Characteristic Polynomial
Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem
More informationBANACH AND HILBERT SPACE REVIEW
BANACH AND HILBET SPACE EVIEW CHISTOPHE HEIL These notes will briefly review some basic concepts related to the theory of Banach and Hilbert spaces. We are not trying to give a complete development, but
More informationIntroduction to Matrix Algebra
Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary
More informationBig Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
More informationSF2940: Probability theory Lecture 8: Multivariate Normal Distribution
SF2940: Probability theory Lecture 8: Multivariate Normal Distribution Timo Koski 24.09.2015 Timo Koski Matematisk statistik 24.09.2015 1 / 1 Learning outcomes Random vectors, mean vector, covariance matrix,
More informationCS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace
More informationDiscussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.
Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Fabienne Comte, Celine Duval, Valentine Genon-Catalot To cite this version: Fabienne
More informationContinuity of the Perron Root
Linear and Multilinear Algebra http://dx.doi.org/10.1080/03081087.2014.934233 ArXiv: 1407.7564 (http://arxiv.org/abs/1407.7564) Continuity of the Perron Root Carl D. Meyer Department of Mathematics, North
More informationNotes on Determinant
ENGG2012B Advanced Engineering Mathematics Notes on Determinant Lecturer: Kenneth Shum Lecture 9-18/02/2013 The determinant of a system of linear equations determines whether the solution is unique, without
More informationA note on companion matrices
Linear Algebra and its Applications 372 (2003) 325 33 www.elsevier.com/locate/laa A note on companion matrices Miroslav Fiedler Academy of Sciences of the Czech Republic Institute of Computer Science Pod
More informationFUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
More informationLecture 1: Schur s Unitary Triangularization Theorem
Lecture 1: Schur s Unitary Triangularization Theorem This lecture introduces the notion of unitary equivalence and presents Schur s theorem and some of its consequences It roughly corresponds to Sections
More informationApplications to Data Smoothing and Image Processing I
Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is
More informationMath 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = 36 + 41i.
Math 5A HW4 Solutions September 5, 202 University of California, Los Angeles Problem 4..3b Calculate the determinant, 5 2i 6 + 4i 3 + i 7i Solution: The textbook s instructions give us, (5 2i)7i (6 + 4i)(
More informationLecture Topic: Low-Rank Approximations
Lecture Topic: Low-Rank Approximations Low-Rank Approximations We have seen principal component analysis. The extraction of the first principle eigenvalue could be seen as an approximation of the original
More informationFactorization Theorems
Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization
More informationVector and Matrix Norms
Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty
More informationJournal of Computational and Applied Mathematics
Journal of Computational and Applied Mathematics 226 (2009) 92 102 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam
More informationMATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).
MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors Jordan canonical form (continued) Jordan canonical form A Jordan block is a square matrix of the form λ 1 0 0 0 0 λ 1 0 0 0 0 λ 0 0 J = 0
More informationFactor analysis. Angela Montanari
Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number
More informationMATRIX ALGEBRA AND SYSTEMS OF EQUATIONS
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a
More informationWe shall turn our attention to solving linear systems of equations. Ax = b
59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system
More informationMatrix Differentiation
1 Introduction Matrix Differentiation ( and some other stuff ) Randal J. Barnes Department of Civil Engineering, University of Minnesota Minneapolis, Minnesota, USA Throughout this presentation I have
More information10. Proximal point method
L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing
More informationNOTES ON LINEAR TRANSFORMATIONS
NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all
More information6. Cholesky factorization
6. Cholesky factorization EE103 (Fall 2011-12) triangular matrices forward and backward substitution the Cholesky factorization solving Ax = b with A positive definite inverse of a positive definite matrix
More informationClassification of Cartan matrices
Chapter 7 Classification of Cartan matrices In this chapter we describe a classification of generalised Cartan matrices This classification can be compared as the rough classification of varieties in terms
More information(67902) Topics in Theory and Complexity Nov 2, 2006. Lecture 7
(67902) Topics in Theory and Complexity Nov 2, 2006 Lecturer: Irit Dinur Lecture 7 Scribe: Rani Lekach 1 Lecture overview This Lecture consists of two parts In the first part we will refresh the definition
More information1 Introduction to Matrices
1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns
More informationApproximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
More informationMATH 551 - APPLIED MATRIX THEORY
MATH 55 - APPLIED MATRIX THEORY FINAL TEST: SAMPLE with SOLUTIONS (25 points NAME: PROBLEM (3 points A web of 5 pages is described by a directed graph whose matrix is given by A Do the following ( points
More informationGI01/M055 Supervised Learning Proximal Methods
GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators
More information13 MATH FACTS 101. 2 a = 1. 7. The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.
3 MATH FACTS 0 3 MATH FACTS 3. Vectors 3.. Definition We use the overhead arrow to denote a column vector, i.e., a linear segment with a direction. For example, in three-space, we write a vector in terms
More informationThe Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method
The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method Robert M. Freund February, 004 004 Massachusetts Institute of Technology. 1 1 The Algorithm The problem
More information3. Let A and B be two n n orthogonal matrices. Then prove that AB and BA are both orthogonal matrices. Prove a similar result for unitary matrices.
Exercise 1 1. Let A be an n n orthogonal matrix. Then prove that (a) the rows of A form an orthonormal basis of R n. (b) the columns of A form an orthonormal basis of R n. (c) for any two vectors x,y R
More informationSHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH
31 Kragujevac J. Math. 25 (2003) 31 49. SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH Kinkar Ch. Das Department of Mathematics, Indian Institute of Technology, Kharagpur 721302, W.B.,
More informationAbout the inverse football pool problem for 9 games 1
Seventh International Workshop on Optimal Codes and Related Topics September 6-1, 013, Albena, Bulgaria pp. 15-133 About the inverse football pool problem for 9 games 1 Emil Kolev Tsonka Baicheva Institute
More informationVariational approach to restore point-like and curve-like singularities in imaging
Variational approach to restore point-like and curve-like singularities in imaging Daniele Graziani joint work with Gilles Aubert and Laure Blanc-Féraud Roma 12/06/2012 Daniele Graziani (Roma) 12/06/2012
More informationCHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.
CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In
More informationt := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).
1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction
More informationMehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics
INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree
More information3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions
3. Convex functions Convex Optimization Boyd & Vandenberghe basic properties and examples operations that preserve convexity the conjugate function quasiconvex functions log-concave and log-convex functions
More information3. INNER PRODUCT SPACES
. INNER PRODUCT SPACES.. Definition So far we have studied abstract vector spaces. These are a generalisation of the geometric spaces R and R. But these have more structure than just that of a vector space.
More informationx1 x 2 x 3 y 1 y 2 y 3 x 1 y 2 x 2 y 1 0.
Cross product 1 Chapter 7 Cross product We are getting ready to study integration in several variables. Until now we have been doing only differential calculus. One outcome of this study will be our ability
More informationOrthogonal Projections
Orthogonal Projections and Reflections (with exercises) by D. Klain Version.. Corrections and comments are welcome! Orthogonal Projections Let X,..., X k be a family of linearly independent (column) vectors
More informationCS3220 Lecture Notes: QR factorization and orthogonal transformations
CS3220 Lecture Notes: QR factorization and orthogonal transformations Steve Marschner Cornell University 11 March 2009 In this lecture I ll talk about orthogonal matrices and their properties, discuss
More information1 VECTOR SPACES AND SUBSPACES
1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such
More informationTwo classes of ternary codes and their weight distributions
Two classes of ternary codes and their weight distributions Cunsheng Ding, Torleiv Kløve, and Francesco Sica Abstract In this paper we describe two classes of ternary codes, determine their minimum weight
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of
More information7 Gaussian Elimination and LU Factorization
7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method
More information3 Orthogonal Vectors and Matrices
3 Orthogonal Vectors and Matrices The linear algebra portion of this course focuses on three matrix factorizations: QR factorization, singular valued decomposition (SVD), and LU factorization The first
More informationISOMETRIES OF R n KEITH CONRAD
ISOMETRIES OF R n KEITH CONRAD 1. Introduction An isometry of R n is a function h: R n R n that preserves the distance between vectors: h(v) h(w) = v w for all v and w in R n, where (x 1,..., x n ) = x
More information18.06 Problem Set 4 Solution Due Wednesday, 11 March 2009 at 4 pm in 2-106. Total: 175 points.
806 Problem Set 4 Solution Due Wednesday, March 2009 at 4 pm in 2-06 Total: 75 points Problem : A is an m n matrix of rank r Suppose there are right-hand-sides b for which A x = b has no solution (a) What
More informationLecture 5 Principal Minors and the Hessian
Lecture 5 Principal Minors and the Hessian Eivind Eriksen BI Norwegian School of Management Department of Economics October 01, 2010 Eivind Eriksen (BI Dept of Economics) Lecture 5 Principal Minors and
More informationAu = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.
Chapter 7 Eigenvalues and Eigenvectors In this last chapter of our exploration of Linear Algebra we will revisit eigenvalues and eigenvectors of matrices, concepts that were already introduced in Geometry
More informationα = u v. In other words, Orthogonal Projection
Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v
More informationConstrained Least Squares
Constrained Least Squares Authors: G.H. Golub and C.F. Van Loan Chapter 12 in Matrix Computations, 3rd Edition, 1996, pp.580-587 CICN may05/1 Background The least squares problem: min Ax b 2 x Sometimes,
More informationPacific Journal of Mathematics
Pacific Journal of Mathematics GLOBAL EXISTENCE AND DECREASING PROPERTY OF BOUNDARY VALUES OF SOLUTIONS TO PARABOLIC EQUATIONS WITH NONLOCAL BOUNDARY CONDITIONS Sangwon Seo Volume 193 No. 1 March 2000
More informationMathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
More informationSection 1.1. Introduction to R n
The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to
More informationOctober 3rd, 2012. Linear Algebra & Properties of the Covariance Matrix
Linear Algebra & Properties of the Covariance Matrix October 3rd, 2012 Estimation of r and C Let rn 1, rn, t..., rn T be the historical return rates on the n th asset. rn 1 rṇ 2 r n =. r T n n = 1, 2,...,
More information5. Orthogonal matrices
L Vandenberghe EE133A (Spring 2016) 5 Orthogonal matrices matrices with orthonormal columns orthogonal matrices tall matrices with orthonormal columns complex matrices with orthonormal columns 5-1 Orthonormal
More informationNMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing
NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing Alex Cloninger Norbert Wiener Center Department of Mathematics University of Maryland, College Park http://www.norbertwiener.umd.edu
More informationLecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs
CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like
More informationMATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.
MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set. Vector space A vector space is a set V equipped with two operations, addition V V (x,y) x + y V and scalar
More informationTHE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS
THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS KEITH CONRAD 1. Introduction The Fundamental Theorem of Algebra says every nonconstant polynomial with complex coefficients can be factored into linear
More informationTorgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances
Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances It is possible to construct a matrix X of Cartesian coordinates of points in Euclidean space when we know the Euclidean
More informationChapter 7. Lyapunov Exponents. 7.1 Maps
Chapter 7 Lyapunov Exponents Lyapunov exponents tell us the rate of divergence of nearby trajectories a key component of chaotic dynamics. For one dimensional maps the exponent is simply the average
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationContinued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
More informationAn Accelerated First-Order Method for Solving SOS Relaxations of Unconstrained Polynomial Optimization Problems
An Accelerated First-Order Method for Solving SOS Relaxations of Unconstrained Polynomial Optimization Problems Dimitris Bertsimas, Robert M. Freund, and Xu Andy Sun December 2011 Abstract Our interest
More information