Regression With Gaussian Measures

Transcription

1 Regression With Gaussian Measures Michael J. Meyer Copyright c April 11, 2004

2 ii PREFACE We treat the basics of Gaussian processes, Gaussian measures, kernel reproducing Hilbert spaces and related topics. All mathematical details are included and every effort is made to keep this as selfcontained as possible. Only elementary Hilbert space theory and integration theory as well as basic results from probability theory are assumed. This is a work in progress and has been written up in haste. Undoubtedly there are mistakes. Please me at spyqqqdia@yahoo.com if you find mistakes or have suggestions. Michael J. Meyer April 11, 2004

3 Contents 1 Introduction 1 2 Operators on Hilbert Space Hilbert space basics Adjoint operator Selfadjoint and positive operators Compact operators between Banach spaces Compact selfadjoint operators Compact operators between Hilbert spaces Hilbert-Schmidt and trace class operators Inverse problems and regularization Regularization Kernels and integral operators Symmetric kernels L 2 -Bounded Kernels Reproducing Kernel Hilbert Spaces Positive semidefinite kernels Translation invariant kernels Reproducing kernel Hilbert spaces Bilinear kernel expansion Characterization of functions in H K Kernel domination Approximation in reproducing kernel Hilbert spaces Orthonormal bases Second description of H Gaussian Measures Probability measures in Hilbert space iii

4 iv CONTENTS 4.2 Gaussian measures on Hilbert space Cameron-Martin space Regression with Gaussian measures Model choices Square Integrable Processes Integrable processes Processes with sample paths in an RKHS Gaussian random fields Definition and construction Construction of Gaussian random fields A Vector Valued Integration 109 B Conditioning of multinormal Random Vectors 111 C Orthogonal polynomials 113 C.0.2 Legendre polynomials

5 Chapter 1 Introduction We will freely use the terminology which will be defined later. Let F be a nonempty set and f : F R a real valued function on F. Consider the following problem: we have observed the value of f at some points x 1,..., x n F as y j = f(x j ), j = 1,..., n, (1.1) and from this we want to estimate f itself. We will follow a Bayesian approach. It is assumed that the function f belongs to a real vector space H of functions on F. A prior probability P is placed on H and the regressor ˆf (the estimate of f in light of the data) is computed as the mean of P conditioned on the data (1.1). The probability P is defined on the σ-field E generated by the continuous linear functionals on H. If I : (H, E, P ) H denotes the H-valued random variable defined as I(f) = f (the identity on H) then the mean of the distribution P on H is the expectation E P [I] of I under P, that is, the H-valued integral E P [I] = IdP = f P (df), (1.2) H Do not worry if this sounds needlessly abstract since it is not how things are handled in practice. It merely serves to motivate the procedures below. The vector valued integral (1.2) commutes with all continuous linear functionals Λ on H, that is, Λ ( E P [I] ) = E P (Λ I) = Λ(f) P (df) and the same holds true if the ordinary expectation is replaced with a con- 1 H H

6 2 CHAPTER 1. INTRODUCTION ditional expectation. The regressor ˆf is the conditional expectation and so we have ˆf = E P [I data] (1.3) Λ( ˆf) = E P [Λ data] (1.4) for each continuous linear functional Λ on H. (note that Λ I = Λ). Thus rather than computing the regressor ˆf globally as in (1.3) we compute Λ(f) for enough continuous linear functionals Λ on H to obtain a good view of ˆf. For each x F let E x : f H f(x) R denote the valuation functional at the point x. If Λ = E x then Λ( ˆf) = ˆf(x) is our prediction for the value of f at the point x in light of the data (1.1). Note that the data themselves can be written in terms of the valuation functionals as E j (f) = y j, 1 j n, (1.5) where E j = E xj is the evaluation functional at the point x j. With this the regressor ˆf becomes the condional expectation and ˆf = E P [ I E j = y j, j n ] Λ( ˆf) = E P [Λ E j = y j, j n ], (1.6) for each continuous linear functional Λ on H. To make this feasible we have to assume that 1. The evaluation functionals E x, x F, are continuous on H. The computation of (1.6) involves only the finite dimensional distribution of the random vector W = (E 1,..., E n, Λ) on R n+1 under the probability P. Note that each continuous linear functional on H is a random variable on the probability space (H, E, P ). The measure P is is called a Gaussian measure on H if every continuous linear functional Λ on H is a normal random variable under P. In this case the distribution of the vector W is automatically Gaussian (multinormal) on R n+1 and the computation of the conditional epxectation (1.6) involves merely routine computations with the multinormal density.

7 3 We have chosen the particular form (1.1) for the data because this is the standard in regression problems. Note however that our approach applies to all forms of data and predictions which can be articulated in terms of events involving finitely many continuous linear functionals on H. Regression with Gaussian processes assumes that f is the trajectory of a Gaussian process Z = Z(x) on F. The mean of the process is assumed to be zero and thus the process Z completely determined by its covariance function K(x, y) which is a symmetric positive semidefinite kernel on F. The kernel K : F F R is a parameter of the regression procedure. The space H is the product space H = R F of all functions f : F R and the probability P is the distribution of Z on H. Kolmogoroff s existence theorem for product measures guarentees the existence of the probability P on H for every symmetric, positive semidefinite kernel K on F. The space H = R F is a topological vector space with only one redeeming quality: the evaluation functionals are the coordinate functionals and hence continuous in the product topololgy on H. Unfortunately there are essentially no other continuous linear functionals on H. Every continuous linear functional on H is a finite linear combination of coordinate functionals. Consequently this setup limits us to data presented in the form (1.1) and consequent predictions of values f(x) at other points x F in a point by point fashion. There are other disadvantages. For example it requires a substantial effort to extract properties of the admissible functions f, that is, the trajectories of the Gaussian process Z, from properties of the covariance kernel K and the resulting properties are often weaker than desired. Consequently we take a slightly different approach. We assume instead that f is an element of a separable Hilbert space H of functions on F. P is a Gaussian measure on H defined in terms of an orthonormal basis {ψ j } of H and a sequence (σ j ) of positive numbers (which diagonalize the covariance operator Q of P below). We can then proceed as above provided that the evaluation functionals are continuous on H. But we also have other options. The data and predictions can be articulated in any fashion which uses only finitely many continuous linear functionals Λ on H. Point estimates are one possibility. Another possibility are the coefficients Λ(f) = (f, ψ k ) of f in the expansion of f = j (f, ψ j)ψ j in the basis {ψ j } of H.

8 4 CHAPTER 1. INTRODUCTION Here we had to assume that the evaluation functionals are continuous on H. A Hilbert space of functions on F with this property is called a reproducing kernel Hilbert space on F. Such a Hilbert space H defines a unique symmetric, positive semidefinite kernel K : F F R. Conversely every symmetric, positive semidefinite kernel on K : F F R determines a unique reproducing kernel Hilbert space. There is an interesting interplay between orthonormal bases of H and the kernel K. A basic question is how to find an orthonormal basis for H. If F R d is compact and K is continuous, then we have additional structure in the form of the Euclidean topology and Lebesgue measure on X. Associated with the kernel K we have the integral operator T : L 2 (F ) L 2 (F ) defined by (T f)(x) = K(x, y)f(y)dy, f L 2 (F ), x F, F where dx denotes Lebesgue measure on F. It turns out that T is a Hilbert- Schmidt operator. Consequently the orthogonal complement of the null space of T has an orthonormal basis {φ j } consisting of eigenvectors of T. Let λ j denote the corresponding eigenvalues. Then the functions ψ j = λ j φ j are an orthonormal basis for the reproducing kernel Hilbert space H with kernel K. This establishes the connection to the spectral theory of compact, selfadjoint operators on a Hilbert space. There is another connection. For f H let Λ f be the bounded linear functional Λ f (h) = (h, f) on H. The Gaussian measure P on H defines a unique bounded linear operator Q : H H such that the covariances of the random variables Λ f, Λ g are given as Cov P (Λ f, Λ g ) = (Qf, g) H, f, g H. (1.7) The operator Q is a positive trace class operator. Conversely for every positive trace class operator Q : H H, there exists a unique Gaussian measure P on H such that (1.7) holds. Thus the material presents an interesting interaction of functional analysis and probability theory. If you are only interested in the regression problem you need only read Chapter 2, Chapter 3, sections 1-4,7,8 and Chapter 4, sections 1,2,4.

9 Chapter 2 Operators on Hilbert Space In this chapter we develop the spectral theory of compact operators between Hilbert spaces. Our scalars are the reals, that is, we consider only real Hilbert spaces. 2.1 Hilbert space basics We review the basics of Hilbert space theory. Let H be a (real) Hilbert space with inner product (, ). Let denote the closed unit ball in H and H 1 = { x H : x 1 } S 1 (H) = { x H : x = 1 } the unit sphere in H. For vectors x, y H we write x y (orthogonal) if (x, y) = 0. For subsets A, B of H we write A B if a b for all a A and b B. We let A := { x H x a, for all a A }. Then A is a closed subspace of H. If V is a closed subspace of H, then H = V + V, in particular every closed subspace of H is complemented in H. This is the first fundamental fact about Hilbert spaces. Each element x H has unique decomposition x = v + v with v V and v V. We have x 2 = v 2 + v 2 5

10 6 CHAPTER 2. OPERATORS ON HILBERT SPACE (the Law of Pythagoras). The map x v is called the perpendicular projection onto the subspace V and is denoted π V. If (φ j ) is an ON-basis of V, then π V (x) = j (x, φ j)φ j, x H. (2.1) The second fundamental property of a Hilbert space H is the fact that the continuous linear functionals on H can be identified with the elements of H if a H, then Λ a : x H (x, a) R defines a continuous linear functional on H. The converse is also true: every continuous linear functional on H has this form (Riesz Representation Theorem). Bilinear forms. Let X and Y be Hilbert spaces. A function ψ = ψ(x, y) : X Y R is called a bilinear form if it is linear in both variables x and y. The bilinear form ψ is called continuous if ψ = sup{ ψ(x, y) : x X 1, y Y 1 } <. (2.2) In this case ψ(x, y) ψ x y, for all x X and y Y. Note that the closed unit balls X 1, Y 1 can be replaced with the unit spheres S 1 (X), S 1 (Y ) with no effect on the definition of the norm of ψ. If A : X Y is a bounded linear operator, then ψ(x, y) = (Ax, y) defines a continuous bilinear form on X Y with ψ = A. Conversely Theorem (Lax-Milgram). Let ψ = ψ(x, y) be a continuous bilinear form on X Y. Then there exists a bounded linear operator A : X Y such that ψ(x, y) = (Ax, y) Y, for all x X and y Y. Proof. Fix x X. Then Λ x (y) = ψ(x, y) is a continuous linear functional on Y. By the Riesz Representation Theorem there exists an element a Y with Λ x (y) = (a, y) Y, for all y Y. Clearly a is uniquely determined by x. Write a = Ax. Thus defines a map A : X Y which satisfies ψ(x, y) = (Ax, y). The uniqueness of a and linearity of ψ in the first argument imply that the map A is linear. The continuity of ψ implies that A is continuous. If X = Y = H, then a bilinear form ψ = ψ(x, y) on X Y is called a bilinear form on H. Such a bilinear from is called symmetric if it satisfies ψ(x, y) = ψ(y, x), for all x, y H. In this case Proposition Let ψ = ψ(x, y) be a symmetric bilinear form on H. Then ψ = sup{ ψ(x, x) : x 1 } (2.3)

11 2.2. ADJOINT OPERATOR 7 Proof. Let C denote the right hand side of (2.3). Obviously C ψ and we have to show only the reverse inequality. Write φ(x) = ψ(x, x). Then φ(x) C if x 1. Using the the symmetry of ψ we can write ψ(x, y) = 1 2 [ φ ( x + y 2 ) φ ( x y 2 )]. Recall that H 1 deotes the closed unit ball in H. If x, y H 1, then (x±y)/2 H 1 and it follows that φ((x ± y)/2) C. From this ψ(x, y) 1 (C + C) = C. 2 Taking the sup over all x, y H 1 now yields ψ C. 2.2 Adjoint operator The Lax-Milgram theorem can be used show the existence of the adjoint operator. Let X, Y be Hilbert spaces and T : X Y a bounded linear operator. Then ψ(y, x) = (y, T x) Y is a continuous bilinear form on Y X. Consequently there exists a bounded linear operator T : Y X such that ψ(y, x) = (T y, x) X, for all y Y and x X. It is easy o see that the operator T is uniquely determined by its defining property (T x, y) = (x, T y), x X, y Y. Obviously T = T. We note the following Proposition We have (i) N(T T ) = N(T ). (ii) N(T ) = R(T ). (iii) N(T ) = R(T ). Proof. (i) If T x = 0, then T T x = 0. Conversely, if T T x = 0, then T x 2 = (T x, T x) = (T T x, x) = 0, thus x N(T ). (ii) Let w N(T ) and y = T x for some x X. Then (y, w) = (x, T w) = 0. Thus w R(T ). Conversely, if w R(T ), then (T w, x) = (y, T x) = 0, for all x X. This implies T y = 0 (let x = T y) and so w N(T ). Now (iii) follows from this. Replace T with T and note that T = T. Remark. By taking orthogonal complements in (ii) and (iii) we obtain R(T ) N(T ) and R(T ) N(T ) but we will not have equality in general since R(T ) and R(T ) need not be closed.

12 8 CHAPTER 2. OPERATORS ON HILBERT SPACE For any subset A X we have A = (A). Thus (ii) can be written as N(T ) = [R(T )]. Note that this implies that T is one to one on the closure R(T ) of the range of T. 2.3 Selfadjoint and positive operators A bounded linear operator T on H is called selfadjoint if it satisfies (T x, y) = (x, T y), (2.4) for all x, y H. In this case the nullspace N(T ) = { x H T x = 0 } satisfies N(T ) = R(T ). The converse R(T ) = N(T ) is not true in general simply because the range R(T ) will not in general be closed. The number λ is called an eigenvalue of T if there is a nonzero vector x H with T x = λx, that is x N(T λi), where I is the identity operator on H. We let E λ (T ) := N(T λi) = { x H T x = λx } denote the eigenspace associated with the eigenvalue λ. Obviously this space is defined wether or not λ is an eigenvalue of T. It is an eigenvalue if and only if E λ (T ) {0}. The nonzero elements of E λ (T ) are called the eigenvectors associated with the eigenvalue λ. Proposition Let T be a selfadjoint operator on H. Then λ µ implies E λ (T ) E µ (T ), in other words, eigenvectors with respect to different eigenvalues are perpendicular to each other. Proof. Assume that T x = λx and T y = µy. Then λ(x, y) = (T x, y) = (x, T y) = µ(x, y). Since λ µ this implies that (x, y) = 0. If λ = 0 then the eigenspace E λ (T ) is simply the nullspace N(T ) and λ = 0 is an eigenvalue of T if and only if T has an nontrivial nullspace. If T is selfadjoint this eigenspace is perpendicular to the range R(T ) and so no eigenvector associated with the eigenvalue zero is in the range of T. By contrast, if λ 0, then E λ (T ) R(T ) since every eigenvector associated with λ satisfies x = λ 1 T x. A subspace V X is called T -invariant if it satisfies T (V ) V. In this case the restriction of T to V is a linear operator on V.

13 2.3. SELFADJOINT AND POSITIVE OPERATORS 9 Proposition Let T be a selfadjoint operator on H and V H a T - invariant subspace. Then the orthogonal complement V is also T -invariant. Proof. Let x V. Then for all y V we have (T x, y) = (x, T y) = 0, since T y V. Thus T x V. Assume that V is a closed T invariant subspace, write H = V + V and let T 1, T 2 denote the restrictions of T to V respectively V. Then T = T 1 π V + T 2 π V, where π V, π V are the orthogonal projections onto the subspaces V, V. Thus the restrictions T 1, T 2 completely determine the operator T. Every eigenspace E λ (T ) of T and in particular the null space N(T ) is T -invariant. Write H = N(T ) + W, where W = N(T ). Then the restriction of T to W is a linear operator on W and obviously this restriction completely determines the operator T (since the restriction of T to its null space is simply zero). Thus we will often be able to disregard the eigenvectors associated with the eigenvalue zero, that is, the eigenvectors in the nullspace of T. Proposition If the operator T on H is selfadjoint, then T = sup{ (T x, x) : x = 1 }. (2.5) Proof. Clearly it will suffice to show (2.5) with x = 1 replaced with x 1. Set ψ(x, y) = (x, T y). Then ψ is a bilinear form with ψ = T. Since T is selfadjoint, ψ is symmetric. Now apply (2.3). Positive operators. A bounded linear operator A on H is called positive if it satisfies (Ax, x) 0, for all x H. If strict inequality holds for all nonzero x, then A is called strictly positive. For example, if X and Y are Hilbert spaces and T : X Y a bounded linear operator, then the operator A = T T on X is positive: (Ax, x) = (T T x, x) = (T x, T x) = T x 2 0. Proposition If the operator A on H is positive, then every eigenvalue λ of A satisfies λ 0.

14 10 CHAPTER 2. OPERATORS ON HILBERT SPACE Proof. Let x be an eigenvector with eigenvalue λ. Then λ x 2 = λ(x, x) = (Ax, x) 0. Proposition If the operator A on H is positive, then the operator αi + A has a bounded inverse on all of H, for each α > 0. Proof. Let α > 0 an set T = αi + A. Then, for each x H we have T x 2 = α 2 x 2 + 2α(Ax, x) + Ax 2 α 2 x 2. It follows that T is one to one and has closed range. Moreover T is selfadjoint. Thus R(T ) = N(T ) = {0}. Thus T has dense range. It follows that R(T ) = H and T has an inverse T 1 : H H as a linear map. The inverse is bounded since T x α x implies that T 1 y α 1 y. We will also need the following result Proposition If the operator A on H is positive, then there exists a unique positive operator S on H such that A = S 2. The operator S is called the (positive) square root of A and denoted S = A. The existence of S is a special case of the so called continuous functional calculus which is a consequence of the representation theory of commutative C -algebras. This theory is quite easy and provides the most natural proof. The reader is referred to the literature. 2.4 Compact operators between Banach spaces Let us recall without proof some facts about compact sets in a complete normed space X. A subset A X is called relatively compact if the closure of A is compact. The set A is called totally bounded if for each ɛ > 0 there are finitely many balls B(x i, ɛ), x i X, of radius ɛ which cover A. With this Theorem For a subset A X the following are equivalent: (i) A is relatively compact. (ii) A is totally bounded. (iii) Each sequence (a n ) A has a subsequence which converges in X. The proof is given in every class on metric spaces. The limit of the subsequence in (iii) will be in the closure of A but need not be in A itself.

15 2.4. COMPACT OPERATORS BETWEEN BANACH SPACES 11 Let X, Y be complete normed spaces. A linear operator T : X Y is called compact if the image T (B) Y of the unit ball B X is relatively compact in Y. T is called a finite rank operator it the range R(T ) := T (X) Y is finite dimensional. In this case T has the form T (x) = j<n Λ j(x)φ j, x X, (2.6) where n = dim(r(t )), φ j Y and the Λ j are continuous linear functionals on X. Simply let the { φ 0,..., φ n 1 } be a basis for R(T ) and Λ j = ψ j T, where ψ j is the coordinate functional associated with the basis vector φ j, that is, y = j<n ψ j(y)φ j, y R(T ). Now set y = T x. Conversely every operator of this form is finite dimensional with R(T ) = span({φ j }). Since a bounded set in a finite dimensional space is relatively compact (Bolzano-Weierstrass Theorem) every finite rank operator is compact. Theorem Let X, Y be complete normed spaces and T : X Y a linear operator. (i) If T is a finite rank operator then T is compact. (ii) If T is the limit in operator norm of compact operators, then T is compact. Proof. Assume that T n : X Y is compact, for each n 1, and T n T in operator norm. Let B X be the unit ball and ɛ > 0. Choose n such that T n T < ɛ/2. There exist finitely many balls B(y i, ɛ/2) Y which cover T n (B). Then the corresponding balls B(y i, ɛ) cover T (B). This shows that T (B) is totally bounded. Let us introduce the following notation: with B(X, Y ) we denote the space of all bounded linear operators T : X Y. Likewise F (X, Y ) and K(X, Y ) denote the set of finite rank respectively compact operators in B(X, Y ). If X = Y, we write B(X), F (X) and K(X) for B(X, X), F (X, X) and K(X, X). It is easily verified that F (X, Y ) and K(X, Y ) are in fact subspaces of B(X, Y ). Then from (ii) F (X, Y ) K(X, Y ) B(X, Y ). The converse of (ii) is not true in general but it is true if X and Y are Hilbert spaces as we shall see below. In other words, F (X, Y ) K(X, Y ) in general but we have equality in the case of Hilbert spaces X and Y.

16 12 CHAPTER 2. OPERATORS ON HILBERT SPACE For an operator T F (X, Y ) we set rank(t ) = dim(r(t ). If T has the form (2.6, then rank(t ) = n if φ 0,..., φ n 1 are linearly independent. Let T K(X, Y ). Then the image T (D) Y of each bounded subset D X is relatively compact. Using (2.4.1) we see Proposition Let T B(X, Y ). Then T is compact if and only if the sequence (T x n ) Y has a convergent subsequence for each bounded sequence (x n ) a bounded sequence in X. Let A be any set and τ, σ topologies on A with τ σ. If τ is a Hausdorff topology and A compact in the topology σ then τ = σ. It will suffice to show that each σ-closed set F A is τ-closed. Indeed, F is σ-compact and hence τ-compact (every cover with τ-open sets is a cover with σ-open sets). Since τ is Hausdorff it follows that F is τ-closed. Let X be a normed space and X the space of all continuous linear functionals on X. Recall that the weak topology on X is the weakest topology in which all functionals F X are continuous. Clearly this topology is weaker than the norm topolgy on X. It is a Hausdorff topology (the continuous linear functionals on a normed space X separate points on X). The observation above shows that the weak topology agrees with the norm topology on every norm compact subset of X. Recall that a sequence (x n ) X satisfies x n x weakly (in the weak topology) if and only if F (x n ) F (x), for each continuous linear functional F X. Proposition Let T B(X, Y ) be compact and (x n ) X bounded. If x n x X weakly, then T x n T x in norm. Proof. Since T is bounded the weak convergence x n x X implies the weak convergence T x n T x. Choose a bounded subset B X with (x n ) B and x B. Then K = T (B) Y is compact. Consequently the weak topology agrees with the norm topology on K. Since T x n, T x K and T x n T x weakly it follows that T x n T x in norm. Remark. A weakly convergent sequence (x n ) is automatically bounded, that is the assumption of boundedness above is superfluous but we don t need this result. If (x n ) is weakly convergent then it is weakly bounded, ie. sup n F (x n ) <, for each continuous linear functional F X. The Uniform Boundedness Principle now implies that the sequence (x n ) is bounded in norm.

17 2.4. COMPACT OPERATORS BETWEEN BANACH SPACES 13 Exercise. Let X, Y, Z be complete normed spaces and T : X Y, S : Y Z bounded linear operators. If one of S, T is compact then so is the product S,T. Hint: regardless of compactness T maps bounded sets to bounded sets and S maps relatively compact sets to relatively compact sets. We conclude this section with a characterization of compact operators on Hilbert space Theorem Let X and Y be Hilbert spaces and T B(X, Y ) a bounded linear operator. Then T is compact if and only if T e n 0, for each orthonormal sequence (e n ) X. Proof. ( ) Assume that T is compact and let (e n ) X be an orthonormal sequence. Then n (x, e n) 2 x 2 < and so (x, e n ) 0, as n, for each x X. By the Riesz representation theorem this means F (e n ) 0, for each continuous linear functional F X, that is, e n 0 weakly in X. According to the compactness of T now implies T e n 0 in norm. ( ) Recall that N 1 denotes the closed unit ball of a normed space N. Assume that T is not compact and hence T (X 1 ) Y not totally bounded. Let ɛ > 0 be such that the closure T (X 1 ) cannot be covered with finitely many balls of radius 2ɛ. We construct an orthonormal sequence (e n ) X such that T e n ɛ, for all n 1. (A) We claim that for every finite dimensional subspace N X there exists e N with e = 1 and T e ɛ. If this were not true let N X be a finite dimensional subspace such that T e ɛ, for all e V := N with e 1, that is T (V 1 ) ɛy 1. Note that T (N 1 ) Y is compact and hence can be covered by finitely many balls B j (y j, ɛ) of radius ɛ. Since X 1 N 1 + V 1 we have T (X 1 ) T (N 1 ) + T (V 1 ). It follows that T (X 1 ) is covered by the balls B j (y j, 2ɛ) in contradiction to the choice of ɛ. This shows (A). (B) Now we can construct the sequence (e n ) by induction. Using (A) with N = {0} find e 0 with T e 0 ɛ. Given that orthonormal e 0,..., e n with T e j ɛ have already been constructed set N = span({e 0,..., e n }) and choose e n+1 N with e n+1 = 1 such that T e n+1 ɛ. Then the sequence {e 0,..., e n+1 } is orthonormal and the construction continues.

18 14 CHAPTER 2. OPERATORS ON HILBERT SPACE 2.5 Compact selfadjoint operators Let T be a compact, selfadjoint operator on a Hilbertspace H. Then T can be diagonalized in the sense that there is an orthonormal basis for H consisting of eigenvectors of T. This result makes it very easy to work with such operators. For the proof we need the following Lemma Let T be a compact, selfadjoint operator on H. Then at least one of λ = T or λ = T is an eigenvalue of T. Proof. We may assume that T 0. From (2.5) we get a sequence of vectors x n H with x n = 1 and λ such that λ = T and (T x n, x n ) λ, as n. Then, for each n 0 we have 0 T x n λx n 2 = T x n 2 2λ(T x n, x n ) + λ 2 x n 2 (2.7) T 2 2λ(T x n, x n ) + λ 2 (2.8) As n, the rightmost quantity converges to 2λ 2 2λ 2 = 0. Thus we also have T x n λx n 0. Set y n = T x n. By compactness of T the sequence y n has a convergent subsequence. Passing to this subsequence we may assume that the sequence y n is itself convergent. But then the sequence x n = λ 1 (y n (y n λx n ) converges also. Since T x n λx n 0 the limit x = lim n x n must satisfy T x = λx. Since x n = 1, for all n, we have x = 1. With this we can now prove the main result about compact selfadjoint operators: Theorem Let T be a compact, selfadjoint operator on H. Then there exists an orthonormal basis for H consisting of eigenvectors of T. More precisely N(T ) has a countable orthonormal basis (φ j ) consisting of eigenvectors of T and if λ j are the associated eigenvalues, then T x = j λ j(x, φ j )φ j, x H, where the series converges in the norm of H. If the sequence (φ j ) is infinite, then λ j 0, as j. Proof. By induction we construct a (possibly finite) sequence of numbers λ j 0 and orthonormal vectors φ j such that (i) T φ j = λ j φ j, (ii) the restriction T j of T to { φ 0,..., φ j 1 } satisfies T j = λ j, and (iii) T = 0 on { φ 0, φ 1,... }.

19 2.5. COMPACT SELFADJOINT OPERATORS 15 Since the λ j are nonzero, each φ j is in N(T ) and from (iii) it follows that the φ j span all of N(T ) (recall that (A ) is the closed linear span of A). The quantities λ 0 and φ 0 exist by lemma (2.5.1). Assume that λ 0,... λ j and φ 0,..., φ j have already been constructed. Set X j = { φ 0,..., φ j }. If T = 0 on X j, then we are finished. Otherwise note that X j is a closed T - invariant subspace (since span({ φ 0,..., φ j }) is T -invariant). The restriction T j of T to X j is a compact selfadjoint operator on X j. Applying lemma (2.5.1) to T j we see that there is a unit vector φ j+1 X j and a number λ j+1 such that (a) λ j+1 = T j and (b) T φ j+1 = T j φ j+1 = λ j+1 φ j+1. Obviously φ j+1 φ 0..., φ j and so the resulting sequence (φ j ) is orthonormal. If T j = 0 at any time, then (iii) is already satisfied and we are finished. Assume now that T j 0, for all j 0, set X = { φ 0, φ 1,... } and let S be the restriction of T to X. We must show that S = 0. From (ii) it follows that λ 0 λ 1 λ j S, for all j 0, and so it will suffice to show that λ j 0 as j. If λ j 0, we have λ j ρ for some number ρ > 0. Then the sequence (φ j /λ j ) H is bounded and by compactness of T the sequence y j = T (φ j /λ j ) = φ j has a convergent subsequence. However this contradicts the fact that the sequence φ j is orthonormal and hence φ j φ k = 2, for all j k. Consequently we must have λ j 0. Remark (Spectrum). We claim that the sequence (λ j ) contains all the nonzero eigenvalues of T. If λ λ j, 0 were another eigenvalue, the associated eigenspace would be contained in N(T ) and perpendicular to all the φ j which contradicts the fact that the φ j span N(T ). It follows that the λ j contain all the nonzero eigenvalues of T. Note also that the convergence λ j 0 implies that the eigenspaces corresponding to nonzero eigenvalues are all finite dimensional. The sequence (λ j ) contains all nonzero eigenvalues of T but what about the spectrum of T, that is the set σ(t ) = { λ R T λi is not invertible on H }

20 16 CHAPTER 2. OPERATORS ON HILBERT SPACE Let us assume that H is not finite dimensional. Then the unit ball H 1 is not compact. It follows that T is not invertible, that is, 0 σ(t ) (regardless of wether 0 is an eigenvalue or not). However, if λ λ j, 0, for all j 0, then it can be shown that the operator T λi is invertible on H. To compute (T λi) 1 we must solve (T λi)x = y (2.9) for x in terms of y. Write V = N(T ) and x = π V (x) + π V (x) as well as y = π V (y) + π V (y). With this (2.9) becomes λπ V (x) + (T λi)π V (x) = π V (y) + π V (y) and since V is T -invariant and hence T λi-invariant, this is equivalent with λπ V (x) = π V (y) and (T λi)π V (x) = π V (y) (2.10) Since the φ j are an ON-basis for V we have π V (y) = j (y, φ j)φ j and π V (x) = j α jφ j with α j to be determined. Note that (T λi)φ j = (λ j λ)φ j. With this (2.10) becomes j α j(λ j λ)φ j = j (y, φ j)φ j which solves for α j = (y, φ j )/(λ j λ) resulting in x = π V (x) + π V (x) = 1 λ π V (y) + j (y, φ j ) λ j λ φ j. The solution x exists for each y and is a continuous linear function of y, in other words (T λi) 1 y = 1 λ π V (y) + (y, φ j ) j λ j λ φ j exists as a continuous linear operator on H. Consequently the point λ is not in the spectrum of T and we have shown that σ(t ) = {λ j } {0}. Remark (Range). The series expansion (2.5.1) also allows us to determine the range R(T ) quite easily. Let y H and consider the equation T x = y. (2.11) If this equation has a solution x, then y N(T ). Assume now that y N(T ). Then we have an expansion y = j (y, φ j)φ j. Clearly to find x H

21 2.6. COMPACT OPERATORS BETWEEN HILBERT SPACES 17 with T x = y we can restrict ourselves to x N(T ). Such x will then have an expansion x = j α jφ j (2.12) with α j to be determined. In terms of these series expansion (2.11) becomes j α jλ j φ j = T x = y = j (y, φ j)φ j which implies that we must have α j = λ 1 j (y, φ j ). However for these α j the series (2.12) converges exactly if j λ 2 j (y, φ j ) 2 <. It follows that { R(T ) = y N(T ) : } j λ 2 j (y, φ j ) 2 < 2.6 Compact operators between Hilbert spaces The case of a general compact operators T : X Y between Hilbert spaces X and Y can be reduced to the selfadjoint case by observing that the product T T is a compact, selfadjoint operator on X. The results of the last section then carry over with minimal changes. Let X and Y be Hilbert spaces, T B(X, Y ). A singular system for T is a sequence (µ j, φ j, ξ j ) j where (i) µ 0 µ 1 µ n > 0, (ii){φ j } is an ON-basis for N(T ), (iii) {ξ j } is an ON-basis for N(T ), and (iv) T φ j = µ j ξ j and T ξ j = µ j φ j, for all j 0. Assume that (µ j, φ j, ξ j ) j is such a system, set V = N(T ) and let x X. Then the orthogonal projection π V (x) of x on V has an expansion π V (x) = j (x, φ j)φ j and applying T to this expansion it follows that T x = T π V (x) = j µ j(x, φ j )ξ j (2.13) with convergence pointwise on X. For φ X and ξ Y define the rank one operator S = φ ξ as Sx = (x, φ)ξ Y, x X.

22 18 CHAPTER 2. OPERATORS ON HILBERT SPACE Then the above expansion for T can be rewritten as where the series converges pointwise on X. Set T = j µ j(φ j ξ j ) (2.14) T n = j<n µ j(φ j ξ j ) (2.15) and let x X. Using (i) and the orthonormality of the ξ n we have (T T n )x 2 = j n µ j(x, φ j )ξ 2 j = j n µ2 j (x, φ j) 2 µ 2 n (x, φ j) 2 µ 2 j n n x 2. This shows that T T n µ n. (2.16) in operator norm. Letting x = φ n above we see that we actually have equality. Consequently, if µ n 0, then the series (2.14) converges in operator norm and hence T is compact. Not every operator T B(X, Y ) has a singular system. However, if X = Y and T B(X) is selfadjoint, let {φ j } be the eigenvectors associated with the nonzero eigenvalues λ j of T arranged in decreasing order. Then (µ j, φ j, ξ j ) j with µ j = λ j and ξ j = φ j is a singular system for T. This is exactly the content of Theorem Now we generalize this fact to all compact operators T K(X, Y ): Theorem Let T : X Y be a compact operator, set A = T T, note that A is compact and selfadjoint on X and let {φ j } be the eigenvectors associated with the nonzero eigenvalues λ j of A arranged in decreasing order. Then µ j = λ j, and ξ j = µ 1 j T φ j defines a singular system (µ j, φ j, ξ j ) j for T. We have µ n 0 and hence the series (2.14) converges in operator norm. In particular T is the limit of finite rank operators. Proof. Note first that N(A) = N(T ) according to (2.2.1). Thus the φ j are an ON-basis for N(T ). By definition of (µ j, φ j, ξ j ) we have T φ j = µ j ξ j and T T φ j = µ 2 j φ j and this implies that T ξ j = µ j φ j. We claim that {ξ j } is an ON-basis for N(T ). Indeed, for j, k 0 we have (ξ j, ξ k ) = (µ 1 j T φ j, µ 1 k T φ k) = (µ j µ k ) 1 (T T φ j, φ k ) = δ jk. (2.17)

23 2.6. COMPACT OPERATORS BETWEEN HILBERT SPACES 19 Thus {ξ j } R(T ) N(T ) := W is an orthonormal system. We claim that this system spans all of W. Let w W and assume that w ξ j, for all j 0. Then T w R(T ) N(T ) and (T w, φ j ) = (w, T φ j ) = µ j (w, ξ j ) = 0, for all j 0. Since the {φ j } are an ON-basis for N(T ) it follows that T w = 0, that is w N(T ) = W. Thus w W W = {0}. This shows that the orthonormal system {ξ j } in N(T ) is complete.. Remark. If T : X Y is any bounded linear operator and (φ j ) an ON-basis for V = N(T ), then the expansion (2.1) is valid and applying T to this expansion yields T x = j (x, φ j)t φ j. What makes the expansion (2.13) interesting is the additional information contained in the singular system for T. Remark (Adjoint). Recall that T = T. If (µ j, φ j, ξ j ) j is a singular system for T then (µ j, ξ j, φ j ) j is a singular system for T and so we have the expansion T y = j µ j(ξ j φ j ). Thus if T is compact then so is the adjoint T. Remark (Range). The expansion (2.13) allows us to work with ONbases just as in the case of a compact selfadjoint operator. As an example we determine the range R(T ), that is, we study the equation T x = y. (2.18) Fix y Y. If a solution exists, then y R(T ) N(T ). Now assume that y N(T ). Then we have an expansion y = j (y, ξ j)ξ j. If there exists any solution x of (2.18) in X, then there exists a solution in V = N(T ) (in fact π V (x) is one). Thus we may assume that x V and have an expansion Applying T to this yields x = j α jφ j. (2.19) j α jµ j ξ j = T x = y = j (y, ξ j)ξ j.

24 20 CHAPTER 2. OPERATORS ON HILBERT SPACE It follows that we must have α j = µ 1 j (y, ξ j ). With this the series for x converges exactly if j µ 2 j (y, ξ j ) 2 <. Consequently { R(T ) = y N(T ) : } j λ 2 j (y, ξ j ) 2 < (2.20) exactly as in the selfadjoint case. 2.7 Hilbert-Schmidt and trace class operators Let X, Y be Hilbert spaces, T K(X, Y ) compact and (µ j, φ j, ξ j ) j a singular system for T. We know from Theorem that T is the limit in operator norm of finite operators. Now we quantify the speed of convergence. Approximation numbers. Set We have seen that then On the other hand we show now that T n = j<n µ j(φ j ξ j ). T T n µ n. (2.21) T S µ n, (2.22) for each finite rank operator S F (X, Y ) with rank(s) n. Set X n = span({φ 0,..., φ n } and note that T x µ n x, for all x X n. (2.23) Let x X n. Then x = j n (x, φ j)φ j and so T x = j n µ j(x, φ j )ξ j. It follows that T x 2 = j n µ2 j (x, φ j) 2 µ 2 n (x, φ j) 2 = µ 2 j n n x 2. Now let S F (X, Y ) with dim(r(s)) n. Then S is not one to one on X n and so there exists a unit vector u X n with Su = 0. Using (2.23) we have Thus T S µ n. The quantities (T S)u = T u µ n. a n (T ) := inf{ T S : S F (X, Y ), rank(s) n }, n 0, (2.24)

25 2.7. HILBERT-SCHMIDT AND TRACE CLASS OPERATORS 21 are called the approximation numbers of T. Here a 0 (T ) = T. The estimates (2.21) and (2.22) show that µ n = a n (T ) (2.25) and that the operator S = T n provides the best approximation of T in the operator norm among all operators of rank at most n. In particular this shows that the numbers µ n in a singular system for T are uniquely determined by T and do not depend on the singular system. The µ n are called the singular values of T. Obviously the vectors φ n and ξ n in a singular system for T are not uniquely determined. Consider the selfadjoint case and note that there are many ways to extract an orthonormal basis from each eigenspace of T. The approximation numbers a n (T ) are defined for each bounded linear operator T B(X, Y ). T is compact if and only if a n (T ) 0, as n and this is the only case of interest. In this case we have a n (T ) = µ n, where the µ j are the singular values of T (square root of the eigenvalues of T T ). For each bounded linear operator T B(X, Y ) let and let ( T = an (T ) p) 1/p S p (X, Y ) = { T B(X, Y ) : T p < }. Clearly each T S p (X, Y ) is compact. One can show that S p (X, Y ) B(X, Y ) is a closed subspace but we won t need this result. We are only interested in the cases p = 1, 2. We now assume that T K(X, Y ) is compact and (µ j, φ j, ξ j ) j a singular system for T. Hilbert-Schmidt operators. The operator T is called a Hilbert-Schmidt operator, if T S 2 (X, Y ), that is, T 2 2 := n a n(t ) 2 = n µ2 n <. Proposition If T K(X, Y ) is compact and {e α } is any ON-basis for X, then T 2 2 = α T e α 2. Remark. It follows that T is a Hilbert-Schmidt operator if and only if α T e α 2 <, for some ON-basis {e α } of X and in this case the sum is independent of the choice of the basis {e α }.

26 22 CHAPTER 2. OPERATORS ON HILBERT SPACE We do not assume that X is separable, that is that the basis {e α } is countable. However since T and hence T are compact the entire action is essentially separable: both N(T ) and R(T ) = N(T ) have countable ON-bases. Proof. Let {e α } be any ON-basis for X. N(T ) = R(T ), we have Since {ξ k } is an ON-basis for T e α 2 = k (T e α, ξ k ) 2 = k (e α, T ξ k ) 2 = k µ2 k (e α, φ k ) 2, for each α. It follows that ( α T e α 2 = α k µ2 k (e α, φ k ) 2) = k µ2 k = k µ2 k φ k 2 = k µ2 k = T 2 2. ( α (e α, φ k ) 2) Hilbert-Schmidt operators on the space X = L 2 (ν) of square integrable functions with respect to a finite measure ν will be characterized in terms of integration kernels below. Trace class operators. We now assume that X and Y have the same orthogonal dimension, that is, ON-bases {e α } of X and {f α } of Y can be indexed with the same indices α. Because of the compactness of T we can even assume both spaces to be separable. The operator T is called a trace class operator, if T S 1 (X, Y ), that is, T 1 = n a n(t ) <. Recall that (µ j, φ j, ξ j ) j denotes a singular system for T. It follows that T 1 = n µ n. Proposition Let T K(X, Y ). Then T 1 = max α (T e α, f α ), (2.26) where the maximum is taken over all ON-bases {e α } of X and {f α } of Y. Proof. Let {e α } and {f α } be ON-bases of X and Y and write T e α = j µ j(e α, φ j )ξ j. It follows that α, f α ) j (e α, φ j ) (ξ j, f α ) α α j = µ j α, φ j ) (ξ j, f α ) j α µ j j ( α (e α, φ j ) 2) 1/2 ( α (e α, ξ j ) 2) 1/2 j µ j φ j ξ j = k µ k = T 1.

27 2.7. HILBERT-SCHMIDT AND TRACE CLASS OPERATORS 23 On the other hand if we enlarge the bases {φ j } of N(T ) X and {ξ j } of R(T ) Y to ON-bases {e α } of X and {f α } of Y, then T vanishes on all e α {φ j } and the above sum becomes α (T e α, f α ) = j (T φ j, ξ j ) = j µ j = T 1. Thus T is a trace class operator if and only if the sum (2.26) is finite for all ON-bases {e α } of X and {f α } of Y. Proposition Let T K(X, Y ). Then T 1 = min n x n y n, (2.27) where the minimum is taken over all sequences (x n ) X and (y n ) Y such that T = n x n y n. Proof. Assume that T = n x n y n. Let {e α } and {f α } be ON-bases X and Y and write T e α = n (e α, x n )y n. With this (T e α, f α ) (e α, x n ) (y n, f α ) α α n ( α (e α, x n ) 2) 1/2 ( α (y n, f α ) 2) 1/2 n = n x n y n. Taking the sup over all such bases {e α } and {f α } yields T 1 n x n y n. Conversely, if we set x n = µ n φ n and y n = ξ n, then T = n x n y n and n x n y n = n µ n = T 1. Thus T is a trace class operator if and only if T has the form T = n x n y n with n x n y n <. Trace. Assume now that X = Y and T K(X) is a trace class operator. Then we define the trace of T as tr(t ) = α (T e α, e α ), (2.28)

28 24 CHAPTER 2. OPERATORS ON HILBERT SPACE where {e α } is an ON-basis for X. The series converges absolutely but we have to show that the sum does not depend on the choice of the basis {e α }. Fix a representation of T as T = x n y n with x n y n < (2.29) n n and let {e α } be an ON-basis for X. For n 0 write x n = α (x n, e α )e α and y n = β (y n, e β )e β. Entering this into the inner product (x n, y n ) we obtain (x n, y n ) = α,β (x n, e α )(y n, e β )(e α, e β ) = α (x n, e α )(y n, e α ). Now, for each α, write T e α = n (e α, x n )y n. With this we have α (T e α, e α ) = α n (e α, x n )(y n, e α ). Using Cauchy-Schwartz on sums along α we see that the double series on the right is absolutely convergent. We can thus rearrange it to obtain (T e α, e α ) = (e α, x n )(y n, e α ) = (x n, y n ). α n α n This shows that the value of the sum α (T e α, e α ) does not depend on the choice of basis {e α }. Thus the trace tr(t ) is well defined. But note that the representation of T as (2.29) was also arbitrary. Thus Proposition Let T = n x n y n with n x n y n <. Then tr(t ) = n (x n, y n ). We will see concrete examples of trace class operators and compute their trace in the treatment of kernel reproducing Hilbert spaces.

29 2.8. INVERSE PROBLEMS AND REGULARIZATION Inverse problems and regularization Let X, Y be Hilbert spaces, T B(X, Y ) be a bounded linear operator, and w R(T ). We will study the equation T v = w (2.30) to be solved for v X (inverse problem). Imagine that (2.30) arises from the theoretical study of some physical system and all ingredients are known with perfect precision. We call this problem the clean inverse problem. By assumption this problem has a solution. Unfortunately we do not know the true data w. We have only a polluted version y of w and must instead solve the practical problem T x = y (2.31) where y w is small. We do not know wether y R(T ), that is, the equation (2.31) may not have a solution, nonetheless it is all that we have to work with. We are interested in the solution v of (2.30) and not any solution x of (2.31). We reason as follows: each solution v of (2.30) is an approximate solution of (2.31), that is, T v y = w y is small. Therefore let us seek x X such that T x y is small. Then T x T v = T x w T x y + y w will be small and from this we hope to be able to conclude that x v is small, that is, x is a good approximation of the solution v of the clean problem (2.30). We are thus led to replace (2.31) with the minimization problem min x X T x y (2.32) Obviously each solution of (2.31) will be a minimizer of (2.32) but such solutions need not exist. Recall that our approach is based on the following reasoning T x T v is small x v is small. Since small is somewhat vague we might want to require more strongly that x v 0 whenever T x T v 0. This implies that N(T ) = {0} (thus T is invertible on R(T )) and the inverse S = T 1 : R(T ) X is continuous. In this case the problem (2.30) is called well posed, otherwise it is called ill posed. In short the inverse problem (2.30) is ill posed if

30 26 CHAPTER 2. OPERATORS ON HILBERT SPACE (i) the solution v is not unique, or (ii) it is unique but is not a continuous function of w R(T ). The usual definition of well posedness requires that the solution v of (2.30) exist for all w Y, be uniquenely determined and be a continuous function of w. The continuity requirement is then superfluous. It follows from the first two (N(T ) = {0} and R(T ) = Y ) by the Open Mapping Theorem. This fact is usually ignored and creates an akward situation as the continuous dependence of the solution on the right hand side is the very essence of well posedness. We are not following this lead here since it is unreasonable to require the existence of a solution of (2.30) for each right hand side w Y. In practice we are only dealing with one particular right hand side w Y and the natural assumption is that (2.30) have a solution for the given right hand side w, that is, w R(T ). It does not make much sense to seek the solution of a problem which does not have a solution by virtue of the theory underlying the problem. Condition (ii) is the crucial one. The uniqueness of the solution can always be enforced if we replace X with N(T ). Unfortunately in many cases of practical interest the operator T is compact and the space Y infinite dimensional. In this case (ii) is guarenteed to fail. We now devise a strategy to cope with the ill posedness of (2.30) rephrased as the minimization problem min x X T x y. (2.33) Note that we do not assume that T is compact, it is a general bounded linear operator T B(X, Y ). The minimization problem (2.33) does not always have a solution. This merely means that no minimum is assumed. We can still get arbitrarily close to the infimum and thus hope to find x such that T x y is small. However, in this case, it is not clear how to find such x. On the other hand if minimizers do exist we can hope that they have special properties which make them easy to find. Indeed, we will see below that they can be obtained as a solution of the so called normal equation. Let W = R(T ) = N(T ). Then the orthogonal projection b = π W (y) is the unique element u W which minimizes the distance y u. Note that T b = T y. Since b can be approximated arbitrarily closely with elements in the range of T, an element x X minimizes (2.33) if and only if T x = b and

31 2.8. INVERSE PROBLEMS AND REGULARIZATION 27 since the operator T is one to one on W (2.2.1) this is equivalent with T T x = T b = T y. (2.34) This equation is called the normal equation associated with the minimization problem (2.33). It has a solution exactly if b R(T ). This condition is automatically satisfied if T has closed range. In this case the minimizers of (2.33) are exactly the solutions of (2.34) plus arbitrary vectors in N(T T ) = N(T ). The unique solution in N(T ) is the solution with minimal norm. Example (Polynomial least sqares interpolation). If X and Y are finite dimensional then R(T ) = R(T ) and thus b R(T ) is automatically satisfied, that is, the normal equations do have a solution. Consider the following example: let n 1 and assume we are given n pairs of points (x 1, y 1 ),..., (x n, y n ) R 2. We want to find a polynomial Q of fixed degree k which minimizes the squared error n j=1 Q(x j) y j 2 = Q x y 2 where Q x = (Q(x 1 ),..., Q(x n )) and y = (y 1,..., y n ) are vectors in R n. The error is computed in the Euclidean norm of R n and with this norm R n is a Hilbert space. Write Q(x) = a 0 + a 1 x + + a k x k. and identify the polynomial Q with the vector a = (a 0,..., a k ) R k+1 of its coefficients. With this identification Q x = T a, where T : R k+1 R n is the linear operator given by the matrix 1 x 1 x x k 1 1 x 1 x x k 1 T =... 1 x n x 2 n... x k n and the normal equations T T a = T y can be solved for the coefficients a of Q. In the special case of linear least squares interpolation (k = 1) we have ( ) 1 x 1 ( ) T T = 1 x 1 x 1 x 2... x n... = n xk xk x 2 k 1 x n

32 28 CHAPTER 2. OPERATORS ON HILBERT SPACE and the normal equations assume the form a 0 n + a 1 xk = y k a 0 xk + a 1 x 2 k = x k y k Divide by n and write Ex = n 1 x k, Ey = n 1 y k, Exx = n 1 x 2 k and Exy = n 1 x k y k ( E for expected value) to obtain with solution a 0 + a 1 Ex = Ey a 0 Ex + a 1 Exx = Exy a 0 = ExExy EyExx Ex 2 Exx and a 1 = ExEy Exy Ex 2 Exx Regularization In general the normal equation (2.34) suffers from the same drawbacks as the inverse problem (2.30). If the operator T is compact then so is T T and hence the normal equation (2.34) is an ill posed inverse problem. Now let α > 0 be a small positive number. Then the operator αi + T T is invertible on X with bounded inverse (2.3.5). If we replace the normal equation (2.34) with (αi + T T )x = T y (2.35) we have a well posed problem with solution x = (αi + T T ) 1 T y. The following proposition shows that x is the solution to a modified minimization problem: Proposition The solution x = (αi + T T ) 1 T y to the regularized normal equations (2.35) is the unique minimizer of min x X ( T x y 2 + α x 2). (2.36) Proof. Let F := X Y be the product space endowed with the inner product (a b, u v) F := α(a, u) X + (b, v) Y, a, u X, b, v Y with associated norm x y 2 := y 2 + α x 2, x X, y Y.