2. Linear Minimum Mean Squared Error Estimation

. Linear inimum ean Squared Error Estimation.1. Linear minimum mean squared error estimators Situation considered: - A random sequence X ( 1,, X( whose realizations can be observed. - A random variable Y which has to be estimated. - We seek an estimate of Y with a linear estimator of the form: Ŷ = h 0 + h m X( m - A measure of the goodness of Ŷ is the mean squared error (SE: E[ ( Ŷ Y. Covariance and variance of random variables: Let U and V denote two random variables with expectation µ U E[ U and µ V E[ V. - The covariance of U and V is defined to be: Σ UV E[ ( U µ U ( V µ V = E[ UV µ U µ V. Let U [ U ( 1,, U( T and V [ V ( 1,, V( ' T denote two random vectors. The covariance matrix of U and V is defined as A direct way to obtain Σ UV : where Σ U ( 1V ( 1 Σ U ( 1V( ' Σ UV Σ U( V ( 1 Σ U( V( ' Σ UV = E[ ( U µ U ( V µ V T = E[ UV T µ U ( µ V T µ U E[ U = [ E[ U ( 1,, E[ U( T µ V E[ V Examples: U = X [ X ( 1,, X( T and V = Y. In the sequel we shall frequently make use of the following covariance matrix and vector: (i Σ XX = E[ ( X µ X ( X µ X T σ X ( 1 Σ X ( 1X( = Σ X( X ( 1 σ X( - The variance of U is defined to be: σ U E[ ( U µ U = = E[ U ( µ U Σ UU (ii Σ XY = E[ ( X µ X ( Y µ Y T = Σ X ( 1Y Σ X( Y Linear minimum mean squared error estimator (LSEE A LSEE of Y is a linear estimator, i.e. an estimator of the form Ŷ = h 0 + h m X( m, which minimizes the SE E[ ( Ŷ Y. -1 -

A linear estimator is entirely determined by the ( + 1 -dimensional vector h [ h 0,, h T. Orthogonality principle: Orthogonality principle: A necessary condition for h [ h 0,, h T to be the coefficient vector of the LSEE is that its entries fulfils the ( + 1 identities: E[ Y Ŷ = E Y h 0 + h m X( m = 0 E[ ( Y ŶX( j = E Y h 0 + h m X( m X( j = 0, j = 1,, Because the coefficient vector of the LSEE minimizes E[ ( Ŷ Y, its components must satisfy the set of equations: E [ ( Ŷ Y = 0 j = 0,,. h j Notice that the two expressions in (.1 can be rewritten as: E[ Y Ŷ = 0 (.a E[ ( Y ŶX( j = 0 j = 1,, Important consequences of the orthogonality principle: E[ ( Y ŶŶ = 0 E[ ( Y Ŷ = E[ Y E[ Ŷ = E[ ( Y ŶY (.3a (.3b Ŷ (.b Y Y (.1a (.1b Ŷ E Y [ Geometrical interpretation: Let U and V denote two random variables with finite second moment, i.e. E[ U < and E[ V <. Then, the expectation E[ UV can be viewed as the scalar or inner product of U and V. Within this interpretation: - U and V are uncorrelated, i.e. E[ UV = 0 if and only if, they are orthogonal, - E[ U is the norm (length of U. Interpretation of both equations in (.3: - (.3a: the estimation error Y Ŷ and the estimate Ŷ are orthogonal. - (.3b: results from Pythagoras Theorem. Computation of the coefficient vector of the LSEE: The coefficients of the LSEE satisfy the relationships: µ Y = h 0 h m µ X( m = h + ( h - T µ 0 X Σ XY Σ XX h - = where h - [ h 1,, h T and X [ X 1,, X T. Both identities follow by appropriately reformulating relations (.1a and (.1b and using a matrix notation for the latter one. -3-4

Thus, provided exists the coefficients of the LSEE are given by: Example: Linear prediction of a WSS process Let Y( n denote a WSS process with - zero mean, i.e E[ Y( n = 0, - autocorrelation function E[ Y( ny( n + k = R YY ( k We seek the LSEE for the present value of Y( n based on the past observations Y( n 1,, Y( n of the process. Hence, - Y = Y( n - X ( m = Y( n m,,,, i.e. - Σ XX = ( Σ XX 1 h - = ( Σ XX 1 Σ XY (.4a h 0 µ Y ( h - T T = µ X = µ Y Σ XY ( ΣXX 1 µ X X = [ Y( n 1,, Y( n T Because µ Y = 0 and µ X = 0, it follows from (.4b that h 0 = 0 Computation of Σ XY and Σ XX : - Σ XY = [ E[ Y( n 1Y( n,, E[ Y( n Y( n T = [ R YY ( 1,, R YY ( T = (.4b E[ Y( n 1 E[ Y( n 1Y( n E[ Y( n 1Y( n E[ Y( n Y( n 1 E[ Y( n E[ Y( n Y( n E[ Y( n Y( n 1 E[ Y( n Y( n E[ Y( n Residual error using a LSEE: The SE resulting when using a LSEE is E[ ( Ŷ Y = σ Y ( h - T Σ XY (.5 E[ ( Y Ŷ = E[ ( Y ŶY = E[ Y E[ Ŷ Y.. inimum mean squared error estimators Conditional expectation: Let U and V denote two random variables. The conditional expectation of V given U = u is observed is defined to be E[ V u vp( v udv. Notice that E[ V U is a random variable. In the sequel we shall make use of the following important property of conditional expectations: EEV [ [ U = E[ V = R YY ( 0 R YY ( 1 R YY ( R YY ( 1 R YY ( 1 R YY ( 0 R YY ( 1 R YY ( R YY ( R YY ( 1 R YY ( 0 R YY ( 3 R YY ( 1 R YY ( R YY ( 3 R YY ( 0-5 -6

inimum mean squared error estimator (SEE: The SEE of Y based on the observation of X ( 1,, X( is of the form: Hence if X ( 1 = x( 1,, X( = x ( is observed, then = Let Ŷ denote an arbitrary estimator. Then, = 0 = E[ ( Ŷ Y + E[ ( Y Y Thus, E[ ( Ŷ Y ( Y Y = 0. Y ( X( 1,, X( = E[ Y X( 1,, X( Y ( x( 1,, x ( = E[ Y x( 1,, x ( E[ ( Ŷ Y = E[ (( Ŷ Y ( Y Y yp( y x( 1,, x ( dy = E[ ( Ŷ Y E[ ( Ŷ Y ( Y Y + E[ ( Y Y with equality if, and only if, Ŷ = E[ ( Ŷ Y E[ ( Y Y Y. We still have to prove that Example: ultivariate Gaussian variables: [ Y, X( 1,, X( T N ( µσ, with - µ [ µ Y, µ X ( 1, µ, X( T σ - Σ Y Σ XY From Equation (6. in [Shanmugan it follows that Bivariate case: = 1, X ( 1 = X - Σ XX = σ X, Σ XY - Σ XY = ρσ X σ Y, where ρ -------------- is the correlation coefficient of Y and σ X σ Y X. In this case, ( Σ XY T Y Σ XX = E[ Y X = µ Y + ( Σ XY T ( Σ XX 1 ( X µ X Y ρσ Y = E[ Y X = µ Y + ---------- ( X µ σ X X ρσ Y µ Y ---------- µ σ X ρσ Y = + ---------- X X σ X h 0 h 1 We can observe that Y is linear, i.e. is the LSEE Ŷ = Y in the bivariate case. This is also true in the general multivariate Gaussian case. In fact, Ŷ = Y if, and only if, [ Y, X( 1,, X( T is a Gaussian random vector. -7-8

.3. Time-discrete Wiener filters Problem: Estimation of a WSS random sequence Y( n based on the observation of another sequence X( n. Without loss of generality we assume that E[ Y( n = E[ X( n = 0. The goodness of the estimator Ŷ ( n is described by the SE E[ ( Ŷ ( n Y( n. We distinguish between two cases: - Prediction: Ŷ ( n depends on one or several past observations of X( n only, i.e. Ŷ ( n = Ŷ ( X( n 1, X( n, with n 1, n, < n - Filtering: Ŷ ( n depends on the present observation and/or one or many future observation(s of X( n, i.e. Ŷ ( n = Ŷ ( X( n 1, X( n, where at least one n i n If all n i n, the filter is causal otherwise it is noncausal. xn ( Prediction yn ( Filtering.3.1. Noncausal Wiener filters Linear inimum ean Squared Error Filter We seek a linear filter Ŷ ( n = hm ( X( n m = m = which minimizes the SE E[ ( Ŷ ( n Y( n. Such a filter exists. It is called a Wiener filter in honour of his inventor. Orthogonality principle (time domain: The coefficients of a Wiener filter satisfy the conditions: E[ ( Y( n Ŷ ( n X( n k = It follows from these identities (see also (.3 that hn ( * X( n = E Y( n hm ( X( n m X( n k = 0, k =, 1, 0, 1, m = E[ ( Y( n Ŷ ( n Ŷ ( n = 0 E[ ( Y( n Ŷ ( n = E[ Y( n E[ Ŷ ( n = E[ ( Y( n Ŷ ( n Y( n Ŷ ( n Y( n Ŷ ( n Y( n 1 0 1 3 4 n 1 0 1 3 4 n Typical application: WSS signal embedded in additive white noise X( n = Y( n + W( n. where, - W( n is a white noise sequence, - Y( n is a WSS process With the definitions R XX ( k E[ X( nx( n + k, R XY ( k E[ X( ny( n + k we can recast the orthogonality conditions as follows: R XY ( k = hm ( R XX ( k m k =, 1, 0, 1, m = R XY ( k = hk ( *R XX ( k Wiener-Hopf equation - Y( n and W( n are uncorrelated. However, the theoretical treatment is more general as shown below. -9-10

Orthogonality principle (frequency domain: where Transfer function of the Wiener filter: SE of the Wiener filter (time-domain formulation: S XY ( f = H( f S XX ( f S XY ( f F { R XY ( k } H( f S XY ( f = ------------------- S XX ( f E[ ( Y( n Ŷ ( n = σ Y m = hm ( R XY ( m E[ ( Y( n Ŷ ( n = E[ Y( n E[ Ŷ ( ny( n E[ ( Y( n Ŷ ( n is the value p( 0 of the sequence pk ( = R YY ( k hm ( R YX ( k m Hence, m = = R YY ( k hk ( *R YX ( k S XY ( f P( f = S YY ( f H( f S YX ( f = S YY ( f ------------------------ S XX ( f 1 E[ ( Y( n Ŷ ( n = p( 0 = P( f df 1 1 E[ ( Y( n Ŷ ( n S XY ( f = S YY ( f ------------------------ df S XX ( f 1 SE of the Wiener filter (frequency-domain formulation: We can rewrite the above identity as: E[ ( Y( n Ŷ ( n = R YY ( 0 hm ( R YX ( m m =.3.. Causal Wiener filters A. X( n is a white noise. We first assume that X( n is a white noise with unit variance, i.e. E[ X( nx( n+ k = δ( k. Derivation of the causal Wiener filter from the noncausal Wiener filter: Let us consider the noncausal Wiener filter Ŷ ( n = hm ( X( n m m = whose transfer function is given by S XY ( f H( f = ------------------- =. S XX ( f S ( f XY Then, the causal Wiener filter Ŷ c ( n results by cancelling the noncausal part of the non-causal Wiener filter: Ŷ c ( n = hm ( X( n m m = 0-11 -1

Sketch of the proof: Ŷ ( n can be written as 1 Ŷ ( n = hm ( X( n m + m = m = 0 hm ( X( n m Because X( n is a white noise, the causal part V = Ŷ c ( n and the noncausal part U = Ŷ( n Ŷ c ( n of Ŷ ( n are orthogonal. It follows from this property U V Ŷ c ( n (Causal X( n Whitening Filter Z( n E[ Z( nz( n + k = δ( k gn ( This operation is called whitening and the filter gn ( is called a whitening filter. equivalent there exists another causal filter g ( n so that X ( n = g ( n*z( n : Z( n g ( n X( n Y( n Ŷ c ( n Yˆ c ( n U Y( n = Ŷ( n Ŷ c ( n Y( n Ŷ ( n Ŷ ( n that Ŷ c ( n and Y( n are orthogonal, i.e. that Ŷ c ( n minimizes the SE within the class of linear causal estimators. Notice that if then G( f = S XX ( f 1 G ( f = S XX ( f G( f F { gn ( } G ( f F { g ( n } We shall see that a whitening filter exists such that (.6 G ( f = G( f 1 Causal Wiener filter aking use of the result in Part A, the block diagram of the causal Wiener filter is B. X( n is an arbitrary WSS process whose spectrum satisfies the Paley- Wiener condition. Usually, the above truncation procedure to obtain Ŷ c ( n does not apply because U and V are correlated and therefore not orthogonal in the general case. Causal whitening filter: However, we can show (see the Spectral Decomposition Theorem below that provided S XX ( f satisfies the Paley-Wiener condition (see below then X( n can be converted into an equivalent white noise sequence Z( n with unit variance by filtering it with an appropriate causal filter gn (, X( n (Causal Whitening Filter gn ( Z( n Causal part of F 1 { S ZY ( f } S ZY ( f is obtained from S XY ( f according to S ZY ( f = G( f S XY ( f Ŷ c ( n -13-14

Hence, the block diagram of the causal Wiener filter is: X( n Spectral Decomposition Theorem: Let S XX ( f satisfies the so-called Paley-Wiener condition: log( S XX ( f df > 1 Then S XX ( f can be written as S XX ( f = G( f + G( f - with G( f + and G( f - satisfying G( f + = G( f - = S XX ( f oreover, the sequences gn ( + F 1 { G( f + } gn ( - F 1 { G( f - } g 1 ( n F 1 { 1 G ( f + } g 1 ( n F 1 { 1 G ( f - } satisfy Whitening Filter gn ( 1 gn ( + = g 1 ( n = 0 n < 0 gn ( - = g 1 ( n = 0 n > 0 Z( n Causal part of F 1 { G( f S XY ( f }. Causal sequences Anticausal sequences Ŷ c ( n Whitening filter (cont d: The sought whitening filter used to obtained Z( n is gn ( = g 1 ( n + and g ( n = gn ( +. It can be easily verified that both sequences satisfy the identities in (.6..3.3. Finite Wiener filters Finite linear filter: Wiener-Hopf equation: By applying the orthogonality principle we obtain the Wiener-Hopf system of equations: where and Σ XX Coefficient vector of the finite Wiener filter: provided Σ XX is invertible. Ŷ ( n = hm ( X( n m m = Σ XY 1 = Σ XX h h [ h( 1,, h ( T Σ XY [ R XY ( 1,, R XY ( T R XX ( 0 R XX ( 1 R XX ( R XX ( 1 + R XX ( 1 R XX ( 0 R XX ( 1 R XX ( 1 + 1 R XX ( R XX ( 1 R XX ( 0 R XX ( 1 + R XX ( 1 + R XX ( 1 + 1 R XX ( 1 + R XX ( 0 h = ( Σ XX 1 Σ XY -15-16