ECE 275B Homework # 3 Solutions Winter 2015

Transcription

1 ECE 275B Homework # 3 Solutions Winter 25. Kay. I) Use of the prior pdf p(θ) = δ(θ θ ) means that we are % certain that θ = θ. In this case, one computes θ MSE = θ MAP = θ regardless of the quality and quantity of the subsequently collected data. This provides a model of stubbornness. II) Contrawise, an uninformative prior (typified by the uniform prior) says that we have no a priori knowledge of the parameter. In this case only the data is informative and we obtain θ MAP = θ MLE. (For symmetric, unimodal distributions we have θ MSE = θ MAP = θ MLE.) III) Prior pdf parameters (θ in the case considered above) are so called hyperpriors, meta level priors, or meta priors. 2. Kay.3 Let X = {x[],, x[n ]}, x = N And therefore, From, we obtain p(x θ) = N k= N k= x[k], and x = min k x[k]. Then p(x[k] θ) = e N( x θ) χ {θ < x}. p(x, θ) = p(x θ) p(θ) = e N x +(N )θ χ { < θ < x}. p(θ X ) = p(θ, X ) p(x ) θ MSE = E {θ X } = = = θ p(x, θ)dθ p(x, θ)dθ = = p(θ, X ), p(θ, X )dθ θ p(θ X )dθ x θ e (N )θ dθ x e (N )θ dθ x e (N )x N. This formula breaks down for N = and this case must be separately derived. Note that θ parameterizes the prior pdf and is assumed known and fixed in value. On the other hand, θ is variable and represents any admissible value for the unknown value of the rv Θ(ω).

2 3. Kay.4 Let X = {x[],, x[n ]}, x = max k x[k], and x = min k x[k]. Then And therefore, p(x θ) = N k= It is readily determined that, p(x[k] θ) = χ { x x θ}. θn p(x, θ) = p(x θ) p(θ) = χ { x x θ β}. β θn p(θ X ) = From which is obtained, N x ( N) β χ { x x θ β}. ( N) θn θ MSE = E {θ X } = N N 2 x(2 N) β (2 N) x ( N) β ( N). In the limit as β (for which the prior becomes uninformative), we get, θ MSE = N N 2 x. 4. Kay. is equivalent to Kay Example. and the answer is thus given by equation (.) in Kay eq seq. Because the posterior density is symmetric and unimodal (because it s Gaussian), the MMSE and MAP estimators are the same. 5. Kay.2 Notice that the two Gaussian components of the bimodal mixture distribution have the same variance and are symmetrically located about the origin at the points x and x. From symmetry considerations, then, it is obvious that for ɛ =.5 we have θ MSE = while θ MAP is unable to distinguish between the two equally probable maxima located at points x and x. For the case ɛ =.75, the symmetry is broken and it is obvious that the Gaussian component located at position x dominates so that θ MAP = x. It is easily found that the general solution for the MMSE estimator is given by the convex combination, θ MSE = ɛ x + ( ɛ)( x) = (2ɛ )x. Thus for ɛ =.75 we obtain, θ MSE = x 2. 2

3 6. Kay. Recall that consistency is defined to be convergence in probability and that mean square convergence is a sufficient condition for consistency. The MAP estimator, θ(n), is derived in Kay Example.2. Note that in the limit as N gets large we have that, θ(n) = N x[n]. N It is readily shown that θ(n) is an unbiased estimator of θ n= error which goes like N as N. Thus θ(n) of θ with a mean square is a consistent estimator. Therefore, by the carry over property for convergence in probability, the estimator θ(n) is a consistent estimator of θ. In general, as N certain Bayesian estimators (such as the MMSE and MAP estimators) generally become equivalent to the MLE estimator (as the data eventually overwhelms our prior belief about θ) which, in turn, is generally consistent. 7. Kay.8 From Equation (.4) we know that the MMSE is given by ŝ = C(C + σ 2 I) x, were C is the autocovariance matrix of the vector s. 2 Because we have a matrix equation, we must place the above equation into the requested form using noncommutative matrix operations. 3 We have that (C + σ 2 I)ŝ = (C + σ 2 I)C(C + σ 2 I) x = (C + σ 2 I) [ (C + σ 2 I) σ 2 I ] (C + σ 2 I) x = (C + σ 2 I) [ I σ 2 (C + σ 2 I) ] x = [ C + σ 2 I σ 2 I ] x = Cx. This equation is known as the Wiener Hopf (W H) equation, the solution of which yields the optimal, noncausal Wiener Filter. It is straightforward to solve for the (noncausal) solution to the W H equation using the procedure outlined by Kay. As noted by Kay, the W H equation can be written out at the components level as shown on page 374. By allowing the lower and upper limits of the shown sums to extend to + and respectively, we are considering the entire sampled time series of the process s(t). (This is equivalent to extending the N N 2 Recall that the vector s contains samples, s[n], of the zero mean process s(t) and therefore the elements of C are samples of the autocorrelation function of s(t). Similarly, the x is made up of samples from the noisy measurement process x(t) = s(t) + w(t). 3 On an exam you would lose points if you assumed commutativity where it is not permissible. 3

4 matrix C in the vector matrix W H given above to matrix.) Taking the (formal) Fourier transform of the resulting infinite summed equation, we obtain, This yields the noncausal Wiener filter, The last step follows because, Φ xx (f)ŝ(f) = Φ ssx(f). H(f) = Ŝ(f) X(f) = Φ ss(f) Φ xx (f) = x[n] = s[n] + w[n], Φ ss(f) Φ ss (f) + σ 2. where s and w are independent and w is a white noise process with variance σ 2. The inverse Fourier transform of H(f) will yield the discrete time domain Wiener Filter. In general (without additional assumptions or constraints) this filter will be IIR, noncausal, and infinite dimensional. To obtain a causal filter, the (complicated) technique of spectral factorization is used. To obtain a finite dimensional Wiener filter the process s is further assumed to have a so called rational spectrum. (The Kalman filter is a procedure to obtain a finite dimensional, causal optimal filter directly in the time domain, thereby side stepping the difficult issues involved in rational spectral approximation and factorization.) To obtain an FIR Wiener filter, rather than an IIR filter, an FIR filter structure is assumed at the outset, prior to deriving the W H equation. An excellent introduction to, and discussion of, these issues is given in Introduction to Optimal Estimation, E.W. Kaman and J.K. Su, Springer, 999. More advanced issues pertaining to the relationship between spectral factorization, covariance factorization, Wiener filtering, and Kalman filtering are discussed in Optimal Filtering, B.D.O. Anderson and J.B. Moore, Prentice Hall, 979. The Wiener Filter can also be written as H(f) = + σ2 Φ ss(f) Thus the optimal solution is given in the frequency domain by Ŝ(f) = =. Φ ss (f) X(f) Φ ss (f) + σ2 () X(f). + σ2 Φ ss(f) (2) Note from () that the optimal Wiener filter attempts to reconstruct the signal S(f) by weighting the measured noisy signal X(f) = S(f) + W (f) proportionately to the fraction of power in X(f) that is due to S(f) at each frequency f (the 4

5 remaining fraction of the power at that frequency being due to the corrupting noise W (f)). Note from (2) that at frequencies for which there is a high signal to noise ratio (SNR), Φss(f), the measured signal X(f) (being therefore mostly σ 2 comprised by the desired signal S(f)) is passed almost unattenuated, while at frequencies with a low SNR, Φss(f), the measured signal X(f) is greatly attenuated (being comprised in this instance mainly by the noise σ 2 N(f)). 8. Note that via an application of a matrix identity proved last quarter, Equation (.28) of Kay can be readily reexpressed as Equation (.32). Now note that taking µ θ = and C θ I (which corresponds to assuming a noninformative prior for θ) yields the Gauss-Markov solution of Equation (6.9). Theorem.3 is the Bayesian form of the Gauss Markov theorem. 9. Part a). Because of the symmetry of the posterior density function we have that θ mmse = θ map = arg max p(θ y) θ = arg max θ p(y θ) p(θ) = arg max ln p(y θ) + ln p(θ) θ = arg min y Aθ 2 W + θ θ 2 Σ Thus the problem is equivalent to solving a weighted least squares problem where the least squares loss function is given by, l(θ) = ( ) T ( ) ( ) y Aθ W y Aθ θ θ Σ θ θ = (η Aθ) T Λ (η Aθ) where. = η Aθ 2 Λ, ( ) y η =, A = θ ( ) ( ) A W, and Λ = I Σ Part b). We can now solve this problem using the deterministic weighted least squares theory developed last quarter. The adjoint operator of A is given by A = A T Λ = [ ] A T W, Σ, and its pseudoinverse is, A + = ( A T W A + Σ The optimal estimate is therefore, θ mmse = θ map = A + η = ( A T W A + Σ 5 ) [ ] A T W, Σ. ) ( A T W y + Σ θ )..

6 Note that for Σ = σ 2 I, we have Σ = I σ 2 as σ 2. Thus is the case of an uninformative prior we obtain the classical Gauss Markov MLE solution, θ mle = ( A T W A ) A T W y = A + y.. Kay 2. Optimal Quadratic Estimator. For simplicity, denote the single sample x[] by x. Note that we wish to approximate the highly nonlinear function of x, by the simpler quadratic function of x, θ(x) = cos (2πx) θ(x) = ax 2 + bx + c. A straightforward way to tackle this problem is to take the partial derivative of the mean square-error, { (θ(x) E (ax 2 + bx + c) ) } 2 with respect to the unknown parameters a, b, and c respectively. Then set each of these partial derivatives equal to zero and solve the resulting three equations for the three unknown parameters. Alternatively, let s discuss this problem within the geometric framework that has been a constant key theme of the course. Note that θ lies in the vector space of all real scalar valued nonlinear measurable functions of x (call this space RV), and that we wish to approximate θ within the subspace of all quadratic functions of x. This three dimensional subspace is made up of the one dimensional subspace of the constant random variables (which are all multiples of the number ), the one dimensional subspace of rv s which are multiples of x, and the one dimensional subspace of rv s which are multiples of x 2. Since all the random variables of interest are square integrable, RV (and each subspace) can be interpreted as a Hilbert space with inner product θ, θ 2 = E {θ θ 2 }. Therefore, we can apply our standard optimization in Hilbert space arguments to obtain the optimal approximation. There are actually at least two (equivalent) ways that the problem can be framed within a Hilbert Space framework. In both, we will obtain an approximation of the form, θ = Ay, 6

7 for an appropriately defined data vector y. I. The first way that the problem can be posed was described last quarter in ECE275A. Here we take y = (a b c) T R 3 and A = (x 2 x ) : R 3 RV. From this perspective, we are trying to find the Moore Penrose pseudoinverse solution, ŷ +, to the problem, θ = Ay. Note that in this setup, the linear operator A is known and y is unknown. Precisely this least squares problem was discussed last quarter. Once ŷ + is known we can compute the desired approximation to θ as θ + = Aŷ +. Note that the optimal estimate θ + lies in the range of the operator A, R(A) = quadratic functions of x, which is a Hilbert subspace of RV. II. An alternative way to pose the problem, and the one looked for in the Bayesian framework, is to look for the best linear minimum mean-square estimator of θ from within the Hilbert subspace, L, of RV made up of random variables which are obtained as linear functions of the known random variable y = (x 2 x ) T, L = {θ θ = Ay, A = (a b c)} RV. In this formulation of the problem, the linear operator A is unknown and y is known. Note that the Hilbert subspaces obtained here and in the previous paragraph are one and the same subspace, viz the quadratic functions of x. Only our interpretation of the nature of this subspace as changed (and thus, necessarily, our definitions of the quantities A and y). 4 We will solve the problem using the second setup. The projection theorem demands that the optimal solution θ o = A o y satisfies (θ θ ) o L. As we discussed in class, because the error must be orthogonal to any random variable of the form Ay regardless of the specific value of the operator A, we have the equivalent statement that, E { (θ A o y)y T } =. 4 Note that we could alternatively have defined the rv z = (x 2 x) T, A = (a b), and then looked for the optimal affine estimator θ = Ax + c, the solution of which we know is given by θ = m θ + Σ θz Σ zz (z m z ). Using our second definition of y, we have that y = (z T ) T. This shows that by embedding z in a one more dimensioned larger space we can restrict our search to the class of linear (rather than affine) estimators in the larger space. This is a standard trick. 7

8 This yields, as expected. A o = Σ θy Σ yy, Let the j th moment of x be µ j = E {x j }. Note that x is a zero mean rv with an even (i.e., symmetric about the origin) pdf. Because x j is an odd function for j odd, we have that µ = µ 3 =. Also note that θ = θ(x) = cos (2πx) is an even function of x, so that x j θ is an odd function of x for j odd, yielding (for j = ) E {xθ} =. Finally, because x is uniformly and symmetrically distributed about the origin over a span x, which is equal to 2π radians of the argument of θ cos ( ), 2 2 we have that E {θ} =. Via some straightforward integrations, it can also be ascertained that µ 2 = µ 4 =, and E 8 {x2 θ} =. 2π 2 These facts yield, and Finally, we obtain A o = Σ θy Σ yy = Σ θy = [ ], 2π Σ yy =. 2 This yields the optimal quadratic estimator, ( 9 5 ) ) = (â π 2 2π 2 o bo ĉ o. θ quad = 9 π 2 x π 2. Optimal Linear Estimator. Now we have y = (x ) T, A = (b c). In this case, we have that Σ θy = [E {xθ} E {θ}] = [ ]. Thus, regardless of the value of Σ yy, we have A o = Σ θy Σ yy = [ ]. Thus, the optimal linear estimator is given by θ lin =. This makes sense, if you think about it, because the constant function of x when integrated against θ(x) is zero (i.e., is orthogonal to θ(x)) while a linear function of x times the even function θ(x) is an odd function of x and therefore a linear function of x is also orthogonal 8 2,

9 to θ(x). Thus the projection of θ(x) on any affine (linear plus constant) function of x must be zero. Optimal MMSE. We have that θ MSE = E {θ x} = E {cos (2πx) x} = cos (2πx) = θ. Obviously, because θ is a deterministic function of x, knowledge of the value of x results in complete knowledge of the value of θ. Comparisons of MSE s.. Kay 2.8. MSE(linear) =.5 > MSE(quadratic) =.38 > MSE(optimal) =. I) Note that α = Aθ + b implies that the means are related by m α = Am θ + b and the cross covariances by Σ αx = AΣ θx. With these facts we have that α = m α + Σ αx Σ xx (x m x ) = (Am θ + b) + AΣ θx Σ xx (x m x ) = A ( m θ + Σ θx Σ xx (x m x ) ) + b = A θ + b. II) With α = θ + θ 2, it is straightforward to show that Σ αx = Σ θ x + Σ θ2 x and m α = m θ + m θ2. This yields, α = m α + Σ αx Σ xx (x m x ) = = (m θ + m θ2 ) + (Σ θ x + Σ θ2 x) Σ xx (x m x ) = ( m θ + Σ θ xσ xx (x m x ) ) + ( m θ2 + Σ θ2 xσ xx (x m x ) ) = θ + θ 2,. 2. Note that all quantities are zero mean. 5 (a) Note that the problem is completely symmetric in x and n. Therefore the solution for x yields the solution for n, mutatis mutandis. The optimal linear estimator is generally of the form ay + b, i.e., it is actually affine. However it is easily shown that we can take b =, so that the optimal estimator is drawn from the family of unbiased estimators. Using the Hilbert space framework, to find the optimal estimator x = ay from within the space of linear functions of y, we invoke the projection theorem to obtain the orthogonality condition, (x ay), y = E {(x ay)y} =. 5 But what if a particularly difficult examiner disallows this simplifying assumption? 9

10 From this we obtain the Wiener Hopf equation, E { y 2} a = E {xy}, from which the solution is determined to be, x = From symmetry, we obviously have σ2 x y. σx 2 + σn 2 n = σ2 n y. σx 2 + σn 2 (b) Note that x, y, and n must all be jointly gaussian. Again from symmetry, what holds for x must hold for n (and vice versa), mutatis mutandis. For the gaussian case, the well known optimal mmse, 6 x = E {x y}, is given by x = E {x y} = m x + Σ xy Σ yy (y m y ) which is just the optimal linear estimator derived above. (c) Not all of the possible combinations make sense. Chapter 4 of Kay is a summary review of all the estimation schemes. For the Gauss Markov theorem, see page 4 of Kay. 3. In class lecture and in the texts it is shown that With z = Ay + b, A invertible, we have, x(y) = m x + Σ xy Σ yy (y m y ). m z = Am y + b ; z m z = A (y m y ) ; Σ xz = Σ zy A T ; Σ zz = AΣ yy A T. x(z) = mx + ΣxzΣ zz (z m z ) = m x + Σ xy A T A T Σ yy A A (y m y ) 4. No solution provided. = m x + Σ xy Σ yy (y m y ) = x(y). 6 Can you prove this at the board, fast, in real time? 7 Note the abuse of notation in the statement x(z) = x(y).