ORTHOGONALITY CRITERIA FOR OPTIMALITY

Transcription

1 MINIMUM MEAN-SQUARED ERROR LINEAR ESTIMATION Consider the task of linearly estimating random variable Y from random variables X 1,...,X N with minimum mean squared error (MSE). That is, we wish to choose coefficients a 1,...,a N such that the estimate N ^Y = ai X i i=1 has the smallest possible MSE: E(Y-^Y) 2 = E Y- N 2 ai X i i=1 Optimal linear estimators. We will now show how to choose the a i 's to minimize the MSE. Let us assume that Y and the X i 's have zero mean. If not, we could subtract their means to obtain zero mean random variables, apply optimal linear estimation, and then add the mean of Y to the estimate of Y-EY. Equivalently, we could optimize an estimator of the form Σi=1 N a i X i + b, which is called affine. We consider X = (X 1,...,X N ) t and a = (a 1,...,a N ) t to be column vectors. In this vector notation, ^Y = a t X We assume that the N N covariance matrix K = [EX i X j ] of X is known, as is the vector of covariances r = (EX 1 Y,..., EX N Y) t March 6, 2005 DPCM-20 ORTHOGONALITY CRITERIA FOR OPTIMALITY To find the optimal choice of a we will prove the following. Orthogonality Criteria for Linear Estimators a is the optimal linear predictor coefficients iff E (Y-a t X)X j = 0, j = 1,...,N i.e. iff the estimation error is orthogonal to all observations. The above equation is called the orthogonality criteria or principle. Proof: If statement: Suppose a satisfies the above, and b is an arbitrary set of coefficents. Then, E (Y-b t X) 2 = E ((Y-a t X) + (a t X-b t X)) 2 = E (Y-a t X) E (Y-a t X)(a t X-b t X)) + E (a t X-b t X) 2 = E (Y-a t X) E (Y-a t N X) (aj -b j )X j + E (a t X-b t X) 2 j=1 = E (Y-a t X) 2 N + 2 (aj -b j ) E (Y-a t X)X j + E (a t X-b t X) 2 j=1 = E (Y-a t X) 2 + E (a t X-b t X) 2 because E (Y-a t X)X j = 0 E (Y-a t X) 2 because E (a t X-b t X) 2 0 March 6, 2005 DPCM-21

2 Since this is true for every choice of b, it must be that a is optimal. In other words, if a satisfies the orthogonality criteria, it is optimal. Only if statement: Suppose b is optimal and a satisfies the orthogonality criteria. Then as shown above, E (Y-b t X) 2 = E (Y-a t X) 2 + E (a t X-b t X) 2 Since b is optimal E (Y-b t X) 2 E (Y-a t X) 2. If follows from the above two equations that E (a t X-b t X) 2 = 0 which implies E X 2 (a t -b t ) 2 = 0 which implies b = a. That is, if b is optimal, it satisfies the orthogonality criteria. March 6, 2005 DPCM-22 Corollary: Any linear combination of the X i 's is orthogonal to the error of an optimal estimator. That is, if b 1,...,b k are arbitrary numbers, and a is the vector of coefficients of an optimal estimator for Y based on X, then E (Y-a t K X) bj X j = 0 j=1 Proof: Suppose b 1,...,b k are arbitrary numbers, and a is the vector of coefficients of an optimal estimator for Y based on X, then E (Y-a t K K X) bj X j = bj E (Y-a t X) X j = 0 by orthogonality principle j=1 j=1 March 6, 2005 DPCM-23

3 USING THE ORTHOGONALITY PRINCIPLE TO FIND THE OPTIMAL ESTIMATOR From the orthogonality principle we know that if a is the vector of coefficients of the optimal linear estimator, then E(Y-a t X)X j = 0, j = 1,...,N This implies E YX j = a t E XX j = (E XX j ) t a, j = 1,...,N or equivalently E YX j = (EX 1 X j,...,e X N X j ) t a, j = 1,...,N or or EYX 1... EYX N = r = K a EX 1 X 1 EX N X 1 a EX 1 X N EX N X N.. a N Where K is the N N covariance matrix of X and r = E YX = (EYX 1,...,EYX N ) t. If K is invertible, i.e. nonsingular, then a = K -1 r is the vector of optimal linear prediction coefficients. March 6, 2005 DPCM-24 If K is not invertible, i.e. singular, then it can be shown that at least one of the X i 's is a linear combination of the others. Such X i 's can be removed from X because any estimator that used them could also be computed from the others. With these X i 's eliminated, the new covariance matrix K' is nonsingular and we can find the optimal a by inverting K'. The MSE of the optimal estimator: Derivation: MSE = EY 2 - r t K -1 r MSE = E(Y-a t X) 2 = E(Y-a t X)(Y-a t X) = E(Y-a t X)Y + E(Y-a t X)(-a t X) = E(Y-a t X)Y by corollary to orthogonality principle = EY 2 - a t EXY = EY 2 - a t r = EY 2 - r t K -1 r since for optimal estimator r = Ka, and so a t = (K -1 r) t = r t K -1 since K & K -1 are symmetric March 6, 2005 DPCM-25

4 OPTIMAL LINEAR PREDICTION Let us now apply what we've learned to the task of designing the optimal predictor of X N+1 from X 1,...,X N. Let us assume that X is a zero-mean, wide-sense stationary random process with autocorrelation function R X (k) EX i X i+k. Linear estimation theory shows that the Nth-order linear predictor of X i based on X i-1,...,x i-n that minimizes the mean squared prediction error has coefficients a = (a 1,,a N ) t given by a = K -1 r where K is the N N correlation/covariance matrix of X 1,,X N, i.e. K ij = E X i X j = R X ( i-j ), K -1 is its inverse, r = (R X (1),,R X (N)) t. It can be shown that if K is not invertible, then the random process X is deterministic in the sense that there are coefficients b 1,...,b N-1 such that X N = b 1 X b N-1 X N-1. When K is invertible, the resulting minimum mean square prediction error is M N = σ 2 - a t r = σ 2 - r t K -1 r = "N-step prediction error" and the prediction gain is G N = 10 log σ2 10 M N March 6, 2005 DPCM-26 BEHAVIOR OF THE N-STEP PREDICTION ERRORS Clearly, M N decreases with N to some limit, as illustrated below, and the prediction gains G N increase with N to some limit. mean squared error of best Nth order linear predictor N As mentioned in the transform coding section, it can also be shown that M N = K(N+1) K (N) where K (N) and K (N+1) denote, respectively, the N N and (N+1) (N+1) covariance matrices of X, and K denotes the determinant of the matrix K. This is derived in supplementary notes. March 6, 2005 DPCM-27

5 EXAMPLE: FIRST-ORDER AUTOREGRESSIVE (AR) GAUSSIAN SOURCE. Assume X is stationary and X i = ρ X i-1 + Z i ( * ) where -1 < ρ < 1, Z i 's are IID Gaussian with zero mean and variances σz. 2 Z i is independent of X j, j < i. (ρ =.9 is a typical value.) Then 1) EX i = 0 : 2) σ 2 X = EX 2 i = σ2 Z 1-ρ2 ( = 5.26 σ 2 Z if ρ =.9) 3) R X (k) = autocorrelation function = E X i X i+k = σ 2 X ρ k ρ = correlation coefficent = cov(x ix i+1 ) σ 2 X = R X(1) σ 2 X March 6, 2005 DPCM-28 4) Best predictor for X i based on X i-1, X i-2,..., X i-n is ~ X i = ρx i-1 with M N = σ 2 Z Derivation: It's not easy to use a = K -1 r. However, we may directly use the orthogonality principle to check that ~ X i is optimal. Recall that the orthogonal indicates that ~ X i is optimal if and only if We find E (X i - ~ X i )X i-j, for j = 1,...,N. E (X i - ρx i-1 )X i-j = E (ρx i-1 +Z i - ρx i-1 )X i-j = E Z i X i-j = E Z i E X i-j = 0, since Z i indep. of past X's It follows from orthog. principal that ~ X i = ρx i-1 is optimal. The MSPE is M N = E(X i - ρx i-1 ) 2 = E(ρ X i-1 + Z i - ρx i-1 ) 2 = E Z 2 i = σ 2 Z Therefore, for N 1, the prediction gain (which is the gain of DPCM over SQ) is G N = 10 log σ2 X 1 10 M = 10 log 10 N 1-ρ 2 ( = 7.2 db if ρ =.9) March 6, 2005 DPCM-29

6 prediction gain, db G N vs. ρ correlation coefficient March 6, 2005 DPCM-30 Jayant & Noll, p. 271 TYPICAL PREDICTION GAINS FOR SPEECH For each of several speakers, R X (k) was empirically estimated and the prediction gains were computed for various N's. The range of G N values found is shown below. The speech was lowpass filtered, probably to 3kHz. March 6, 2005 DPCM-31

7 TYPICAL PREDICTION GAINS FOR IMAGES Prediction gains in interframe image coding from Jayant and Noll, Fig 6.8, p B -- prediction from pixel immediately to the left ~ X =.965 X B C -- prediction from corresponding pixel in previous frame ~ X =.977 X C BC -- prediction from both pixels ~ X =.379 X B X C BCD -- prediction from the two aforementioned pixels plus the pixel to the left of the corresponding in the previous frame ~ X =.746 X B X C X D For larger values of N, the predictor is based on N pixels from the present and the previous frame. March 6, 2005 DPCM-32 ASYMPTOTIC LIMIT OF M N For a wide sense stationary random process, it can be shown (see Transform Coding Lectures) that as N, M N decreases to Q = lim N M N = "one-step prediction error" = exp 1 2π -π π ln S X (ω) dω where S X (ω) is the power spectral density of X, i.e. S X (ω) = discrete-time Fourier transform of R X (k) = RX (k) e -jkω k=- For an Nth-order AR process, Q = M N = M N+1 =... March 6, 2005 DPCM-33

8 ASYMPTOTIC OPTA OF DPCM In view of the fact that the asymptotic limit of M N is Q, we have the following. When R is large and the pdf of ~ U is similar to a scaled version of that of X, e.g. when X is Gaussian, the asymptotic OPTA function of DPCM optimized over N is δ dpcm (R) 1 12 Q α X 2-2R Q σ 2 δ sq (R) = 1 G δ sq (R) S dpcm (R) S sq (R) + G db where G = 10 log 10 σ2 Q db = prediction gain That is, the gain of DPCM over SQ approximately equals the prediction gain. March 6, 2005 DPCM-34 THE ORIGINS OF THE "DPCM" NAME The letters "PCM" comes from "Pulse Code Modulation", which is a modulation technique for transmitting analog sources such as speech. In PCM, an analog source, such as speech, is sampled; the samples are quantized; fixed-length binary codewords are produced; and each bit of each binary codeword determines which of two specific pulses are sent (e.g. one pulse might be the negative of the other, or the pulse might be sinusoidal with different frequencies). Thus PCM is a modulation technique for transmitting analog sources that competes with AM, FM and various other forms of pulse modulation. Invented in the 1940's, "PCM" is now viewed as three systems: a sampler, a quantizer and a digital modulator. However, the quantizer by itself is often referred to as PCM. DPCM was originally proposed as an improved DPCM type modulator consisting of a sampler, a DPCM quantizer as we know it, and a digital modulator. Nowadays DPCM usually refers just to the quantization part of the system. Predictive Coding DPCM is also considered to be a kind of predictive coding. March 6, 2005 DPCM-35

9 DELTA MODULATION Delta modulation is the special case of DPCM where the quantizer has just two levels: +, -. In its most basic form, the predictor produces, simply, ~ X i = ^X i-1. Fancier predictors are also used. Delta modulation is used when the source is highly correlated, for example speech coding with a high sampling rate. SPEECH CODING Adaptive versions of DPCM and Delta-Modulation have been standardized for use in the telephone industry for digitizing telephone speech. The speech is bandlimited to 3 khz before sampling at 8 khz. However, this kind of speech coding is no longer considered state-of-the-art. March 6, 2005 DPCM-36 EXAMPLES OF DPCM IMAGE CODING Example: Prediction for the present pixel X ij Y i-1,j-1 Y i-1,j ~ X ij = a 1 Y i,j-1 + a 2 Y i-1,j + a 3 Y i-1,j-1 Y i,j-1 X i,j (from R. Jain, Fund'ls of Image Proc, p. 493) All but second curve from top represents performance predicted with predictor matched to the stated source. For the second curve from the top, the values for a 1, a 2 and a 3 were designed for the actual measured correlations for a test set of images. DPCM has been extensively studied for image coding. But it is not often used today. Transform coding is more common. March 6, 2005 DPCM-37

10 DPCM IN VIDEO CODING Though not usually called DPCM, the most commonly used video coding methods use a form of DPCM, e.g. MPEG, H.26X, HDTV, satellite "Direct TV", DVD. Prediction is done on a frame-by-frame basis, with the prediction of a frame being the decoded reconstruction of the previous frame, or a "motion-compensated" version thereof. In the latter case, the coder has a forward-adaptive component in addition to the backward adaption. Example: (Netravali & Haskell, Digital Pictures, p. 331) left pixel, curr. frame Prediction Coefficients left pixel prev. frame same pixel prev frame MSPE Pred. gain db 1-1/2 1/ db 3/4-1/2 3/ db 7/8-5/8 3/ db 1Based on educated guess that σ 2 = March 6, 2005 DPCM-38 COMPARISON OF DPCM AND TRANSFORM CODING Consider a stationary, Gaussian source and large R. For k-dimensional transform coding: δ tr (k,r) 1 12 K (k) 1/k α G 2-2R For DPCM with kth-order linear prediction δ dpcm (k,r) 1 12 M k α G 2-2R where M k is MSPE of optimal kth-order linear prediction for X i from X i-k,,x i-1. Fact A: K (k) 1/k M k. Proof: K (k) 1/k = (σ 2 k-1 X Mi ) 1/k this was proved in the transform coding notes i=1 M k because all terms being averaged are M k It follows from this fact that DPCM with kth-order prediction is at least as good as k-dimensional transform coding. March 6, 2005 DPCM-39

11 Fact B: lim M k = lim k k K (k) 1/k = exp{ 1 2π -π = Q = 1-step prediction error" π ln S(ω) dω } Proof: This was proved in the transform coding notes. It follows from this fact that for Gaussian sources, large k and large R, DPCM and Transform Coding have the same performance, i.e. δ dpcm (,R) = δ tr (,R) However, consider the example of a first-order AR Gaussian source. Then, since M 1 = lim M k = Q, k δ dpcm (1,R) δ tr (,R) Thus, in this case a simple form of DPCM attains performance as good as high dimensional transform coding. March 6, 2005 DPCM-40 In the Gaussian case, the SNR gain in db of DPCM with first-order linear prediction over k-dimensional transform coding is δ tr (k,r) 10 log 10 δ dpcm (1,R) = 10 log (1-ρ2 ) (k-1)/k 10 1-ρ 2 = - 10 k log 10 (1-ρ2) which is plotted below. db 8 7 gain of DPCM over transform coding for ρ = k=transform code dimen., order of DPCM linear predictor Similarly, for an Nth-order AR Gaussian source, δ dpcm (N,R) δ tr (,R) March 6, 2005 DPCM-41

12 Given that DPCM is more efficient for the AR Gauss source, why is transform coding more commonly used nowadays than DPCM, for example for image coding? My feeling is that it because transform coding can more easily be made to take perceptual criteria into account. For example, with transform image coding, it is easy to exploit the fact that the eye is more sensitive to errors in low spatial frequencies than in high spatial frequencies. For example, JPEG does this by using coarser scalar quantizers for high frequency coefficients than for low frequency coefficients. March 6, 2005 DPCM-42