Lecture 2 1 During this lecture you will learn about The Least Mean Squares algorithm (LMS) Convergence analysis of the LMS Equalizer (Kanalutjämnare) Background 2 The method of the Steepest descent that was studies at the last lecture is a recursive algorithm for calculation of the Wiener filter when the statistics of the signals are known (knowledge about R och p). The problem is that this information is oftenly unknown! LMS is a method that is based on the same principles as the method of the Steepest descent, but where the statistics is estimated continuously. Since the statistics is estimated continuously, the LMS algorithm can adapt to changes in the signal statistics; The LMS algorithm is thus an adaptive filter. Because of estimated statistics the gradient becomes noisy. The LMS algorithm belongs to a group of methods referred to as stochastic gradient methods, while the method of the Steepest descent belongs to the group deterministic gradient methods. The Least Mean Square (LMS) algorithm 3 The Least Mean Square (LMS) algorithm 4 We want to create an algorithm that minimizes E{ e(n) 2 }, just like the SD, but based on unkown statistics. A strategy that then can be used is to uses estimates of the autocorrelation matrix R and the cross correlationen vector p. If instantaneous estimates are chosen, br(n) = u(n)u H (n) bp(n) = u(n)d (n) the resulting method is the Least Mean Squares algorithm. For the SD, the update of the filter weights is given by w(n+1) = w(n) + 1 2 µ[ J(n)] where J(n) = 2p + 2Rw(n). In the LMS we use the estimates R b och bp to calculate J(n). b Thus, also the updated filter vector becomes an estimate. It is therefore denoted bw(n); b J(n) = 2bp(n) + 2b R(n)bw(n)
The Least Mean Square (LMS) algorithm 5 The Least Mean Square (LMS) algorithm 6 For the LMS algorithm, the update equation of the filter weights becomes bw(n + 1) = bw(n) + µu(n)e (n) where e (n) = d (n) u H (n)bw(n). Illustration of gradient noise. R och p in the example from Lecture 1. µ = 0.1. SD LMS 2 1.5 1 2 1.5 1 1.06 1.06 Compare with the corresponding equation for the SD: w0 0.5 0 w0 0.5 0 w(n + 1) = w(n) + µe{u(n)e (n)} 0.5 1.96 0.5 1.96 5.56 2.86 5.56 2.86 where e (n) = d (n) u H (n)w(n). 1 1.5 6.46 3.76 4.66 1 1.5 6.46 3.76 4.66 The difference is thus that E{u(n)e (n)} is estimated as u(n)e (n). This leads to gradient noise. 11.86 2 2 1.5 1 0.5 0 0.5 1 1.5 2 11.86 2 2 1.5 1 0.5 0 0.5 1 1.5 2 w 1 w 1 Because of the noise in the gradientm the LMS never reaches J min, which the SD does. Convergence analysis of the LMS, overview 7 1. The filter weight error vector 8 Consider the update equation bw(n + 1) = bw(n)+µu(n)e (n). We want to know how fast and towards what the algorithm converges,in terms of J(n) and bw(n). Strategy: 1. Introduce the filter weight error vector ǫ(n) = bw(n) w o. 2. Express the update equation in ǫ(n). 3. Express J(n) in ǫ(n). This leads to K(n) = E{ǫ(n)ǫ H (n)}. 4. Calculate the difference equation in K(n), which controls the convergence. 5. Make a variable change from K(n) to X(n) which converges equally fast. The LMS contains feedback, and just like for the SD there is a risk that the algorithm diverges, if the stepsize µ is not chosen appropriately. In order to investigate the stability, the filter weight error vector is introduced ǫ(n): ǫ(n) = bw(n) w o (w o = R 1 p Wiener-Hopf) Note that ǫ(n) corresponds to c(n) = w(n) w o for the Steepest descent, but that ǫ(n) is stochastic because of bw(n).
2. Update equation in ǫ(n) 9 3. Express J(n) in ǫ(n) 10 The update equation for ǫ(n) at time n+1 can be expressed recursively as ǫ(n+1) = [I µb R(n)]ǫ(n) + µ[bp(n) b R(n)wo ] = [I µu(n)u H (n)]ǫ(n) + µu(n)e o (n) where e o (n) = d(n) wh o u(n). The convergence analysis of the LMS is considerably more complicated than that for the SD. Therefore, two tools are needed (approximations). These are Independence theory Direct-averaging method Compare to that of the SD: c(n + 1) = (I µr)c(n) 3. Express J(n) i ǫ(n), cont. 11 3. Express J(n) in ǫ(n), cont. 12 Independence Theory 1. The input vectors of different time instants n, u(1), u(2),..., u(n), is independet of each other (and thereby uncorrelated). 2. The input signal vector at time n is independent of the desired signal at all earlier time instants, u(n) idependent of d(1), d(2),..., d(n 1). 3. The desired signal at time n, d(n), is independent of all earlier desired signals, d(1), d(2),..., d(n 1), but dependent of the input signal vector at time n, u(n). Direct-averaging Direct-averaging means that in the update equation for ǫ, ǫ(n + 1) = [I µu(n)u H (n)]ǫ(n) + µu(n)e o (n), the instantaneous estimate b R(n) = u(n)u H (n) is substituted with the expectation over the ensemble, R = E{u(n)u H (n)}, such that results. ǫ(n + 1) = (I µr)ǫ(n) + µu(n)e o (n) 4. The signals d(n) and u(n) at the same time instant n is mutually normally distributed.
3. Express J(n) i ǫ(n), cont. 13 3. Express J(n) i ǫ(n), cont. 14 For the SD we could show that J(n) = J min + [w(n) w o ] H R[w(n) w o ] and that the eigenvalues of R controls the convergence speed. For the LMS, ǫ(n) = bw(n) w o, J(n) = E{ e(n) 2 } = E{ d(n) bw H (n)u(n) 2 } = E{ e o (n) ǫ H (n)u(n) 2 } = i = E{ e o (n) 2 H +E{ǫ (n) H u(n)u (n) ǫ(n)} {z } {z } J min br(n) Here, (i) indicates that the indpendence assupmtion is used. Since ǫ and b R are estimates, and in addition not independent, the expectation operator cannot just be moved into b R. The behavior of J ex (n) = J(n) J min = E{ǫ H (n)b Rǫ(n)} determines the convergence. Since the following holds for the trace of a matrix, 1. tr(scalar) = scalar 2. tr(ab) = tr(ba) 3. E{tr(A)} = tr(e{a}) J ex (n) can be written as J ex (n) 1,2 = E{tr[b R(n)ǫ(n)ǫ H (n)]} 3 = tr[e{b R(n)ǫ(n)ǫ H (n)}] i = tr[e{b R(n)}E{ǫ(n)ǫ H (n)}] = tr[rk(n)] The convergence thus depends on K(n) = E{ǫ(n)ǫ H (n)}. 4. Difference equation of K(n) 15 5. Changing of variables from K(n) to X(n) 16 The update equation for the filter weight error is, as shown previously, ǫ(n + 1) = [I µu(n)u H (n)]ǫ(n) + µu(n)e o (n) The solution to this stochastic difference equation is for small µ close to the solution for ǫ(n + 1) = (I µr)ǫ(n) + µu(n)e o (n), which is resulting from Direct-averaging. This equation is still difficult to solve, but now the behavior of K(n) can be described by the following difference equation K(n + 1) = (I µr)k(n)(i µr) + µ 2 J min R The variable changes Q H RQ = Λ Q H K(n)Q = X(n) where Λ is the diagonal eigenvalue matrix of R, and where X(n) normally is not diagonal, yields J ex (n) = tr(rk(n)) = tr(λx(n)) The convergence can be discribed by X(n + 1) = (I µλ)x(n)(i µλ) + µ 2 J min Λ
5. Changing of variables from K(n) to X(n), cont. 17 J ex (n) depends only on the diagonal elements of X(n) (because of MX the trace tr[λx(n)]). We can therefore write J ex (n) = λ i x i (n). The update of the diagonal elements is controlled by x i (n + 1) = (1 µλ i ) 2 x i (n) + µ 2 J min λ i i=1 This is an inhomogenous difference equation.this means that the solution will contain a transient part, and a stationary part that depends on J min and µ. The cost function can therefore be written Properties of the LMS 18 Convergence for the same interval as the SD: 0 < µ < 2 λ max X M µλ i J ex ( ) = J min 2 µλ i=1 i Misadjustment M = J ex( ) is a measure of how close the the J min optimal solution the LMS (in mean-square sense). J(n) = J min + J ex ( ) + J trans (n) where J min +J ex ( ) och J trans (n) is the stationary and transient solutions, respectively. At the best, the LMS can thus reach J min +J ex ( ). Rules of thumb LMS 19 Summary LMS 20 The eigenvalues of R are rarely known. Therefore, a set of rules of thumbs is commonly used, that is based on the tap-input power, P M 1 k=0 E{ u(n k) 2 } = Mr(0) = tr(r). The stepsize 0 < µ < 2 P M 1 k=0 E{ u(n k) 2 } Misadjustment M µ P M 1 2 k=0 E{ u(n k) 2 } Summary of the iterations: 1. Set start values for the filter coefficients, ŵ(0) 2. Calculate the error e(n) = d(n) ŵ H (n)u(n) 3. Update the filter coefficients ŵ(n + 1) = ŵ(n) + µu(n)[d (n) u H (n)ŵ(n)] 4. Repeat steps 2 and 3 Average time constant τ mse,av 1 2µλav där λ av = 1 M P Mi=1 λ i. When λ i is unknown, the following relationship is useful. Approximate relationship M and τ mse,av M M 4τmse,av Here, τ mse,av denotes the time it takes for J(n) to decay a factor of e 1.
Example: Channel equalizer in training phase 21 Example, cont: Learning Curve 22 p(n) p(n) =..., 1, 1, 1, 1, 1,... Channel h(n) WGN v(n) σ 2 v = 0.001 + u(n) Σ + h(n) = {0, 0.2798, 1.000, 0.2798, 0} z 7 FIR M = 11 Fördröjning LMS d(n) bd(n) + Σ e(n) Sent information in the form of a pn-sequence is affected by a channel (h(n)). An adaptive inverse filter is used to compensate for the channel such that the total effect can be approximated by a delay. White Gaussian noise is added during the transmission (WGN). Average of MSE, J(n) 10 0 10 1 Learning curves based on 200 realizations. The decay is related to τmse,av. A large µ gives fast convergence, but also a large Jex( ). µ = 0.0075 µ = 0.025 µ = 0.075 10 2 0 500 1000 1500 Iteration n The curves illustrates learning curves for two different µ. What to read? 23 Least Mean Squares, Haykin chap.5.1 5.3, 5.4 (see lecture), 5.5 5.7 Exercises: 4.1, 4.2, 1.2, 4.3, 4.5, (4.4) Computer exercise, theme: Implement the LMS algorithm in Matlab, and generate the results in Haykin chap.5.7