IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002 1873 Analysis of Mean-Square Error Transient Speed of the LMS Adaptive Algorithm Onkar Dabeer, Student Member, IEEE, Elias Masry, Fellow, IEEE Abstract For the least mean square (LMS) algorithm, we analyze the correlation matrix of the filter coefficient estimation error the signal estimation error in the transient phase as well as in steady state. We establish the convergence of the second-order statistics as the number of iterations increases, we derive the exact asymptotic expressions for the mean square errors. In particular, the result for the excess signal estimation error gives conditions under which the LMS algorithm outperforms the Wiener filter with the same number of taps. We also analyze a new measure of transient speed. We do not assume a linear regression model: the desired signal the data process are allowed to be nonlinearly related. The data is assumed to be an instantaneous transformation of a stationary Markov process satisfying certain ergodic conditions. Index Terms Asymptotic error, least mean square (LMS) adaptive algorithm, products of rom matrices, transient analysis. I. INTRODUCTION CONSIDER jointly stationary rom processes, taking values in, respectively. The vector which minimizes is a solution of the Wiener Hopf equation, if is invertible, it is given by,. Note that all vectors are column vectors, denotes the transpose of. We can write the estimation error (1) is orthogonal to the data, that is, In practice, the statistics of the data is seldom known, has to be estimated based on a single realization of. For example, such a problem arises in system identification channel equalization [1, Introduction]. A common approach is to use stochastic adaptive algorithms, which recursively update an estimate of as more data becomes available. In this paper, we consider the step-size least mean square (LMS) algorithm, which updates an estimate of using the recursion Manuscript received May 1, 2000; revised July 1, 2001. This work was supported by the Center for Wireless Communications, University of California, San Diego. The material in this paper was presented in part at CISS-2000, Princeton, NJ, March 2000. The authors are with the Department of Electrical Computer Engineering, University of California, San Diego, La Jolla, CA 92093 USA (e-mail: onkar@ucsd.edu; masry@ece.ucsd.edu). Communicated by U. Madhow, Associate Editor for Detection Estimation. Publisher Item Identifier S 0018-9448(02)05156-8. (2) (3) Here is the fixed step size is a deterministic initialization. The LMS algorithm, its variants, have been used in a variety of applications, many researchers have analyzed them (see, for example, [2] [22]). To give a flavor of the analytical results known so far, we briefly mention some previous results for dependent data. In the literature, two commonly used performance criteria are the estimation error in the filter coefficients (also called, deviation error), the signal estimation error. In [4], convergence in distribution (as )of is established for bounded uniformly mixing data. In [6], for -dependent data, it is shown that as tends to infinity, is bounded by a multiple of.for bounded, purely nondeterministic regressors, the time average is analyzed in [10] as. In [12], [23], [24], for a general class of adaptive algorithms, asymptotic normality of is established by letting in such a way that remains bounded. In [13], [21], [22], for specific examples it is shown that as, can be smaller than, that is, in some cases the LMS algorithm outperforms. Many authors have also analyzed the speed of convergence, the most recent work is that in [15] [20]. In Section III, we compare our results with some of the above mentioned results; however, we next compare our contribution to the more recent works in this area. Our results are closest in spirit to recent results in [14], [17, Theorem 5], [20]. In [14] [17], a simple approximation is given for when is time-varying. Even when is time-invariant, which is the case in this paper, the results in [14] [17] are the most general results known for error analysis in the transient phase. However, [14] [17] impose restrictive conditions on the relationship between. In [14], it is assumed that are independent (4) is zero mean, white. (5) Condition (4) implies that (1) is a linear regression model. In the general case, may be nonlinearly related, though (2) holds, may be dependent. Also, the sequence may be correlated, (5) may not hold. In [17, Theorem 5], it is assumed that If the data is zero mean Gaussian, then (6) implies that independent identically distributed (i.i.d.) (6) are is inde- 0018-9448/02$17.00 2002 IEEE
1874 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002 pendent of. The conditions (4) (6) are very restrictive, they are not satisfied in applications such as channel equalization. In this paper, we analyze the LMS algorithm without putting any strong restriction on the relationship between. We not only extend the results in [14] [17], but we also prove additional new results. Specifically, we provide comprehensive quadratic-mean analysis of the LMS algorithm for dependent data, under conditions considerably weaker than those in the literature. 1) We approximate by a matrix, which is specified by a simple recursion depending on the statistics of the data. The approximation error vanishes as. In the general case of dependent data nonlinear regression model, itself does not have a simple recursion. This result extends the result in [14] [17] by removing (4) (6). 2) For small, we prove the convergence of the algorithm in the following sense: there exists such that for, exists. To the best of our knowledge, for dependent data, convergence of second-order statistics of the LMS algorithm has not been established before. 3) We show that the limit exists, it satisfies the Lyapunov equation given in [23]. 4) We study the excess signal estimation error,. We approximate by a simple expression, which can be computed using a simple recursion. The approximation error is small when is small. This result generalizes a result in [14], (4) (5) are assumed. 5) We show that for sufficiently small step size, the limit exists, we derive the limit in terms of the statistics of the data. This expression is new. In particular, our result shows that under certain conditions, the LMS algorithm gives a smaller signal estimation error than. No previous results explain this phenomenon. 6) We analyze a new measure of transient speed. We assume to be an instantaneous vector transformation of a stationary Markov process satisfying certain ergodic conditions. Our assumptions are satisfied for many examples of practical importance, they allow applications such as channel equalization, (4) (6) are not true (see Section II). We note that itself in not Markovian, it need not even be uniformly mixing (see Section II). Most papers dealing with the convergence of the LMS algorithm establish some form of exponential convergence of products of rom matrices. In Lemma 11 of Appendix I, we obtain exponential convergence for products of rom matrices using the operator-theoretic framework for Markov processes [25, Ch. 16]. Lemma 11 establishes refinements of some results in [14], which are critical to prove Theorems 1 4 without assuming (4) (5). Since we do not assume (4) (5), the analysis in this paper is substantially different from that in [14]. The paper is organized as follows. In Section II, we present our assumptions provide examples. In Section III, we state discuss the main results, we also compare our results with previously known results. In Section IV, we prove the main results using a series of lemmas which are proved in Section V. In Section VI, we present the conclusion. Lemma 11 a few preliminary results are proved in Appendix I. In Appendix II, we prove related refinements of some results in [14]. II. ASSUMPTIONS AND EXAMPLES In this section, we first state our assumptions, then we give detailed explanations examples. ( ) is an aperiodic, -irreducible (see [25]), stationary Markov process with state space, there exists a measurable function such that a) ; b) for any measurable function with, there exists such that ( ) a), is measurable,, is measurable. b), is a,,. c). d),. e) is positive definite. Discussion of : Using the terminology in [25], assumption states that is a -uniformly ergodic Markov process. This also implies that is -uniformly ergodic. (We have stated this assumption for instead of in order to avoid carrying a square root sign throughout the paper.) Assumption b) implies that for a class of functions, the conditional expectation converges exponentially fast (as ) to the unconditional expectation. Though at first sight this assumption looks difficult to verify, [25, Theorem 16.0.1 Lemma 15.2.8] give a simple criteria to verify it: find a function such that,. It turns out that in most applications, can be chosen to be a polynomial in, or an exponential function of a power of. For example, consider, are i.i.d. In [14, Sec. 3.1] it is shown that if for some, the eigenvalues of are strictly inside the unit circle, is controllable, has an every positive density, then is satisfied with the
DABEER AND MASRY: ANALYSIS OF MEAN-SQUARE ERROR AND TRANSIENT SPEED OF THE LMS ADAPTIVE ALGORITHM 1875 choice. Similarly, if for some, then such that is satisfied with the choice. Discussion of a): This assumption states that the data is an instantaneous transformation of a Markov process. Note that the data itself need not be Markovian. This assumption has been used before in [14] [23]. We give examples later to show how this assumption arises in practice. From [26, Theorem 4.3], we know that is absolutely regular. Absolute regularity is a popular form of mixing, which is weaker than uniform mixing ( -mixing) stronger than strong mixing ( -mixing). The assumption of uniformly mixing data has been used before (see, for example, [4] [16]). We note that the Markov process is in general not uniformly mixing: from [26, Sec. 4 Theorem 6.0.2] we know that is uniformly mixing if only if for all. Discussion of b): For, b) is the same as [14, Sec. 4, ]. The extra assumption we make for is not at all restrictive, it is dealt with exactly as in [14]. This assumption puts a restriction on the growth of with respect to the growth of the drift.if is a polynomial in, then without any further assumptions, this condition requires the data to be bounded. However, for an exponential drift, the components of are permitted to be any polynomial, even a function with suitably slow exponential growth is allowed. We note that this assumption allows data for which all the moments are not finite (see the examples below). Discussion of c) e): c) is satisfied if are bounded by a multiple of. This is a very mild condition, especially for an exponential drift function. Under this restriction on the growth of, assumption d) is satisfied provided. Such growth conditions arise because we wish to apply the convergence of the conditional expectation (see b)) for these functions. The positive definiteness of is commonly used to guarantee the existence of a unique. Example 1: Let be a Gaussian vector ARMA( ) process. Then by stacking together the vectors,, we can define. As in [27, Example 12.1.5, p. 468], we can write. Under the assumptions on stated in the discussion of above, an exponential drift can be chosen. Hence linear functions of are allowed, in particular satisfies our assumptions. Similarly, we can also take to be any polynomial transformation, or even a suitable exponential transformation of. In the latter case, all the moments of the data may not be finite. Also note that are in general nonlinearly related. Example 2: For the case of channel equalization, the conditions (4) (6) are not satisfied. We now show that our assumptions are satisfied for this application. Let the channel output be Here are independent information sources of two users experiencing channels with impulse response, respectively. The channel noise is independent of the information sources. The aim is to estimate based on in the presence of interference due to user 2, in the presence channel noise. The information sources are usually modeled as irreducible, aperiodic Markov chains with a finite state space, the noise is modeled as zero mean, i.i.d. Assuming, it is easy to see that our assumptions are satisfied by choosing. This example can be clearly extended to any finite number of users, using the method in Example 1, an additional narrowb interference satisfying a scalar autoregressive moving average (ARMA) model can also be included. More examples can be found in [14, Sec. 3] [25]. Since some of our results are similar in nature to those in [14] [17, Theorem 5], we state below the main differences in our assumptions. We do not assume (4) (5), which are assumed in [14]. As indicated in the discussion of our assumptions above, the extra assumptions we make over [14] are mild. We do not assume (6), which is assumed in [17, Theorem 5]. Also, the moment restrictions in [17] are more stringent. For example, for i.i.d. data, condition (15) in [17] requires that the density of the data should decay like a Gaussian or faster than a Gaussian. On the other h, our assumptions allow data which can be an exponential transformation of Gaussian data (see Example 1). In particular, if are i.i.d., are finite, then it is easy to verify that our assumptions are satisfied with. III. MAIN RESULTS Notation: Let. For a square matrix, denotes the trace, denotes the matrix norm induced by the Euclidean norm. By the phrase sufficiently small, we mean such that for. In many of the expressions, denotes a term bounded by a multiple of, the bound is uniform in other parameters that may be involved in the expressions. For an matrix, a matrix, denotes the Kronecker product which is defined to be the matrix whose th block is the matrix,,. For an matrix, denotes the -dimensional column vector obtained by stacking the columns of into a single vector. A. Analysis of the Mean of the Deviation Error Under assumptions (4) (5), it is shown in [28, Ch. 6] [14, Theorem 3] that converges exponentially fast to zero as, that is, the filter coefficient estimate based on the LMS algorithm is asymptotically unbiased. However, this is not always the case when (4) (5) are violated. Consider the following example.
1876 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002 Example 3: Let be i.i.d. Gaussian rom variables with zero mean unit variance. Let the desired signal. Then by simple calculations using (3), it follows that exists, (9) (10) Hence for,, that is, the filter coefficient estimate based on the LMS algorithm is not asymptotically unbiased. In the general case, we have the following result. Theorem 1: Suppose assumptions are satisfied. Then for sufficiently small is uniform in. Also exists for sufficiently small. Further, if is independent of, then can be replaced by. The proof of this result is much simpler than the proof of the other results in this paper, the method of proof is also similar to the proof of the other results. Hence, we do not present the proof of Theorem 1 in this paper. The interested reader is referred to the Ph.D. dissertation [29]. B. Analysis of the Correlation Matrix of the Deviation Error Theorem 2: Suppose assumptions are satisfied. Then we have the following results. 1) For sufficiently small (7), are uniform in, for In the above recursion the infinite series converges absolutely. 2) For sufficiently small, exists, exists, it satisfies the Lyapunov equation. The proof is given in Section IV-A. Before discussing the above result, we state two corollaries. Corollary 1: Suppose assumptions are satisfied. Then for sufficiently small step-size (8) the infinite series converges absolutely. The proof is given in Section V-H. Corollary 2: Suppose assumptions are satisfied. Then for sufficiently small is uniform in. The proof is given in Section V-I. In the initial phase of the algorithm, is large compared to, while the approximation error in Part 1 of Theorem 2 is of the order of. When is large, the approximation error is of the order of, while is of the order of. Hence, Part 1 of Theorem 2 implies that for small, is a good approximation to in the transient phase as well as in steady state. Note that the recursion for is easy to analyze, while for the general case of dependent data nonlinear regression model, itself does not have a simple expression. If we assume (6) (or (4) (5)), then. The corresponding simplified form of Part 1 of Theorem 2 is similar to the result obtained by applying [14, Theorem 4] (also see [17, Theorem 5]) to the special case of the LMS algorithm being used to estimate a time-invariant parameter. However, we have a better rate for the error term than in [14] [17], the error term is. Thus, we not only remove assumptions (4) (6), but we also obtain a better rate for the approximation error. The literature on the performance analysis of the LMS algorithm, its variants, can be divided into two categories: that which studies quantities like the asymptotic error by assuming the convergence of the algorithm, that which first establishes the convergence of the algorithm. Our work is in the spirit of the second category. In Part 2 of Theorem 2, we first establish the existence of the limit, then we study its behavior for small. Note that the convergence of as does not follow from Part 1 of Theorem 2. To the best of our knowledge, for dependent data, convergence of second-order statistics of the LMS algorithm has not been shown before. In order to explicitly indicate the dependence of on, we denote it by. In [23, Theorem 15, p. 335], under the assumption that the data is an instantaneous function of a geometrically ergodic Markov process, an asymptotic normality result for general adaptive algorithms is established. (Similar results have also been proven in [24] [12].) For the case of the LMS algorithm, [23, Theorem 15, p. 335] implies that if we let such that remains bounded, then converges in distribution to a normal distribution with zero mean covariance matrix satisfying. This weak convergence result rules out the case of practical interest: fixed large. Also, this asymptotic distribution result
DABEER AND MASRY: ANALYSIS OF MEAN-SQUARE ERROR AND TRANSIENT SPEED OF THE LMS ADAPTIVE ALGORITHM 1877 does not imply convergence of the second moment. In contrast, in Theorem 2 we study the second-order statistics of. In Part 1, we present a result about the behavior of during the transient phase. In Part 2, for fixed, we prove the convergence (as ) of, then we study for small. Also B(ii), B(iii), B(iv) of [23, p. 335] are assumptions on the sequence, these appear to be difficult to verify. On the other h, all our assumptions are on the data process. Using [30, eq. (12), p. 20] the positive definiteness of, it is easy to show that the unique solution to is specified by If is the eigenvector of corresponding to eigenvalue,, then can also be written as In [5], under the assumption that the process -dependent, it is shown that for sufficiently small In comparison, Corollary 1 is a stronger result: we show that the limiting value is well defined for sufficiently small, we establish the precise rate at which the asymptotic error converges to as, we obtain the asymptotic in terms of the statistics of the data. Furthermore, we remove the assumption of -dependent data. For small,, is the on the right-h side of (10). Corollary 2 is a consequence of Theorems 1 2. As per this result, for small, can be approximated by the deterministic vector. This approximation is meaningful during the initial phase of the algorithm when is large compared to. A similar result was suggested in [20] without proof. C. Analysis of the Excess Signal Estimation Error Theorem 3: Suppose assumptions are satisfied. Then we have the following results. 1) For sufficiently small,, are uniform in. The matrices are as defined in Theorem 2. 2) For sufficiently small, exists, the infinite series converges absolutely. is (11) The proof is given in Section IV-B. Part 1 of Theorem 3 provides a simple approximation to the excess signal estimation error. For small, the approximation is good in the transient phase as well as in steady state. If we assume (4) (5), then in the recursion (8) for, is replaced by. This simplified result is similar to that obtained by using [14, Theorem 4], the error term is. Thus, for the case of a time-invariant parameter,we not only remove assumptions 4) 5) made in [14] without any additional restrictive assumptions, but we also obtain a better rate for the error term. If we assume (4) (5), then the limit in (11) simplifies to. This simplified result has been established in [15], under the assumption of i.i.d. Gaussian it can also be derived from the exact expression for given in [8]. In comparison, our result is valid even when are nonlinearly related, the data is dependent. Equation (11) implies that for small,, is the on the right-h side of (11). For two specific examples, it was shown in [21] that the LMS algorithm asymptotically results in a smaller signal estimation error than the optimal Wiener filter. A similar result was also stated for one-step prediction of an AR process for a normalized LMS algorithm in [13]. The result (11) of Theorem 3 can be used to investigate this phenomenon under general conditions. Let denote the on the right-h side of (11). If is negative, then for sufficiently small,, that is, the LMS algorithm gives a smaller signal estimation error than the optimal Wiener filter. As explained in [21], the reason for this is that the LMS algorithm, due to its recursive nature, utilizes all the past data, while the Wiener filter only uses the present data. Unlike [21] [13], which only deal with specific examples, (11) holds for a large class of processes. The is always positive if (6) is true, or (4) (5) are true. Hence, conventional analysis based on these assumptions does not indicate this phenomenon. In [31], lower bounds on the estimation error of causal estimators are derived using Kalman filtering. These bounds are of interest to lower-bound the signal estimation error of the LMS algorithm. Bounds based on causal Wiener filters are also given in [22]. Comparison of our results with these lower bounds is not feasible for the general case. However, we give a comparison for the example below for small. Example 4: Let,, be a zero mean, stationary Gaussian AR process with. Let the desired signal be.for Thus, if the s are sufficiently positively correlated, then the LMS algorithm outperforms the Wiener filter. For this example, the lower bound in [31, Theorem 2.3] is. The smallest signal estimation error is achieved by the causal Wiener filter using two taps. For small, the LMS algorithm employing one tap gives the signal estimation error. As,, hence choosing for
1878 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002, for close to one,. Thus, for close to one, the LMS algorithm with one tap comes close to the optimal predictor, provided the step size is chosen appropriately. D. Analysis of the Measure of Transient Speed Consider the measure of transient speed This is a measure of how fast converges to its steadystate value. Since we sum over all, this measure also takes into account the transient phase of the algorithm. The smaller the value of, the faster is the speed of convergence. We have the following result. Theorem 4: Suppose assumptions are satisfied. Then for sufficiently small, the measure of transient speed is finite it satisfies (12) The proof is given in Section IV-C. Let, is the eigenvector of corresponding to the eigenvalue. Then Using a heuristic argument, [20] proposes a measure of transient speed (see [20, eq. (8)] previous sections, which, even for infinitesimal, depend on the dependence structure of the data. In [19], the following scheme is suggested for improving the speed of the LMS algorithm: use in place of in (3), is an matrix obtained by throwing away the rows of an orthogonal matrix. Using the normalized measure of transient speed defined above, we show in the Ph.D. dissertation [29] that the speed of convergence increases for an obtained from the Karhunen Loeve transform matrix (a matrix whose rows are eigenvectors of ). However, no single orthogonal transformation improves speed for all. Remark 1: Consider the measure of transient speed which is based on the signal estimation error. Under assumptions, it can be shown that. For the normalized measure of transient speed,,. The highest value of this (see [30, eq. (1), p. 72]) is, as in the case above, it coincides with the measure of speed proposed in [15]. IV. PROOF OF MAIN RESULTS Notation: To maintain simplicity of expressions, we define.by we denote the matrix product for, for the empty product is to be interpreted as the identity matrix. Let denote a function such that For small, behaves similar to. Thus, the analysis of the intuitively appealing measure of speed, leads to a rigorous justification of [20, eq. (8)]. Consider the normalized measure of transient speed Since let.by we denote an -dimensional column vector with at th position otherwise,. We use Kronecker product of matrices Lemma 13 summarizes the properties we need. Many of the basic properties mentioned in Lemma 13 are used frequently, except for the first few instances, we do not refer to them. By simple calculations using (1) (3), we get the recursion The highest value of this (see [30, eq. (1), p. 72]),, corresponds to the slowest speed of convergence, in this case, for small,. For the LMS algorithm, [15, Theorem 1] implies that the speed of convergence is inversely proportional to. Thus, the measure of speed proposed in [15] corresponds to the worst case previously mentioned. Unlike [15], we do not assume (4) (5). Also, we provide a rigorous proof, while the proof in [15, Theorem 1] is based on the independence assumption. The depends only on. We get the same if we assume to be i.i.d. Thus, for small, the transient speed is not affected by the presence of correlation amongst the s. Hence, analysis based on the independence assumption [1] leads to the same result. This is in contrast to the results in the Note that is is. In our calculations we need the following expression for which is obtained by repeated application of the above recursion (13) (14) We first study the expression through a series of lemmas, then we prove our main results. Using (13)
DABEER AND MASRY: ANALYSIS OF MEAN-SQUARE ERROR AND TRANSIENT SPEED OF THE LMS ADAPTIVE ALGORITHM 1879 (15) (16) Note that in the preceding equation, in the equations that follow, is to be interpreted as. Let the diagonal term ( )be (17) Using Lemma 13 i), ii) we can write (18) (20) (19) let the off-diagonal term ( )be Here is, is, is. Lemma 1: If assumptions are satisfied, then for sufficiently small (21) Further, for sufficiently small, Lemma 3: If assumptions are satisfied, then for sufficiently small The proof is given in Section V-A. Lemma 2: If assumptions are satisfied, then for sufficiently small Further, for sufficiently small, exists, exists Further, for sufficiently small,,, The proof is given in Section V-D. The analysis of requires more decomposition. From (14) (18) The proof is given in Section V-B. We have the following result for the off-diagonal term. Lemma 4: If assumptions are satisfied, then for sufficiently small Further, for sufficiently small, exists, exists, The proof is given in Section V-C.
1880 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002 A. Proof of Theorem 2 Suppose we choose is an -dimensional column vector with at th position otherwise. Here, take values from. Then,. Thus, by choosing different values of, we can get all the entries in. Also, since,. Hence, we can apply the result of Lemmas 1 4. Using, we obtain, B. Proof of Theorem 3 From the definition of,,, weget. Hence, To prove Theorem 3, we derive the contribution of each of these terms separately. Lemma 5: If assumptions are satisfied, then for sufficiently small Further, for sufficiently small, exists Let is as defined in Theorem 2. Therefore, (22) (23) the infinite series converges absolutely. The proof is given in Section V-E. The cross term also contributes to the asymptotic estimation error we have the following lemma. Lemma 6: If assumptions are satisfied, then for sufficiently small Further, for sufficiently small, exists For sufficiently small, by Lemma 12 vi), for some. Hence, it follows that the infinite series converges absolutely. The lemma is proved in Section V-F. Theorem 3 now follows directly from Lemmas 5 6. It is easy to see that satisfies the recursion (8). For sufficiently small, the eigenvalues of are in hence C. Proof of Theorem 4 If we choose, then by Lemma 13 iii),. Now consider is well-defined. From (8) it follows that (24) (25) Also from Lemmas 1 4, exists for sufficiently small is well-defined. Further, so Note that in we do not take the absolute value of hence we can separately calculate the contribution of,,, to. From Lemmas 1 4 Lemma 13 vi), we get that,. Since, we get (26) In Section V, the proof of Lemmas 1 4 yields Corollary 3 which shows that (27) Hence, from (25) we get that. satisfies the Lyapunov equation Equation (12) now follows from (26) (27).
DABEER AND MASRY: ANALYSIS OF MEAN-SQUARE ERROR AND TRANSIENT SPEED OF THE LMS ADAPTIVE ALGORITHM 1881 V. PROOF OF LEMMAS By (59) of Lemma 11, for sufficiently small we get Notation: We define. With these definitions. For a vector, denotes the th component of. For a vector function, denotes the th component function. Indicator function of a set is denoted by. for a matrix, is the corresponding induced norm. By we denote the Banach space of measurable functions such that, (29) (28) Note that for the function, a fixed vector in,, hence, it belongs to.if is a bounded operator on, then by we denote the operator norm induced by.by we denote the function obtained by the action of the operator on the function.by, a fixed vector in, we mean that the operator is acting on the function. By we refer to a family of operators such that. In many of the inequalities we use to refer to a that does not depend on the parameters involved in the inequalities. This section is organized as follows. Lemma 1 is proved in Section V-A. The proof of Lemma 2 follows the same main steps as that of Lemma 4 but the details are much simpler. Hence, we first prove Lemmas 3 4 in Sections V-B V-C, respectively. Lemma 2 is proved in Section V-D. Lemma 5 is proved in Section V-E Lemma 6 is proved in Section V-F. Equation (27) is proved in Section V-G Corollaries 1 2 are proved in Sections V-H V-I, respectively. All the proofs follow the following main theme: split the term under consideration into further terms by applying Lemma 11 deal with each of these terms separately; for the term under consideration, say, identify the order of the term as a function of, show that exists for sufficiently small, show that exists; study. We frequently use the preliminary results stated in Lemmas 12 13, except for the first few instances, we do not refer to them. (30) (31) are bounded operators on such that (32) Contribution of : By Lemma 11, we are free to choose. We choose for reasons which become clear in Corollary 3. Thus, by Lemma 13 iv) it has eigenvalues,. For all the eigenvalues of are in hence for sufficiently small,. Consider. By [30, eq. (11), p. 107] Hence, Contribution of : From (30) Now by (28), a), (32) as A. Proof of Lemma 1 By Lemma 11, defined by is a bounded operator on. Also. Hence, from (19), using stationarity Lemma 12 vii) Therefore, for sufficiently small. Consider as
1882 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002 Contribution of : The analysis of follows exactly the same steps as above we get, for sufficiently small, as Contribution of : From (34) Lemma 1 follows directly from the analysis of,,. B. Proof of Lemma 3 Using stationarity in (20) followed by a change of variable we get Using the Cauchy Schwarz inequality for the inner product By the definition of Lemma 12 iv) we obtain As in Section V-A, we can apply Lemma 11. Using (58) we get, (33) Hence exists for sufficiently small ; the series involved converges absolutely. Further,. Consider (34) (35) are bounded operators on satisfying (32). Contribution of : For the eigenvalues of are in hence is well defined. Using Lemma 12 iv) Contribution of : The proof for follows exactly the same steps as we get, exists for sufficiently small, as Hence using Lemma 3 follows directly from the analysis of,. Thus, exists. Consider C. Proof of Lemma 4 The off-diagonal term given by (21) can be split into two terms:, in which the inner sum in (21) is over,, in which the inner sum in (21) is over. Using stationarity in we get
DABEER AND MASRY: ANALYSIS OF MEAN-SQUARE ERROR AND TRANSIENT SPEED OF THE LMS ADAPTIVE ALGORITHM 1883 Substituting in place of we get (42) Note that unlike in Section V-A, we have chosen for convenience. We deal with each of these terms separately. The proof is very long we split it into three subsections. Lemma 4 then follows directly from Lemmas 7 10 proved in the following subsections. 1) Analysis of : we have used Lemma 13 ii) Now by changing the order of summation substituting in place of we get, (36) From (40), we can write (43) By similar calculations (37) (44) Lemma 7: If assumptions are satisfied, then for sufficiently small (38) (39) Further, for sufficiently small, exists, exists, Due to similarity of, in order to study we only need to study. We provide a proof for, the proof for follows almost identically. From (37), using Lemma 12 vii) Remark 2: For the term in we have From (58) of Lemma 11 we get, (40) Proof: Let are -dimensional column vectors. Note that does not depend on. Substituting in (43) for from (36) (41) (45)
1884 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002 In the last step we have used the definition of Kronecker product we have split each -dimensional vector into vectors of dimension. The argument is independent of it suffices to consider the th term only. Let Similarly, term is uniform in. It follows that the are uniform in.so, exists (the series involved converges absolutely), is uniform in. Now consider Contribution of : By Lemma 12 vi) Hence, From the analysis of it follows that exists for sufficiently small as Now consider Further as Contribution of : Let Using Lemma 12 vi) hence exists (note that does not depend on ). Also, In the proof of Lemma 1 in Section V-A, we showed that Hence. Note that Also by Lemma 12 ii). Thus, defined by. Using the operators Further, by (2), follows that. Hence, by (62) of Lemma 11, it The norm of the second term is hence By (2), the first term is zero. Also, By Lemma 12 vi) we know that
DABEER AND MASRY: ANALYSIS OF MEAN-SQUARE ERROR AND TRANSIENT SPEED OF THE LMS ADAPTIVE ALGORITHM 1885 It follows that,. Hence,. Using the same bounds as Thus, has been dealt with. Repeating the same steps as for small. Also,,weget for sufficiently This completes the proof. Lemma 8: If assumptions are satisfied, then for sufficiently small,. Further,. Proof: Now let are -dimensional column vectors. Following exactly the same steps as in the proof of Lemma 7 considering only the th term has been ana- Since the argument did not depend on, lyzed the proof of Lemma 8 is complete. 2) Analysis of : Lemma 9: If assumptions are satisfied, then for sufficiently small, exists. Further, Proof: From (41) Using (60) of Lemma 11 with,, We need the following bound: For we have the following bound: (46) the does not depend on,,. We write Hence,, are -dimensional column vectors. Using (36)
1886 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002 Consider the th term In this case By Lemma 12 iii), hence, for the above bound, is finite for sufficiently small. Therefore,. Further, we can use dominated convergence as in (52) if we show that exists almost surely. But which is finite by Lemma 12 ii) (46). Hence, (47) the does not depend on,,,. Thus,. Using (60) of Lemma 11 with we get, (48) (49) a.s. since almost surely by a). Thus, (50) Remark 3: In [14], is assumed only for instead of. We have stated for (which implies for ). This allows us to apply Lemma 11 on the space as above. If this assumption is not made, then in order to apply Lemma 11 we need. This can be guaranteed only under the restrictive condition of bounded data. By orthogonality (2) is well-defined almost surely Applying dominated convergence to (52) as Thus, is well-defined. Using Lemma 12 iii) Let (47). Then by From (49) Consider (51) (52) Remark 4: The refinement of [14, Theorem 2], Lemma 11, permits us to show that exists. If instead [14, Theorem 2] is used, we only get that nothing can be said about the limit. Consider
DABEER AND MASRY: ANALYSIS OF MEAN-SQUARE ERROR AND TRANSIENT SPEED OF THE LMS ADAPTIVE ALGORITHM 1887.So D. Proof of Lemma 2 Now consider. Substituting (14) in (16) using Lemma 13 i), ii) we get For (53) Similarly, for (54) Here is, is, is. Similarly, we have (55) Using these bounds (56) as Thus, has been dealt with. The analysis of is similar to that of. We get that exists for sufficiently small, the limit,, given by (53) (55), respectively, are similar they can be treated in the same way. As can be seen from (53), (54), (36), (37), corresponds to the term in except that is replaced by. Therefore, the main steps in the proof are similar to those in the proof of Lemma 4. But overall, the proof is simpler due to the absence of the summation present in (37). Since the proof does not involve any new ideas, we do not present it here. The interested reader is referred to the Ph.D. dissertation [29]. E. Proof of Lemma 5 Using Lemma 13 i) the fact that depends only on Thus, has been dealt with hence has been dealt with. Since the argument did not depend on, the proof of the lemma is complete. 3) Analysis of : Lemma 10: If assumptions are satisfied, then for sufficiently small, exists. Further, Note that since, by assumptions b) a). By Markovianity Thus, which is similar to the expression considered in Section IV. In this case, hence from Lemmas 1 4 The proof is very similar to that of Lemma 9 is not be presented here. The interested reader is referred to the Ph.D. dissertation [29].
1888 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002 Using, (22),, it follows that. As shown in the proof of Theorem 2 in Section IV-A,,. Hence, we obtain that. From Lemmas 1 4, exists for sufficiently small. In the proof of Theorem 2 in Section IV-B, we showed that exists. Using the expression for above using, we get By orthogonality (2) zero. Further hence the first term is const. By the same argument, is bounded by a that does not depend on for sufficiently small. Hence, for sufficiently small. We now obtain the contribution due to. Using stationarity followed by a change of variable By Lemma 12 vi),. Hence, This completes the proof. F. Proof of Lemma 6 Using (13) By Markovianity with as in case of above Using (14), we can write We first show that does not contribute to for sufficiently small. Since, using Markovianity then stationarity we are using the notation of Lemma 11. Note that. Hence, by (62) of Lemma 11. By Lemma 12 ii), the function. Also, by [14, Lemma 3], the conditional expectation is a bounded operator on. Hence we can apply (61) of Lemma 11 with. Thus, we obtain By (2), the first term is zero. Now Similarly
DABEER AND MASRY: ANALYSIS OF MEAN-SQUARE ERROR AND TRANSIENT SPEED OF THE LMS ADAPTIVE ALGORITHM 1889 Hence,. With the choice of made,. Hence from Lemma 13 i), iii) we obtain,. Further from the proof of Lemma 1 in Section V-A It follows that exists that. From the analysis of, we get that we have used Lemma 13 vi). Hence, exists for sufficiently small. Further The desired result now follows from (57). From Lemma 12 vi) we can write H. Proof of Corollary 1 From Theorem 2 it follows that,. Multiplying both sides by taking the trace,. Using the definition of, it follows that This completes the proof of the lemma. G. Proof of (27) Corollary 3: Under assumptions Note that in the last step we have used the fact that the infinite series in converges absolutely. The desired result follows by using again. Proof: Choosing Lemma 13 iii), from Section IV we get then using Here,,, are as in Section IV but with as defined in this section. Hence, I. Proof of Corollary 2 For sufficiently small, the eigenvalues of are in. Hence from (23) we obtain, is uniform in. Hence, from Theorem 2, we obtain that. Using this (7), it follows that therefore, From Lemmas 1 4 we get From the proof of Lemma 1 in Section V-A we have (57) VI. CONCLUSION In this paper, we analyzed the LMS algorithm in the transient phase as well as in steady state. We provided simple approximations for the estimation error in the filter coefficients, for the excess signal estimation error. For sufficiently small, we also proved the convergence of the LMS algorithm in the sense that the second-order statistics attain limiting values as the number of iterations increases to infinity. Further, we analyzed the steady-state errors as. The result for the excess signal estimation error shows that for sufficiently small, the LMS algorithm can result in a smaller signal estimation error than the Wiener filter. We also studied a measure of transient speed for small step size. Our analysis shows that for small, the transient speed does not depend on the signal estimation error the correlation amongst the s.
1890 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002 Our result can also be used to analyze the scheme suggested in [19] for improving the speed of convergence. Unlike many of the previous works, we do not assume a linear regression model. We also consider dependent data. Our assumptions permit many data processes of practical importance. APPENDIX I PRELIMINARY LEMMAS In this section, we first prove an exponential convergence result for certain products of rom matrices, which are encountered in the analysis of the LMS algorithm. Lemma 11: Suppose, a) b) hold. Then is a bounded operator on,. Further, the operator satisfies (58) (59),, is any matrix for sufficiently small, the family of operators on, satisfies, the family satisfies. Similarly, Proof: Equations (58) (62) follow directly from Lemma 16, we only have to verify the assumptions of Lemma 16. First we note that by, is -uniformly ergodic, hence by [25, Lemma 15.2.9], it is also -uniformly ergodic. For a matrix let. Consider const. From these bounds b), assumption (64) of Appendix II follows. The fact that are bounded operators follows from Lemma 14. The extra assumption that the matrices be symmetric is satisfied as is symmetric. We also need to verify that the eigenvalues of these matrices are strictly negative. This follows from the assumption that is positive definite. We now state a few preliminary results. Lemma 12: Assumptions imply the following. i), for. ii). iii). iv). v). vi),,. vii) For,, is a bounded operator on as well as it satisfies (60) Proof: Taking expectation on both sides of using a) we obtain i). b) hence, (61),, the operators are as described in the notation at the beginning of Section V. Also, if, then ii) now follows from c). Similarly Hence iii) follows from i) d). Similarly (62)
DABEER AND MASRY: ANALYSIS OF MEAN-SQUARE ERROR AND TRANSIENT SPEED OF THE LMS ADAPTIVE ALGORITHM 1891 By i),. Since vii) For symmetric invertible by d). By c) which is finite by i). Thus, iv) follows. v) follows because which is finite by assumption By [25, eq. (16.17), p. 388] c). (63), is as in assumption a). By ii), for some, hence,,. By orthogonality (2),. Therefore, by setting applying (63) the desired result follows. vii) follows from Lemma 14 of Appendix II with Properties i) iv) can be found in [30] the references therein. v) vii) are proved in the Ph.D. dissertation [29]. APPENDIX II ANALYSIS OF PRODUCT OF CERTAIN RANDOM MATRICES In this appendix, we prove Lemma 16 which not only refines [14, Theorem 2], but also establishes a new result. We assume the Markov process to be -uniformly ergodic consider products of rom matrices which are functions of. The framework in this section is more general than that for the LMS algorithm. In [32] [33], a multiplicative ergodic theorem has been established for bounded functions of geometrically ergodic Markov processes. Lemma 16 gives a multiplicative ergodic theorem for certain matrices which are in general unbounded functions of an ergodic Markov process. In addition, this result also gives us control over the error term, which is critical for performance analysis of the LMS algorithm. Consider the space the operator We only have to verify the assumptions of Lemma 14: of Appendix II. is part of the hypothesis while has been verified in the proof of Lemma 11 above. Lemma 13: i). ii) Here,, are matrices of real-valued measurable functions on such that (64) Lemma 14: Suppose is -uniformly ergodic let (64) hold. Then, for any for all iii) are -dimensional vectors Further, is a bounded operator on, (65),. iv) Let denote the eigenvalues of with eigenvectors,. Then has eigenvalues with eigenvectors,. v) Let, be the eigenvalues of with eigenvectors. Then, has eigenvalues with eigenvectors,. vi) For symmetric, invertible Proof: Let (66) (67). For simplicity of notation, let. We get
1892 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002 The step above follows from the Markov property, inequality follows from the definition of, inequality follows from (64), inequality follows by repeatedly applying inequality. This completes the proof of (65). Consider Using (64) the definition of, it follows that is a bounded operator for, hence is a bounded operator on. Finally, we prove (67). Since Due to space constraints we keep the proofs short. More detailed proofs are given in the Ph.D. dissertation [29]. Lemma 15: Suppose is -uniformly ergodic let (64) be true. In addition, suppose is symmetric has nonzero eigenvalues with corresponding eigen-projection,,,, for. Then, the operator has the decomposition such that,,. The operator satisfies For the operator, the eigenvalues,, the corresponding eigen-projections, the corresponding eigen-nilpotents satisfy (68) (69) (70) Remark 5: Equations (68) (70) are improvements of [14, eqs. (14) (16)] while (69) is the same as [14, eq. (15)]. Proof: The decomposition of the operator follows as in [14]. We now establish (68). From the proof of [14, Theorem 1] we have has nonzero eigenvalues,. We show in what follows that under the additional assumption of symmetric, are semisimple eigenvalues of. Equation (68) then follows from [34, eq. (2.41), p. 82]. To prove that the eigenvalues are semi-simple consider By (65), the above expression is equal to hence by Markovianity Since is symmetric, by spectral decomposition we have used. Thus, we have shown (67) for. In the same manner, using induction, it can be proved for. Reference [14, Theorem 2] establishes an exponential convergence result for thus provides a tool to deal with products of rom matrices like those considered in (67). In order to derive a refinement of [14, Theorem 2], we prove a lemma that improves [14, Theorem 1] under an additional assumption. But first we introduce some more operators. The expectation defines a bounded operator on.it is shown in the proof of [14, Theorem 1] that is a projection onto the subspace of functions which is a finite-dimensional subspace of. For any deterministic matrix,. Since expectation conditional expectation commute, we have. We use to denote the identity operator on. It is assumed that the reader is familiar with the proofs of [14, Theorem 1] [14, Theorem 2]. For the definitions of eigen-projections, eigen-nilpotents, semisimple eigenvalues we refer the reader to [34, p. 41]. Using, it is easy to show that, that is, is the resolvent (see [34, p. 36]) of. Using the spectral decomposition Further, the range of is finite dimensional for hence, are eigenvalues of with corresponding eigen-projections (see [34, p. 181]). The corresponding eigen-nilpotent hence, are semisimple eigenvalues the proof of (68) is complete. Equation (69) is the same as [14, eq. (15)]. We now prove (70). Let
DABEER AND MASRY: ANALYSIS OF MEAN-SQUARE ERROR AND TRANSIENT SPEED OF THE LMS ADAPTIVE ALGORITHM 1893 we have used (69). Since are eigen-projections eigen-nilpotents, (see [34, p. 39]). Further, the eigen-projections commute with each other with. Hence, (71) Now using (68) (69) it can be shown that. Since, the desired result follows. We now prove the main result of this appendix. Lemma 16: Suppose is -uniformly ergodic let (64) be true. Also, assume that is symmetric its eigenvalues,,, for. Let is any matrix. Then, for sufficiently small, for sufficiently small const (72) (73) (74) Remark 6: Equations (72) (73) refine [14, Theorem 2]. Equation (74) is new. Proof: From the proof of [14, Theorem 2, p. 595], to prove (72), we only need to show that By [14, Equation (17)] for Lemma 15, spectral decomposition for is the algebraic multiplicity of as an eigenvalue of. Using Lemma 15 using the assumption that, it is shown in [29] that each of the above terms is of order. Equation (73) can be proved in exactly the same way as [14, eq. (19)] but by using Lemma 15 instead of [14, Theorem 1]. Finally, we prove (74). Since,,,. Using we get that (75) Using,,, we obtain,. It is shown in [14] that for some. Using this it can be easily shown that (76) Using,, (75), (76), (see [14]), This completes the proof. REFERENCES [1] S. Haykin, Adaptive Filter Theory, 3rd ed, ser. Information System Sciences Series. Englewood Cliffs, NJ: Prentice Hall, 1996. [2] B. Widrow, J. M. McCool, M. G. Larimoore, C. R. Johnson, Stationary nonstationary learning characteristics of the LMS adaptive filter, Proc. IEEE, vol. 64, pp. 1151 1162, Aug. 1976. [3] R. R. Bitmead B. D. O. Anderson, Performance of adaptive estimation algorithms in dependent rom environments, IEEE Trans. Automat. Contr., vol. AC-25, pp. 788 794, Aug. 1980. [4] R. R. Bitmead, Convergence in distribution of LMS-type adaptive parameter estimates, IEEE Trans. Automat. Contr., vol. AC-28, pp. 54 60, Jan. 1983. [5] O. Macchi E. Eweda, Second-order convergence analysis of stochastic adaptive linear filtering, IEEE Trans. Automat. Contr., vol. AC-28, pp. 76 85, Jan. 1983. [6], Convergence analysis of self-adaptive equalizers, IEEE Trans. Inform. Theory, vol. IT-30, pp. 161 176, Mar. 1984. [7] W. Gardner, Learning characteristics of stochastic gradient-descent-algorithms: A general study, analysis critique, Signal Processing, vol. 6, no. 2, pp. 113 133, Apr. 1984. [8] A. Feur E. Weinstein, Convergence analysis of LMS filters with uncorrelated Gaussian data, IEEE Trans. Acoust., Speech. Signal Processing, vol. ASSP-33, pp. 222 229, Feb. 1985. [9] N. J. Bershad L. Z. Qu, On the probability density function of the complex scalar LMS adaptive weights, IEEE Trans. Acoust., Speech, Signal Processing, vol. 37, pp. 43 56, Jan. 1989. [10] V. Solo, The limiting behavior of LMS, IEEE Trans. Acoust., Speech, Signal Processing, vol. 37, pp. 1909 1922, Dec. 1989. [11], The error variance of LMS with time-varying weights, IEEE Trans. Signal Processing, vol. 40, pp. 803 813, Apr. 1992. [12] J. Bucklew, T. Kurtz, W. Sethares, Weak convergence local stability properties of fixed step size recursive algorithms, IEEE Trans. Inform. Theory, vol. IT-30, pp. 966 978, May 1993. [13] J. J. Fuchs B. Delyon, When is adaptive better than optimal?, IEEE Trans. Automat. Contr., vol. 38, pp. 1700 1703, Nov. 1993. [14] G. V. Moustakides, Exponential convergence of products of rom matrices: Application to adaptive algorithms, Int. J. Adapt. Contr. Signal Processing, vol. 12, pp. 579 597, Nov. 1998. [15], Locally optimum adaptive signal processing algorithms, IEEE Trans. Signal Processing, vol. 46, pp. 3315 3325, Dec. 1998. [16] L. Guo L. Ljung, Performance analysis of general tracking algorithms, IEEE Trans. Automat. Contr., vol. 40, pp. 1388 1402, Aug. 1995. [17] L. Guo, L. Ljung, G. Wang, Necessary sufficient conditions for stability of LMS, IEEE Trans. Automat. Contr., vol. 42, p. 761, June 1996. [18] A. H. Sayed M. Rupp, Error-energy bounds for adaptive gradient algorithms, IEEE Trans. Signal Processing, vol. 44, pp. 1982 1989, Aug. 1996.
1894 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002 [19] N. Erdol F. Basbug, Wavelet transform based adaptive filters: Analysis new results, IEEE Trans. Signal Processing, vol. 44, pp. 2163 2171, Sept. 1996. [20] J. Homer, R. R. Bitmead, I. Mareels, Quantifying the effects of dimension on the convergence rate of the LMS adaptive FIR estimator, IEEE Trans. Signal Processing, vol. 46, pp. 2611 2615, Oct. 1998. [21] M. Reuter J. R. Zeidler, Nonlinear effects in LMS adaptive equalizers, IEEE Trans. Signal Processing, vol. 47, pp. 1570 1579, June 1999. [22] K. J. Quirk, L. B. Milstein, J. R. Zeidler, A performance bound for the LMS estimator, IEEE Trans. Inform. Theory, vol. 46, pp. 1150 1158, May 2000. [23] A. Benveniste, M. Metivier, P. Priouret, Adaptive Algorithms Stochastic Approximation. New York: Springer-Verlag, 1990. [24] H. J. Kushner G. G. Yin, Stochastic Approximation Algorithms Applications. New York: Springer-Verlag, 1997. [25] S. P. Meyn R. L. Tweedie, Markov Chains Stochastic Stability. London, U.K.: Springer-Verlag, 1993. [26] R. C. Bradley, Basic Properties of strong mixing conditions, in Dependence in Probability Statistics. Boston, MA: Birkhauser, 1986, pp. 165 191. [27] P. J. Brockwell R. A. Davis, Time Series: Theory Methods, 2nd ed, ser. Springer Series in Statistics. New York: Springer-Verlag, 1991. [28] O. Macchi, Adaptive Processing The Least Mean Squares Approach With Applications in Transmission. New York: Wiley, 1995. [29] O. J. Dabeer, Convergence analysis of the LMS the modulus algorithms, Ph.D. dissertation, Univ. California, San Diego, 2002. [30] H. Lütkepohl, Hbook of Matrices. New York: Wiley, 1996. [31] R. Ravikanth S. P. Meyn, Bounds on the achievable performance in the identification adaptive control of time-varying systems, IEEE Trans. Automat. Contr., vol. 44, pp. 670 682, Apr. 1999. [32] S. Balaji S. P. Meyn, Multiplicative ergodic theorems large deviations for an irreducible Markov chain, Stochastic Processes Their Applic., vol. 90, no. 1, pp. 123 144, 2000. [33] I. Kontoyiannis S. Meyn, Spectral theory limit theorems for geometrically ergodic Markov processes, paper, submitted for publication. [34] T. Kato, Perturbation theory for linear operators, in Classics in Mathematics. Berlin, Germany: Springer-Verlag, 1995, reprint of 1980 edition.