The p-norm generalization of the LMS algorithm for adaptive filtering

Size: px

Start display at page:

Download "The p-norm generalization of the LMS algorithm for adaptive filtering"

Horatio Johnson
8 years ago
Views:

1 The p-norm generalization of the LMS algorithm for adaptive filtering Jyrki Kivinen University of Helsinki Manfred Warmuth University of California, Santa Cruz Babak Hassibi California Institute of Technology

Helsinki Manfred Warmuth University of California,

2 Least Mean Squares (LMS) update Pick learning rate η > 0. Initialize w 0 = 0 R n At time t, for t = 1,..., T, the algorithm observes input x t R n makes prediction w t 1 x t R observes feedback y t R, and updates its hypothesis as w t = w t 1 η(w t 1 x t y t )x t

3 Main Results Techniques from machine learning lead to generalizations of LMS H -optimal filtering in signal processing is similar to relative on-line loss bounds in machine learning

4 Motivation Non-Gaussian modeling Get away from rotation invariant algorithms Develop algorithms that work well when instances orthogonal and target weight vectors sparse

5 Expected Bounds for LMS Assume y t = u x t + ν t, where ν t iid with E[νt 2 ] = ε. Then [ ] 1 T E (u x t w t 1 x t ) 2 ε + 1 T T X2 2 u 2 2 t=1 Better algorithms exist for probabilistic setting However our goal is to weaken the assumptions

6 H bound for LMS [HSK96] Assume x t 2 X 2 for all t. Choose η = 1/X 2 2 For any u R n T t=1 (u x t w t 1 x t ) 2 T t=1 (u x t y t ) 2 + X 2 2 u If some u with small norm is good predictor then LMS must approximate predictions of u Bound holds for any u and (x t, y t ) No probabilistic assumptions LMS is H -optimal: No algorithm can achieve ratio < 1 u and (x t, y t )

2 1 If some u with small norm is good predictor then LMS must approximate predictions of u

7 Two related problems A priori filtering: Control Theory Try to match u x t T (u x t w t 1 x t ) 2 t=1 Prediction: On-line Learning Try to match y t T (y t w t 1 x t ) 2 t=1

8 Comparison of known LMS-related bounds For η = α/x 2 2, T (u x t w t 1 x t ) 2 t=1 T (u x t y t ) 2 + X2 2 u 2 2 t=1 For η = α/x 2 2 (0 < α < 1) T (y t w t 1 x t ) 2 t=1 1 1 α T (u x t y t ) 2 t=1 + 1 α X2 2 u 2 2 } {{ } Loss u tuned α Loss u + 2 Loss u X 2 u 2 + X2 2 u 2 2 [CBLW96]

9 Generalizing the LMS bound Replace x 2 u 2 by x p u q where 1/p + 1/q = 1 and x p = ( i x i p ) 1/p Instead of comparing predictions to u x t for a fixed target u compare to u t x t where u t may change Replace ( ) 2 by more general loss

10 Basic LMS t = η(w t 1 x t y t )x t w t 1 + w t t

11 p-norm LMS Write θ t = f(w t ) w t 1 w t W -space f θ t 1 t + f 1 θ t Θ-space

12 p-norm LMS based on [GLS01] where f: R n R n given by w t = f 1 (f(w t 1 ) η(w t 1 x t y t )x t ) f i (w) = sign(w i) w i q 1 w q 2 q and f 1 i (θ) = sign(θ i) θ i p 1 θ p 2 p When p = q = 2, then f(w) = w: LMS For large p, f 1 emphasizes differences in components

13 A priori filtering bound Theorem Assume x t p X p for all t, and let η = 1/((p 1)X 2 p ) Then for any u the p-norm algorithm satisfies T (u x t w t 1 x t ) 2 t=1 T (y t u x t ) 2 + (p 1)Xp 2 u 2 q t=1 1/p + 1/q = 1 and 2 p <, 1 < q 2 How do we get the dual norm pair (, 1) (where x = max i x i )? For p = 2 ln n, (p 1) x 2 p u 2 q (2e ln n) x 2 u 2 1

) 2 + (p 1)Xp 2 u 2 q t=1 1/p + 1/q = 1 and 2 p <, 1 < q 2 How do we get the dual norm

14 Comparison with basic LMS New bounds incomparable with old ones because for p > 2 and q < 2 x p < x 2 and u q > u 2 Compare p = 2 and p = O(log n) in two extreme cases: Sparse target, dense instances: Let u = (1, 0,..., 0) and x = (1,..., 1). x 2 2 u 2 2 = n2 (log n) x 2 u 2 1 = log n Thus large p better Dense target, sparse instances: Let u = (1, 1,..., 1) and x = (1, 0,..., 0). x 2 2 u 2 2 = n2 (log n) x 2 u 2 1 = n2 log n Thus p = 2 better

15 The p-norm LMS can behave like EG Hadamard Matrix: instances targets Instances are orthogonal Target weight vectors are units LMS: error 1 k n p-norm LMS with p = O(log n): error ln n k

16 Time-varying target (following [HW01]) Up to now, model has been y t = u x t + noise where target u is fixed Generalize this to y t = u t x t + noise where target u t may vary over time Example 1: target makes one jump Choose a, b R n and take u t = { a for 1 t T/2 b for T/2 < t T Example 2: target moves steadily Choose a, b R n and take u t = T t T 1 a + t 1 T 1 b

time Example 1: target makes one jump Choose a, b R n and take u t = { a for 1 t T/2 b for

17 Algorithms for time-varying target Old update: Bounding update: w t = f 1 (f(w t 1 ) η(w t 1 x t y t )x t ) w t = { w t if w t q U q otherwise U q where U q > 0 is a norm bound w t w t q We rescale the weight vector whenever q-norm larger than U q

18 Bound for time-varying target Theorem Assume x t p X p for all t, and let η = 1/((p 1)X 2 p ) Then if u t q U q for all t, the bounded p-norm LMS satisfies T (u t x t w t 1 x t ) 2 t=1 T (y t u t x t ) 2 + (p 1)Xp 2 U q 2 t=1 + 2(p 1)X 2 p U q T 1 t=1 u t+1 u t q Only total distance t u t+1 u t q traveled by the target matters Cost 2(p 1)Xp 2U q per unit target movement For fixed target u t+1 = u t, we recover previous bound However U q needs to be known in advance

1)X 2 p U q T 1 t=1 u t+1 u t q Only total distance t u t+1 u t q traveled by the target matters Cost 2(p 1)Xp 2U q

19 Bregman divergences Key tool in analyzing and understanding the algorithms Fix strictly convex differentiable F : R n R. Denote the gradient by f = F. Now the Bregman divergence d F : R n R n R is d F (u, w) = F (u) F (w) f(w) (u w) F d F (u, w) d F (u, w) is the error of firstorder Taylor approximation of F (u) around w w u

Now the Bregman divergence d F : R n R n R is d F (u, w) = F (u) F (w) f(w) (u w)

20 Basic properties of Bregman divergences d F (u, w) 0, d F (u, w) = 0 iff u = w not symmetrical (in general) does not satisfy triangle inequality d F (u, w) convex in u, not necessarily in w Connection to exponential families (roughly): F is cumulant function, f is link function w is expectation parameter, f(w) canonical parameter d F (u, w) is the KL divergence between distributions parameterized by u and w

to exponential families (roughly): F is cumulant function, f is link function w is expectation

21 Example: p-norm divergence [GLS01] F (w) = 1 2 w 2 q Then the gradient f = F satisfies f i (w) = sign(w i) w i q 1 The divergence is w q 2 q and f 1 i (θ) = sign(θ i) θ i p 1 θ p 2 p d F (u, w) = 1 2 u 2 q 1 2 w 2 q f(w) (u w). Special case p = q = 2 gives d F (u, w) = 1 2 u v 2 2

22 Deriving the updates Define a regularized instantaneous loss C t (w) = d F (w, w t 1 ) + η 2 (y t w x t ) 2 Basic aim is to have w t = argmin w C t (w) Minimize by setting C t (w t ) = 0, obtaining the implicit update f(w t ) = f(w t 1 ) η(w t x t y t )x t Approximate w t x t w t 1 x t to obtain the update f(w t ) = f(w t 1 ) η(w t 1 x t y t )x t

23 Analyzing the update Measure of progress d F (u, w t 1 ) d F (u, w t ) = η(y t w t 1 x t )x t (u w t 1 ) d F (w t 1, w t ) Massage the term (y t w t 1 x t )x t (u w t 1 ) until (u x t w t 1 x t ) 2 and (y t u x t ) 2 appear; throw rest away Estimate d F (w t 1, w t ) in terms of x t p Sum over t = 1,..., T

24 Conclusion LMS and normalized LMS can be derived from an optimization problem involving a certain Bregman divergence Different Bregman divergences lead to different algorithms, with loss bounds in terms of different norms Bounds can be generalized for time-varying targets (and generalized linear models, not presented in the talk); proofs easy Algorithms for p = 2 can be kernelized, for p > 2 probably not Bottom line: Machinery from on-line machine learning carries over to H -optimal filtering

25 Where are we headed? Develop p-norm Kalman filter Prove relative loss bounds

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650