Adaptive Classification by Hybrid EKF with Truncated Filtering: Brain Computer Interfacing Ji Won Yoon, Stephen J. Roberts, Matthew Dyson 2, and John Q. Gan 2 Pattern Analysis and Machine Learning Group Department of Engineering Science, University of Oxford, UK jwyoon,sjrob}@robots.ox.ac.uk http://www.robots.ox.ac.uk/ parg 2 Department of Computing and Electronic Systems University of Essex, UK mdyson,jqgan}@essex.ac.uk http://dces.essex.ac.uk/staff/jqgan/ Abstract. This paper proposes a robust algorithm for adaptive modelling of EEG signal classification using a modified Extended Kalman Filter (EKF). This modified EKF combines Radial Basis functions (RBF) and Autoregressive (AR) modeling and obtains better classification performance by truncating the filtering distribution when new observations are very informative. Keywords: Extended Kalman Filter, Informative Observation, Logistic Classification, Truncated Filtering. Introduction The goal of Brain Computer Interfacing (BCI) is to enable people with severe neurological disabilities to operate computers by manipulation of the brain s electrical activity rather than by physical means. It is known that the generation and control of electrical brain activity (the electroencephalogram or EEG) signals for a BCI system often requires extensive subject training before a reliable communication channel can be formed between a human subject and a computer interface [,2. In order to reduce the overheads of training and, importantly, to cope with new subjects, adaptive approaches to the core data modelling have been developed for BCI systems. Such adaptive approaches differ from the typical methodology, in which an algorithm is trained off-line on retrospective data, in so much that the process of learning is continuously taking place rather than being confined to a section of training data. In the signal processing and machine learning communities this is referred to as sequential classification. Previous research in this area applied to BCI data has used state space modelling of the time series [3,4,5. In particular Extended Kalman Filter (EKF) approaches have been effective [3,5. In these previous models, based upon the EKF, the variances of the observation noise and hidden state noise are re-estimated using C. Fyfe et al. (Eds.): IDEAL 28, LNCS 5326, pp. 37 377, 28. c Springer-Verlag Berlin Heidelberg 28
Adaptive Classification by Hybrid EKF with Truncated Filtering 37 a maximum-likelihood style framework, ultimately due to the work of Jazwinski [7. In previous work [6, we developed a simple, computationally practial, modification of the conventional EKF approach in which we marginalise (integrate out) the unknown variance parameters for variances by applying Bayesian conjugate priors [8. This enabled us to avoid the need for spurious parameter estimation steps which decrease the robustness of the algorithm in highly non-stationary and noisy regions of data. The EKF offers one framework for approximating the non-linear linkage between observations and decisions (via the logistic function). In all this prior work, however, the output function assumed the target labels (the decision classes) to be time independent, i.e. there is no explict Markov dependency from one decision to the next. This is a poor model for the characteristics of the BCI interface, at the sample rate we operate at (over.2s intervals), in which sequences of labels of the same class persist for several seconds before making a transition. The algorithm presented here employs both a set of non-linear basis functions (thus making it a dynamic Radial Basis Function (RBF) classifier) and an autoregressive (AR) model to tackle this Markov dependency of the labeling. In addition, our approach modifies a filtering step by truncating the filtering distribution in terms of new observations. This modified filtering step provides a robust algorithm when given labels (i.e. as training data) are very informative. Moreover, this algorithm does not have to approximate the convolution of a logistic function and Normal distribution using a probit function, which can lead to poor performance for the EKF [5,6. 2 Mathematical Model We consider an observed input stream (i.e. a time series) x t at time t =, 2,,T which we project into a non-linear basis space, x t ϕ(x t ). We also (partially) observe z t, } such that Pr(z t = x t, w t )=g(f(x t, w t )) where g( ) can be logistic model or probit link function which takes the latent variables to the outcome decision variable. In this paper, we use the logistic function, i.e. g(s) =/( + e s ). The state space model for the BCI system can hence be regarded as a hierarchical model in that the noise of observations influences the model indirectly through the logistic regression model g( ). Such indirect influence makes for a more complicated model, but we can circumvent much of this complexity by forming a three stage state space model and by introducing an auxiliary variable y t. The latter variable acts so as to link the indirect relationships between observations and the logistic regression model given by p(z t y t )=g(y t ) zt ( g(y t )) zt y t = abw T t ϕ t(x t )+( a)y t + v t w t = Aw t + Lh t () where v t N(,κ)andh t N(,τI). The variable a denotes a convex mixing ratio between the Radial Basis Function (RBF) and the Autoregressive (AR) model. To simplify notation, we use ϕ t instead of ϕ t (x t ) i.e.
372 J.W. Yoon et al. ϕ t = ϕ t (x t )= [ x T t, φ() t (x t )} T,, φ (N b) t (x t )} T, T (2) and φ (i) t (x t ) is the response of the ith Gaussian basis function for i,,n b } at time t. Here, N b is the number of Gaussian basis functions and the basis functions are chosen in random. In adaptive classification, there are two steps in our state space model which perform model inference: Prediction: p(w t,y t z :t, x :t ) Filtering: p(w t,y t z :t, x :t ) after which we obtain ȳ t t = y t y t [ w t p(w t,y t z :t, x :t )dw t dy t. 2. Prediction p(w t,y t z :t, x :t ) = p(w t,y t, w t,y t z :t, x :t )dw t dy t = p(y t w t,y t )p(w t w t )p(w t,y t z :t, x :t )dw t dy t [ [ = p(y t w t,y t,κ)p(κ)dκ p(w t w t,τ)p(τ)dτ κ τ p(w t,y t z :t, x :t )dw t dy t (3) ( ( ) ) ( α N y t ; aϕ T ( ) ) t Bw t +( a)y t, κ α N w t ; Aw t, τ LL T N = N = N ([ yt w t ([ yt ([ yt where w t w t ; β κ ) ; μ t t,σ t t dw t dy t [ [ ) a aϕ T t BA yt,σ N A w t ; Sμ t t,σ+ SΣ t t S T ) [ a aϕ T t BA A [ Σy Σ,Σ = c, Σ c Σ w ([ yt w t ; μ t t,σ t t ) dw t dy t S = ( ) α Σ y = E[(y t E(y t ))(y t E(y t )) T =a 2 ϕ T ( ) t BL τ α L T B T ϕ t + κ, ( α Σ w = E[(w t E(w t ))(w t E(w t )) T =L τ ) L T, ( ) α Σ c = E[(y t E(y t ))(w t E(w t )) T =aϕ T t BL τ L T. (3) β κ (4) 2.2 Filtering In the filtering step, our EKF uses a new observation z t. In general, p(w t,y t z :t, x :t ) is used with the following form:
Adaptive Classification by Hybrid EKF with Truncated Filtering 373 p(w t,y t z :t, x :t ) p(z t, w t,y t z :t, x :t ) = p(z t y t )p(w t,y t z :t, x :t ). (4) It is difficult to obtain a full distribution for the EKF filtering step so previous approaches approximated the convolution between the normal distribution and a logistic function using the probit function [5,6. However, if the new observation is very informative, filtering can be improved without the need for the approximation, which is one of the major sources of non-linearity in logistic EKF filtering. We can obtain better filtering distributions by truncation as we do not need the convolution approximation. Hence we obtain p(w t,y t z :t, x :t ) p(wt,y = p(w t,y t z :t, x :t,z t )= t z :t, x :t )I yt (y t ) if z t = p(w t,y t z :t, x :t )I yt<(y t ) if z t =, if yt satisfies C where I C (y t )=, otherwise, p(wt y = t, z :t, x :t )p(y t z :t, x :t )I yt (y t ) if z t = p(w t y t, z :t, x :t )p(y t z :t, x :t )I yt<(y t ) if z t = p(wt y t, z :t, x :t )[p(y t z :t, x :t )I yt (y t ) if z t = p(w t y t, z :t, x :t )[p(y t z :t, x :t )I yt<(y t ) if z t = N (wt ;ˆμ = w,t t, ˆΣ w,t t ) [ N (y t ; μ y,t t,σ y,t t )I yt (y t ) if z t = N (w t ;ˆμ w,t t, ˆΣ w,t t ) [ N (y t ; μ y,t t,σ y,t t )I yt<(y t ) if z t = ˆμw,t t = μ w,t t + Σ c,t t Σ y,t t where (y t μ y,t t ) ˆΣ w,t t = Σ w,t t Σc,t t T Σ y,t t Σ c,t t N(w t ;ˆμ w,t t, ˆΣ w,t t )N (y t ; μ y,zt,σ y,zt )forz t, } where μ y,zt= = E N (yt;μ y,t t,σ y,t t )I yt (y t)(y t ) μ y,zt= = E N (yt;μ y,t t,σ y,t t )I yt <(y t)(y t ) Σ y,zt= = V N (yt;μ y,t t,σ y,t t )I yt (y t)(y t ) Σ y,zt= = V N (yt;μ y,t t,σ y,t t )I yt <(y t)(y t ) (5) and E( ) and V( ) denote mean and covariance of the truncated normal distribution respectively. The calculation of the mean and covariance of the truncated normal distribution is described in the appendix. Therefore, we have p(w t,y t z :t, x :t) =N yt w t yt μy,t t ; μ t t,σ t t = N ;, w t μ w,t t Σy,t t Σ c,t t Σ c,t t Σ w,t t (6)
374 J.W. Yoon et al. where μ y,t t = μ yt,zt Σ y,t t = Σ yt,zt Σ c,t t = Σ c,t t Σ y,t t Σyt,zt μ w,t t = μ w,t t + Σ c,t t Σ y,t t (μy,zt μ y,t t ) Σ w,t t = Σ w,t t + Σ c,t t Σ y,t t Σy,zt Σ y,t t Σ c,t t Σ c,t t Σ y,t t ΣT c,t t (7) 3 Results for Experimental Data Set 3. Data Acquisition Data used in this experiment consisted of two channels of EEG, recorded at 256Hz placed over the central portion of the head and one channel of muscle electrical activity (EMG), recorded at 24Hz over the muscles of the right fore-arm. The EMG was then down-sampled to 256Hz and muscle contraction strength for movement and non-movement detection was evaluated via a simple windowed peak and trough detection; this then formed a movement / non-movement label. 3.2 Feature Extraction and Basis Formation The second reflection coefficient of a second-order autoregressive (AR) model [ were calculated over each EEG signal once every 78ms using a sliding one-second- a) Underlying true labels Missing labels 5 5 2 25 3 35 4 45 b).5 5 5 2 25 3 35 4 45 c).5 5 5 2 25 3 35 4 45 d).5 5 5 2 25 3 35 4 45 Fig.. Comparison of different EKF algorithms: a) trajectories with true labels (blue) and missing labels (red), b) Standard EKF (fixed τ and κ are used), c) Modified EKF (marginalising both τ and κ), d) hybrid EKF using marginalistion and truncation
Adaptive Classification by Hybrid EKF with Truncated Filtering 375 Cross Entropy 2 8 Values 6 4 2 Standard EKF Marginalised EKF hybrid EKF:(a=.5) Fig. 2. Comparison of Cross Entropy: the method proposed in this paper outperforms other EKF approaches long window, forming a set of feature vectors x t. These vectors were projected into a non-linear latent space using a set of Gaussian basis functions (N b = ), providing the stream of ϕ t. 3.3 Performance Comparison In this paper, hyper-parameters defining prior distributions over τ and κ are given by α τ =2, =, α κ =5,andβ κ =.. Since there is no proper model in dynamics for our model, we used a transition kernel based on random walks so that the matrices of Eq. () are known as A = B = L = I Nw where I Nw denotes an N w N w size identity matrix and N w is the number of elements of w t.for comparison, fixed τ and κ are chosen by τ = E(τ α τ, )= τp(τ α τ, )dτ and κ = E(κ α κ,β κ )= κp(κ α κ,β κ )dκ where τ. Also,p(w )=N ( ; μ,σ ) where μ = and Σ = τ I respectively. Here τ = 2. Fig. shows the comparison of several methods. The first plot of the figure shows the labels: inactive (z t = ), active (z t = ) and unknown labels (z t = ) in the movement respectively. The algorithm receives no label information after t = samples. Fig. 2 demonstrates the comparison of the cross entropy of the several methods. The cross entropy of the tth sample is defined by zt = log p(z t z t )= log z t ( z t ) zt} (8) E cross t where z t and z t represent for underlying labels and the posterior of estimated labels respectively. 4 Conclusion and Future Work In this paper, we propose a dynamic Bayesian model for adaptive classification in time series based on a Bayesian version of the modified Extended Kalman filter
376 J.W. Yoon et al. which combines Radial basis function and autoregressive model. The AR model explains the Markov dependency of the output function so that it provides the possibility of improvement in conventional EKF classification. This proposed algorithm also employs truncation of filtering distribution to obtain a better classifier when the observation is very informative. The critical data-specific sensitivity of our model is to the hyper-parameter a. In principle this can be inferred from a section of labelled data or set via knowledge of the typical withinstate times. This requires a more extensive testing against a larger number of data sets than presented in this pilot study and this will be the focus of future work. Acknowledgment This project is supported by UK EPSRC funding to whom we are most grateful. References. Wopaw, J.R., McFarland, D.J., Neat, D.J., Forneris, C.A.: An EEG-based braincomputer interface for cursor control. Electroencephalogr. Clin. Neurophysiol. 78, 252 259 (99) 2. Pfurtscheller, G., Flotzinger, D., Kalcher, J.: Brain-computer interface: a new communication device for handicapped persons. Journal of Microcomputer Applications archive 6(3), 293 299 (993) 3. Sykacek, P., Roberts, S.J., Stokes, M.: Adaptive BCI Based on Variational Bayesian Kalman Filtering: An Empirical Evaluation. IEEE Transactions on Biomedical Engineering 5(5), 79 727 (24) 4. Penny, W.D., Roberts, S.J., Curran, E.A., Stokes, M.: EEG-based communication: a pattern recognition approach. IEEE Transactions on Rehabilitation Engineering 8(2), 24 26 (2) 5. Lowne, D.R., Roberts, S.J., Garnett, R.: Sequential Non-stationary Dynamic Classification. Machine Learning (submitted, 28) 6. Yoon, J., Roberts, S.J., Dyson, M., Gan, J.Q.: Sequential Bayesian Estimation for Adaptive Classification. Multisensor Fusion and Integration for Intelligent Systems (accepted for publication, 28) 7. Jazwinski, A.H., Bailie, A.E.: Adaptive filtering Interim report. NASA Technical Reports (March, 967) 8. Bernardo, J.M., Smith, A.F.M.: Bayesian Theory. Wiley, Chichester (994) 9. Bishop, C.M.: Pattern Recognition and Machine Learning, st edn. Springer, Heidelberg (988). Pardey, J., Roberts, S.J., Tarassenko, L.: A Review of Parametric Modelling Techniques for EEG Analysis. Med. Eng. Phys. 8(), 2 (996) Appendix 4. Laplace Approximation In Eq. (3), the non-gaussian distribution generated from the marginalisation and logistic model are approximated by Gaussian distributions using the Laplace
Adaptive Classification by Hybrid EKF with Truncated Filtering 377 approximation. Let π(w t ) be the non-gaussian distribution of interest. Then, we have π(w t )=(2π) Nw/2 βατ τ Γ (α τ [ ) Γ (α τ ) βτ + 2 (w t w t ) T (w t w t ) α. (9) τ In order to find the mode of π(w t ), we use log of the distribution given by [ =logπ(w t )= α q log + 2 (w t w t ) T (w t w t ) + c () L wt where c = N w /2log(2π) +α τ log log Γ (α τ ) + log(α τ ). Using the first derivate of L wt in terms of w t,wehave dl α τ = dw t + 2 (w t w t ) T (w t w t ) (w t w t ) = () and a mode can be easily obtained by w t = w t. In addition, using the second derivative of the L wt, we can obtain the covariance of the approximated distribution as follows: d 2 L dw 2 t wt=w t = α τ. (2) Now, we have an approximated distribution ( ( ) ) α π(w t ) N w t ; w t, τ. (3) 4.2 Truncated Normal Distribution Suppose we condition on Y t θ =[θ,θ 2, where <θ <θ 2 <. When we have the normal distribution with mean μ = μ y,t t and standard deviation σ = Σ y,t t, the conditional density of Y t is σ p(y t θ) = φ ( y t μ) ) σ ),θ y t θ 2 (4) Φ Φ ( θ2 μ σ ( θ μ σ where φ( ) andφ( ) denote normal distribution function and normal cumulative distribution function respectively. μ y,zt= = E[Y t Y t >==μ + σ φ( μ σ ) Φ( μ σ ) μ y,zt= = E[Y t Y t < = μ σ φ( μ σ ) Φ( μ σ ) Σ y,zt= = V[Y t Y t >==σ 2 + μ σ φ( μ σ ) [ φ( μ Φ( μ σ ) σ ) 2 } Φ( μ σ ) Σ y,zt= = V[Y t Y t < = σ 2 + μ σ φ( μ σ ) Φ( μ σ ) [ φ( μ σ ) Φ( μ σ ) 2 }. (5)