Global Optimisation of Neural Network Models Via Sequential Sampling



Similar documents
Discrete Frobenius-Perron Tracking

A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking

Sequential Tracking in Pricing Financial Options using Model Based and Neural Network Approaches

Real-Time Monitoring of Complex Industrial Processes with Particle Filters

EE 570: Location and Navigation

Understanding and Applying Kalman Filtering

Master s thesis tutorial: part III

VEHICLE TRACKING USING ACOUSTIC AND VIDEO SENSORS

UW CSE Technical Report Probabilistic Bilinear Models for Appearance-Based Vision

STA 4273H: Statistical Machine Learning

Tracking in flussi video 3D. Ing. Samuele Salti

System Identification for Acoustic Comms.:

Forecasting "High" and "Low" of financial time series by Particle systems and Kalman filters

11. Time series and dynamic linear models

Exploratory Data Analysis Using Radial Basis Function Latent Variable Models

Tutorial on Markov Chain Monte Carlo

Efficient Streaming Classification Methods

Visual Tracking. Frédéric Jurie LASMEA. CNRS / Université Blaise Pascal. France. jurie@lasmea.univ-bpclermoaddressnt.fr

Probability and Random Variables. Generation of random variables (r.v.)

An Introduction to the Kalman Filter

Linear Models for Classification

Centre for Central Banking Studies

Accurate and robust image superresolution by neural processing of local image representations

Deterministic Sampling-based Switching Kalman Filtering for Vehicle Tracking

How To Solve A Sequential Mca Problem

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Kalman Filter Applied to a Active Queue Management Problem

Bayesian Statistics: Indian Buffet Process

Using artificial intelligence for data reduction in mechanical engineering

Computing with Finite and Infinite Networks

Probability Hypothesis Density filter versus Multiple Hypothesis Tracking

Kristine L. Bell and Harry L. Van Trees. Center of Excellence in C 3 I George Mason University Fairfax, VA , USA kbell@gmu.edu, hlv@gmu.

Robust Neural Network Regression for Offline and Online Learning

Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement

ROBUST REAL-TIME ON-BOARD VEHICLE TRACKING SYSTEM USING PARTICLES FILTER. Ecole des Mines de Paris, Paris, France

Mathieu St-Pierre. Denis Gingras Dr. Ing.

Performance evaluation of multi-camera visual tracking

Bayesian Image Super-Resolution

Keywords: Image complexity, PSNR, Levenberg-Marquardt, Multi-layer neural network.

Joint parameter and state estimation algorithms for real-time traffic monitoring

Recurrent Neural Networks

Statistics Graduate Courses

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

People Tracking Using Hybrid Monte Carlo Filtering

Monte Carlo-based statistical methods (MASM11/FMS091)

Approach to Process Modeling

Mobility Tracking in Cellular Networks Using Particle Filtering

A Reliability Point and Kalman Filter-based Vehicle Tracking Technique

Nonlinear Model Predictive Control of Hammerstein and Wiener Models Using Genetic Algorithms

Understanding Purposeful Human Motion

A Movement Tracking Management Model with Kalman Filtering Global Optimization Techniques and Mahalanobis Distance

Introduction to Engineering System Dynamics

Advanced Signal Processing and Digital Noise Reduction

A State Space Model for Wind Forecast Correction

An Introduction to Neural Networks

A Game Theoretical Framework for Adversarial Learning

Mean-Shift Tracking with Random Sampling

INTELLIGENT ENERGY MANAGEMENT OF ELECTRICAL POWER SYSTEMS WITH DISTRIBUTED FEEDING ON THE BASIS OF FORECASTS OF DEMAND AND GENERATION Chr.

Stock Option Pricing Using Bayes Filters

Ensemble Data Assimilation: How Many Members Do We Need?

Self-Organising Data Mining

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Handling attrition and non-response in longitudinal data

Forecasting of Economic Quantities using Fuzzy Autoregressive Model and Fuzzy Neural Network

PTE505: Inverse Modeling for Subsurface Flow Data Integration (3 Units)

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Real-Time Camera Tracking Using a Particle Filter

References. Importance Sampling. Jessi Cisewski (CMU) Carnegie Mellon University. June 2014

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Chapter 4: Artificial Neural Networks

Towards better accuracy for Spam predictions

An Introduction to Machine Learning

Automatic Detection of Emergency Vehicles for Hearing Impaired Drivers

PARTICLE FILTERING UNDER COMMUNICATIONS CONSTRAINTS

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

Performance Evaluation of Artificial Neural. Networks for Spatial Data Analysis

SELECTING NEURAL NETWORK ARCHITECTURE FOR INVESTMENT PROFITABILITY PREDICTIONS

Automated Stellar Classification for Large Surveys with EKF and RBF Neural Networks

Marine Technology Society

Monte Carlo-based statistical methods (MASM11/FMS091)

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

SMORN-VII REPORT NEURAL NETWORK BENCHMARK ANALYSIS RESULTS & FOLLOW-UP 96. Özer CIFTCIOGLU Istanbul Technical University, ITU. and

ONE of the most important tasks in autonomous navigation

Univariate and Multivariate Methods PEARSON. Addison Wesley

Component Ordering in Independent Component Analysis Based on Data Power

TAGUCHI APPROACH TO DESIGN OPTIMIZATION FOR QUALITY AND COST: AN OVERVIEW. Resit Unal. Edwin B. Dean

Linear regression methods for large n and streaming data

Performance Analysis of the LMS Algorithm with a Tapped Delay Line (Two-Dimensional Case)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Object Tracking System Using Approximate Median Filter, Kalman Filter and Dynamic Template Matching

Neural network models: Foundations and applications to an audit decision problem

Computational Statistics for Big Data

Kalman and Extended Kalman Filters: Concept, Derivation and Properties

How To Build A Hybrid Fusion Module

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

A Multi-Model Filter for Mobile Terminal Location Tracking

Linear Threshold Units

Independent component ordering in ICA time series analysis

Stability of the LMS Adaptive Filter by Means of a State Equation

Machine Learning Introduction

Transcription:

Global Optimisation of Neural Network Models Via Sequential Sampling J oao FG de Freitas jfgf@eng.cam.ac.uk [Corresponding author] Mahesan Niranjan niranjan@eng.cam.ac.uk Arnaud Doucet ad2@eng.cam.ac.uk Andrew H Gee ahg@eng.cam.ac.uk Abstract We propose a novel strategy for training neural networks using sequential sampling-importance resampling algorithms. This global optimisation strategy allows us to learn the probability distribution of the network weights in a sequential framework. It is well suited to applications involving on-line, nonlinear, non-gaussian or non-stationary signal processing. 1 INTRODUCTION This paper addresses sequential training of neural networks using powerful sampling techniques. Sequential techniques are important in many applications of neural networks involving real-time signal processing, where data arrival is inherently sequential. Furthermore, one might wish to adopt a sequential training strategy to deal with non-stationarity in signals, so that information from the recent past is lent more credence than information from the distant past. One way to sequentially estimate neural network models is to use a state space formulation and the extended Kalman filter (Singhal and Wu 1988, de Freitas, Niranjan and Gee 1998). This involves local linearisation of the output equation, which can be easily performed, since we only need the derivatives of the output with respect to the unknown parameters. This approach has been employed by several authors, including ourselves.

Global Optimisation of Neural Network Models via Sequential Sampling 4]] However, locallinearisation leading to the EKF algorithm is a gross simplification of the probability densities involved. Nonlinearity of the output model induces multimodality of the resulting distributions. Gaussian approximation of these densities will loose important details. The approach we adopt in this paper is one of sampling. In particular, we discuss the use of 'sampling-importance resampling' and 'sequential importance sampling' algorithms, also known as particle filters (Gordon, Salmond and Smith 1993, Pitt and Shephard 1997), to train multi-layer neural networks. 2 STATE SPACE NEURAL NETWORK MODELLING We start from a state space representation to model the neural network's evolution in time. A transition equation describes the evolution of the network weights, while a measurements equation describes the nonlinear relation between the inputs and outputs of a particular physical process, as follows: Wk+l = Wk +dk Yk = g(wk, Xk) + Vk where (Yk E lro) denotes the output measurements, (Xk E!R<i) the input measurements and (Wk E lrm) the neural network weights. The measurements nonlinear mapping g(.) is approximated by a multi-layer perceptron (MLP). The measurements are assumed to be corrupted by noise Vk. In the sequential Monte Carlo framework, the probability distribution of the noise is specified by the user. In our examples we shall choose a zero mean Gaussian distribution with covariance R. The measurement noise is assumed to be uncorrelated with the network weights and initial conditions. We model the evolution of the network weights by assuming that they depend on the previous value Wk and a stochastic component dk. The process noise dk may represent our uncertainty in how the parameters evolve, modelling errors or unknown inputs. We assume the process noise to be a zero mean Gaussian process with covariance Q, however other distributions can also be adopted. This choice of distributions for the network weights requires further research. The process noise is also assumed to be uncorrelated with the network weights. The posterior density p(wkiyk), where Yk = {Yl, Y2, "', Yk} and Wk = {Wl, W2, "', Wk}, constitutes the complete solution to the sequential estimation problem. In many applications, such as tracking, it is of interest to estimate one of its marginals, namely the filtering density p(wkiyk). By computing the filtering density recursively, we do not need to keep track of the complete history of the weights. Thus, from a storage point of view, the filtering density turns out to be more parsimonious than the full posterior density function. IT we know the filtering density of the network weights, we can easily derive various estimates of the network weights, including centroids, modes, medians and confidence intervals. 3 SEQUENTIAL IMPORTANCE SAMPLING In the sequential importance sampling optimisation framework, a set of representative samples is used to describe the posterior density function of the network parameters. Each sample consists of a complete set of network parameters. More specifically, we make use of the following Monte Carlo approximation: N p(wkiyk) = ~ L 6(Wk - W~i)) i=l (1) (2)

412 1. F G. de Freitas, M. Niranjan, A. Doucet and A. H. Gee where W~i) represents the samples used to describe the posterior density and 6(.) denotes the Dirac delta function. Consequently, any expectations of the form: E[A(Wk)] =!!k(wk)p(wkiyk)dwk may be approximated by the following estimate: where the samples W~i) N E[jk(Wk)] ~ ~ LA(W~i» i=l are drawn from the posterior density function. Typically, one cannot draw samples directly from the posterior density. Yet, if we can draw samples from a proposal density function 7r(WkIYk), we can transform the expectation under p(wkiyk) to an expectation under 7r(WkIYk) as follows: p(wkiyk) E[A(Wk)] =!!k(wk) 7r(WkIYk) 7r(WkIYk)dWk J A (Wk)qk (Wk)7r(WkIYk)dWk J qk (Wk)7r(Wk IYk)dWk E,.. [qk (Wk)!k(Wk)] E,..[qk(Wk)] where the variables qk(wk) are known as the unnormalised importance ratios: p(ykiwk)p(wk) qk = =--:...-:..:-=-:-~,-:-...::.. 7r(WkIYk) Hence, by drawing samples from the proposal function 7r(.), we can approximate the expectations of interest by the following estimate: N (") (") lin Li=l!k(Wk' )qk(wk' ) N (") lin Li=l qk(wk' ) N L!k(W~i»qk(W~i) i=l where the normalised importance ratios tiii) are given by: -Ci) _ qk C i) qk - "N (j) L..Jj=l qk It is not difficult to show (de Freitas, Niranjan, Gee and Doucet 1998) that, if we assume w to be a hidden Markov process with initial density p(wo) and transition density p(wklwk-l), various recursive algorithms can be derived. One of these algorithms (HySIR), which we derive in (de Freitas, Niranjan, Gee and Doucet 1998), has been shown to perform well in neural network training. Here we extended the algorithm to deal with multiple noise levels. The pseudo-code for the HySIR algorithm with EKF updating is as followsl: (3) (4) 1 We have made available the software for the implementation of the HySIR algorithm at the following web-site: http://svr-vwv.eng.cam.ac.ukrjfgf/ softvare. html.

Global Optimisation of Neural Network Models via Sequential Sampling 413 1. INITIALISE NETWORK WEIGHTS (k=o): 2. For k = 1"", L (a) SAMPLING STAGE: For i = 1,,N Predict via the dynamics equation: A (i) _ (i) + d(i) W k+1 - wk k where d~i) is a sample from p(dk ) (N(O, Qk) in our case). Update samples with the EKF equations. Evaluate the importance ratios: qi11 = qii)p(yk+1iw~il) = q~) N(g(Xk+1, W~11)' Rk) Normalise the importance ratios: (b) RESAMPLING STAGE: For i = 1,,N If Nell ~ Threshold: (i) _ A (i) w k+1 - W k+1 (i) _ A (i) PH1 - Pk+1 Q *(i) _ Q*(i) k+1 - k+1 Else Resample new (i) _ A (j) W k+1 - W k+1, (i) _ 1 qk+l - N where KH1 is known as ~he Kalman gain matrix, Imm denotes the identity matrix of size m x m, and R* and Q* are two tuning parameters, whose roles are explained in detail in (de Freitas, Niranjan and Gee 1997). G represents the Jacobian matrix and, strictly speaking, Pk is an approximation to the covariance matrix of the network weights. The resampling stage is used to eliminate samples with low probability and multiply samples with high probability. Various authors have described efficient algorithms for accomplishing this task in O(N) operations (Pitt and Shephard 1997, Carpenter, Clifford and Fearnhead 1997, Doucet 1998).

414 J. F G. de Freitas, M. Niranjan, A. Doucet and A. H. Gee 4 EXPERlMENT To assess the ability of the hybrid algorithm to estimate time-varying hidden parameters, we generated input-output data from a logistic function followed by a linear scaling and a displacement as shown in Figure 1. This simple model is equivalent to an MLP with one hidden neuron and an output linear neuron. We applied two Gaussian (N(O, 10)) input sequences to the model and corrupted the weights and output values with Gaussian noise (N(O, 1 x 10-3) and N(O, 1 x 10-4) respectively). We then trained a second model with the same structure using the input-output y Figure 1: Logistic function with linear scaling and displacement used in the experiment. The weights were chosen as follows: wl(k) = 1 + k/100, w2(k) = sin(0.06k) - 2, w3(k) = 0.1, w4(k) = 1, ws(k) = -0.5. data generated by the first model. In so doing, we chose 100 sampling trajectories and set R to 10, Q to 1 X 10-3155, the initial weights variance to 5, Po to 100155, R* to 1 X 10-5 The process noise parameter Q* was set to three levels: 5 x 10-3, 1 X 10-3 and 1 x 10-10, as shown in the plot of Figure 2 at time zero. In the training,s 20 Samples '20 Figure 2: Noise level estimation with the HySIR algorithm. phase, of 200 time steps, we allowed the model weights to vary with time. During this phase, the HySIR algorithm was used to track the input-output training data and estimate the latent model weights. In addition, we assumed three possible noise variance levels at the begining of the training session. After the 200-th time step, we fixed the values of the weights and generated another 200 input-output data test sets from the original model. The input test data was then fed to the trained model, using the weights values estimated at the 200-th time step. Subsequently,

... Global Optimisation of Neural Network Models via Sequential Sampling 415 the output prediction of the trained model was compared to the output data from the original model to assess the generalisation performance of the training process. As shown in Figure 2, the noise level of the trajectories converged to the true value (1 x 10-3). In addition, it was possible to track the network weights and obtain accurate output predictions as shown in Figures 3 and 4. " 3.5 3 3 s a. '[2.5 S 2 0 S / Qi 0 2 1/1 1... Qi Cl 1/11.5.... C "E 0 "~ 1.. -. ~ ~ -1 ~ 0.5 -. :.: :/ ""... -2 0-2 0 2 0 1 2 Output prediction " Output prediction Figure 3: One step ahead predictions during the training phase (left) and stationary predictions in the test phase (right). E 100... III C) 50.s 2 II) I~ ".!!! -,2 'm ~.::L. 0... 0.! II) -2 Z Time 0 W 2-4 0 20 40 60 80 100 120 140 160 180 200 Time Figure 4: Weights tracking performance with the HySIR algorithm. As indicated by the histograms of W2, the algorithm performs a global search in parameter space.

416 1. F G. de Freitas, M. Niranjan, A. Doucet and A. H. Gee 5 CONCLUSIONS In this paper, we have presented a sequential Monte Carlo approach for training neural networks in a Bayesian setting. In particular, we proposed an algorithm (HySIR) that makes use of both gradient and sampling information. HySIR can be interpreted as a Gaussian mixture filter, in that only a few sampling trajectories need to be employed. Yet, as the number of trajectories increases, the computational requirements increase only linearly. Therefore, the method is also suitable as a sampling strategy for approximating multi-modal distributions. Further avenues of research include the design of algorithms for adapting the noise covariances R and Q, studying the effect of different noise models for the network weights and improving the computational efficiency of the algorithms. ACKNOWLEDGEMENTS Joao FG de Freitas is financially supported by two University of the Witwatersrand Merit Scholarships, a Foundation for Research Development Scholarship (South Africa), an ORS award and a Trinity College External Studentship (Cambridge). References Carpenter, J., Clifford, P. and Fearnhead, P. (1997). An improved particle filter for non-linear problems, Technical report, Department of Statistics, Oxford University, England. Available at http://www.stats.ox.ac.ukrclifford/index.htm. de Freitas, J. F. G., Niranjan, M. and Gee, A. H. (1997). Hierarchichal Bayesian Kalman models for regularisation and ARD in sequential learning, Technical Report CUED/F-INFENG/TR 307,, http://svrwww.eng.cam.ac.uk/-jfgf. de Freitas, J. F. G., Niranjan, M. and Gee, A. H. (1998). Regularisation in sequential learning algorithms, in M. I. Jordan, M. J. Kearns and S. A. Solla (eds), Advances in Neural Information Processing Systems, Vol. 10, MIT Press. de Freitas, J. F. G., Niranjan, M., Gee, A. H. and Doucet, A. (1998). Sequential Monte Carlo methods for optimisation of neural network models, Technical Report CUED/F-INFENG/TR 328,, http://svrwww.eng.cam.ac.uk/-jfgf. Doucet, A. (1998). On sequential simulation-based methods for Bayesian filtering, Technical Report CUED/F-INFENG/TR 310,. Available at http://www.stats.bris.ac.uk:81/mcmc/pages/list.html. Gordon, N. J., Salmond, D. J. and Smith, A. F. M. (1993). Novel approach to nonlinear/non-gaussian Bayesian state estimation, lee Proceedings-F 140(2): 107-113. Pitt, M. K. and Shephard, N. (1997). Filtering via simulation: Auxiliary particle filters, Technical report, Department of Statistics, Imperial College of London, England. Available at http://www.nuff.ox.ac.uk/economics/papers. Singhal, S. and Wu, L. (1988). Training multilayer perceptrons with the extended Kalman algorithm, in D. S. Touretzky (ed.), Advances in Neural Information Processing Systems, Vol. 1, San Mateo, CA, pp. 133-140.