Latent variable and deep modeling with Gaussian processes; application to system identification. Andreas Damianou

Similar documents
arxiv: v1 [cs.dc] 18 Oct 2014

HT2015: SC4 Statistical Data Mining and Machine Learning

Bayesian Time Series Learning with Gaussian Processes

Gaussian Processes for Big Data

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Hybrid Discriminative-Generative Approach with Gaussian Processes

Gaussian Process Dynamical Models

Model-based Synthesis. Tony O Hagan

STA 4273H: Statistical Machine Learning

Bayesian Statistics: Indian Buffet Process

Gaussian Process Training with Input Noise

Cell Phone based Activity Detection using Markov Logic Network

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Learning Shared Latent Structure for Image Synthesis and Robotic Imitation

Manifold Learning with Variational Auto-encoder for Medical Image Analysis

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Gaussian Processes in Machine Learning

Gaussian Processes for Big Data Problems

Switching Gaussian Process Dynamic Models for Simultaneous Composite Motion Tracking and Recognition

Course: Model, Learning, and Inference: Lecture 5

Deterministic Sampling-based Switching Kalman Filtering for Vehicle Tracking

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Basics of Statistical Machine Learning

Machine Learning and Pattern Recognition Logistic Regression

Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

Statistical Machine Learning

Bayesian Image Super-Resolution

Bayesian networks - Time-series models - Apache Spark & Scala

Coding and decoding with convolutional codes. The Viterbi Algor

Efficient online learning of a non-negative sparse autoencoder

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Statistical Models in Data Mining

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Feature Engineering in Machine Learning

MACHINE LEARNING IN HIGH ENERGY PHYSICS

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Data Mining: An Overview. David Madigan

Principles of Data Mining by Hand&Mannila&Smyth

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Classification. Chapter 3

Machine Learning and Statistics: What s the Connection?

Classification Problems

Detection of changes in variance using binary segmentation and optimal partitioning

Dynamical Binary Latent Variable Models for 3D Human Pose Tracking

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Statistics Graduate Courses

Unsupervised Learning

Big Data, Machine Learning, Causal Models

Machine Learning Overview

Dynamic Modeling, Predictive Control and Performance Monitoring

Christfried Webers. Canberra February June 2015

Least Squares Estimation

Machine learning for algo trading

Flexible and efficient Gaussian process models for machine learning

Inference on Phase-type Models via MCMC

Mixtures of Robust Probabilistic Principal Component Analyzers

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Monte Carlo-based statistical methods (MASM11/FMS091)

Machine Learning in FX Carry Basket Prediction

Lecture 3: Linear methods for classification

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Real-time Body Tracking Using a Gaussian Process Latent Variable Model

Data Analytics at NICTA. Stephen Hardy National ICT Australia (NICTA)

Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases. Andreas Züfle

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Tracking in flussi video 3D. Ing. Samuele Salti

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Monocular 3D Human Motion Tracking Using Dynamic Probabilistic Latent Semantic Analysis

A Hybrid Neural Network-Latent Topic Model

People Tracking with the Laplacian Eigenmaps Latent Variable Model

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Supervised Learning (Big Data Analytics)

Linear Threshold Units

Large-Scale Similarity and Distance Metric Learning

Bayesian probability theory

Parametric Statistical Modeling

Local Gaussian Process Regression for Real Time Online Model Learning and Control

Extreme Value Modeling for Detection and Attribution of Climate Extremes

An Introduction to Machine Learning

EE 570: Location and Navigation

Data Mining mit der JMSL Numerical Library for Java Applications

11. Time series and dynamic linear models

Machine learning in financial forecasting. Haindrich Henrietta Vezér Evelin

Comparing GPLVM Approaches for Dimensionality Reduction in Character Animation

NTC Project: S01-PH10 (formerly I01-P10) 1 Forecasting Women s Apparel Sales Using Mathematical Modeling

Linear Classification. Volker Tresp Summer 2015

Filtered Gaussian Processes for Learning with Large Data-Sets

Analysis of Bayesian Dynamic Linear Models

Exploratory Data Analysis Using Radial Basis Function Latent Variable Models

UW CSE Technical Report Probabilistic Bilinear Models for Appearance-Based Vision

Recent Developments of Statistical Application in. Finance. Ruey S. Tsay. Graduate School of Business. The University of Chicago

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Transcription:

Latent variable and deep modeling with Gaussian processes; application to system identification Andreas Damianou Department of Computer Science, University of Sheffield, UK Brown University, 17 Feb. 2016

Outline Part 1: Introduction Recap from yesterday Part 2: Incorporating latent variables Unsupervised GP learning Structure in the latent space - representation learning Bayesian Automatic Relevance Determination Deep GP learning Multi-view GP learning Part 3: Dynamical systems Policy learning Autoregressive Dynamics Going deeper: Deep Recurrent Gaussian Process Regressive dynamics with deep GPs

Outline Part 1: Introduction Recap from yesterday Part 2: Incorporating latent variables Unsupervised GP learning Structure in the latent space - representation learning Bayesian Automatic Relevance Determination Deep GP learning Multi-view GP learning Part 3: Dynamical systems Policy learning Autoregressive Dynamics Going deeper: Deep Recurrent Gaussian Process Regressive dynamics with deep GPs

Part 1: GP Take-home messages Gaussian processes as infinite dimensional Gaussian distributions can be used as priors over functions Non-parametric: training data act as parameters Uncertainty Quantification Learning from scarce data

Notation and graphical models Graphical models x y f x f y White nodes: Observed variables Shaded nodes: Unobserved, or latent variables. Convention: Sometimes the latent function will be placed next to the arrow; sometimes I ll explicitly include the collection of instantiations f

Outline Part 1: Introduction Recap from yesterday Part 2: Incorporating latent variables Unsupervised GP learning Structure in the latent space - representation learning Bayesian Automatic Relevance Determination Deep GP learning Multi-view GP learning Part 3: Dynamical systems Policy learning Autoregressive Dynamics Going deeper: Deep Recurrent Gaussian Process Regressive dynamics with deep GPs

Unsupervised learning: GP-LVM x y f If X is unobserved, treat it as a parameter and optimize over it. GP-LVM is interpreted as non-linear, non-parametric probabilistic PCA (Lawrence, JMLR 2015). Objective (likelihood function) is similar to GPs: p(y x) = p(y f)p(f x)df = N (y 0, K ff + σ 2 I) but now x s are optimized too. Neil Lawrence, JMLR, 2005.

Unsupervised learning: GP-LVM x y f If X is unobserved, treat it as a parameter and optimize over it. GP-LVM is interpreted as non-linear, non-parametric probabilistic PCA (Lawrence, JMLR 2015). Objective (likelihood function) is similar to GPs: p(y x) = p(y f)p(f x)df = N (y 0, K ff + σ 2 I) but now x s are optimized too. Neil Lawrence, JMLR, 2005.

Unsupervised learning: GP-LVM x y f If X is unobserved, treat it as a parameter and optimize over it. GP-LVM is interpreted as non-linear, non-parametric probabilistic PCA (Lawrence, JMLR 2015). Objective (likelihood function) is similar to GPs: p(y x) = p(y f)p(f x)df = N (y 0, K ff + σ 2 I) but now x s are optimized too. Neil Lawrence, JMLR, 2005.

Fitting the GP-LVM 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Fitting the GP-LVM 5.5 Figure credits: C. H. Ek 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Fitting the GP-LVM 5.5 Figure credits: C. H. Ek 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Additional difficulty: x s are also missing! Improvement: Invoke the Bayesian methodology to find x s too.

Bayesian GP-LVM Additionally integrate out the latent space. p(y) = p(y f)p(f x)p(x)df dx where p(x) N (0, I) Titsias and Lawrence 2010, AISTATS; Damianou et al. 2015, JMLR

Automatic dimensionality detection In general, X = [x 1,..., x N ] with x (n) R Q. Automatic dimenionality detection by employing automatic relevance determination (ARD) priors for the mapping f. f GP(0, k f ) with: k f (x (i), x (j)) = σ 2 exp 1 Q ( w (q) x (i,q) x (j,q)) 2 2 Example: q=1

Outline Part 1: Introduction Recap from yesterday Part 2: Incorporating latent variables Unsupervised GP learning Structure in the latent space - representation learning Bayesian Automatic Relevance Determination Deep GP learning Multi-view GP learning Part 3: Dynamical systems Policy learning Autoregressive Dynamics Going deeper: Deep Recurrent Gaussian Process Regressive dynamics with deep GPs

Sampling from a deep GP Input f Unobserved f Y Output

Model x = given f h = f h (x) = GP(0, k h (x, x)) h = f h (x) + ɛ h f y = f y (h) = GP(0, k f (h, h)) y = f y (h) + ɛ y So: y = f y (f h (x) + ɛ h ) + ɛ y Objective: p(y x) = p(y f y )p(f y h)p(h f h )p(f h x)df y df h dh Damianou and Lawrence, AISTATS 2013; Damianou, PhD Thesis, 2015

Deep GP: Step function (credits for idea to J. Hensman, R. Calandra, M. Deisenroth) (standard GP) (deep GP - 1 hidden layer) (deep GP - 3 hidden layers)

1 0.5 0 0.5 1 1.5 2 1 0.5 0 0.5 1 1.5 2 1 0.5 0 0.5 1 1.5 2 1.5 1 0.5 0 0.5 1 0.5 0 0.5 1 1.5 2 Deep GP with three hidden plus one warping layer 1.5 1 0.5 0 0.5 1 Standard GP 1 0.5 0 0.5 1 1.5 2

Non-linear feature learning Stacked GP (nonlinear) Stacked PCA (linear) Successive warping creates knots which act as features. Features discovered in one layer remain in the next ones (ie knots are not un-tied) f 1 (f 2 (f 3 (x))) With linear warpings (stacked PCA) we can t achieve this effect: W 1 (W 2 (W 3 x))) = W x Damianou, PhD Thesis, 2015

Deep GP: MNIST example https://youtu.be/e8-vxt8wxbu (Live sampling video)

Outline Part 1: Introduction Recap from yesterday Part 2: Incorporating latent variables Unsupervised GP learning Structure in the latent space - representation learning Bayesian Automatic Relevance Determination Deep GP learning Multi-view GP learning Part 3: Dynamical systems Policy learning Autoregressive Dynamics Going deeper: Deep Recurrent Gaussian Process Regressive dynamics with deep GPs

Multi-view: Manifold Relevance Determination (MRD) Multi-view data arise from multiple information sources. These sources naturally contain some overlapping, or shared signal (since they describe the same phenomenon ), but also have some private signal. MRD: Model such data using overlapping sets of latent variables Demo: https://youtu.be/ripx3ciohky Faces, https://youtu.be/xoxnjaavgle mocap-silhouette Damianou et al., ICML 2012; Damianou, PhD Thesis, 2015

Multi-view: Manifold Relevance Determination (MRD) Multi-view data arise from multiple information sources. These sources naturally contain some overlapping, or shared signal (since they describe the same phenomenon ), but also have some private signal. MRD: Model such data using overlapping sets of latent variables Demo: https://youtu.be/ripx3ciohky Faces, https://youtu.be/xoxnjaavgle mocap-silhouette Damianou et al., ICML 2012; Damianou, PhD Thesis, 2015

Multi-view: Manifold Relevance Determination (MRD) Multi-view data arise from multiple information sources. These sources naturally contain some overlapping, or shared signal (since they describe the same phenomenon ), but also have some private signal. MRD: Model such data using overlapping sets of latent variables Demo: https://youtu.be/ripx3ciohky Faces, https://youtu.be/xoxnjaavgle mocap-silhouette Damianou et al., ICML 2012; Damianou, PhD Thesis, 2015

Deep GPs: Another multi-view example X 1 0 1 2 X.05 0.05 0.1.15

Automatic structure discovery summary: ARD and MRD Tools: ARD: Eliminate uncessary nodes/connections MRD: Conditional independencies Approximating evidence: Number of layers (?)

Automatic structure discovery summary: ARD and MRD Tools: ARD: Eliminate uncessary nodes/connections MRD: Conditional independencies Approximating evidence: Number of layers (?)

More deepgp representation learning examples... https://youtu.be/s4zath1tjg8 https://youtu.be/usq0vxlcfvu icub interaction Faces - autoencoder

Outline Part 1: Introduction Recap from yesterday Part 2: Incorporating latent variables Unsupervised GP learning Structure in the latent space - representation learning Bayesian Automatic Relevance Determination Deep GP learning Multi-view GP learning Part 3: Dynamical systems Policy learning Autoregressive Dynamics Going deeper: Deep Recurrent Gaussian Process Regressive dynamics with deep GPs

Dynamical systems examples Through a model, we wish to learn to: perform free simulation by learning patterns coming from the latent generation process (a mechanistic system we do not know) perform inter/extrapolation in time-series data which are very high-dimensional (e.g. video) detect outliers in data coming from a dynamical system optimize policies for control based on a model of the data.

Policy learning Model-based policy learning https://www.youtube.com/watch?v=xiigtgkzfks (Cart-pole) http://mlg.eng.cam.ac.uk/?portfolio=andrew-mchutchon (Unicycle) Work by: Marc Deisenroth, Andrew McHutchon, Carl Rasmussen

NARX model A standard NARX model considers an input vector x i R D comprised of L y past observed outputs y i R and L u past exogenous inputs u i R: x i = [y i 1,, y i Ly, u i 1,, u i Lu ], y i = f(x i ) + ɛ (y) i, ɛ (y) i Latent auto-regressive GP model: N (ɛ (y) i 0, σy), 2 x i = f(x i 1,, x i Lx u i 1,, u i Lu ) + ɛ (x) i, y i = x i + ɛ (y) i, Contribution 1: Simultaneous auto-regressive and representation learning. Contribution 2: Latents avoid the feedback of possibly corrupted observations into the dynamics. Mattos, Damianou, Barreto, Lawrence, 2016

NARX model A standard NARX model considers an input vector x i R D comprised of L y past observed outputs y i R and L u past exogenous inputs u i R: x i = [y i 1,, y i Ly, u i 1,, u i Lu ], y i = f(x i ) + ɛ (y) i, ɛ (y) i Latent auto-regressive GP model: N (ɛ (y) i 0, σy), 2 x i = f(x i 1,, x i Lx u i 1,, u i Lu ) + ɛ (x) i, y i = x i + ɛ (y) i, Contribution 1: Simultaneous auto-regressive and representation learning. Contribution 2: Latents avoid the feedback of possibly corrupted observations into the dynamics. Mattos, Damianou, Barreto, Lawrence, 2016

Robustness to outliers Latent auto-regressive GP model: x i = f(x i 1,, x i Lx u i 1,, u i Lu ) + ɛ (x) i, y i = x i + ɛ (y) i, ɛ (x) i ɛ (y) i N (ɛ (x) i 0, σx), 2 N (ɛ (y) 0, τ 1 i i ), τ i Γ(τ i α, β), Contribution 3: Switching-off outliers by including the above Student-t likelihood for the noise. Mattos, Damianou, Barreto, Lawrence, DYCOPS 2016

Robust GP autoregressive model: demonstration (a) Artificial 1. (b) Artificial 2. Figure: RMSE values for free simulation on test data with different levels of contamination by outliers.

(c) Artificial 3. (d) Artificial 4. (e) Artificial 5.

Going deeper: Deep Recurrent Gaussian Process u x (1) x (2) x (H ) y Figure 1: RGP graphical model with H hidden layers. x is the lagged latent function values augmented with the lagged exogenous inputs. Mattos, Dai, Damianou, Barreto, Lawrence, ICLR 2016

Inference is tricky... log p(y) N L 2 Tr H+1 h=1 ( ( K (H+1) z log 2πσ 2 h 1 ) 1 Ψ (H+1) 2 1 + 2(σH+1 2 y Ψ (H+1) )2 1 { ( H + 1 N 2σ 2 h=1 h i=l+1 + 1 2 K (h) z 1 2 K(h) + 1 ( µ (h)) (h) Ψ 2(σh 2)2 1 N ) i=l+1 x (h) i ( q 2σ 2 H+1 ) ) + 1 2 ( y y + Ψ (H+1) 0 K z (H+1) ( K z (H+1) + 1 Ψ (H+1) σh+1 2 2 ( λ (h) i + µ (h)) µ (h) + Ψ (h) 0 Tr z + 1 Ψ (h) σh 2 2 ( K z (h) + 1 ) 1 ( Ψ (h) σh 2 2 ( ) L log q + x (h) i x (h) i i=1 x (h) i 1 2 K(H+1) z + 1 σh+1 2 ) 1 ( ) y Ψ (h) 1 ( q x (h) i Ψ (H+1) 1 ( ( K (h) z ) µ (h) ) ( log p Ψ (H+1) 2 ) 1 Ψ (h) 2 x (h) i ) }. ) )

Results Results in nonlinear systems identification: 1. artificial dataset 2. drive dataset: by a system with two electric motors that drive a pulley using a flexible belt. input: the sum of voltages applied to the motors output: speed of the belt.

RGP GPNARX MLP-NARX LSTM

RGP GPNARX MLP-NARX LSTM

Avatar control Figure: The generated motion with a step function signal, starting with walking (blue), switching to running (red) and switching back to walking (blue). https://youtu.be/fr-oegxv6yy https://youtu.be/at0hmtopgjc https://youtu.be/fuf-uz83vmw Videos: Switching between learned speeds Interpolating (un)seen speed Constant unseen speed

Regressive dynamics with deep GPs t x h f Instead of coupling f s by encoding the Markov property, we couple them by coupling the f s inputs through another GP with time as input. y = f(x) + ɛ x GP(0, k x (t, t)) f GP(0, k f (x, x)) y Damianou et al., NIPS 2011; Damianou, Titsias and Lawrence, JMLR, 2015

Dynamics Dynamics are encoded in the covariance matrix K = k(t, t). We can consider special forms for K. Model individual sequences Model periodic data https://www.youtube.com/watch?v=i9teoyxabxq (missa) https://www.youtube.com/watch?v=muy1xhpnocu (dog) https://www.youtube.com/watch?v=fhdwlojtgk8 (mocap)

Summary Data-driven, model-based approach to real and control problems Gaussian processes: Uncertainty quantification / propagation gives an advantage Deep Gaussian processes: Representation learning + dynamics learning Future work: Deep Gaussian processes + mechanistic information; consider real applications. Thanks!

Thanks Thanks to: Neil Lawrence, Michalis Titsias, Carl Henrik Ek, James Hensman, Cesar Lincoln Mattos, Zhenwen Dai, Javier Gonzalez, Tony Prescott, Uriel Martinez-Hernandez, Luke Boorman Thanks to George Karniadakis, Paris Perdikaris for hosting me

BACKUP SLIDES 1: Deep GP Optimization - variational inference

MAP optimisation? x f 1 h 1 f 2 h 2 f 3 y Joint Distr. = p(y h 2 )p(h 2 h 1 )p(h 1 x) MAP optimization is extremely problematic because: Dimensionality of hs has to be decided a priori Prone to overfitting, if h are treated as parameters Deep structures are not supported by the model s objective but have to be forced [Lawrence & Moore 07] We want: To use the marginal likelihood as the objective: marg. lik. = h 2,h 1 p(y h 2 )p(h 2 h 1 )p(h 1 x) Further regularization tools.

Marginal likelihood is intractable Let s try to marginalize out the top layer only: p(h 2 ) = p(h 2 h 1 )p(h 1 )dh 1 = p(h 2 f 2 )p(f 2 h 1 )p(h 1 )df 2 h 1 = p(h 2 f 2 )[ ] p(f 2 h 1 )p(h 1 )dh 1 df 2 } {{ } Intractable! Intractability: h 1 appears non-linearly in p(f 2 h 1 ), inside K 1 (and also the determinant term), where K = k(h 1, h 1 ).

Solution: Variational Inference Similar issues arise for 1-layer models. Solution was given by Titsias and Lawrence, 2010. A small modification to that solution does the trick in deep GPs too. Extend Titsias method for variational learning of inducing variables in Sparse GPs. Analytic variational bound F p(y x) Approximately marginalise out h Hence obtain the approximate posterior q(h)

Inducing points: sparseness, tractability and Big Data h (1) f (1) h (2) f (2) h (30) f (30) h (31) f (31) h (N) f (N)

Inducing points: sparseness, tractability and Big Data h (1) f (1) h (2) f (2) h (30) f (30) h (i) u (i) h (31) f (31) h (N) f (N)

Inducing points: sparseness, tractability and Big Data h (1) f (1) h (2) f (2) h (30) f (30) h (i) u (i) h (31) f (31) h (N) f (N) Inducing points originally introduced for faster (sparse) GPs But this also induces tractability in our models, due to the conditional independencies assumed Viewing them as global variables extension to Big Data [Hensman et al., UAI 2013]

Bayesian regularization F = Data Fit KL (q(h 1 ) p(h 1 )) + } {{ } Regularisation L H (q(h l )) } {{ } Regularisation l=2