Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Similar documents

Lecture 3: Linear methods for classification

Statistical Machine Learning

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Fitting Subject-specific Curves to Grouped Longitudinal Data

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

HETEROGENEOUS AGENTS AND AGGREGATE UNCERTAINTY. Daniel Harenberg University of Mannheim. Econ 714,

STA 4273H: Statistical Machine Learning

Gaussian Processes in Machine Learning

Introduction to General and Generalized Linear Models

Linear Threshold Units

Linear Classification. Volker Tresp Summer 2015

An Introduction to Machine Learning

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

GLAM Array Methods in Statistics

CSCI567 Machine Learning (Fall 2014)

Parametric fractional imputation for missing data analysis

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Lecture 8 February 4

Chapter 13 Introduction to Nonlinear Regression( 非線性迴歸 )

Numerical methods for American options

Math Placement Test Practice Problems

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Least Squares Estimation

State of Stress at Point

Introduction: Overview of Kernel Methods

Extreme Value Modeling for Detection and Attribution of Climate Extremes

Reject Inference in Credit Scoring. Jie-Men Mok

Basics of Statistical Machine Learning

Neural Networks: a replacement for Gaussian Processes?

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Statistical Machine Learning from Data

Note on the EM Algorithm in Linear Regression Model

Contrôle dynamique de méthodes d approximation

Orthogonal Distance Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression

Introduction to Matrix Algebra

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

1 Prior Probability and Posterior Probability

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Gaussian Conjugate Prior Cheat Sheet

Smoothing. Fitting without a parametrization

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Component Ordering in Independent Component Analysis Based on Data Power

Solving Linear Systems of Equations. Gerald Recktenwald Portland State University Mechanical Engineering Department

Analyzing Structural Equation Models With Missing Data

Practical Approaches to Principal Component Analysis in the Presence of Missing Values

Factorial experimental designs and generalized linear models

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Nonlinear Algebraic Equations Example

Cyber-Security Analysis of State Estimators in Power Systems

Introduction to mixed model and missing data issues in longitudinal studies

1 Introduction to Matrices

DERIVATIVES AS MATRICES; CHAIN RULE

Java Modules for Time Series Analysis

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Unit 3 (Review of) Language of Stress/Strain Analysis

Moving Least Squares Approximation

APPLIED MISSING DATA ANALYSIS

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Optimal shift scheduling with a global service level constraint

Bayesian Statistics in One Hour. Patrick Lam

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Probabilistic user behavior models in online stores for recommender systems

α = u v. In other words, Orthogonal Projection

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

An explicit link between Gaussian fields and Gaussian Markov random fields; the stochastic partial differential equation approach

Lecture 6: Logistic Regression

Extrinsic geometric flows

Time Series Analysis

Designing a learning system

Regression Analysis. Regression Analysis MIT 18.S096. Dr. Kempthorne. Fall 2013

Math 241, Exam 1 Information.

Tutorial on Markov Chain Monte Carlo

Detection of changes in variance using binary segmentation and optimal partitioning

1 VECTOR SPACES AND SUBSPACES

5. Linear Regression

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Recall the basic property of the transpose (for any A): v A t Aw = v w, v, w R n.

MAT188H1S Lec0101 Burbulla

Statistics Graduate Courses

Smoothing and Non-Parametric Regression

A General Approach to Variance Estimation under Imputation for Missing Survey Data

Introduction to the Finite Element Method

Bayesian Image Super-Resolution

Analysis of Bayesian Dynamic Linear Models

Classification by Pairwise Coupling

Derivative Free Optimization

Complex Numbers. w = f(z) z. Examples

Transcription:

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission Problems, Moscow, Russia 2 DATADVANCE, llc, Moscow, Russia 3 PreMoLab, MIPT, Dolgoprudny, Russia ICML 2014 workshop on New Learning Frameworks and Models for Big Data Beijing, 2014 1/39 Burnaev Evgeny

Approximation task Problem statement Let y = g (x) be some unknown function. The training sample is given D = {x i, y i}, g (x i) = y i, i = 1,..., N. The task is to construct ˆf (x) such that: ˆf (x) g(x). 2/39 Burnaev Evgeny

Factorial Design of Experiments Factors: s k = {x k i k X k, i k = 1,...,n k }, X k R d k, k = 1,..., K; d k dimensionality of the factor s k. Factorial Design of Experiments: S = {x i} N i=1 = s 1 s 2 s K. Dimensionality of x S: d = K k=1 d k. Sample size: N = K k=1 n k x 2 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 3/39 Burnaev Evgeny

Example of Factorial DoE x 3 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 x 1 1 x 2 Figure : Factorial DoE with multidimensional factor 4/39 Burnaev Evgeny

DoE in engineering problems 1 Independent groups of variables factors. 2 Training sample generation procedure: Fix values of the first factor. Perform experiments varying values of other factors. Fix other value of the first factor. Perform new series of experiments. 3 Take into account knowledge from a subject domain. Data properties Has special structure Can have large sample size Factors sizes can differ significantly 5/39 Burnaev Evgeny

Example of Factorial DoE Modeling of pressure distribution over the aircraft wing: Angle of attack: s 1 = {0, 0.8, 1.6, 2.4, 3.2, 4.0}. Mach number: s 2 = {0.77, 0.78, 0.79, 0.8, 0.81, 0.82, 0.83}. Wing points coordinates: s 3 5000 point in R 3. The training sample: S = s 1 s 2 s 3. Dimensionality d = 1 + 1 + 3 = 5. Sample size N = 6 7 5000 = 210000. 6/39 Burnaev Evgeny

Existing solutions Universal techniques: Disadvantages: don t take into account sample structure low approximation quality, high computational complexity Multivariate Adaptive Regression Splines [Friedman, 1991] Disadvantages: discontinuous derivatives, non-physical behaviour Tensor product of splines [Stone et al., 1997, Xiao et al., 2013] Disadvantages: only one-dimensional factors, no accuracy evaluation procedure Gaussian Processes on lattice [Dietrich and Newsam, 1997, Stroud et al., 2014] Disadvantages: two-dimensional grid with equidistant points 7/39 Burnaev Evgeny

The aim is to develop computationally efficient algorithm taking into account special features of factorial Design of Experiments 8/39 Burnaev Evgeny

Gaussian Process Regression Function model g(x) = f(x) + ε(x), where f(x) Gaussian process (GP), ε(x) Gaussian white noise. GP is fully defined by its mean and covariance function. The covariance function of f(x) K f (x, x ) = σ 2 f exp ( ) d θi 2 (x (i) x (i) ) 2, where x (i) i-th component of vector, θ = (σ 2 f, θ 1,..., θ d ) parameters of the covariance function. The covariance function of g(x): δ(x, x ) Kronecker delta. i=1 K g(x, x ) = K(x, x ) + σ 2 noiseδ(x, x ), 9/39 Burnaev Evgeny

Choosing parameters Maximum Likelihood Estimation Loglikelihood log p(y X, θ) = 1 2 yt K 1 g y 1 2 log Kg N 2 log 2π, where K g determinant of matrix K g = K g(x i, x j) N i,j=1, xi, xj S. Parameters θ are chosen such that θ = arg max(log p(y X, θ)) θ 10/39 Burnaev Evgeny

Final model Prediction of g(x) at point x ˆf(x) = k T K 1 g y, where k = (K f (x 1, x),..., K f (x n, x)). Posterior variance σ 2 (x) = K f (x, x) k T K 1 g k. 11/39 Burnaev Evgeny

Gaussian Processes for factorial DoE Issues: 1 High computational complexity: O(N 3 ). In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 12/39 Burnaev Evgeny

Gaussian Processes for factorial DoE Issues: 1 High computational complexity: O(N 3 ). In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 12/39 Burnaev Evgeny

Estimation of covariance function parameters Loglikelihood: Derivatives: log p(y X, θ) = 1 2 yt K 1 g y 1 2 log Kg N 2 log 2π, θ (log p(y X, σ f, σ noise)) = 1 2 Tr(K 1 g K ) + 1 2 yt K 1 g K K 1 g y, where θ is a parameter of covariance function (component of θ i, σ noise or σ f,i, i = 1,..., d), and K = K θ 13/39 Burnaev Evgeny

Tensors and Kronecker product Definition Tensor Y is a K-dimensional matrix of size n 1 n 2 n K: Y = { y i1,i 2,...,i K, {i k = 1,..., n k } K k=1}. Definition The Kronecker product of matrices A and B is a block matrix a 11B a 1nB A B =........ a m1b a mnb 14/39 Burnaev Evgeny

Related operations Operation vec: vec(y) = [Y 1,1,...,1, Y 2,1,...,1,..., Y n1,1,...,1, Y 1,2,...,1,..., Y n1,n 2,...,n K ]. Multiplication of a tensor by a matrix along k-th direction Z = Y k B Z i1,...,i k 1,j,i k+1,...i K = Y i1,...,i k,...,i K B ik j. i k Connection between tensors and the Kronecker product: vec(y 1 B 1 K B K) = (B 1 B K)vec(Y) (1) Complexity of computation of the left part O(N k n k), of the right part O(N 2 ). 15/39 Burnaev Evgeny

Covariance function Form of the covariance function: K K f (x, y) = k i(x i, y i ), x i, y i s i, i=1 where k i is an arbitrary covariance function for i-th factor. Covariance matrix: K K = K i, i=1 K i is a covariance matrix for i-th factor. 16/39 Burnaev Evgeny

Fast computation of loglikelihood Proposition Let K i = U id iu T i be a Singular Value Decomposition (SVD) of the matrix K i, where U i is an orthogonal matrix, and D i is diagonal. Then: K g = D i1,...,i K, i 1,...,i K ( ) ( ) 1 ( ) K 1 g = U T k D k + σnoisei 2 U k, (2) k k K 1 g y = vec [((Y 1 U 1 K U K) D 1) ] 1 U T 1 K U T K, k where D is a tensor of diagonal elements of the matrix σ 2 noisei + k D k 17/39 Burnaev Evgeny

Computational complexity Proposition Calculation of the loglikelihood using (2) has the following computation complexity ( ) K K O N n i +. i=1 i=1 n 3 i Assuming n i N (number of factors is large and their sizes are close) we get O (N ) ( ) n i = O N 1+ K 1. 18/39 Burnaev Evgeny

Fast computation of derivatives Proposition The following statements hold Tr(K 1 g K ) = diag ( ˆD 1 ), K diag ( U ik ) iu i, i=1 1 2 yt K 1 g K K 1 g y = A, A 1 K T 1 2 i 1 K T i 1 i i K T i θ where ˆD = σnoisei 2 + k D k, and vec(a) = K 1 g y. The computational complexity is ( ) K K O N n i +. i=1 i+1 KT i+1 i+2 K K T K i=1 n 3 i, 19/39 Burnaev Evgeny

Gaussian Processes for facotrial DoE Issues: 1 High computational complexity: O(N 3 ). In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 20/39 Burnaev Evgeny

Gaussian Processes for facotrial DoE Issues: 1 High computational complexity: O(N 3 ). In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 20/39 Burnaev Evgeny

Degeneracy Example of degeneracy 2.5 2 1.5 1 0.5 0 True function training set 2.5 2 1.5 1 0.5 0 GP regression training set 0.5 1 1 0.5 0 x 2 0.5 1 1 0.5 x 1 0 0.5 1 0.5 1 1 0.5 0 x 2 0.5 1 1 0.5 x 1 0 0.5 1 Figure : Original function Figure : Approximation obtained using GP from GPML toolbox 21/39 Burnaev Evgeny

Regularization Prior distribution: θ (i) k b (i) k a(i) k a(i) k Be(α, β), {i = 1,..., d k } K k=1, a (i) k = c k max y (i) ), x,y s k(x(i) b(i) k = C k (x (i) y (i) ) min x,y s k,x y where Be(α, β) is the Beta distribution, c k and C k are parameters of the algorithm (we use c k = 0.01 and C k = 2). Initialization [ )] 1 θ (i) 1 k (max = ) min ). n k x s k(x(i) x s k(x(i) 22/39 Burnaev Evgeny

Regularization Loglikelihood: log p(y X, θ, σ f, σ noise) = 1 2 yt K 1 y y 1 2 log Ky N 2 + k,i ( (α 1) log ( θ (i) k b (i) k a(i) k a(i) k ) + (β 1) log ( 1 θ(i) k b (i) k a(i) k a(i) k log 2π )) d log(b(α, β)). 23/39 Burnaev Evgeny

Regularization Loglikelihood: log p(y X, θ, σ f, σ noise) = 1 2 yt K 1 y y 1 2 log Ky N 2 + k,i ( (α 1) log ( θ (i) k b (i) k a(i) k a(i) k ) + (β 1) log ( 1 θ(i) k b (i) k a(i) k a(i) k log 2π )) d log(b(α, β)). 1 0 1 log(pdf) 2 3 4 5 a = b = 2 6 0 0.5 1 1.5 2 x Figure : Penalty function 23/39 Burnaev Evgeny

Regularization Example of regularized approximation 2.5 2 1.5 1 0.5 0 0.5 1 1 0.5 0 x 2 0.5 1 1 True function training set 0 0.5 x 1 0.5 1 2.5 2 1.5 1 0.5 0 0.5 1 1 0.5 0 x 2 0.5 1 1 tensorgp reg training set 0 0.5 x 1 0.5 1 Figure : Original function Figure : Approximation obtained using developed algorithm with regularization 24/39 Burnaev Evgeny

Toy problems Set of 34 functions (usual artificial test functions used for testing of global optimization algorithms) Dimensionality from 2 to 6. Sample sizes from 80 to 216000 with full factorial DoE Quality criteria training time and approximation error: MSE = 1 N test ( N ˆf(x i) g(x i)) 2 test i=1 Tested algorithms: MARS Multivariate Adaptive Regression Splines SparseGP sparse Gaussian Processes (GPML toolbox) tensorgp developed algorithm without regularization tensorgp-reg developed algorithm with regularization 25/39 Burnaev Evgeny

Dolan-Moré curves T problems, A algorithms. e ta approximation error (or training time) of a-th algorithm on t-th problem. ẽ t = min a e ta. ρ a(τ) = #{t : eta < τẽt} T The higher the curve is the better works the corresponding algorithm. ρ a(1) fraction of problems for which a-th algorithm worked the best. 26/39 Burnaev Evgeny

Results of experiments: training time Figure : Dolan-Moré curves. Quality criterion training time 27/39 Burnaev Evgeny

Results of experiments: approximation quality Figure : Dolan-Moré curves. Quality criterion MSE 28/39 Burnaev Evgeny

Real problem: Rotating disc of an impeller Objective functions: p 1 contact pressure. S rmax maximum radial stress. w weight of disc. The geometrical shape of the disc is parametrized by 6 input variables x = (h 1, h 2, h 3, h 4, r 2, r 3) (r 1 and r 4 are fixed) Training sample (generated from computational physical model): Sample size 14400 Factor sizes {1, 8, 8, 3, 15, 5} Figure : Rotating disc of an impeller 29/39 Burnaev Evgeny

Results of experiment Table : Approximation errors of p 1 MAE MSE RRMS MARS 4.4644 6.5120 0.1166 SparseGP 76.9313 86.7034 1.5530 tensorgp-reg 0.3020 0.3981 0.0070 RRMS = MAE = 1 N test N ˆf(x i) g(x i), test MSE = 1 N test i=1 ( ˆf(x i) g(x i)) 2 N test i=1 Ntest i=1 ( ˆf(x i) g(x i)) 2 Ntest i=1 (ȳ g(xi))2, ȳ = 1 N N y i. i=1 30/39 Burnaev Evgeny

Results of approximation 2D-slices of approximations (other parameters are fixed) 150 100 p1 50 0 50 tensorgp reg training set test set 100 350 300 250 200 150 100 50 x5 Figure : GPML Sparse GP is applied 31/39 Burnaev Evgeny 650 600 550 500 450 400 x6 Figure : Approximation obtained using developed algorithm with regularization 350

Further research: missing data Reasons for missing points: Data generation is in progress (each point calculation is time consuming). Data generator failed in some points. 32/39 Burnaev Evgeny

Incomplete Factorial DoE S full full factorial DoE. N full = S full. S incomplete factorial DoE. N = S. S S full Covariance matrix is not the Kronecker product! K K f K i. i=1 x 2 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 33/39 Burnaev Evgeny

Computation of loglikelihood Notation {x i} N full i=1 = S full. K f covariance matrix for full factorial DoE S full. { 1, if x i S W diagonal matrix such that W ii = 0, if x i / S. ỹ vector of outputs y extended by arbitrary values. Proposition Let z be a solution of ( K f W K f + σ 2 noise K f ) z = K f Wỹ. (3) Then the solution z of (K f + σ 2 noisei)z = y, i.e. z = (K f + σ 2 noisei) 1 y, has the form z = ( z i 1,..., z i N ), where i k {j : x j S}, k = 1,..., N. 34/39 Burnaev Evgeny

Computation of loglikelihood R = N full N number of missing points. Ũ = K i=1 Ũi D = K i=1 D i. ˆ D = D + σ 2 noisei. The change of variables ( ) 12 ˆ D D Ũ T z if R < N, α = ( D) 12 σnoise 2 Ũ T z if R N. (4) Proposition System of linear equations (3) can be solved using the change of variables (4) and Conjugate Gradient Method in at most min(r + 1, N + 1) iterations. The computational complexity of each iteration is O(N full k n k). 35/39 Burnaev Evgeny

Computation of determinant ( ) K f + σnoisei 2 Kg A = A T B Determinant Proposition K g = K f + σ 2 noisei B A T K 1 g A. (5) K f + σnoise 2 I is computed using formulae for full factorial case. B A T K 1 g A is computed numerically. The complexity of computing determinant using (5) is O ( min{r + 1, N + 1}RN full k n k). 36/39 Burnaev Evgeny

Conclusion The developed algorithm is computationally efficient; can handle large samples; takes into account features of given data; is proved to be efficient on a large set of toy problems as well as real world problems. 37/39 Burnaev Evgeny

Thank you for attention! More details are given in Belyaev, M., Burnaev, E., and Kapushev, Y. (2014). Exact inference for gaussian process regression in case of big data with the cartesian product structure. arxiv preprint arxiv:1403.6573 38/39 Burnaev Evgeny

References Dietrich, C. R. and Newsam, G. N. (1997). Fast and exact simulation of stationary gaussian processes through circulant embedding of the covariance matrix. SIAM J. Sci. Comput., 18(4):1088 1107. Friedman, J. (1991). Multivariate adaptive regression splines. Annals of Statistic, 19(1):1 141. Stone, C., Hansen, M., Kooperberg, C., and Truong, Y. (1997). Polynomial splines, their tensor products in extended linear modeling. Annals of Statistic, 25:1371 1470. Stroud, J. R., Stein, M. L., and Lysen, S. (2014). Bayesian and maximum likelihood estimation for gaussian processes on an incomplete lattice. arxiv preprint arxiv:1402.4281. Xiao, L., Li, Y., and Ruppert, D. (2013). Fast bivariate p-splines: the sandwich smoother. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):577 599. 39/39 Burnaev Evgeny