Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission Problems, Moscow, Russia 2 DATADVANCE, llc, Moscow, Russia 3 PreMoLab, MIPT, Dolgoprudny, Russia ICML 2014 workshop on New Learning Frameworks and Models for Big Data Beijing, 2014 1/39 Burnaev Evgeny
Approximation task Problem statement Let y = g (x) be some unknown function. The training sample is given D = {x i, y i}, g (x i) = y i, i = 1,..., N. The task is to construct ˆf (x) such that: ˆf (x) g(x). 2/39 Burnaev Evgeny
Factorial Design of Experiments Factors: s k = {x k i k X k, i k = 1,...,n k }, X k R d k, k = 1,..., K; d k dimensionality of the factor s k. Factorial Design of Experiments: S = {x i} N i=1 = s 1 s 2 s K. Dimensionality of x S: d = K k=1 d k. Sample size: N = K k=1 n k x 2 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 3/39 Burnaev Evgeny
Example of Factorial DoE x 3 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 x 1 1 x 2 Figure : Factorial DoE with multidimensional factor 4/39 Burnaev Evgeny
DoE in engineering problems 1 Independent groups of variables factors. 2 Training sample generation procedure: Fix values of the first factor. Perform experiments varying values of other factors. Fix other value of the first factor. Perform new series of experiments. 3 Take into account knowledge from a subject domain. Data properties Has special structure Can have large sample size Factors sizes can differ significantly 5/39 Burnaev Evgeny
Example of Factorial DoE Modeling of pressure distribution over the aircraft wing: Angle of attack: s 1 = {0, 0.8, 1.6, 2.4, 3.2, 4.0}. Mach number: s 2 = {0.77, 0.78, 0.79, 0.8, 0.81, 0.82, 0.83}. Wing points coordinates: s 3 5000 point in R 3. The training sample: S = s 1 s 2 s 3. Dimensionality d = 1 + 1 + 3 = 5. Sample size N = 6 7 5000 = 210000. 6/39 Burnaev Evgeny
Existing solutions Universal techniques: Disadvantages: don t take into account sample structure low approximation quality, high computational complexity Multivariate Adaptive Regression Splines [Friedman, 1991] Disadvantages: discontinuous derivatives, non-physical behaviour Tensor product of splines [Stone et al., 1997, Xiao et al., 2013] Disadvantages: only one-dimensional factors, no accuracy evaluation procedure Gaussian Processes on lattice [Dietrich and Newsam, 1997, Stroud et al., 2014] Disadvantages: two-dimensional grid with equidistant points 7/39 Burnaev Evgeny
The aim is to develop computationally efficient algorithm taking into account special features of factorial Design of Experiments 8/39 Burnaev Evgeny
Gaussian Process Regression Function model g(x) = f(x) + ε(x), where f(x) Gaussian process (GP), ε(x) Gaussian white noise. GP is fully defined by its mean and covariance function. The covariance function of f(x) K f (x, x ) = σ 2 f exp ( ) d θi 2 (x (i) x (i) ) 2, where x (i) i-th component of vector, θ = (σ 2 f, θ 1,..., θ d ) parameters of the covariance function. The covariance function of g(x): δ(x, x ) Kronecker delta. i=1 K g(x, x ) = K(x, x ) + σ 2 noiseδ(x, x ), 9/39 Burnaev Evgeny
Choosing parameters Maximum Likelihood Estimation Loglikelihood log p(y X, θ) = 1 2 yt K 1 g y 1 2 log Kg N 2 log 2π, where K g determinant of matrix K g = K g(x i, x j) N i,j=1, xi, xj S. Parameters θ are chosen such that θ = arg max(log p(y X, θ)) θ 10/39 Burnaev Evgeny
Final model Prediction of g(x) at point x ˆf(x) = k T K 1 g y, where k = (K f (x 1, x),..., K f (x n, x)). Posterior variance σ 2 (x) = K f (x, x) k T K 1 g k. 11/39 Burnaev Evgeny
Gaussian Processes for factorial DoE Issues: 1 High computational complexity: O(N 3 ). In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 12/39 Burnaev Evgeny
Gaussian Processes for factorial DoE Issues: 1 High computational complexity: O(N 3 ). In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 12/39 Burnaev Evgeny
Estimation of covariance function parameters Loglikelihood: Derivatives: log p(y X, θ) = 1 2 yt K 1 g y 1 2 log Kg N 2 log 2π, θ (log p(y X, σ f, σ noise)) = 1 2 Tr(K 1 g K ) + 1 2 yt K 1 g K K 1 g y, where θ is a parameter of covariance function (component of θ i, σ noise or σ f,i, i = 1,..., d), and K = K θ 13/39 Burnaev Evgeny
Tensors and Kronecker product Definition Tensor Y is a K-dimensional matrix of size n 1 n 2 n K: Y = { y i1,i 2,...,i K, {i k = 1,..., n k } K k=1}. Definition The Kronecker product of matrices A and B is a block matrix a 11B a 1nB A B =........ a m1b a mnb 14/39 Burnaev Evgeny
Related operations Operation vec: vec(y) = [Y 1,1,...,1, Y 2,1,...,1,..., Y n1,1,...,1, Y 1,2,...,1,..., Y n1,n 2,...,n K ]. Multiplication of a tensor by a matrix along k-th direction Z = Y k B Z i1,...,i k 1,j,i k+1,...i K = Y i1,...,i k,...,i K B ik j. i k Connection between tensors and the Kronecker product: vec(y 1 B 1 K B K) = (B 1 B K)vec(Y) (1) Complexity of computation of the left part O(N k n k), of the right part O(N 2 ). 15/39 Burnaev Evgeny
Covariance function Form of the covariance function: K K f (x, y) = k i(x i, y i ), x i, y i s i, i=1 where k i is an arbitrary covariance function for i-th factor. Covariance matrix: K K = K i, i=1 K i is a covariance matrix for i-th factor. 16/39 Burnaev Evgeny
Fast computation of loglikelihood Proposition Let K i = U id iu T i be a Singular Value Decomposition (SVD) of the matrix K i, where U i is an orthogonal matrix, and D i is diagonal. Then: K g = D i1,...,i K, i 1,...,i K ( ) ( ) 1 ( ) K 1 g = U T k D k + σnoisei 2 U k, (2) k k K 1 g y = vec [((Y 1 U 1 K U K) D 1) ] 1 U T 1 K U T K, k where D is a tensor of diagonal elements of the matrix σ 2 noisei + k D k 17/39 Burnaev Evgeny
Computational complexity Proposition Calculation of the loglikelihood using (2) has the following computation complexity ( ) K K O N n i +. i=1 i=1 n 3 i Assuming n i N (number of factors is large and their sizes are close) we get O (N ) ( ) n i = O N 1+ K 1. 18/39 Burnaev Evgeny
Fast computation of derivatives Proposition The following statements hold Tr(K 1 g K ) = diag ( ˆD 1 ), K diag ( U ik ) iu i, i=1 1 2 yt K 1 g K K 1 g y = A, A 1 K T 1 2 i 1 K T i 1 i i K T i θ where ˆD = σnoisei 2 + k D k, and vec(a) = K 1 g y. The computational complexity is ( ) K K O N n i +. i=1 i+1 KT i+1 i+2 K K T K i=1 n 3 i, 19/39 Burnaev Evgeny
Gaussian Processes for facotrial DoE Issues: 1 High computational complexity: O(N 3 ). In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 20/39 Burnaev Evgeny
Gaussian Processes for facotrial DoE Issues: 1 High computational complexity: O(N 3 ). In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 20/39 Burnaev Evgeny
Degeneracy Example of degeneracy 2.5 2 1.5 1 0.5 0 True function training set 2.5 2 1.5 1 0.5 0 GP regression training set 0.5 1 1 0.5 0 x 2 0.5 1 1 0.5 x 1 0 0.5 1 0.5 1 1 0.5 0 x 2 0.5 1 1 0.5 x 1 0 0.5 1 Figure : Original function Figure : Approximation obtained using GP from GPML toolbox 21/39 Burnaev Evgeny
Regularization Prior distribution: θ (i) k b (i) k a(i) k a(i) k Be(α, β), {i = 1,..., d k } K k=1, a (i) k = c k max y (i) ), x,y s k(x(i) b(i) k = C k (x (i) y (i) ) min x,y s k,x y where Be(α, β) is the Beta distribution, c k and C k are parameters of the algorithm (we use c k = 0.01 and C k = 2). Initialization [ )] 1 θ (i) 1 k (max = ) min ). n k x s k(x(i) x s k(x(i) 22/39 Burnaev Evgeny
Regularization Loglikelihood: log p(y X, θ, σ f, σ noise) = 1 2 yt K 1 y y 1 2 log Ky N 2 + k,i ( (α 1) log ( θ (i) k b (i) k a(i) k a(i) k ) + (β 1) log ( 1 θ(i) k b (i) k a(i) k a(i) k log 2π )) d log(b(α, β)). 23/39 Burnaev Evgeny
Regularization Loglikelihood: log p(y X, θ, σ f, σ noise) = 1 2 yt K 1 y y 1 2 log Ky N 2 + k,i ( (α 1) log ( θ (i) k b (i) k a(i) k a(i) k ) + (β 1) log ( 1 θ(i) k b (i) k a(i) k a(i) k log 2π )) d log(b(α, β)). 1 0 1 log(pdf) 2 3 4 5 a = b = 2 6 0 0.5 1 1.5 2 x Figure : Penalty function 23/39 Burnaev Evgeny
Regularization Example of regularized approximation 2.5 2 1.5 1 0.5 0 0.5 1 1 0.5 0 x 2 0.5 1 1 True function training set 0 0.5 x 1 0.5 1 2.5 2 1.5 1 0.5 0 0.5 1 1 0.5 0 x 2 0.5 1 1 tensorgp reg training set 0 0.5 x 1 0.5 1 Figure : Original function Figure : Approximation obtained using developed algorithm with regularization 24/39 Burnaev Evgeny
Toy problems Set of 34 functions (usual artificial test functions used for testing of global optimization algorithms) Dimensionality from 2 to 6. Sample sizes from 80 to 216000 with full factorial DoE Quality criteria training time and approximation error: MSE = 1 N test ( N ˆf(x i) g(x i)) 2 test i=1 Tested algorithms: MARS Multivariate Adaptive Regression Splines SparseGP sparse Gaussian Processes (GPML toolbox) tensorgp developed algorithm without regularization tensorgp-reg developed algorithm with regularization 25/39 Burnaev Evgeny
Dolan-Moré curves T problems, A algorithms. e ta approximation error (or training time) of a-th algorithm on t-th problem. ẽ t = min a e ta. ρ a(τ) = #{t : eta < τẽt} T The higher the curve is the better works the corresponding algorithm. ρ a(1) fraction of problems for which a-th algorithm worked the best. 26/39 Burnaev Evgeny
Results of experiments: training time Figure : Dolan-Moré curves. Quality criterion training time 27/39 Burnaev Evgeny
Results of experiments: approximation quality Figure : Dolan-Moré curves. Quality criterion MSE 28/39 Burnaev Evgeny
Real problem: Rotating disc of an impeller Objective functions: p 1 contact pressure. S rmax maximum radial stress. w weight of disc. The geometrical shape of the disc is parametrized by 6 input variables x = (h 1, h 2, h 3, h 4, r 2, r 3) (r 1 and r 4 are fixed) Training sample (generated from computational physical model): Sample size 14400 Factor sizes {1, 8, 8, 3, 15, 5} Figure : Rotating disc of an impeller 29/39 Burnaev Evgeny
Results of experiment Table : Approximation errors of p 1 MAE MSE RRMS MARS 4.4644 6.5120 0.1166 SparseGP 76.9313 86.7034 1.5530 tensorgp-reg 0.3020 0.3981 0.0070 RRMS = MAE = 1 N test N ˆf(x i) g(x i), test MSE = 1 N test i=1 ( ˆf(x i) g(x i)) 2 N test i=1 Ntest i=1 ( ˆf(x i) g(x i)) 2 Ntest i=1 (ȳ g(xi))2, ȳ = 1 N N y i. i=1 30/39 Burnaev Evgeny
Results of approximation 2D-slices of approximations (other parameters are fixed) 150 100 p1 50 0 50 tensorgp reg training set test set 100 350 300 250 200 150 100 50 x5 Figure : GPML Sparse GP is applied 31/39 Burnaev Evgeny 650 600 550 500 450 400 x6 Figure : Approximation obtained using developed algorithm with regularization 350
Further research: missing data Reasons for missing points: Data generation is in progress (each point calculation is time consuming). Data generator failed in some points. 32/39 Burnaev Evgeny
Incomplete Factorial DoE S full full factorial DoE. N full = S full. S incomplete factorial DoE. N = S. S S full Covariance matrix is not the Kronecker product! K K f K i. i=1 x 2 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 33/39 Burnaev Evgeny
Computation of loglikelihood Notation {x i} N full i=1 = S full. K f covariance matrix for full factorial DoE S full. { 1, if x i S W diagonal matrix such that W ii = 0, if x i / S. ỹ vector of outputs y extended by arbitrary values. Proposition Let z be a solution of ( K f W K f + σ 2 noise K f ) z = K f Wỹ. (3) Then the solution z of (K f + σ 2 noisei)z = y, i.e. z = (K f + σ 2 noisei) 1 y, has the form z = ( z i 1,..., z i N ), where i k {j : x j S}, k = 1,..., N. 34/39 Burnaev Evgeny
Computation of loglikelihood R = N full N number of missing points. Ũ = K i=1 Ũi D = K i=1 D i. ˆ D = D + σ 2 noisei. The change of variables ( ) 12 ˆ D D Ũ T z if R < N, α = ( D) 12 σnoise 2 Ũ T z if R N. (4) Proposition System of linear equations (3) can be solved using the change of variables (4) and Conjugate Gradient Method in at most min(r + 1, N + 1) iterations. The computational complexity of each iteration is O(N full k n k). 35/39 Burnaev Evgeny
Computation of determinant ( ) K f + σnoisei 2 Kg A = A T B Determinant Proposition K g = K f + σ 2 noisei B A T K 1 g A. (5) K f + σnoise 2 I is computed using formulae for full factorial case. B A T K 1 g A is computed numerically. The complexity of computing determinant using (5) is O ( min{r + 1, N + 1}RN full k n k). 36/39 Burnaev Evgeny
Conclusion The developed algorithm is computationally efficient; can handle large samples; takes into account features of given data; is proved to be efficient on a large set of toy problems as well as real world problems. 37/39 Burnaev Evgeny
Thank you for attention! More details are given in Belyaev, M., Burnaev, E., and Kapushev, Y. (2014). Exact inference for gaussian process regression in case of big data with the cartesian product structure. arxiv preprint arxiv:1403.6573 38/39 Burnaev Evgeny
References Dietrich, C. R. and Newsam, G. N. (1997). Fast and exact simulation of stationary gaussian processes through circulant embedding of the covariance matrix. SIAM J. Sci. Comput., 18(4):1088 1107. Friedman, J. (1991). Multivariate adaptive regression splines. Annals of Statistic, 19(1):1 141. Stone, C., Hansen, M., Kooperberg, C., and Truong, Y. (1997). Polynomial splines, their tensor products in extended linear modeling. Annals of Statistic, 25:1371 1470. Stroud, J. R., Stein, M. L., and Lysen, S. (2014). Bayesian and maximum likelihood estimation for gaussian processes on an incomplete lattice. arxiv preprint arxiv:1402.4281. Xiao, L., Li, Y., and Ruppert, D. (2013). Fast bivariate p-splines: the sandwich smoother. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):577 599. 39/39 Burnaev Evgeny