Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Transcription

1 Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission Problems, Moscow, Russia 2 DATADVANCE, llc, Moscow, Russia 3 PreMoLab, MIPT, Dolgoprudny, Russia ICML 2014 workshop on New Learning Frameworks and Models for Big Data Beijing, /39 Burnaev Evgeny

2 Approximation task Problem statement Let y = g (x) be some unknown function. The training sample is given D = {x i, y i}, g (x i) = y i, i = 1,..., N. The task is to construct ˆf (x) such that: ˆf (x) g(x). 2/39 Burnaev Evgeny

3 Factorial Design of Experiments Factors: s k = {x k i k X k, i k = 1,...,n k }, X k R d k, k = 1,..., K; d k dimensionality of the factor s k. Factorial Design of Experiments: S = {x i} N i=1 = s 1 s 2 s K. Dimensionality of x S: d = K k=1 d k. Sample size: N = K k=1 n k x x 1 3/39 Burnaev Evgeny

4 Example of Factorial DoE x x 1 1 x 2 Figure : Factorial DoE with multidimensional factor 4/39 Burnaev Evgeny

5 DoE in engineering problems 1 Independent groups of variables factors. 2 Training sample generation procedure: Fix values of the first factor. Perform experiments varying values of other factors. Fix other value of the first factor. Perform new series of experiments. 3 Take into account knowledge from a subject domain. Data properties Has special structure Can have large sample size Factors sizes can differ significantly 5/39 Burnaev Evgeny

6 Example of Factorial DoE Modeling of pressure distribution over the aircraft wing: Angle of attack: s 1 = {0, 0.8, 1.6, 2.4, 3.2, 4.0}. Mach number: s 2 = {0.77, 0.78, 0.79, 0.8, 0.81, 0.82, 0.83}. Wing points coordinates: s point in R 3. The training sample: S = s 1 s 2 s 3. Dimensionality d = = 5. Sample size N = = /39 Burnaev Evgeny

7 Existing solutions Universal techniques: Disadvantages: don t take into account sample structure low approximation quality, high computational complexity Multivariate Adaptive Regression Splines [Friedman, 1991] Disadvantages: discontinuous derivatives, non-physical behaviour Tensor product of splines [Stone et al., 1997, Xiao et al., 2013] Disadvantages: only one-dimensional factors, no accuracy evaluation procedure Gaussian Processes on lattice [Dietrich and Newsam, 1997, Stroud et al., 2014] Disadvantages: two-dimensional grid with equidistant points 7/39 Burnaev Evgeny

8 The aim is to develop computationally efficient algorithm taking into account special features of factorial Design of Experiments 8/39 Burnaev Evgeny

9 Gaussian Process Regression Function model g(x) = f(x) + ε(x), where f(x) Gaussian process (GP), ε(x) Gaussian white noise. GP is fully defined by its mean and covariance function. The covariance function of f(x) K f (x, x ) = σ 2 f exp ( ) d θi 2 (x (i) x (i) ) 2, where x (i) i-th component of vector, θ = (σ 2 f, θ 1,..., θ d ) parameters of the covariance function. The covariance function of g(x): δ(x, x ) Kronecker delta. i=1 K g(x, x ) = K(x, x ) + σ 2 noiseδ(x, x ), 9/39 Burnaev Evgeny

10 Choosing parameters Maximum Likelihood Estimation Loglikelihood log p(y X, θ) = 1 2 yt K 1 g y 1 2 log Kg N 2 log 2π, where K g determinant of matrix K g = K g(x i, x j) N i,j=1, xi, xj S. Parameters θ are chosen such that θ = arg max(log p(y X, θ)) θ 10/39 Burnaev Evgeny

11 Final model Prediction of g(x) at point x ˆf(x) = k T K 1 g y, where k = (K f (x 1, x),..., K f (x n, x)). Posterior variance σ 2 (x) = K f (x, x) k T K 1 g k. 11/39 Burnaev Evgeny

12 Gaussian Processes for factorial DoE Issues: 1 High computational complexity: O(N 3 ). In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 12/39 Burnaev Evgeny

13 Gaussian Processes for factorial DoE Issues: 1 High computational complexity: O(N 3 ). In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 12/39 Burnaev Evgeny

14 Estimation of covariance function parameters Loglikelihood: Derivatives: log p(y X, θ) = 1 2 yt K 1 g y 1 2 log Kg N 2 log 2π, θ (log p(y X, σ f, σ noise)) = 1 2 Tr(K 1 g K ) yt K 1 g K K 1 g y, where θ is a parameter of covariance function (component of θ i, σ noise or σ f,i, i = 1,..., d), and K = K θ 13/39 Burnaev Evgeny

15 Tensors and Kronecker product Definition Tensor Y is a K-dimensional matrix of size n 1 n 2 n K: Y = { y i1,i 2,...,i K, {i k = 1,..., n k } K k=1}. Definition The Kronecker product of matrices A and B is a block matrix a 11B a 1nB A B = a m1b a mnb 14/39 Burnaev Evgeny

16 Related operations Operation vec: vec(y) = [Y 1,1,...,1, Y 2,1,...,1,..., Y n1,1,...,1, Y 1,2,...,1,..., Y n1,n 2,...,n K ]. Multiplication of a tensor by a matrix along k-th direction Z = Y k B Z i1,...,i k 1,j,i k+1,...i K = Y i1,...,i k,...,i K B ik j. i k Connection between tensors and the Kronecker product: vec(y 1 B 1 K B K) = (B 1 B K)vec(Y) (1) Complexity of computation of the left part O(N k n k), of the right part O(N 2 ). 15/39 Burnaev Evgeny

17 Covariance function Form of the covariance function: K K f (x, y) = k i(x i, y i ), x i, y i s i, i=1 where k i is an arbitrary covariance function for i-th factor. Covariance matrix: K K = K i, i=1 K i is a covariance matrix for i-th factor. 16/39 Burnaev Evgeny

18 Fast computation of loglikelihood Proposition Let K i = U id iu T i be a Singular Value Decomposition (SVD) of the matrix K i, where U i is an orthogonal matrix, and D i is diagonal. Then: K g = D i1,...,i K, i 1,...,i K ( ) ( ) 1 ( ) K 1 g = U T k D k + σnoisei 2 U k, (2) k k K 1 g y = vec [((Y 1 U 1 K U K) D 1) ] 1 U T 1 K U T K, k where D is a tensor of diagonal elements of the matrix σ 2 noisei + k D k 17/39 Burnaev Evgeny

19 Computational complexity Proposition Calculation of the loglikelihood using (2) has the following computation complexity ( ) K K O N n i +. i=1 i=1 n 3 i Assuming n i N (number of factors is large and their sizes are close) we get O (N ) ( ) n i = O N 1+ K 1. 18/39 Burnaev Evgeny

20 Fast computation of derivatives Proposition The following statements hold Tr(K 1 g K ) = diag ( ˆD 1 ), K diag ( U ik ) iu i, i=1 1 2 yt K 1 g K K 1 g y = A, A 1 K T 1 2 i 1 K T i 1 i i K T i θ where ˆD = σnoisei 2 + k D k, and vec(a) = K 1 g y. The computational complexity is ( ) K K O N n i +. i=1 i+1 KT i+1 i+2 K K T K i=1 n 3 i, 19/39 Burnaev Evgeny

21 Gaussian Processes for facotrial DoE Issues: 1 High computational complexity: O(N 3 ). In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 20/39 Burnaev Evgeny

22 Gaussian Processes for facotrial DoE Issues: 1 High computational complexity: O(N 3 ). In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 20/39 Burnaev Evgeny

23 Degeneracy Example of degeneracy True function training set GP regression training set x x x x Figure : Original function Figure : Approximation obtained using GP from GPML toolbox 21/39 Burnaev Evgeny

24 Regularization Prior distribution: θ (i) k b (i) k a(i) k a(i) k Be(α, β), {i = 1,..., d k } K k=1, a (i) k = c k max y (i) ), x,y s k(x(i) b(i) k = C k (x (i) y (i) ) min x,y s k,x y where Be(α, β) is the Beta distribution, c k and C k are parameters of the algorithm (we use c k = 0.01 and C k = 2). Initialization [ )] 1 θ (i) 1 k (max = ) min ). n k x s k(x(i) x s k(x(i) 22/39 Burnaev Evgeny

25 Regularization Loglikelihood: log p(y X, θ, σ f, σ noise) = 1 2 yt K 1 y y 1 2 log Ky N 2 + k,i ( (α 1) log ( θ (i) k b (i) k a(i) k a(i) k ) + (β 1) log ( 1 θ(i) k b (i) k a(i) k a(i) k log 2π )) d log(b(α, β)). 23/39 Burnaev Evgeny

26 Regularization Loglikelihood: log p(y X, θ, σ f, σ noise) = 1 2 yt K 1 y y 1 2 log Ky N 2 + k,i ( (α 1) log ( θ (i) k b (i) k a(i) k a(i) k ) + (β 1) log ( 1 θ(i) k b (i) k a(i) k a(i) k log 2π )) d log(b(α, β)) log(pdf) a = b = x Figure : Penalty function 23/39 Burnaev Evgeny

27 Regularization Example of regularized approximation x True function training set x x tensorgp reg training set x Figure : Original function Figure : Approximation obtained using developed algorithm with regularization 24/39 Burnaev Evgeny

28 Toy problems Set of 34 functions (usual artificial test functions used for testing of global optimization algorithms) Dimensionality from 2 to 6. Sample sizes from 80 to with full factorial DoE Quality criteria training time and approximation error: MSE = 1 N test ( N ˆf(x i) g(x i)) 2 test i=1 Tested algorithms: MARS Multivariate Adaptive Regression Splines SparseGP sparse Gaussian Processes (GPML toolbox) tensorgp developed algorithm without regularization tensorgp-reg developed algorithm with regularization 25/39 Burnaev Evgeny

29 Dolan-Moré curves T problems, A algorithms. e ta approximation error (or training time) of a-th algorithm on t-th problem. ẽ t = min a e ta. ρ a(τ) = #{t : eta < τẽt} T The higher the curve is the better works the corresponding algorithm. ρ a(1) fraction of problems for which a-th algorithm worked the best. 26/39 Burnaev Evgeny

30 Results of experiments: training time Figure : Dolan-Moré curves. Quality criterion training time 27/39 Burnaev Evgeny

31 Results of experiments: approximation quality Figure : Dolan-Moré curves. Quality criterion MSE 28/39 Burnaev Evgeny

32 Real problem: Rotating disc of an impeller Objective functions: p 1 contact pressure. S rmax maximum radial stress. w weight of disc. The geometrical shape of the disc is parametrized by 6 input variables x = (h 1, h 2, h 3, h 4, r 2, r 3) (r 1 and r 4 are fixed) Training sample (generated from computational physical model): Sample size Factor sizes {1, 8, 8, 3, 15, 5} Figure : Rotating disc of an impeller 29/39 Burnaev Evgeny

33 Results of experiment Table : Approximation errors of p 1 MAE MSE RRMS MARS SparseGP tensorgp-reg RRMS = MAE = 1 N test N ˆf(x i) g(x i), test MSE = 1 N test i=1 ( ˆf(x i) g(x i)) 2 N test i=1 Ntest i=1 ( ˆf(x i) g(x i)) 2 Ntest i=1 (ȳ g(xi))2, ȳ = 1 N N y i. i=1 30/39 Burnaev Evgeny

34 Results of approximation 2D-slices of approximations (other parameters are fixed) p tensorgp reg training set test set x5 Figure : GPML Sparse GP is applied 31/39 Burnaev Evgeny x6 Figure : Approximation obtained using developed algorithm with regularization 350

35 Further research: missing data Reasons for missing points: Data generation is in progress (each point calculation is time consuming). Data generator failed in some points. 32/39 Burnaev Evgeny

36 Incomplete Factorial DoE S full full factorial DoE. N full = S full. S incomplete factorial DoE. N = S. S S full Covariance matrix is not the Kronecker product! K K f K i. i=1 x x 1 33/39 Burnaev Evgeny

37 Computation of loglikelihood Notation {x i} N full i=1 = S full. K f covariance matrix for full factorial DoE S full. { 1, if x i S W diagonal matrix such that W ii = 0, if x i / S. ỹ vector of outputs y extended by arbitrary values. Proposition Let z be a solution of ( K f W K f + σ 2 noise K f ) z = K f Wỹ. (3) Then the solution z of (K f + σ 2 noisei)z = y, i.e. z = (K f + σ 2 noisei) 1 y, has the form z = ( z i 1,..., z i N ), where i k {j : x j S}, k = 1,..., N. 34/39 Burnaev Evgeny

38 Computation of loglikelihood R = N full N number of missing points. Ũ = K i=1 Ũi D = K i=1 D i. ˆ D = D + σ 2 noisei. The change of variables ( ) 12 ˆ D D Ũ T z if R < N, α = ( D) 12 σnoise 2 Ũ T z if R N. (4) Proposition System of linear equations (3) can be solved using the change of variables (4) and Conjugate Gradient Method in at most min(r + 1, N + 1) iterations. The computational complexity of each iteration is O(N full k n k). 35/39 Burnaev Evgeny

39 Computation of determinant ( ) K f + σnoisei 2 Kg A = A T B Determinant Proposition K g = K f + σ 2 noisei B A T K 1 g A. (5) K f + σnoise 2 I is computed using formulae for full factorial case. B A T K 1 g A is computed numerically. The complexity of computing determinant using (5) is O ( min{r + 1, N + 1}RN full k n k). 36/39 Burnaev Evgeny

40 Conclusion The developed algorithm is computationally efficient; can handle large samples; takes into account features of given data; is proved to be efficient on a large set of toy problems as well as real world problems. 37/39 Burnaev Evgeny

41 Thank you for attention! More details are given in Belyaev, M., Burnaev, E., and Kapushev, Y. (2014). Exact inference for gaussian process regression in case of big data with the cartesian product structure. arxiv preprint arxiv: /39 Burnaev Evgeny

42 References Dietrich, C. R. and Newsam, G. N. (1997). Fast and exact simulation of stationary gaussian processes through circulant embedding of the covariance matrix. SIAM J. Sci. Comput., 18(4): Friedman, J. (1991). Multivariate adaptive regression splines. Annals of Statistic, 19(1): Stone, C., Hansen, M., Kooperberg, C., and Truong, Y. (1997). Polynomial splines, their tensor products in extended linear modeling. Annals of Statistic, 25: Stroud, J. R., Stein, M. L., and Lysen, S. (2014). Bayesian and maximum likelihood estimation for gaussian processes on an incomplete lattice. arxiv preprint arxiv: Xiao, L., Li, Y., and Ruppert, D. (2013). Fast bivariate p-splines: the sandwich smoother. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3): /39 Burnaev Evgeny