Gaussian Processes for Big Data Problems

Transcription

1 Gaussian Processes for Big Data Problems Marc Deisenroth Department of Computing Imperial College London Machine Learning Summer School, Chalmers University 14 April 2015

2 Gaussian Processes Marc 14 April

3 Problem Setting f(x) x Objective For a set of observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f q that explains the data Probabilistic regression problem Gaussian Processes Marc 14 April

4 Problem Setting 2 f(x) x Objective For a set of observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f q that explains the data Probabilistic regression problem Gaussian Processes Marc 14 April

5 (Some) Relevant Application Areas Global black-box optimization and experimental design Autonomous learning in robotics Probabilistic dimensionality reduction and data visualization Gaussian Processes Marc 14 April

6 Table of Contents Introduction Gaussian Distribution Gaussian Processes Definition and Derivation Inference Covariance Functions Model Selection Limitations Scaling Gaussian Processes to Large Data Sets Sparse Gaussian Processes Distributed Gaussian Processes Gaussian Processes Marc 14 April

8 Key Concepts in Probability Theory Two fundamental rules: ż ppxq ppx, yqdy ppx, yq ppy xqppxq Sum rule/marginalization property Product rule Gaussian Processes Marc 14 April

9 Key Concepts in Probability Theory Two fundamental rules: ż ppxq ppx, yqdy ppx, yq ppy xqppxq Sum rule/marginalization property Product rule Bayes Theorem (Probabilistic Inverse) ppy xq ppxq ppx yq, x : hypothesis, y : measurement ppyq Gaussian Processes Marc 14 April

10 Key Concepts in Probability Theory Two fundamental rules: ż ppxq ppx, yqdy ppx, yq ppy xqppxq Sum rule/marginalization property Product rule Bayes Theorem (Probabilistic Inverse) ppy xq ppxq ppx yq, x : hypothesis, y : measurement ppyq Posterior belief Prior belief x Likelihood (measurement model) y Marginal likelihood (normalization constant) Gaussian Processes Marc 14 April

11 The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 p(x) Mean 95% confidence bound p(x) x Gaussian Processes Marc 14 April

12 The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 p(x) Mean 95% confidence bound 3 2 Mean 95% confidence bound p(x) x x x 1 Gaussian Processes Marc 14 April

13 The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 Data p(x) Mean 95% confidence interval 3 2 Data Mean 95% confidence bound p(x) x x x 1 Gaussian Processes Marc 14 April

14 Sampling from a Multivariate Gaussian Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. However, we only have access to a random number generator that can sample x from N `0, I... Gaussian Processes Marc 14 April

15 Sampling from a Multivariate Gaussian Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. However, we only have access to a random number generator that can sample x from N `0, I... Exploit that affine transformations y Ax ` b of Gaussians remain Gaussian Mean: E x rax ` bs AE x rxs ` b Covariance: V x rax ` bs AV x rxsa J Gaussian Processes Marc 14 April

16 Sampling from a Multivariate Gaussian Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. However, we only have access to a random number generator that can sample x from N `0, I... Exploit that affine transformations y Ax ` b of Gaussians remain Gaussian Mean: E x rax ` bs AE x rxs ` b Covariance: V x rax ` bs AV x rxsa J 1. Find conditions for A, b to match the mean of y 2. Find conditions for A, b to match the covariance of y Gaussian Processes Marc 14 April

17 Sampling from a Multivariate Gaussian (2) Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. x = randn(d,1); y = chol(σ) *x + µ; Sample x N `0, I Scale x Here chol(σ) is the Cholesky factor L, such that L J L Σ Gaussian Processes Marc 14 April

18 Sampling from a Multivariate Gaussian (2) Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. x = randn(d,1); y = chol(σ) *x + µ; Sample x N `0, I Scale x Here chol(σ) is the Cholesky factor L, such that L J L Σ Therefore, the mean and covariance of y are Erys ȳ ErL J x ` µs L J Erxs ` µ µ Covrys Erpy ȳqpy ȳq J s ErL J xx J Ls L J Erxx J sl L J L Σ Gaussian Processes Marc 14 April

19 Conditional 3 2 «ff µx Joint p(x,y) ppx, yq N, µ y «ff Σxx Σ xy Σ yx Σ yy 1 0 y x Gaussian Processes Marc 14 April

20 Conditional Joint p(x,y) Observation ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy y x Gaussian Processes Marc 14 April

21 Conditional 3 2 Joint p(x,y) Observation y Conditional p(x y) ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy y ppx yq N `µ x y, Σ x y µ x y µ x ` Σ xy Σ 1 yy py µ y q Σ x y Σ xx Σ xy Σ 1 yy Σ yx x Conditional ppx yq is also Gaussian Computationally convenient Gaussian Processes Marc 14 April

22 Marginal Joint p(x,y) Marginal p(x) ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy y ż pp x q pp x, y qd y N ` µ x, Σ xx x The marginal of a joint Gaussian distribution is Gaussian Intuitively: Ignore (integrate out) everything you are not interested in Gaussian Processes Marc 14 April

24 Gaussian Process Definition A Gaussian process (GP) is a collection of random variables x 1, x 2,..., any finite number of which is Gaussian distributed. Gaussian Processes Marc 14 April

25 Gaussian Process Definition A Gaussian process (GP) is a collection of random variables x 1, x 2,..., any finite number of which is Gaussian distributed. Unexpected (?) example: Linear dynamical system x t`1 Ax t ` w, x 1, x 2,... is a collection of random variables w N `0, Q Gaussian Processes Marc 14 April

26 Gaussian Process Definition A Gaussian process (GP) is a collection of random variables x 1, x 2,..., any finite number of which is Gaussian distributed. Unexpected (?) example: Linear dynamical system x t`1 Ax t ` w, w N `0, Q x 1, x 2,... is a collection of random variables, any finite number of which is Gaussian distributed Gaussian Processes Marc 14 April

27 Gaussian Process Definition A Gaussian process (GP) is a collection of random variables x 1, x 2,..., any finite number of which is Gaussian distributed. Unexpected (?) example: Linear dynamical system x t`1 Ax t ` w, w N `0, Q x 1, x 2,... is a collection of random variables, any finite number of which is Gaussian distributed This is a GP Gaussian Processes Marc 14 April

28 Gaussian Process Definition A Gaussian process (GP) is a collection of random variables x 1, x 2,..., any finite number of which is Gaussian distributed. Unexpected (?) example: Linear dynamical system x t`1 Ax t ` w, w N `0, Q x 1, x 2,... is a collection of random variables, any finite number of which is Gaussian distributed This is a GP But we will use the Gaussian process for a different purpose, not for time series Gaussian Processes Marc 14 April

29 The Gaussian in the Limit Consider the joint Gaussian distribution ppx, xq, where x P R D and x P R k, k Ñ 8 Gaussian Processes Marc 14 April

30 The Gaussian in the Limit Consider the joint Gaussian distribution ppx, xq, where x P R D and x P R k, k Ñ 8 Then «ff j µx Σxx Σ ppx, xq N, x x µ x Σ xx where Σ x x P R kˆk and Σ x x P R Dˆk, k Ñ 8. Σ x x Gaussian Processes Marc 14 April

31 The Gaussian in the Limit Consider the joint Gaussian distribution ppx, xq, where x P R D and x P R k, k Ñ 8 Then «ff j µx Σxx Σ ppx, xq N, x x µ x Σ xx Σ x x where Σ x x P R kˆk and Σ x x P R Dˆk, k Ñ 8. However, the marginal remains finite ż pp x q pp x, x qd x N ` µ x, Σ xx where we integrate out an infinite number of variables x i. Gaussian Processes Marc 14 April

32 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Gaussian Processes Marc 14 April

33 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Then, x tx train, x test, x other u (x other plays the role of x from previous slide) Gaussian Processes Marc 14 April

34 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Then, x tx train, x test, x other u (x other plays the role of x from previous slide)» fi» ppxq N µ train µ test µ other ffi ffi fl, Σ train Σ train,test Σ test,train Σ test Σ test,other Σ train,other Σ other,train Σ other,test Σ other fi ffi ffi fl Gaussian Processes Marc 14 April

35 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Then, x tx train, x test, x other u (x other plays the role of x from previous slide)» fi» ppxq N µ train µ test µ other ffi ffi fl, Σ train Σ train,test Σ test,train Σ test Σ test,other Σ train,other Σ other,train Σ other,test Σ other ż ppx train, x test q pp x train, x test, x other qd x other fi ffi ffi fl Gaussian Processes Marc 14 April

36 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Then, x tx train, x test, x other u (x other plays the role of x from previous slide)» fi» ppxq N µ train µ test µ other ffi ffi fl, Σ train Σ train,test Σ test,train Σ test Σ test,other Σ train,other Σ other,train Σ other,test Σ other ż ppx train, x test q pp x train, x test, x other qd x other ppx test x train q N `µ, Σ µ µ test ` Σ test,train Σ 1 train px train µ train q Σ Σ test Σ test,train Σ 1 train Σ train,test fi ffi ffi fl Gaussian Processes Marc 14 April

37 Back to Regression: Distribution over Functions We are not really interested in a distribution on x, but rather in a distribution over function values f pxq Gaussian Processes Marc 14 April

38 Back to Regression: Distribution over Functions We are not really interested in a distribution on x, but rather in a distribution over function values f pxq Let s replace x with function values f f pxq (Treat a function as a long vector of function values)» fi» fi pp f, f q N µ f fl, µ f Σ f f Σ f f Σ f f Σ f f where Σ f f P R kˆk and Σ f f P R Nˆk, k Ñ 8. fl Gaussian Processes Marc 14 April

39 Back to Regression: Distribution over Functions We are not really interested in a distribution on x, but rather in a distribution over function values f pxq Let s replace x with function values f f pxq (Treat a function as a long vector of function values)» fi» fi pp f, f q N µ f fl, µ f Σ f f Σ f f Σ f f Σ f f where Σ f f P R kˆk and Σ f f P R Nˆk, k Ñ 8. fl Again, the marginal remains finite ż pp f q pp f, f qd f N ` µ f, Σ f f Gaussian Processes Marc 14 April

40 Marginal and Conditional over Functions Define f : f test, f : f train. Gaussian Processes Marc 14 April

41 Marginal and Conditional over Functions Define f : f test, f : f train. Marginal pp f, f q N «µ f µ ff, «Σ f f Σ f Σ f Σ ff Gaussian Processes Marc 14 April

42 Marginal and Conditional over Functions Define f : f test, f : f train. Marginal pp f, f q N «µ f µ ff, «Σ f f Σ f Σ f Σ ff Conditional (predictive distribution) pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f Gaussian Processes Marc 14 April

43 Marginal and Conditional over Functions Define f : f test, f : f train. Marginal pp f, f q N «µ f µ ff, «Σ f f Σ f Σ f Σ ff Conditional (predictive distribution) pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f We need to compute (cross-)covariances between unknown function values Gaussian Processes Marc 14 April

44 Marginal and Conditional over Functions Define f : f test, f : f train. Marginal pp f, f q N «µ f µ ff, «Σ f f Σ f Σ f Σ ff Conditional (predictive distribution) pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f We need to compute (cross-)covariances between unknown function values The kernel trick helps us out Gaussian Processes Marc 14 April

45 Kernelization ˆ j µ pp f, f q N f, µ j Σ f f Σ f Σ f Σ pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f Gaussian Processes Marc 14 April

46 Kernelization ˆ j µ pp f, f q N f, µ j Σ f f Σ f Σ f Σ pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f A kernel function k is symmetric and positive definite. Gaussian Processes Marc 14 April

47 Kernelization ˆ j µ pp f, f q N f, µ j Σ f f Σ f Σ f Σ pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f A kernel function k is symmetric and positive definite. Kernel computes covariances between unknown function values f px i q and f px j q by just looking at the corresponding inputs x i, x j Σ pi,jq f f Covr f px i q, f px j qs kpx i, x j q Gaussian Processes Marc 14 April

48 Kernelization ˆ j µ pp f, f q N f, µ j Σ f f Σ f Σ f Σ pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f A kernel function k is symmetric and positive definite. Kernel computes covariances between unknown function values f px i q and f px j q by just looking at the corresponding inputs x i, x j Σ pi,jq f f Covr f px i q, f px j qs kpx i, x j q This yields the predictive distribution pp f f, X, X q N `m, S m µ ` kpx, XqkpX, Xq 1 p f µq S K kpx, XqkpX, Xq 1 kpx, X q Gaussian Processes Marc 14 April

49 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f q that explains the data Gaussian Processes Marc 14 April

50 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f q that explains the data Training data: X, y. Bayes theorem yields ppy f, Xq pp f q pp f X, yq ppy Xq Gaussian Processes Marc 14 April

51 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f q that explains the data Training data: X, y. Bayes theorem yields ppy f, Xq pp f q pp f X, yq ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k Gaussian Processes Marc 14 April

52 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σ 2 ε, find a distribution over functions pp f q that explains the data Training data: X, y. Bayes theorem yields ppy f, Xq pp f q pp f X, yq ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy f, Xq N ` f pxq, σε 2 I Gaussian Processes Marc 14 April

53 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σ 2 ε, find a distribution over functions pp f q that explains the data Training data: X, y. Bayes theorem yields ppy f, Xq pp f q pp f X, yq ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy f, Xq N ` f pxq, σε 2 I Marginal likelihood (evidence): ppy Xq ş ppy f, Xqpp f qd f Gaussian Processes Marc 14 April

54 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σ 2 ε, find a distribution over functions pp f q that explains the data Training data: X, y. Bayes theorem yields ppy f, Xq pp f q pp f X, yq ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy f, Xq N ` f pxq, σε 2 I Marginal likelihood (evidence): ppy Xq ş ppy f, Xqpp f qd f Posterior: pp f y, Xq GPpm post, k post q Gaussian Processes Marc 14 April

55 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σ 2 ε, find a distribution over functions pp f q that explains the data Training data: X, y. Bayes theorem yields ppy f, Xq pp f q pp f X, yq ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy f, Xq N ` f pxq, σε 2 I Marginal likelihood (evidence): ppy Xq ş ppy f, Xqpp f qd f Posterior: pp f y, Xq GPpm post, k post q m post px i q mpx i q ` kpx, x i q J pk ` σ 2 ε Iq 1 py mpx i qq k post px i, x j q kpx i, x j q kpx, x i q J pk ` σ 2 ε Iq 1 kpx, x j q Gaussian Processes Marc 14 April

56 GP Regression as a Bayesian Inference Problem (2) Posterior Gaussian process: m post px i q mpx i q ` kpx, x i q J pk ` σ 2 ε Iq 1 py mpx i qq k post px i, x j q kpx i, x j q kpx, x i q J pk ` σ 2 ε Iq 1 kpx, x j q Gaussian Processes Marc 14 April

57 GP Regression as a Bayesian Inference Problem (2) Posterior Gaussian process: m post px i q mpx i q ` kpx, x i q J pk ` σ 2 ε Iq 1 py mpx i qq k post px i, x j q kpx i, x j q kpx, x i q J pk ` σ 2 ε Iq 1 kpx, x j q Predictive distribution pp f X, y, x q at test inputs x : pp f X, y, x q N `Er f s, Vr f s Er f X, y, x s m post px q mpx q ` kpx, x q J pk ` σ 2 ε Iq 1 py mpx qq Vr f X, y, x s k post px, x q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc 14 April

58 GP Regression as a Bayesian Inference Problem (2) Posterior Gaussian process: m post px i q mpx i q ` kpx, x i q J pk ` σ 2 ε Iq 1 py mpx i qq k post px i, x j q kpx i, x j q kpx, x i q J pk ` σ 2 ε Iq 1 kpx, x j q Predictive distribution pp f X, y, x q at test inputs x : pp f X, y, x q N `Er f s, Vr f s Er f X, y, x s m post px q mpx q ` kpx, x q J pk ` σ 2 ε Iq 1 py mpx qq Vr f X, y, x s k post px, x q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q From now: Set prior mean function m 0 Gaussian Processes Marc 14 April

59 Illustration f(x) x Prior belief about the function Predictive (marginal) mean and variance: Er f px q x, s mpx q 0 Vr f px q x, s σ 2 px q kpx, x q Gaussian Processes Marc 14 April

60 Illustration f(x) x Prior belief about the function Predictive (marginal) mean and variance: Er f px q x, s mpx q 0 Vr f px q x, s σ 2 px q kpx, x q Gaussian Processes Marc 14 April

61 Illustration f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc 14 April

70 Covariance Function A Gaussian process is fully specified by a mean function m and a kernel/covariance function k Covariance function encodes high-level structural assumptions about the latent function f (e.g., smoothness, differentiability, periodicity) Gaussian Processes Marc 14 April

71 Gaussian Covariance Function k Gauss px i, x j q θ1 2 exp ` px i x j q J px i x j q{θ2 2 θ 1 : Amplitude of the latent function θ 2 : Length scale. How far do we have to move in input space before the function value changes significantly Smoothness parameter Gaussian Processes Marc 14 April

72 Matérn Covariance Function k Mat,3{2 px i, x j q θ 2 1? 3pxi x 1 ` j q θ 2 exp? 3pxi x j q θ 2 Gaussian Processes Marc 14 April

73 Periodic Covariance Function k per px i, x j q θ1 2 exp 2 ` sin2 θ 3 px i x j q 2π θ 3 : Periodicity parameter θ 2 2 k Gauss pupx i q, upx j qq, upxq j cospxq sinpxq Gaussian Processes Marc 14 April

74 Model Selection A GP is fully specified by a mean function m and a covariance function (kernel) k. Both functions possess parameters tθ m, θ k u : θ How do we find good parameters θ? How do we choose m and k? Model selection Gaussian Processes Marc 14 April

75 f(x) Model Selection I: Length-Scales Length scales determine how wiggly the function is Data x Gaussian Processes Marc 14 April

79 Model Selection: Hyper-Parameters GP Training Find good GP hyper-parameters θ (kernel and mean function parameters) Gaussian Processes Marc 14 April

80 Model Selection: Hyper-Parameters GP Training Find good GP hyper-parameters θ (kernel and mean function parameters) Assign a prior ppθq over hyper-parameters Posterior over hyper-parameters: ż ppθqppy X, θq ppθ X, yq, ppy X, θq ppy Xq ppy f pxqqpp f θqd f Gaussian Processes Marc 14 April

81 Model Selection: Hyper-Parameters GP Training Find good GP hyper-parameters θ (kernel and mean function parameters) Assign a prior ppθq over hyper-parameters Posterior over hyper-parameters: ż ppθqppy X, θq ppθ X, yq, ppy X, θq ppy Xq ppy f pxqqpp f θqd f Choose MAP hyper-parameters θ, such that θ P arg max log ppθq ` log ppy X, θq θ Gaussian Processes Marc 14 April

82 Model Selection: Hyper-Parameters GP Training Find good GP hyper-parameters θ (kernel and mean function parameters) Assign a prior ppθq over hyper-parameters Posterior over hyper-parameters: ż ppθqppy X, θq ppθ X, yq, ppy X, θq ppy Xq ppy f pxqqpp f θqd f Choose MAP hyper-parameters θ, such that θ P arg max log ppθq ` log ppy X, θq θ Maximize marginal likelihood if ppθq U (uniform prior) Gaussian Processes Marc 14 April

83 Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood (probability of the data given the hyper-parameters) θ P arg max log ppy X, θq θ log ppy X, θq 1 2yJK 1 θ y 1 2 log K θ ` const Gaussian Processes Marc 14 April

84 Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood (probability of the data given the hyper-parameters) θ P arg max log ppy X, θq θ log ppy X, θq 1 2yJK 1 θ y 1 2 log K θ ` const Automatic trade-off between data fit and model complexity Gaussian Processes Marc 14 April

85 Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood (probability of the data given the hyper-parameters) θ P arg max log ppy X, θq θ log ppy X, θq 1 2yJK 1 θ y 1 2 log K θ ` const Automatic trade-off between data fit and model complexity Gradient-based optimization possible: B log ppy X, θq Bθ 1 BK 2yJK 1 K 1 y 1 Bθ 2 BK tr`k 1 Bθ Gaussian Processes Marc 14 April

86 f(x) Example Data LML= LML= LML= x Compromise between fitting the data and simplicity of the model Not fitting the data is explained by a larger noise variance (treated as an additional hyper-parameter and learned jointly)s Gaussian Processes Marc 14 April

87 Model Selection II Mean Function and Kernel Assume we have a finite set of models M i, each one specifying a mean function m i and a kernel k i. How do we find the best one? Gaussian Processes Marc 14 April

88 Model Selection II Mean Function and Kernel Assume we have a finite set of models M i, each one specifying a mean function m i and a kernel k i. How do we find the best one? Assign a prior ppm i q for each model Posterior model probability: ppm i X, yq ppm iqppy X, M i q ppy Xq Gaussian Processes Marc 14 April

89 Model Selection II Mean Function and Kernel Assume we have a finite set of models M i, each one specifying a mean function m i and a kernel k i. How do we find the best one? Assign a prior ppm i q for each model Posterior model probability: ppm i X, yq ppm iqppy X, M i q ppy Xq Choose MAP model M, such that M P arg max log ppmq ` log ppy X, Mq M Gaussian Processes Marc 14 April

90 Model Selection II Mean Function and Kernel Assume we have a finite set of models M i, each one specifying a mean function m i and a kernel k i. How do we find the best one? Assign a prior ppm i q for each model Posterior model probability: ppm i X, yq ppm iqppy X, M i q ppy Xq Choose MAP model M, such that M P arg max log ppmq ` log ppy X, Mq M Compare marginal likelihoods if ppm i q 1{ M Gaussian Processes Marc 14 April

91 Model Selection II Mean Function and Kernel Assume we have a finite set of models M i, each one specifying a mean function m i and a kernel k i. How do we find the best one? Assign a prior ppm i q for each model Posterior model probability: ppm i X, yq ppm iqppy X, M i q ppy Xq Choose MAP model M, such that M P arg max log ppmq ` log ppy X, Mq M Compare marginal likelihoods if ppm i q 1{ M Marginal likelihood is central for model selection Gaussian Processes Marc 14 April

92 f(x) Example x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc 14 April

93 f(x) Example 3 Constant kernel, LML= x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc 14 April

94 f(x) Example 3 Linear kernel, LML= x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc 14 April

95 f(x) Example 3 Matern kernel, LML= x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc 14 April

96 f(x) Example 3 Gaussian kernel, LML= x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc 14 April

97 Quick Summary f(x) x Gaussian processes are extremely flexible models Consistent confidence bounds! Based on simple manipulations of plain Gaussian distributions Few hyper-parameters Generally easy to train First choice for black-box regression Gaussian Processes Marc 14 April

98 Application Areas Reinforcement Learning and Robotics Model value functions and/or dynamics with GPs Bayesian Optimization (Experimental Design) Model unknown utility functions with GPs Geostatistics Spatial modeling (e.g., landscapes, resources) Sensor networks Gaussian Processes Marc 14 April

99 Limitations of Gaussian Processes Computational and memory complexity Training scales in OpN 3 q Prediction (variances) scales in OpN 2 q Memory requirement: OpND ` N 2 q Practical limit N «10, 000 Gaussian Processes Marc 14 April

101 GP Factor Graph Inducing function values Training function values f i f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Probabilistic graphical model (factor graph) of a GP All function values are jointly Gaussian distributed (e.g., training and test function values) Gaussian Processes Marc 14 April

102 GP Factor Graph Inducing function values Training function values f i f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Probabilistic graphical model (factor graph) of a GP All function values are jointly Gaussian distributed (e.g., training and test function values) GP prior «ff «ff 0 K f f K f pp f, f q N, 0 K f K Gaussian Processes Marc 14 April

103 Inducing Variables Inducing function values f(u1) f(um ) Training function values f i Hypothetical function values f u f(x1) f(xn ) f 1 f L Training data Test data Introduce inducing function values f u Hypothetical function values Gaussian Processes Marc 14 April

104 Inducing Variables Inducing function values f(u1) f(um ) Training function values f i Hypothetical function values f u f(x1) f(xn ) f 1 f L Training data Test data Introduce inducing function values f u Hypothetical function values All function values are still jointly Gaussian distributed (e.g., training, test and inducing function values) Gaussian Processes Marc 14 April

105 Central Approximation Scheme Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Approximation: Training and test set are conditionally independent given the inducing function values: f K f f u Gaussian Processes Marc 14 April

106 Central Approximation Scheme Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Approximation: Training and test set are conditionally independent given the inducing function values: f K f f u Then, the effective GP prior is ż qp f, f q pp f f u qpp f f u qpp f u qd f u «ff» fi N 0, K f f Q f fl, Q ab : K K 1 a fu 0 Q f K f u f u K fu b Gaussian Processes Marc 14 April

107 FI(T)C Sparse Approximation Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Assume that training (and test sets) are fully independent given the inducing variables (Snelson & Ghahramani, 2006) Training data Test data Gaussian Processes Marc 14 April

108 FI(T)C Sparse Approximation Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Assume that training (and test sets) are fully independent given the inducing variables (Snelson & Ghahramani, 2006) Training data Test data Effective GP prior with this approximation «ff» fi qp f, f q N 0, Q f f diagpq f f K f f q Q f fl 0 Q f K Q diagpq K q can be used instead of K FIC Gaussian Processes Marc 14 April

109 FI(T)C Sparse Approximation Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Assume that training (and test sets) are fully independent given the inducing variables (Snelson & Ghahramani, 2006) Training data Test data Effective GP prior with this approximation «ff» fi qp f, f q N 0, Q f f diagpq f f K f f q Q f fl 0 Q f K Q diagpq K q can be used instead of K FIC Training: OpNM 2 q, Prediction: OpM 2 q Gaussian Processes Marc 14 April

110 Inducing Inputs FI(T)C sparse approximation relies on inducing function values f pu i q, where u i are the corresponding inputs Gaussian Processes Marc 14 April

111 Inducing Inputs FI(T)C sparse approximation relies on inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Gaussian Processes Marc 14 April

112 Inducing Inputs FI(T)C sparse approximation relies on inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Find them by maximizing the FI(T)C marginal likelihood with respect to the inducing inputs (and the standard hyper-parameters): u 1:M P arg max q FITC py X, u 1:M, θq u 1:M Gaussian Processes Marc 14 April

113 Inducing Inputs FI(T)C sparse approximation relies on inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Find them by maximizing the FI(T)C marginal likelihood with respect to the inducing inputs (and the standard hyper-parameters): u 1:M P arg max q FITC py X, u 1:M, θq u 1:M Intuitively: The marginal likelihood is not only parameterized by the hyper-parameters θ, but also by the inducing inputs u 1:M. Gaussian Processes Marc 14 April

114 Inducing Inputs FI(T)C sparse approximation relies on inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Find them by maximizing the FI(T)C marginal likelihood with respect to the inducing inputs (and the standard hyper-parameters): u 1:M P arg max q FITC py X, u 1:M, θq u 1:M Intuitively: The marginal likelihood is not only parameterized by the hyper-parameters θ, but also by the inducing inputs u 1:M. End up with a high-dimensional non-convex optimization problem with MD additional parameters Gaussian Processes Marc 14 April

115 FITC Example Pink: Original data Red crosses: Initialization of inducing inputs Blue crosses: Location of inducing inputs after optimization Figure from Ed Snelson Efficient compression of the original data set Gaussian Processes Marc 14 April

116 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Gaussian Processes Marc 14 April

117 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Gaussian Processes Marc 14 April

118 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Gaussian Processes Marc 14 April

119 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Gaussian Processes Marc 14 April

120 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Practical limit M ď Often: M P Op10 2 q in the case of inducing variables Gaussian Processes Marc 14 April

121 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Practical limit M ď Often: M P Op10 2 q in the case of inducing variables If we set M N{100, i.e., each inducing function value summarizes 100 real function values, our practical limit is N P Op10 6 q Gaussian Processes Marc 14 April

122 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Practical limit M ď Often: M P Op10 2 q in the case of inducing variables If we set M N{100, i.e., each inducing function value summarizes 100 real function values, our practical limit is N P Op10 6 q Let s try something different Gaussian Processes Marc 14 April

123 An Orthogonal Approximation: Distributed GPs 1 1 = Gaussian Processes Marc 14 April

124 An Orthogonal Approximation: Distributed GPs 1 1 = Randomly split the full data set into M chunks Gaussian Processes Marc 14 April

125 An Orthogonal Approximation: Distributed GPs 1 1 = Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Gaussian Processes Marc 14 April

126 An Orthogonal Approximation: Distributed GPs 1 1 = Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Independent computations can be distributed Gaussian Processes Marc 14 April

127 An Orthogonal Approximation: Distributed GPs 1 1 = Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Independent computations can be distributed Block-diagonal approximation of kernel matrix K Gaussian Processes Marc 14 April

128 An Orthogonal Approximation: Distributed GPs 1 1 = Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Independent computations can be distributed Block-diagonal approximation of kernel matrix K Combine independent computations to an overall result Gaussian Processes Marc 14 April

129 Training the Distributed GP Split data set of size N into M chunks of size P Independence of experts Factorization of marginal likelihood: log ppy X, θq «ÿ M log p k 1 kpy pkq X pkq, θq Gaussian Processes Marc 14 April

130 Training the Distributed GP Split data set of size N into M chunks of size P Independence of experts Factorization of marginal likelihood: log ppy X, θq «ÿ M log p k 1 kpy pkq X pkq, θq Distributed optimization and training straightforward Gaussian Processes Marc 14 April

131 Training the Distributed GP Split data set of size N into M chunks of size P Independence of experts Factorization of marginal likelihood: log ppy X, θq «ÿ M log p k 1 kpy pkq X pkq, θq Distributed optimization and training straightforward Computational complexity: OpMP 3 q [instead of OpN 3 q] But distributed over many machines Memory footprint: OpMP 2 ` NDq [instead of OpN 2 ` NDq] Gaussian Processes Marc 14 April

132 Empirical Training Time Computation time of NLML and its gradients (s) Computation time (DGP) 10 2 Computation time (Full GP) Computation time (FITC) Number of GP experts (DGP) Training data set size Number of GP experts (DGP) NLML is proportional to training time Gaussian Processes Marc 14 April

133 Empirical Training Time Computation time of NLML and its gradients (s) Computation time (DGP) 10 2 Computation time (Full GP) Computation time (FITC) Number of GP experts (DGP) Training data set size Number of GP experts (DGP) NLML is proportional to training time Full GP (16K training points) «sparse GP (50K training points) «distributed GP (16M training points) Push practical limit by order(s) of magnitude Gaussian Processes Marc 14 April

134 Practical Training Times Training* with N 10 6, D 1 on a laptop: «30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours Gaussian Processes Marc 14 April

135 Practical Training Times Training* with N 10 6, D 1 on a laptop: «30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours *: Maximize the marginal likelihood, stop when converged** **: Convergence often after line searches*** Gaussian Processes Marc 14 April

136 Practical Training Times Training* with N 10 6, D 1 on a laptop: «30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours *: Maximize the marginal likelihood, stop when converged** **: Convergence often after line searches*** ***: Line search «2 3 evaluations of marginal likelihood and its gradient (usually OpN 3 q) Gaussian Processes Marc 14 April

137 Predictions with the Distributed GP µ, σ µ 1, σ 1 µ 2, σ 2 µ 3, σ 3 Prediction of each GP expert is Gaussian N `µ i, σ 2 i How to combine them to an overall prediction N `µ, σ 2? Gaussian Processes Marc 14 April

138 Predictions with the Distributed GP µ, σ µ 1, σ 1 µ 2, σ 2 µ 3, σ 3 Prediction of each GP expert is Gaussian N `µ i, σi 2 How to combine them to an overall prediction N `µ, σ 2? Product-of-GP-experts PoE (product of experts) (Ng & Deisenroth, 2014) gpoe (generalized product of experts) (Cao & Fleet, 2014) BCM (Bayesian Committee Machine) (Tresp, 2000) rbcm (robust BCM) (Deisenroth & Ng, 2015) Gaussian Processes Marc 14 April

139 Objectives µ, σ µ, σ µ1, σ1 µ2, σ2 µ3, σ3 µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23 µ31, σ31 µ32, σ32 µ33, σ33 Figure: Two computational graphs Scale to large data sets Gaussian Processes Marc 14 April

140 Objectives µ, σ µ, σ µ1, σ1 µ2, σ2 µ3, σ3 µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23 µ31, σ31 µ32, σ32 µ33, σ33 Figure: Two computational graphs Scale to large data sets Good approximation of full GP ( ground truth ) Gaussian Processes Marc 14 April

141 Objectives µ, σ µ, σ µ1, σ1 µ2, σ2 µ3, σ3 µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23 µ31, σ31 µ32, σ32 µ33, σ33 Figure: Two computational graphs Scale to large data sets Good approximation of full GP ( ground truth ) Predictions independent of computational graph Runs on heterogeneous computing infrastructures (laptop, cluster,...) Gaussian Processes Marc 14 April

142 Objectives µ, σ µ, σ µ1, σ1 µ2, σ2 µ3, σ3 µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23 µ31, σ31 µ32, σ32 µ33, σ33 Figure: Two computational graphs Scale to large data sets Good approximation of full GP ( ground truth ) Predictions independent of computational graph Runs on heterogeneous computing infrastructures (laptop, cluster,...) Reasonable predictive variances Gaussian Processes Marc 14 April

143 Running Example Investigate various product-of-experts models Same training procedure, but different mechanisms for predictions Gaussian Processes Marc 14 April

144 Product of GP Experts Prediction model (independent predictors): Mź pp f x, Dq p k p f x, D pkq q, k 1 p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Gaussian Processes Marc 14 April

145 Product of GP Experts Prediction model (independent predictors): Mź pp f x, Dq p k p f x, D pkq q, k 1 p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Predictive precision (inverse variance) and mean: q 2 ÿ σ 2 px q k pσ poe µ poe k pσ poe q ÿ 2 σ 2 px qµ k k px q k Gaussian Processes Marc 14 April

146 Product of GP Experts Prediction model (independent predictors): Mź pp f x, Dq p k p f x, D pkq q, k 1 p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Predictive precision (inverse variance) and mean: q 2 ÿ σ 2 px q k pσ poe µ poe k pσ poe q ÿ 2 σ 2 px qµ k k px q Independent of the computational graph k Gaussian Processes Marc 14 April

147 Product of GP Experts Unreasonable variances for M ą 1: Gaussian Processes Marc 14 April

148 Product of GP Experts Unreasonable variances for M ą 1: pσ poe q 2 ÿ σ 2 px q k The more experts the more certain the prediction, even if every expert itself is very uncertain Cannot fall back to the prior Gaussian Processes Marc 14 April k

149 Generalized Product of GP Experts Weight the responsiblity of each expert in PoE with β k Gaussian Processes Marc 14 April

150 Generalized Product of GP Experts Weight the responsiblity of each expert in PoE with β k Prediction model (independent predictors): pp f x, Dq Mź k 1 p β k k p f x, D pkq q p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Gaussian Processes Marc 14 April