Gaussian Processes for Big Data Problems

Size: px
Start display at page:

Download "Gaussian Processes for Big Data Problems"

Transcription

1 Gaussian Processes for Big Data Problems Marc Deisenroth Department of Computing Imperial College London Machine Learning Summer School, Chalmers University 14 April 2015

2 Gaussian Processes Marc 14 April

3 Problem Setting f(x) x Objective For a set of observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f q that explains the data Probabilistic regression problem Gaussian Processes Marc 14 April

4 Problem Setting 2 f(x) x Objective For a set of observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f q that explains the data Probabilistic regression problem Gaussian Processes Marc 14 April

5 (Some) Relevant Application Areas Global black-box optimization and experimental design Autonomous learning in robotics Probabilistic dimensionality reduction and data visualization Gaussian Processes Marc 14 April

6 Table of Contents Introduction Gaussian Distribution Gaussian Processes Definition and Derivation Inference Covariance Functions Model Selection Limitations Scaling Gaussian Processes to Large Data Sets Sparse Gaussian Processes Distributed Gaussian Processes Gaussian Processes Marc 14 April

7 Table of Contents Introduction Gaussian Distribution Gaussian Processes Definition and Derivation Inference Covariance Functions Model Selection Limitations Scaling Gaussian Processes to Large Data Sets Sparse Gaussian Processes Distributed Gaussian Processes Gaussian Processes Marc 14 April

8 Key Concepts in Probability Theory Two fundamental rules: ż ppxq ppx, yqdy ppx, yq ppy xqppxq Sum rule/marginalization property Product rule Gaussian Processes Marc 14 April

9 Key Concepts in Probability Theory Two fundamental rules: ż ppxq ppx, yqdy ppx, yq ppy xqppxq Sum rule/marginalization property Product rule Bayes Theorem (Probabilistic Inverse) ppy xq ppxq ppx yq, x : hypothesis, y : measurement ppyq Gaussian Processes Marc 14 April

10 Key Concepts in Probability Theory Two fundamental rules: ż ppxq ppx, yqdy ppx, yq ppy xqppxq Sum rule/marginalization property Product rule Bayes Theorem (Probabilistic Inverse) ppy xq ppxq ppx yq, x : hypothesis, y : measurement ppyq Posterior belief Prior belief x Likelihood (measurement model) y Marginal likelihood (normalization constant) Gaussian Processes Marc 14 April

11 The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 p(x) Mean 95% confidence bound p(x) x Gaussian Processes Marc 14 April

12 The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 p(x) Mean 95% confidence bound 3 2 Mean 95% confidence bound p(x) x x x 1 Gaussian Processes Marc 14 April

13 The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 Data p(x) Mean 95% confidence interval 3 2 Data Mean 95% confidence bound p(x) x x x 1 Gaussian Processes Marc 14 April

14 Sampling from a Multivariate Gaussian Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. However, we only have access to a random number generator that can sample x from N `0, I... Gaussian Processes Marc 14 April

15 Sampling from a Multivariate Gaussian Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. However, we only have access to a random number generator that can sample x from N `0, I... Exploit that affine transformations y Ax ` b of Gaussians remain Gaussian Mean: E x rax ` bs AE x rxs ` b Covariance: V x rax ` bs AV x rxsa J Gaussian Processes Marc 14 April

16 Sampling from a Multivariate Gaussian Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. However, we only have access to a random number generator that can sample x from N `0, I... Exploit that affine transformations y Ax ` b of Gaussians remain Gaussian Mean: E x rax ` bs AE x rxs ` b Covariance: V x rax ` bs AV x rxsa J 1. Find conditions for A, b to match the mean of y 2. Find conditions for A, b to match the covariance of y Gaussian Processes Marc 14 April

17 Sampling from a Multivariate Gaussian (2) Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. x = randn(d,1); y = chol(σ) *x + µ; Sample x N `0, I Scale x Here chol(σ) is the Cholesky factor L, such that L J L Σ Gaussian Processes Marc 14 April

18 Sampling from a Multivariate Gaussian (2) Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. x = randn(d,1); y = chol(σ) *x + µ; Sample x N `0, I Scale x Here chol(σ) is the Cholesky factor L, such that L J L Σ Therefore, the mean and covariance of y are Erys ȳ ErL J x ` µs L J Erxs ` µ µ Covrys Erpy ȳqpy ȳq J s ErL J xx J Ls L J Erxx J sl L J L Σ Gaussian Processes Marc 14 April

19 Conditional 3 2 «ff µx Joint p(x,y) ppx, yq N, µ y «ff Σxx Σ xy Σ yx Σ yy 1 0 y x Gaussian Processes Marc 14 April

20 Conditional Joint p(x,y) Observation ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy y x Gaussian Processes Marc 14 April

21 Conditional 3 2 Joint p(x,y) Observation y Conditional p(x y) ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy y ppx yq N `µ x y, Σ x y µ x y µ x ` Σ xy Σ 1 yy py µ y q Σ x y Σ xx Σ xy Σ 1 yy Σ yx x Conditional ppx yq is also Gaussian Computationally convenient Gaussian Processes Marc 14 April

22 Marginal Joint p(x,y) Marginal p(x) ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy y ż pp x q pp x, y qd y N ` µ x, Σ xx x The marginal of a joint Gaussian distribution is Gaussian Intuitively: Ignore (integrate out) everything you are not interested in Gaussian Processes Marc 14 April

23 Table of Contents Introduction Gaussian Distribution Gaussian Processes Definition and Derivation Inference Covariance Functions Model Selection Limitations Scaling Gaussian Processes to Large Data Sets Sparse Gaussian Processes Distributed Gaussian Processes Gaussian Processes Marc 14 April

24 Gaussian Process Definition A Gaussian process (GP) is a collection of random variables x 1, x 2,..., any finite number of which is Gaussian distributed. Gaussian Processes Marc 14 April

25 Gaussian Process Definition A Gaussian process (GP) is a collection of random variables x 1, x 2,..., any finite number of which is Gaussian distributed. Unexpected (?) example: Linear dynamical system x t`1 Ax t ` w, x 1, x 2,... is a collection of random variables w N `0, Q Gaussian Processes Marc 14 April

26 Gaussian Process Definition A Gaussian process (GP) is a collection of random variables x 1, x 2,..., any finite number of which is Gaussian distributed. Unexpected (?) example: Linear dynamical system x t`1 Ax t ` w, w N `0, Q x 1, x 2,... is a collection of random variables, any finite number of which is Gaussian distributed Gaussian Processes Marc 14 April

27 Gaussian Process Definition A Gaussian process (GP) is a collection of random variables x 1, x 2,..., any finite number of which is Gaussian distributed. Unexpected (?) example: Linear dynamical system x t`1 Ax t ` w, w N `0, Q x 1, x 2,... is a collection of random variables, any finite number of which is Gaussian distributed This is a GP Gaussian Processes Marc 14 April

28 Gaussian Process Definition A Gaussian process (GP) is a collection of random variables x 1, x 2,..., any finite number of which is Gaussian distributed. Unexpected (?) example: Linear dynamical system x t`1 Ax t ` w, w N `0, Q x 1, x 2,... is a collection of random variables, any finite number of which is Gaussian distributed This is a GP But we will use the Gaussian process for a different purpose, not for time series Gaussian Processes Marc 14 April

29 The Gaussian in the Limit Consider the joint Gaussian distribution ppx, xq, where x P R D and x P R k, k Ñ 8 Gaussian Processes Marc 14 April

30 The Gaussian in the Limit Consider the joint Gaussian distribution ppx, xq, where x P R D and x P R k, k Ñ 8 Then «ff j µx Σxx Σ ppx, xq N, x x µ x Σ xx where Σ x x P R kˆk and Σ x x P R Dˆk, k Ñ 8. Σ x x Gaussian Processes Marc 14 April

31 The Gaussian in the Limit Consider the joint Gaussian distribution ppx, xq, where x P R D and x P R k, k Ñ 8 Then «ff j µx Σxx Σ ppx, xq N, x x µ x Σ xx Σ x x where Σ x x P R kˆk and Σ x x P R Dˆk, k Ñ 8. However, the marginal remains finite ż pp x q pp x, x qd x N ` µ x, Σ xx where we integrate out an infinite number of variables x i. Gaussian Processes Marc 14 April

32 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Gaussian Processes Marc 14 April

33 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Then, x tx train, x test, x other u (x other plays the role of x from previous slide) Gaussian Processes Marc 14 April

34 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Then, x tx train, x test, x other u (x other plays the role of x from previous slide)» fi» ppxq N µ train µ test µ other ffi ffi fl, Σ train Σ train,test Σ test,train Σ test Σ test,other Σ train,other Σ other,train Σ other,test Σ other fi ffi ffi fl Gaussian Processes Marc 14 April

35 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Then, x tx train, x test, x other u (x other plays the role of x from previous slide)» fi» ppxq N µ train µ test µ other ffi ffi fl, Σ train Σ train,test Σ test,train Σ test Σ test,other Σ train,other Σ other,train Σ other,test Σ other ż ppx train, x test q pp x train, x test, x other qd x other fi ffi ffi fl Gaussian Processes Marc 14 April

36 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Then, x tx train, x test, x other u (x other plays the role of x from previous slide)» fi» ppxq N µ train µ test µ other ffi ffi fl, Σ train Σ train,test Σ test,train Σ test Σ test,other Σ train,other Σ other,train Σ other,test Σ other ż ppx train, x test q pp x train, x test, x other qd x other ppx test x train q N `µ, Σ µ µ test ` Σ test,train Σ 1 train px train µ train q Σ Σ test Σ test,train Σ 1 train Σ train,test fi ffi ffi fl Gaussian Processes Marc 14 April

37 Back to Regression: Distribution over Functions We are not really interested in a distribution on x, but rather in a distribution over function values f pxq Gaussian Processes Marc 14 April

38 Back to Regression: Distribution over Functions We are not really interested in a distribution on x, but rather in a distribution over function values f pxq Let s replace x with function values f f pxq (Treat a function as a long vector of function values)» fi» fi pp f, f q N µ f fl, µ f Σ f f Σ f f Σ f f Σ f f where Σ f f P R kˆk and Σ f f P R Nˆk, k Ñ 8. fl Gaussian Processes Marc 14 April

39 Back to Regression: Distribution over Functions We are not really interested in a distribution on x, but rather in a distribution over function values f pxq Let s replace x with function values f f pxq (Treat a function as a long vector of function values)» fi» fi pp f, f q N µ f fl, µ f Σ f f Σ f f Σ f f Σ f f where Σ f f P R kˆk and Σ f f P R Nˆk, k Ñ 8. fl Again, the marginal remains finite ż pp f q pp f, f qd f N ` µ f, Σ f f Gaussian Processes Marc 14 April

40 Marginal and Conditional over Functions Define f : f test, f : f train. Gaussian Processes Marc 14 April

41 Marginal and Conditional over Functions Define f : f test, f : f train. Marginal pp f, f q N «µ f µ ff, «Σ f f Σ f Σ f Σ ff Gaussian Processes Marc 14 April

42 Marginal and Conditional over Functions Define f : f test, f : f train. Marginal pp f, f q N «µ f µ ff, «Σ f f Σ f Σ f Σ ff Conditional (predictive distribution) pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f Gaussian Processes Marc 14 April

43 Marginal and Conditional over Functions Define f : f test, f : f train. Marginal pp f, f q N «µ f µ ff, «Σ f f Σ f Σ f Σ ff Conditional (predictive distribution) pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f We need to compute (cross-)covariances between unknown function values Gaussian Processes Marc 14 April

44 Marginal and Conditional over Functions Define f : f test, f : f train. Marginal pp f, f q N «µ f µ ff, «Σ f f Σ f Σ f Σ ff Conditional (predictive distribution) pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f We need to compute (cross-)covariances between unknown function values The kernel trick helps us out Gaussian Processes Marc 14 April

45 Kernelization ˆ j µ pp f, f q N f, µ j Σ f f Σ f Σ f Σ pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f Gaussian Processes Marc 14 April

46 Kernelization ˆ j µ pp f, f q N f, µ j Σ f f Σ f Σ f Σ pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f A kernel function k is symmetric and positive definite. Gaussian Processes Marc 14 April

47 Kernelization ˆ j µ pp f, f q N f, µ j Σ f f Σ f Σ f Σ pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f A kernel function k is symmetric and positive definite. Kernel computes covariances between unknown function values f px i q and f px j q by just looking at the corresponding inputs x i, x j Σ pi,jq f f Covr f px i q, f px j qs kpx i, x j q Gaussian Processes Marc 14 April

48 Kernelization ˆ j µ pp f, f q N f, µ j Σ f f Σ f Σ f Σ pp f f q N `m, S m µ ` Σ f Σ 1 f f p f µq S Σ Σ f Σ 1 f f Σ f A kernel function k is symmetric and positive definite. Kernel computes covariances between unknown function values f px i q and f px j q by just looking at the corresponding inputs x i, x j Σ pi,jq f f Covr f px i q, f px j qs kpx i, x j q This yields the predictive distribution pp f f, X, X q N `m, S m µ ` kpx, XqkpX, Xq 1 p f µq S K kpx, XqkpX, Xq 1 kpx, X q Gaussian Processes Marc 14 April

49 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f q that explains the data Gaussian Processes Marc 14 April

50 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f q that explains the data Training data: X, y. Bayes theorem yields ppy f, Xq pp f q pp f X, yq ppy Xq Gaussian Processes Marc 14 April

51 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f q that explains the data Training data: X, y. Bayes theorem yields ppy f, Xq pp f q pp f X, yq ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k Gaussian Processes Marc 14 April

52 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σ 2 ε, find a distribution over functions pp f q that explains the data Training data: X, y. Bayes theorem yields ppy f, Xq pp f q pp f X, yq ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy f, Xq N ` f pxq, σε 2 I Gaussian Processes Marc 14 April

53 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σ 2 ε, find a distribution over functions pp f q that explains the data Training data: X, y. Bayes theorem yields ppy f, Xq pp f q pp f X, yq ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy f, Xq N ` f pxq, σε 2 I Marginal likelihood (evidence): ppy Xq ş ppy f, Xqpp f qd f Gaussian Processes Marc 14 April

54 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σ 2 ε, find a distribution over functions pp f q that explains the data Training data: X, y. Bayes theorem yields ppy f, Xq pp f q pp f X, yq ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy f, Xq N ` f pxq, σε 2 I Marginal likelihood (evidence): ppy Xq ş ppy f, Xqpp f qd f Posterior: pp f y, Xq GPpm post, k post q Gaussian Processes Marc 14 April

55 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ε, ε N `0, σ 2 ε, find a distribution over functions pp f q that explains the data Training data: X, y. Bayes theorem yields ppy f, Xq pp f q pp f X, yq ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy f, Xq N ` f pxq, σε 2 I Marginal likelihood (evidence): ppy Xq ş ppy f, Xqpp f qd f Posterior: pp f y, Xq GPpm post, k post q m post px i q mpx i q ` kpx, x i q J pk ` σ 2 ε Iq 1 py mpx i qq k post px i, x j q kpx i, x j q kpx, x i q J pk ` σ 2 ε Iq 1 kpx, x j q Gaussian Processes Marc 14 April

56 GP Regression as a Bayesian Inference Problem (2) Posterior Gaussian process: m post px i q mpx i q ` kpx, x i q J pk ` σ 2 ε Iq 1 py mpx i qq k post px i, x j q kpx i, x j q kpx, x i q J pk ` σ 2 ε Iq 1 kpx, x j q Gaussian Processes Marc 14 April

57 GP Regression as a Bayesian Inference Problem (2) Posterior Gaussian process: m post px i q mpx i q ` kpx, x i q J pk ` σ 2 ε Iq 1 py mpx i qq k post px i, x j q kpx i, x j q kpx, x i q J pk ` σ 2 ε Iq 1 kpx, x j q Predictive distribution pp f X, y, x q at test inputs x : pp f X, y, x q N `Er f s, Vr f s Er f X, y, x s m post px q mpx q ` kpx, x q J pk ` σ 2 ε Iq 1 py mpx qq Vr f X, y, x s k post px, x q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc 14 April

58 GP Regression as a Bayesian Inference Problem (2) Posterior Gaussian process: m post px i q mpx i q ` kpx, x i q J pk ` σ 2 ε Iq 1 py mpx i qq k post px i, x j q kpx i, x j q kpx, x i q J pk ` σ 2 ε Iq 1 kpx, x j q Predictive distribution pp f X, y, x q at test inputs x : pp f X, y, x q N `Er f s, Vr f s Er f X, y, x s m post px q mpx q ` kpx, x q J pk ` σ 2 ε Iq 1 py mpx qq Vr f X, y, x s k post px, x q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q From now: Set prior mean function m 0 Gaussian Processes Marc 14 April

59 Illustration f(x) x Prior belief about the function Predictive (marginal) mean and variance: Er f px q x, s mpx q 0 Vr f px q x, s σ 2 px q kpx, x q Gaussian Processes Marc 14 April

60 Illustration f(x) x Prior belief about the function Predictive (marginal) mean and variance: Er f px q x, s mpx q 0 Vr f px q x, s σ 2 px q kpx, x q Gaussian Processes Marc 14 April

61 Illustration f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc 14 April

62 Illustration f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc 14 April

63 Illustration f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc 14 April

64 Illustration f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc 14 April

65 Illustration f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc 14 April

66 Illustration f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc 14 April

67 Illustration f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc 14 April

68 Illustration f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc 14 April

69 Illustration f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc 14 April

70 Covariance Function A Gaussian process is fully specified by a mean function m and a kernel/covariance function k Covariance function encodes high-level structural assumptions about the latent function f (e.g., smoothness, differentiability, periodicity) Gaussian Processes Marc 14 April

71 Gaussian Covariance Function k Gauss px i, x j q θ1 2 exp ` px i x j q J px i x j q{θ2 2 θ 1 : Amplitude of the latent function θ 2 : Length scale. How far do we have to move in input space before the function value changes significantly Smoothness parameter Gaussian Processes Marc 14 April

72 Matérn Covariance Function k Mat,3{2 px i, x j q θ 2 1? 3pxi x 1 ` j q θ 2 exp? 3pxi x j q θ 2 Gaussian Processes Marc 14 April

73 Periodic Covariance Function k per px i, x j q θ1 2 exp 2 ` sin2 θ 3 px i x j q 2π θ 3 : Periodicity parameter θ 2 2 k Gauss pupx i q, upx j qq, upxq j cospxq sinpxq Gaussian Processes Marc 14 April

74 Model Selection A GP is fully specified by a mean function m and a covariance function (kernel) k. Both functions possess parameters tθ m, θ k u : θ How do we find good parameters θ? How do we choose m and k? Model selection Gaussian Processes Marc 14 April

75 f(x) Model Selection I: Length-Scales Length scales determine how wiggly the function is Data x Gaussian Processes Marc 14 April

76 f(x) Model Selection I: Length-Scales Length scales determine how wiggly the function is Data x Gaussian Processes Marc 14 April

77 f(x) Model Selection I: Length-Scales Length scales determine how wiggly the function is Data x Gaussian Processes Marc 14 April

78 f(x) Model Selection I: Length-Scales Length scales determine how wiggly the function is Data x Gaussian Processes Marc 14 April

79 Model Selection: Hyper-Parameters GP Training Find good GP hyper-parameters θ (kernel and mean function parameters) Gaussian Processes Marc 14 April

80 Model Selection: Hyper-Parameters GP Training Find good GP hyper-parameters θ (kernel and mean function parameters) Assign a prior ppθq over hyper-parameters Posterior over hyper-parameters: ż ppθqppy X, θq ppθ X, yq, ppy X, θq ppy Xq ppy f pxqqpp f θqd f Gaussian Processes Marc 14 April

81 Model Selection: Hyper-Parameters GP Training Find good GP hyper-parameters θ (kernel and mean function parameters) Assign a prior ppθq over hyper-parameters Posterior over hyper-parameters: ż ppθqppy X, θq ppθ X, yq, ppy X, θq ppy Xq ppy f pxqqpp f θqd f Choose MAP hyper-parameters θ, such that θ P arg max log ppθq ` log ppy X, θq θ Gaussian Processes Marc 14 April

82 Model Selection: Hyper-Parameters GP Training Find good GP hyper-parameters θ (kernel and mean function parameters) Assign a prior ppθq over hyper-parameters Posterior over hyper-parameters: ż ppθqppy X, θq ppθ X, yq, ppy X, θq ppy Xq ppy f pxqqpp f θqd f Choose MAP hyper-parameters θ, such that θ P arg max log ppθq ` log ppy X, θq θ Maximize marginal likelihood if ppθq U (uniform prior) Gaussian Processes Marc 14 April

83 Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood (probability of the data given the hyper-parameters) θ P arg max log ppy X, θq θ log ppy X, θq 1 2yJK 1 θ y 1 2 log K θ ` const Gaussian Processes Marc 14 April

84 Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood (probability of the data given the hyper-parameters) θ P arg max log ppy X, θq θ log ppy X, θq 1 2yJK 1 θ y 1 2 log K θ ` const Automatic trade-off between data fit and model complexity Gaussian Processes Marc 14 April

85 Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood (probability of the data given the hyper-parameters) θ P arg max log ppy X, θq θ log ppy X, θq 1 2yJK 1 θ y 1 2 log K θ ` const Automatic trade-off between data fit and model complexity Gradient-based optimization possible: B log ppy X, θq Bθ 1 BK 2yJK 1 K 1 y 1 Bθ 2 BK tr`k 1 Bθ Gaussian Processes Marc 14 April

86 f(x) Example Data LML= LML= LML= x Compromise between fitting the data and simplicity of the model Not fitting the data is explained by a larger noise variance (treated as an additional hyper-parameter and learned jointly)s Gaussian Processes Marc 14 April

87 Model Selection II Mean Function and Kernel Assume we have a finite set of models M i, each one specifying a mean function m i and a kernel k i. How do we find the best one? Gaussian Processes Marc 14 April

88 Model Selection II Mean Function and Kernel Assume we have a finite set of models M i, each one specifying a mean function m i and a kernel k i. How do we find the best one? Assign a prior ppm i q for each model Posterior model probability: ppm i X, yq ppm iqppy X, M i q ppy Xq Gaussian Processes Marc 14 April

89 Model Selection II Mean Function and Kernel Assume we have a finite set of models M i, each one specifying a mean function m i and a kernel k i. How do we find the best one? Assign a prior ppm i q for each model Posterior model probability: ppm i X, yq ppm iqppy X, M i q ppy Xq Choose MAP model M, such that M P arg max log ppmq ` log ppy X, Mq M Gaussian Processes Marc 14 April

90 Model Selection II Mean Function and Kernel Assume we have a finite set of models M i, each one specifying a mean function m i and a kernel k i. How do we find the best one? Assign a prior ppm i q for each model Posterior model probability: ppm i X, yq ppm iqppy X, M i q ppy Xq Choose MAP model M, such that M P arg max log ppmq ` log ppy X, Mq M Compare marginal likelihoods if ppm i q 1{ M Gaussian Processes Marc 14 April

91 Model Selection II Mean Function and Kernel Assume we have a finite set of models M i, each one specifying a mean function m i and a kernel k i. How do we find the best one? Assign a prior ppm i q for each model Posterior model probability: ppm i X, yq ppm iqppy X, M i q ppy Xq Choose MAP model M, such that M P arg max log ppmq ` log ppy X, Mq M Compare marginal likelihoods if ppm i q 1{ M Marginal likelihood is central for model selection Gaussian Processes Marc 14 April

92 f(x) Example x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc 14 April

93 f(x) Example 3 Constant kernel, LML= x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc 14 April

94 f(x) Example 3 Linear kernel, LML= x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc 14 April

95 f(x) Example 3 Matern kernel, LML= x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc 14 April

96 f(x) Example 3 Gaussian kernel, LML= x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc 14 April

97 Quick Summary f(x) x Gaussian processes are extremely flexible models Consistent confidence bounds! Based on simple manipulations of plain Gaussian distributions Few hyper-parameters Generally easy to train First choice for black-box regression Gaussian Processes Marc 14 April

98 Application Areas Reinforcement Learning and Robotics Model value functions and/or dynamics with GPs Bayesian Optimization (Experimental Design) Model unknown utility functions with GPs Geostatistics Spatial modeling (e.g., landscapes, resources) Sensor networks Gaussian Processes Marc 14 April

99 Limitations of Gaussian Processes Computational and memory complexity Training scales in OpN 3 q Prediction (variances) scales in OpN 2 q Memory requirement: OpND ` N 2 q Practical limit N «10, 000 Gaussian Processes Marc 14 April

100 Table of Contents Introduction Gaussian Distribution Gaussian Processes Definition and Derivation Inference Covariance Functions Model Selection Limitations Scaling Gaussian Processes to Large Data Sets Sparse Gaussian Processes Distributed Gaussian Processes Gaussian Processes Marc 14 April

101 GP Factor Graph Inducing function values Training function values f i f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Probabilistic graphical model (factor graph) of a GP All function values are jointly Gaussian distributed (e.g., training and test function values) Gaussian Processes Marc 14 April

102 GP Factor Graph Inducing function values Training function values f i f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Probabilistic graphical model (factor graph) of a GP All function values are jointly Gaussian distributed (e.g., training and test function values) GP prior «ff «ff 0 K f f K f pp f, f q N, 0 K f K Gaussian Processes Marc 14 April

103 Inducing Variables Inducing function values f(u1) f(um ) Training function values f i Hypothetical function values f u f(x1) f(xn ) f 1 f L Training data Test data Introduce inducing function values f u Hypothetical function values Gaussian Processes Marc 14 April

104 Inducing Variables Inducing function values f(u1) f(um ) Training function values f i Hypothetical function values f u f(x1) f(xn ) f 1 f L Training data Test data Introduce inducing function values f u Hypothetical function values All function values are still jointly Gaussian distributed (e.g., training, test and inducing function values) Gaussian Processes Marc 14 April

105 Central Approximation Scheme Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Approximation: Training and test set are conditionally independent given the inducing function values: f K f f u Gaussian Processes Marc 14 April

106 Central Approximation Scheme Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Approximation: Training and test set are conditionally independent given the inducing function values: f K f f u Then, the effective GP prior is ż qp f, f q pp f f u qpp f f u qpp f u qd f u «ff» fi N 0, K f f Q f fl, Q ab : K K 1 a fu 0 Q f K f u f u K fu b Gaussian Processes Marc 14 April

107 FI(T)C Sparse Approximation Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Assume that training (and test sets) are fully independent given the inducing variables (Snelson & Ghahramani, 2006) Training data Test data Gaussian Processes Marc 14 April

108 FI(T)C Sparse Approximation Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Assume that training (and test sets) are fully independent given the inducing variables (Snelson & Ghahramani, 2006) Training data Test data Effective GP prior with this approximation «ff» fi qp f, f q N 0, Q f f diagpq f f K f f q Q f fl 0 Q f K Q diagpq K q can be used instead of K FIC Gaussian Processes Marc 14 April

109 FI(T)C Sparse Approximation Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Assume that training (and test sets) are fully independent given the inducing variables (Snelson & Ghahramani, 2006) Training data Test data Effective GP prior with this approximation «ff» fi qp f, f q N 0, Q f f diagpq f f K f f q Q f fl 0 Q f K Q diagpq K q can be used instead of K FIC Training: OpNM 2 q, Prediction: OpM 2 q Gaussian Processes Marc 14 April

110 Inducing Inputs FI(T)C sparse approximation relies on inducing function values f pu i q, where u i are the corresponding inputs Gaussian Processes Marc 14 April

111 Inducing Inputs FI(T)C sparse approximation relies on inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Gaussian Processes Marc 14 April

112 Inducing Inputs FI(T)C sparse approximation relies on inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Find them by maximizing the FI(T)C marginal likelihood with respect to the inducing inputs (and the standard hyper-parameters): u 1:M P arg max q FITC py X, u 1:M, θq u 1:M Gaussian Processes Marc 14 April

113 Inducing Inputs FI(T)C sparse approximation relies on inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Find them by maximizing the FI(T)C marginal likelihood with respect to the inducing inputs (and the standard hyper-parameters): u 1:M P arg max q FITC py X, u 1:M, θq u 1:M Intuitively: The marginal likelihood is not only parameterized by the hyper-parameters θ, but also by the inducing inputs u 1:M. Gaussian Processes Marc 14 April

114 Inducing Inputs FI(T)C sparse approximation relies on inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Find them by maximizing the FI(T)C marginal likelihood with respect to the inducing inputs (and the standard hyper-parameters): u 1:M P arg max q FITC py X, u 1:M, θq u 1:M Intuitively: The marginal likelihood is not only parameterized by the hyper-parameters θ, but also by the inducing inputs u 1:M. End up with a high-dimensional non-convex optimization problem with MD additional parameters Gaussian Processes Marc 14 April

115 FITC Example Pink: Original data Red crosses: Initialization of inducing inputs Blue crosses: Location of inducing inputs after optimization Figure from Ed Snelson Efficient compression of the original data set Gaussian Processes Marc 14 April

116 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Gaussian Processes Marc 14 April

117 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Gaussian Processes Marc 14 April

118 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Gaussian Processes Marc 14 April

119 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Gaussian Processes Marc 14 April

120 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Practical limit M ď Often: M P Op10 2 q in the case of inducing variables Gaussian Processes Marc 14 April

121 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Practical limit M ď Often: M P Op10 2 q in the case of inducing variables If we set M N{100, i.e., each inducing function value summarizes 100 real function values, our practical limit is N P Op10 6 q Gaussian Processes Marc 14 April

122 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Practical limit M ď Often: M P Op10 2 q in the case of inducing variables If we set M N{100, i.e., each inducing function value summarizes 100 real function values, our practical limit is N P Op10 6 q Let s try something different Gaussian Processes Marc 14 April

123 An Orthogonal Approximation: Distributed GPs 1 1 = Gaussian Processes Marc 14 April

124 An Orthogonal Approximation: Distributed GPs 1 1 = Randomly split the full data set into M chunks Gaussian Processes Marc 14 April

125 An Orthogonal Approximation: Distributed GPs 1 1 = Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Gaussian Processes Marc 14 April

126 An Orthogonal Approximation: Distributed GPs 1 1 = Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Independent computations can be distributed Gaussian Processes Marc 14 April

127 An Orthogonal Approximation: Distributed GPs 1 1 = Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Independent computations can be distributed Block-diagonal approximation of kernel matrix K Gaussian Processes Marc 14 April

128 An Orthogonal Approximation: Distributed GPs 1 1 = Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Independent computations can be distributed Block-diagonal approximation of kernel matrix K Combine independent computations to an overall result Gaussian Processes Marc 14 April

129 Training the Distributed GP Split data set of size N into M chunks of size P Independence of experts Factorization of marginal likelihood: log ppy X, θq «ÿ M log p k 1 kpy pkq X pkq, θq Gaussian Processes Marc 14 April

130 Training the Distributed GP Split data set of size N into M chunks of size P Independence of experts Factorization of marginal likelihood: log ppy X, θq «ÿ M log p k 1 kpy pkq X pkq, θq Distributed optimization and training straightforward Gaussian Processes Marc 14 April

131 Training the Distributed GP Split data set of size N into M chunks of size P Independence of experts Factorization of marginal likelihood: log ppy X, θq «ÿ M log p k 1 kpy pkq X pkq, θq Distributed optimization and training straightforward Computational complexity: OpMP 3 q [instead of OpN 3 q] But distributed over many machines Memory footprint: OpMP 2 ` NDq [instead of OpN 2 ` NDq] Gaussian Processes Marc 14 April

132 Empirical Training Time Computation time of NLML and its gradients (s) Computation time (DGP) 10 2 Computation time (Full GP) Computation time (FITC) Number of GP experts (DGP) Training data set size Number of GP experts (DGP) NLML is proportional to training time Gaussian Processes Marc 14 April

133 Empirical Training Time Computation time of NLML and its gradients (s) Computation time (DGP) 10 2 Computation time (Full GP) Computation time (FITC) Number of GP experts (DGP) Training data set size Number of GP experts (DGP) NLML is proportional to training time Full GP (16K training points) «sparse GP (50K training points) «distributed GP (16M training points) Push practical limit by order(s) of magnitude Gaussian Processes Marc 14 April

134 Practical Training Times Training* with N 10 6, D 1 on a laptop: «30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours Gaussian Processes Marc 14 April

135 Practical Training Times Training* with N 10 6, D 1 on a laptop: «30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours *: Maximize the marginal likelihood, stop when converged** **: Convergence often after line searches*** Gaussian Processes Marc 14 April

136 Practical Training Times Training* with N 10 6, D 1 on a laptop: «30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours *: Maximize the marginal likelihood, stop when converged** **: Convergence often after line searches*** ***: Line search «2 3 evaluations of marginal likelihood and its gradient (usually OpN 3 q) Gaussian Processes Marc 14 April

137 Predictions with the Distributed GP µ, σ µ 1, σ 1 µ 2, σ 2 µ 3, σ 3 Prediction of each GP expert is Gaussian N `µ i, σ 2 i How to combine them to an overall prediction N `µ, σ 2? Gaussian Processes Marc 14 April

138 Predictions with the Distributed GP µ, σ µ 1, σ 1 µ 2, σ 2 µ 3, σ 3 Prediction of each GP expert is Gaussian N `µ i, σi 2 How to combine them to an overall prediction N `µ, σ 2? Product-of-GP-experts PoE (product of experts) (Ng & Deisenroth, 2014) gpoe (generalized product of experts) (Cao & Fleet, 2014) BCM (Bayesian Committee Machine) (Tresp, 2000) rbcm (robust BCM) (Deisenroth & Ng, 2015) Gaussian Processes Marc 14 April

139 Objectives µ, σ µ, σ µ1, σ1 µ2, σ2 µ3, σ3 µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23 µ31, σ31 µ32, σ32 µ33, σ33 Figure: Two computational graphs Scale to large data sets Gaussian Processes Marc 14 April

140 Objectives µ, σ µ, σ µ1, σ1 µ2, σ2 µ3, σ3 µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23 µ31, σ31 µ32, σ32 µ33, σ33 Figure: Two computational graphs Scale to large data sets Good approximation of full GP ( ground truth ) Gaussian Processes Marc 14 April

141 Objectives µ, σ µ, σ µ1, σ1 µ2, σ2 µ3, σ3 µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23 µ31, σ31 µ32, σ32 µ33, σ33 Figure: Two computational graphs Scale to large data sets Good approximation of full GP ( ground truth ) Predictions independent of computational graph Runs on heterogeneous computing infrastructures (laptop, cluster,...) Gaussian Processes Marc 14 April

142 Objectives µ, σ µ, σ µ1, σ1 µ2, σ2 µ3, σ3 µ11, σ11 µ12, σ12 µ13, σ13 µ21, σ21 µ22, σ22 µ23, σ23 µ31, σ31 µ32, σ32 µ33, σ33 Figure: Two computational graphs Scale to large data sets Good approximation of full GP ( ground truth ) Predictions independent of computational graph Runs on heterogeneous computing infrastructures (laptop, cluster,...) Reasonable predictive variances Gaussian Processes Marc 14 April

143 Running Example Investigate various product-of-experts models Same training procedure, but different mechanisms for predictions Gaussian Processes Marc 14 April

144 Product of GP Experts Prediction model (independent predictors): Mź pp f x, Dq p k p f x, D pkq q, k 1 p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Gaussian Processes Marc 14 April

145 Product of GP Experts Prediction model (independent predictors): Mź pp f x, Dq p k p f x, D pkq q, k 1 p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Predictive precision (inverse variance) and mean: q 2 ÿ σ 2 px q k pσ poe µ poe k pσ poe q ÿ 2 σ 2 px qµ k k px q k Gaussian Processes Marc 14 April

146 Product of GP Experts Prediction model (independent predictors): Mź pp f x, Dq p k p f x, D pkq q, k 1 p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Predictive precision (inverse variance) and mean: q 2 ÿ σ 2 px q k pσ poe µ poe k pσ poe q ÿ 2 σ 2 px qµ k k px q Independent of the computational graph k Gaussian Processes Marc 14 April

147 Product of GP Experts Unreasonable variances for M ą 1: Gaussian Processes Marc 14 April

148 Product of GP Experts Unreasonable variances for M ą 1: pσ poe q 2 ÿ σ 2 px q k The more experts the more certain the prediction, even if every expert itself is very uncertain Cannot fall back to the prior Gaussian Processes Marc 14 April k

149 Generalized Product of GP Experts Weight the responsiblity of each expert in PoE with β k Gaussian Processes Marc 14 April

150 Generalized Product of GP Experts Weight the responsiblity of each expert in PoE with β k Prediction model (independent predictors): pp f x, Dq Mź k 1 p β k k p f x, D pkq q p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Gaussian Processes Marc 14 April

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany carl@tuebingen.mpg.de WWW home page: http://www.tuebingen.mpg.de/ carl

More information

Christfried Webers. Canberra February June 2015

Christfried Webers. Canberra February June 2015 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

Model-based Synthesis. Tony O Hagan

Model-based Synthesis. Tony O Hagan Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that

More information

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of

More information

1 Prior Probability and Posterior Probability

1 Prior Probability and Posterior Probability Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

More information

Bayesian probability theory

Bayesian probability theory Bayesian probability theory Bruno A. Olshausen arch 1, 2004 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using probability. The foundations

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Bayesian Image Super-Resolution

Bayesian Image Super-Resolution Bayesian Image Super-Resolution Michael E. Tipping and Christopher M. Bishop Microsoft Research, Cambridge, U.K..................................................................... Published as: Bayesian

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Local Gaussian Process Regression for Real Time Online Model Learning and Control

Local Gaussian Process Regression for Real Time Online Model Learning and Control Local Gaussian Process Regression for Real Time Online Model Learning and Control Duy Nguyen-Tuong Jan Peters Matthias Seeger Max Planck Institute for Biological Cybernetics Spemannstraße 38, 776 Tübingen,

More information

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

arxiv:1410.4984v1 [cs.dc] 18 Oct 2014

arxiv:1410.4984v1 [cs.dc] 18 Oct 2014 Gaussian Process Models with Parallelization and GPU acceleration arxiv:1410.4984v1 [cs.dc] 18 Oct 2014 Zhenwen Dai Andreas Damianou James Hensman Neil Lawrence Department of Computer Science University

More information

Latent variable and deep modeling with Gaussian processes; application to system identification. Andreas Damianou

Latent variable and deep modeling with Gaussian processes; application to system identification. Andreas Damianou Latent variable and deep modeling with Gaussian processes; application to system identification Andreas Damianou Department of Computer Science, University of Sheffield, UK Brown University, 17 Feb. 2016

More information

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

More information

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html 10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

More information

BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

More information

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

More information

PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II. Lecture Notes PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

More information

HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

More information

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut. Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

More information

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

More information

Convolution. 1D Formula: 2D Formula: Example on the web: http://www.jhu.edu/~signals/convolve/

Convolution. 1D Formula: 2D Formula: Example on the web: http://www.jhu.edu/~signals/convolve/ Basic Filters (7) Convolution/correlation/Linear filtering Gaussian filters Smoothing and noise reduction First derivatives of Gaussian Second derivative of Gaussian: Laplacian Oriented Gaussian filters

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

1 Sufficient statistics

1 Sufficient statistics 1 Sufficient statistics A statistic is a function T = rx 1, X 2,, X n of the random sample X 1, X 2,, X n. Examples are X n = 1 n s 2 = = X i, 1 n 1 the sample mean X i X n 2, the sample variance T 1 =

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Probabilistic user behavior models in online stores for recommender systems

Probabilistic user behavior models in online stores for recommender systems Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Penalized Splines - A statistical Idea with numerous Applications...

Penalized Splines - A statistical Idea with numerous Applications... Penalized Splines - A statistical Idea with numerous Applications... Göran Kauermann Ludwig-Maximilians-University Munich Graz 7. September 2011 1 Penalized Splines - A statistical Idea with numerous Applications...

More information

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Principle of Data Reduction

Principle of Data Reduction Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

More information

Flexible and efficient Gaussian process models for machine learning

Flexible and efficient Gaussian process models for machine learning Flexible and efficient Gaussian process models for machine learning Edward Lloyd Snelson M.A., M.Sci., Physics, University of Cambridge, UK (2001) Gatsby Computational Neuroscience Unit University College

More information

Designing a learning system

Designing a learning system Lecture Designing a learning system Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x4-8845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing

More information

Natural Conjugate Gradient in Variational Inference

Natural Conjugate Gradient in Variational Inference Natural Conjugate Gradient in Variational Inference Antti Honkela, Matti Tornio, Tapani Raiko, and Juha Karhunen Adaptive Informatics Research Centre, Helsinki University of Technology P.O. Box 5400, FI-02015

More information

Parametric Statistical Modeling

Parametric Statistical Modeling Parametric Statistical Modeling ECE 275A Statistical Parameter Estimation Ken Kreutz-Delgado ECE Department, UC San Diego Ken Kreutz-Delgado (UC San Diego) ECE 275A SPE Version 1.1 Fall 2012 1 / 12 Why

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x θ) Likelihood: eskmate

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Lecture 6: Logistic Regression

Lecture 6: Logistic Regression Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

More information

Chapter 22: The Electric Field. Read Chapter 22 Do Ch. 22 Questions 3, 5, 7, 9 Do Ch. 22 Problems 5, 19, 24

Chapter 22: The Electric Field. Read Chapter 22 Do Ch. 22 Questions 3, 5, 7, 9 Do Ch. 22 Problems 5, 19, 24 Chapter : The Electric Field Read Chapter Do Ch. Questions 3, 5, 7, 9 Do Ch. Problems 5, 19, 4 The Electric Field Replaces action-at-a-distance Instead of Q 1 exerting a force directly on Q at a distance,

More information

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers Variance Reduction The statistical efficiency of Monte Carlo simulation can be measured by the variance of its output If this variance can be lowered without changing the expected value, fewer replications

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

1 Teaching notes on GMM 1.

1 Teaching notes on GMM 1. Bent E. Sørensen January 23, 2007 1 Teaching notes on GMM 1. Generalized Method of Moment (GMM) estimation is one of two developments in econometrics in the 80ies that revolutionized empirical work in

More information

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n Principles of Data Mining Pham Tho Hoan hoanpt@hnue.edu.vn References [1] David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT press, 2002 [2] Jiawei Han and Micheline Kamber,

More information

Linear Regression. Guy Lebanon

Linear Regression. Guy Lebanon Linear Regression Guy Lebanon Linear Regression Model and Least Squares Estimation Linear regression is probably the most popular model for predicting a RV Y R based on multiple RVs X 1,..., X d R. It

More information

Some probability and statistics

Some probability and statistics Appendix A Some probability and statistics A Probabilities, random variables and their distribution We summarize a few of the basic concepts of random variables, usually denoted by capital letters, X,Y,

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Probability for Estimation (review)

Probability for Estimation (review) Probability for Estimation (review) In general, we want to develop an estimator for systems of the form: x = f x, u + η(t); y = h x + ω(t); ggggg y, ffff x We will primarily focus on discrete time linear

More information

Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour. Patrick Lam Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

More information

Graphical Modeling for Genomic Data

Graphical Modeling for Genomic Data Graphical Modeling for Genomic Data Carel F.W. Peeters cf.peeters@vumc.nl Joint work with: Wessel N. van Wieringen Mark A. van de Wiel Molecular Biostatistics Unit Dept. of Epidemiology & Biostatistics

More information

Gaussian Process Training with Input Noise

Gaussian Process Training with Input Noise Gaussian Process Training with Input Noise Andrew McHutchon Department of Engineering Cambridge University Cambridge, CB PZ ajm57@cam.ac.uk Carl Edward Rasmussen Department of Engineering Cambridge University

More information

Master s Theory Exam Spring 2006

Master s Theory Exam Spring 2006 Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

More information

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye team @ MSRC

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye team @ MSRC Machine Learning for Medical Image Analysis A. Criminisi & the InnerEye team @ MSRC Medical image analysis the goal Automatic, semantic analysis and quantification of what observed in medical scans Brain

More information

Inner Product Spaces

Inner Product Spaces Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Multivariate Statistical Inference and Applications

Multivariate Statistical Inference and Applications Multivariate Statistical Inference and Applications ALVIN C. RENCHER Department of Statistics Brigham Young University A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim

More information

Variations of Statistical Models

Variations of Statistical Models 38. Statistics 1 38. STATISTICS Revised September 2013 by G. Cowan (RHUL). This chapter gives an overview of statistical methods used in high-energy physics. In statistics, we are interested in using a

More information

Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

More information

Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Math 241, Exam 1 Information.

Math 241, Exam 1 Information. Math 241, Exam 1 Information. 9/24/12, LC 310, 11:15-12:05. Exam 1 will be based on: Sections 12.1-12.5, 14.1-14.3. The corresponding assigned homework problems (see http://www.math.sc.edu/ boylan/sccourses/241fa12/241.html)

More information

Constrained Bayes and Empirical Bayes Estimator Applications in Insurance Pricing

Constrained Bayes and Empirical Bayes Estimator Applications in Insurance Pricing Communications for Statistical Applications and Methods 2013, Vol 20, No 4, 321 327 DOI: http://dxdoiorg/105351/csam2013204321 Constrained Bayes and Empirical Bayes Estimator Applications in Insurance

More information

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method Robert M. Freund February, 004 004 Massachusetts Institute of Technology. 1 1 The Algorithm The problem

More information

Dirichlet Processes A gentle tutorial

Dirichlet Processes A gentle tutorial Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.

More information

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

More information

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091) Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091) Magnus Wiktorsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February

More information

Server Load Prediction

Server Load Prediction Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

More information

Traffic Driven Analysis of Cellular Data Networks

Traffic Driven Analysis of Cellular Data Networks Traffic Driven Analysis of Cellular Data Networks Samir R. Das Computer Science Department Stony Brook University Joint work with Utpal Paul, Luis Ortiz (Stony Brook U), Milind Buddhikot, Anand Prabhu

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS Eusebio GÓMEZ, Miguel A. GÓMEZ-VILLEGAS and J. Miguel MARÍN Abstract In this paper it is taken up a revision and characterization of the class of

More information

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

More information

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March 2015. Due:-March 25, 2015.

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March 2015. Due:-March 25, 2015. Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment -3, Probability and Statistics, March 05. Due:-March 5, 05.. Show that the function 0 for x < x+ F (x) = 4 for x < for x

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the

More information

Probabilistic Latent Semantic Analysis (plsa)

Probabilistic Latent Semantic Analysis (plsa) Probabilistic Latent Semantic Analysis (plsa) SS 2008 Bayesian Networks Multimedia Computing, Universität Augsburg Rainer.Lienhart@informatik.uni-augsburg.de www.multimedia-computing.{de,org} References

More information

Dongfeng Li. Autumn 2010

Dongfeng Li. Autumn 2010 Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis

More information

Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

More information

Probability and statistics; Rehearsal for pattern recognition

Probability and statistics; Rehearsal for pattern recognition Probability and statistics; Rehearsal for pattern recognition Václav Hlaváč Czech Technical University in Prague Faculty of Electrical Engineering, Department of Cybernetics Center for Machine Perception

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Section 5. Stan for Big Data. Bob Carpenter. Columbia University Section 5. Stan for Big Data Bob Carpenter Columbia University Part I Overview Scaling and Evaluation data size (bytes) 1e18 1e15 1e12 1e9 1e6 Big Model and Big Data approach state of the art big model

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

More information

The Kelly Betting System for Favorable Games.

The Kelly Betting System for Favorable Games. The Kelly Betting System for Favorable Games. Thomas Ferguson, Statistics Department, UCLA A Simple Example. Suppose that each day you are offered a gamble with probability 2/3 of winning and probability

More information

3. The Junction Tree Algorithms

3. The Junction Tree Algorithms A Short Course on Graphical Models 3. The Junction Tree Algorithms Mark Paskin mark@paskin.org 1 Review: conditional independence Two random variables X and Y are independent (written X Y ) iff p X ( )

More information