Big Data need Big Model 1/44

Transcription

1 Big Data need Big Model 1/44

2 Andrew Gelman, Bob Carpenter, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, Allen Riddell,... Department of Statistics, Columbia University, New York (and other places) 10 Nov /44

3 3/44

4 4/44

5 5/44

6 Ordered probit data { int<lower=2> K; int<lower=0> N; int<lower=1> D; int<lower=1,upper=k> y[n]; row_vector[d] x[n]; } parameters { vector[d] beta; ordered[k-1] c; } model { vector[k] theta; for (n in 1:N) { real eta; eta <- x[n] * beta; theta[1] <- 1 - Phi(eta - c[1]); for (k in 2:(K-1)) theta[k] <- Phi(eta - c[k-1]) - Phi(eta - c[k]); theta[k] <- Phi(eta - c[k-1]); y[n] ~ categorical(theta); } } 6/44

7 Measurement error model data {... real x_meas[n]; real<lower=0> tau; } parameters { real x[n]; real mu_x; real sigma_x;... } model { } // measurement of x // measurement noise // unknown true value // prior location // prior scale x ~ normal(mu_x, sigma_x); x_meas ~ normal(x, tau); y ~ normal(alpha + beta * x, sigma);... // prior // measurement model 7/44

8 Stan overview 8/44

9 Stan overview Fit open-ended Bayesian models 8/44

10 Stan overview Fit open-ended Bayesian models Specify log posterior density in C++ 8/44

11 Stan overview Fit open-ended Bayesian models Specify log posterior density in C++ Code a distribution once, then use it everywhere 8/44

12 Stan overview Fit open-ended Bayesian models Specify log posterior density in C++ Code a distribution once, then use it everywhere Hamiltonian No-U-Turn sampler 8/44

13 Stan overview Fit open-ended Bayesian models Specify log posterior density in C++ Code a distribution once, then use it everywhere Hamiltonian No-U-Turn sampler Autodiff 8/44

14 Stan overview Fit open-ended Bayesian models Specify log posterior density in C++ Code a distribution once, then use it everywhere Hamiltonian No-U-Turn sampler Autodiff Runs from R, Python, Matlab, Julia; postprocessing 8/44

15 People 9/44

16 People Stan core (15) 9/44

17 People Stan core (15) Research collaborators (30) Developers (100) 9/44

18 People Stan core (15) Research collaborators (30) Developers (100) User community (1000) 9/44

19 People Stan core (15) Research collaborators (30) Developers (100) User community (1000) Users (10000) 9/44

20 Funding National Science Foundation Institute for Education Sciences Department of Energy Novartis YouGov 10/44

21 Roles of Stan 11/44

22 Roles of Stan Bayesian inference for unsophisticated users (alternative to Stata, Bugs, etc.) 11/44

23 Roles of Stan Bayesian inference for unsophisticated users (alternative to Stata, Bugs, etc.) Bayesian inference for sophisticated users (alternative to programming it yourself) 11/44

24 Roles of Stan Bayesian inference for unsophisticated users (alternative to Stata, Bugs, etc.) Bayesian inference for sophisticated users (alternative to programming it yourself) Fast and scalable gradient computation 11/44

25 Roles of Stan Bayesian inference for unsophisticated users (alternative to Stata, Bugs, etc.) Bayesian inference for sophisticated users (alternative to programming it yourself) Fast and scalable gradient computation Environment for developing new algorithms 11/44

26 12/44

27 13/44

28 14/44

29 This week, the New York Times and CBS News published a story using, in part, information from a non-probability, opt-in survey sparking concern among many in the polling community. In general, these methods have little grounding in theory and the results can vary widely based on the particular method used. Michael Link, President, American Association for Public Opinion Research 15/44

30 16/44

31 17/44

32 Xbox estimates, adjusting for demographics 18/44

33 19/44

34 Karl Rove, Wall Street Journal, 7 Oct: Mr. Romney s bounce is significant. Nate Silver, New York Times, 6 Oct: Mr. Romney has not only improved his own standing but also taken voters away from Mr. Obama s column. 20/44

35 Xbox estimates, adjusting for demographics and partisanship 21/44

36 Jimmy Carter Republicans and George W. Bush Democrats 22/44

37 23/44

38 24/44

39 Toxicology 25/44

40 Earth science 26/44

41 Lots of other applications Astronomy, ecology, linguistics, epidemiology, soil science,... 27/44

42 Steps of Bayesian data analysis 28/44

43 Steps of Bayesian data analysis Model building 28/44

44 Steps of Bayesian data analysis Model building Inference 28/44

45 Steps of Bayesian data analysis Model building Inference Model checking 28/44

46 Steps of Bayesian data analysis Model building Inference Model checking Model understanding and improvement 28/44

47 Background on Bayesian computation 29/44

48 Background on Bayesian computation Point estimates and standard errors 29/44

49 Background on Bayesian computation Point estimates and standard errors Hierarchical models 29/44

50 Background on Bayesian computation Point estimates and standard errors Hierarchical models Posterior simulation 29/44

51 Background on Bayesian computation Point estimates and standard errors Hierarchical models Posterior simulation Markov chain Monte Carlo (Gibbs sampler and Metropolis algorithm) 29/44

52 Background on Bayesian computation Point estimates and standard errors Hierarchical models Posterior simulation Markov chain Monte Carlo (Gibbs sampler and Metropolis algorithm) Hamiltonian Monte Carlo 29/44

53 Solving problems 30/44

54 Solving problems Problem: Gibbs too slow, Metropolis too problem-specific 30/44

55 Solving problems Problem: Gibbs too slow, Metropolis too problem-specific Solution: Hamiltonian Monte Carlo 30/44

56 Solving problems Problem: Gibbs too slow, Metropolis too problem-specific Solution: Hamiltonian Monte Carlo Problem: Interpreters too slow, won t scale 30/44

57 Solving problems Problem: Gibbs too slow, Metropolis too problem-specific Solution: Hamiltonian Monte Carlo Problem: Interpreters too slow, won t scale Solution: Compilation 30/44

58 Solving problems Problem: Gibbs too slow, Metropolis too problem-specific Solution: Hamiltonian Monte Carlo Problem: Interpreters too slow, won t scale Solution: Compilation Problem: Need gradients of log posterior for HMC 30/44

59 Solving problems Problem: Gibbs too slow, Metropolis too problem-specific Solution: Hamiltonian Monte Carlo Problem: Interpreters too slow, won t scale Solution: Compilation Problem: Need gradients of log posterior for HMC Solution: Reverse-mode algorithmic differentation 30/44

60 Solving problems Problem: Gibbs too slow, Metropolis too problem-specific Solution: Hamiltonian Monte Carlo Problem: Interpreters too slow, won t scale Solution: Compilation Problem: Need gradients of log posterior for HMC Solution: Reverse-mode algorithmic differentation Problem: Existing algo-diff slow, limited, unextensible 30/44

61 Solving problems Problem: Gibbs too slow, Metropolis too problem-specific Solution: Hamiltonian Monte Carlo Problem: Interpreters too slow, won t scale Solution: Compilation Problem: Need gradients of log posterior for HMC Solution: Reverse-mode algorithmic differentation Problem: Existing algo-diff slow, limited, unextensible Solution: Our own algo-diff 30/44

62 Solving problems Problem: Gibbs too slow, Metropolis too problem-specific Solution: Hamiltonian Monte Carlo Problem: Interpreters too slow, won t scale Solution: Compilation Problem: Need gradients of log posterior for HMC Solution: Reverse-mode algorithmic differentation Problem: Existing algo-diff slow, limited, unextensible Solution: Our own algo-diff Problem: Algo-diff requires fully templated functions 30/44

63 Solving problems Problem: Gibbs too slow, Metropolis too problem-specific Solution: Hamiltonian Monte Carlo Problem: Interpreters too slow, won t scale Solution: Compilation Problem: Need gradients of log posterior for HMC Solution: Reverse-mode algorithmic differentation Problem: Existing algo-diff slow, limited, unextensible Solution: Our own algo-diff Problem: Algo-diff requires fully templated functions Solution: Our own density library, Eigen linear algebra 30/44

64 Radford Neal (2011) on Hamiltonian Monte Carlo One practical impediment to the use of Hamiltonian Monte Carlo is the need to select suitable values for the leapfrog stepsize, ɛ, and the number of leapfrog steps L... Tuning HMC will usually require preliminary runs with trial values for ɛ and L... Unfortunately, preliminary runs can be misleading... 31/44

65 The No U-Turn Sampler Created by Matt Hoffman Run the HMC steps until they start to turn around (bend with an angle > 180 ) Computationally efficient Requires no tuning of #steps Complications to preserve detailed balance 32/44

66 Hoffman and Gelman NUTS Example Trajectory Figure 2: Example of a trajectory generated during one iteration of NUTS. The blue ellipse is a contour of the target distribution, the black open circles are the positions θ traced out by the leapfrog integrator and associated with elements of the set of visited states B, the black solid circle is the starting position, the red solid circles are positions associated with states that must be excluded from the set C of possible next samples because their joint probability is below the slice variable u, Gelman Carpenter and thehoffman positionslee withgoodrich a red x through Betancourt them. correspond.. Stan: to states A platform that mustfor be Bayesian inference 33/44

67 Hoffman and Gelman NUTS Example Trajectory Blue ellipse is contour of target distribution Figure 2: Example of a trajectory generated during one iteration of NUTS. The blue ellipse is a contour of the target distribution, the black open circles are the positions θ traced out by the leapfrog integrator and associated with elements of the set of visited states B, the black solid circle is the starting position, the red solid circles are positions associated with states that must be excluded from the set C of possible next samples because their joint probability is below the slice variable u, Gelman Carpenter and thehoffman positionslee withgoodrich a red x through Betancourt them. correspond.. Stan: to states A platform that mustfor be Bayesian inference 33/44

68 Hoffman and Gelman NUTS Example Trajectory Blue ellipse is contour of target distribution Figure 2: Example of a trajectory generated during one iteration of NUTS. The blue ellipse is a contour of the target distribution, the black open circles are the positions θ traced Initial out byposition the leapfrog integrator at black and associated solid with elements of the set of visited states B, the black solid circle is the starting position, the red solid circles are positions associated with states that must be excluded from the set C of possible next samples because their joint probability is below the slice variable u, Gelman Carpenter and thehoffman positionslee withgoodrich a red x through Betancourt them. correspond.. Stan: to states A platform that mustfor be Bayesian inference 33/44

69 Hoffman and Gelman NUTS Example Trajectory Blue ellipse is contour of target distribution Figure 2: Example of a trajectory generated during one iteration of NUTS. The blue ellipse is a contour of the target distribution, the black open circles are the positions θ traced Initial out byposition the leapfrog integrator at black and associated solid with elements of the set of visited states B, the black solid circle is the starting position, the red solid circles are Arrows positions associated indicate with astates U-turn that mustinbe momentum excluded from the set C of possible next samples because their joint probability is below the slice variable u, Gelman Carpenter and thehoffman positionslee withgoodrich a red x through Betancourt them. correspond.. Stan: to states A platform that mustfor be Bayesian inference 33/44

70 NUTS vs. Gibbs andthe Metropolis No-U-Turn Sampler igure 7: Samples generated by random-walk Metropolis, Gibbs sampling, and NUTS. The plots compare 1,000 independent draws from a highly correlated 250-dimensional distribution (right) with 1,000,000 samples (thinned to 1,000 samples for display) generated by random-walk Metropolis (left), 1,000,000 samples (thinned to 1,000 samples for display) generated by Gibbs sampling (second from left), and 1,000 samples generated by NUTS (second from right). Only the first two dimensions are shown here. 34/44

71 NUTS vs. Gibbs andthe Metropolis No-U-Turn Sampler Two dimensions of highly correlated 250-dim distribution igure 7: Samples generated by random-walk Metropolis, Gibbs sampling, and NUTS. The plots compare 1,000 independent draws from a highly correlated 250-dimensional distribution (right) with 1,000,000 samples (thinned to 1,000 samples for display) generated by random-walk Metropolis (left), 1,000,000 samples (thinned to 1,000 samples for display) generated by Gibbs sampling (second from left), and 1,000 samples generated by NUTS (second from right). Only the first two dimensions are shown here. 34/44

72 NUTS vs. Gibbs andthe Metropolis No-U-Turn Sampler Two dimensions of highly correlated 250-dim distribution igure 7: Samples generated by random-walk Metropolis, Gibbs sampling, and NUTS. The plots compare 1M samples 1,000 independent from Metropolis, draws from1ma highly from correlated Gibbs (thin 250-dimensional to 1K) distribution (right) with 1,000,000 samples (thinned to 1,000 samples for display) generated by random-walk Metropolis (left), 1,000,000 samples (thinned to 1,000 samples for display) generated by Gibbs sampling (second from left), and 1,000 samples generated by NUTS (second from right). Only the first two dimensions are shown here. 34/44

73 NUTS vs. Gibbs andthe Metropolis No-U-Turn Sampler Two dimensions of highly correlated 250-dim distribution igure 7: Samples generated by random-walk Metropolis, Gibbs sampling, and NUTS. The plots compare 1M samples 1,000 independent from Metropolis, draws from1ma highly from correlated Gibbs (thin 250-dimensional to 1K) distribu- 1K(right) samples with 1,000,000 from NUTS, samples 1K(thinned independent to 1,000 samples draws for display) generated by tion random-walk Metropolis (left), 1,000,000 samples (thinned to 1,000 samples for display) generated by Gibbs sampling (second from left), and 1,000 samples generated by NUTS (second from right). Only the first two dimensions are shown here. 34/44

74 NUTS vs. Basic HMC 250-D normal and logistic regression models Vertical axis shows effective #sims (big is good) (Left) NUTS; (Right) HMC with increasing t = ɛl 35/44

75 NUTS vs. Basic HMC II Hierarchical logistic regression and stochastic volatility Simulation time is step size ɛ times #steps L NUTS can beat optimally tuned HMC 36/44

76 Solving more problems in Stan 37/44

77 Solving more problems in Stan Problem: Need ease of use of BUGS 37/44

78 Solving more problems in Stan Problem: Need ease of use of BUGS Solution: Compile directed graphical model language 37/44

79 Solving more problems in Stan Problem: Need ease of use of BUGS Solution: Compile directed graphical model language Problem: Need to tune parameters for HMC 37/44

80 Solving more problems in Stan Problem: Need ease of use of BUGS Solution: Compile directed graphical model language Problem: Need to tune parameters for HMC Solution: Auto tuning, adaptation 37/44

81 Solving more problems in Stan Problem: Need ease of use of BUGS Solution: Compile directed graphical model language Problem: Need to tune parameters for HMC Solution: Auto tuning, adaptation Problem: Efficient up-to-proportion density calcs 37/44

82 Solving more problems in Stan Problem: Need ease of use of BUGS Solution: Compile directed graphical model language Problem: Need to tune parameters for HMC Solution: Auto tuning, adaptation Problem: Efficient up-to-proportion density calcs Solution: Density template metaprogramming 37/44

83 Solving more problems in Stan Problem: Need ease of use of BUGS Solution: Compile directed graphical model language Problem: Need to tune parameters for HMC Solution: Auto tuning, adaptation Problem: Efficient up-to-proportion density calcs Solution: Density template metaprogramming Problem: Limited error checking, recovery 37/44

84 Solving more problems in Stan Problem: Need ease of use of BUGS Solution: Compile directed graphical model language Problem: Need to tune parameters for HMC Solution: Auto tuning, adaptation Problem: Efficient up-to-proportion density calcs Solution: Density template metaprogramming Problem: Limited error checking, recovery Solution: Static model typing, informative exceptions 37/44

85 Solving more problems in Stan Problem: Need ease of use of BUGS Solution: Compile directed graphical model language Problem: Need to tune parameters for HMC Solution: Auto tuning, adaptation Problem: Efficient up-to-proportion density calcs Solution: Density template metaprogramming Problem: Limited error checking, recovery Solution: Static model typing, informative exceptions Problem: Poor boundary behavior 37/44

86 Solving more problems in Stan Problem: Need ease of use of BUGS Solution: Compile directed graphical model language Problem: Need to tune parameters for HMC Solution: Auto tuning, adaptation Problem: Efficient up-to-proportion density calcs Solution: Density template metaprogramming Problem: Limited error checking, recovery Solution: Static model typing, informative exceptions Problem: Poor boundary behavior Solution: Calculate limits (e.g. lim x 0 x log x) 37/44

87 Solving more problems in Stan Problem: Need ease of use of BUGS Solution: Compile directed graphical model language Problem: Need to tune parameters for HMC Solution: Auto tuning, adaptation Problem: Efficient up-to-proportion density calcs Solution: Density template metaprogramming Problem: Limited error checking, recovery Solution: Static model typing, informative exceptions Problem: Poor boundary behavior Solution: Calculate limits (e.g. lim x 0 x log x) 37/44

88 Solving more problems in Stan Problem: Need ease of use of BUGS Solution: Compile directed graphical model language Problem: Need to tune parameters for HMC Solution: Auto tuning, adaptation Problem: Efficient up-to-proportion density calcs Solution: Density template metaprogramming Problem: Limited error checking, recovery Solution: Static model typing, informative exceptions Problem: Poor boundary behavior Solution: Calculate limits (e.g. lim x 0 x log x) Problem: Restrictive licensing (e.g., closed, GPL, etc.) 37/44

89 Solving more problems in Stan Problem: Need ease of use of BUGS Solution: Compile directed graphical model language Problem: Need to tune parameters for HMC Solution: Auto tuning, adaptation Problem: Efficient up-to-proportion density calcs Solution: Density template metaprogramming Problem: Limited error checking, recovery Solution: Static model typing, informative exceptions Problem: Poor boundary behavior Solution: Calculate limits (e.g. lim x 0 x log x) Problem: Restrictive licensing (e.g., closed, GPL, etc.) Solution: Open-source, BSD license 37/44

90 New stuff: Differential equation models Simple harmonic oscillator: dz 1 dt dz 2 dt = z 2 = z 1 θz 2 with observations (y 1, y 2 ) t, t = 1,..., T : y 1t N(z 1 (t), σ 2 1) y 2t N(z 2 (t), σ 2 2) Given data (y 1, y 2 ) t, t = 1,..., T, estimate initial state (y 1, y 2 ) t=0 and parameter θ 38/44

91 Stan program: 1 functions { real[] sho(real t, real[] y, real[] theta, real[] x_r, int[] x_i) { real dydt[2]; dydt[1] <- y[2]; dydt[2] <- -y[1] - theta[1] * y[2]; return dydt; } } data { int<lower=1> T; real y[t,2]; real t0; real ts[t]; } transformed data { real x_r[0]; int x_i[0]; } 39/44

92 Stan program: 2 parameters { real y0[2]; vector<lower=0>[2] sigma; real theta[1]; } model { real z[t,2]; sigma ~ cauchy(0,2.5); theta ~ normal(0,1); y0 ~ normal(0,1); z <- integrate_ode(sho, y0, t0, ts, theta, x_r, x_i); for (t in 1:T) y[t] ~ normal(z[t], sigma); } 40/44

93 Stan output Run RStan with data simulated from θ = 0.15, y 0 = (1, 0), and σ = 0.1: Inference for Stan model: sho. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat y0[1] y0[2] sigma[1] sigma[2] theta[1] lp /44

94 Big Data, Big Model, Scalable Computing Items 100 Items 1000 Items Seconds / Sample Total Ratings 42/44

95 Thinking about scalability 43/44

96 Thinking about scalability Hierarchical item response model: Stan JAGS # items # raters # groups # data time memory time memory 20 2, ,000 :02m 16MB :03m 220MB 40 8, ,000 :16m 92MB :40m 1400MB 80 32, ,560,000 4h:10m 580MB :??m?mb 43/44

97 Thinking about scalability Hierarchical item response model: Stan JAGS # items # raters # groups # data time memory time memory 20 2, ,000 :02m 16MB :03m 220MB 40 8, ,000 :16m 92MB :40m 1400MB 80 32, ,560,000 4h:10m 580MB :??m?mb Also, Stan generated 4x effective sample size per iteration 43/44

98 Future work 44/44

99 Future work Programming 44/44

100 Future work Programming Faster gradients and higher-order derivatives 44/44

101 Future work Programming Faster gradients and higher-order derivatives Functions 44/44

102 Future work Programming Faster gradients and higher-order derivatives Functions Statistical algorithms 44/44

103 Future work Programming Faster gradients and higher-order derivatives Functions Statistical algorithms Riemannian Hamiltonian Monte Carlo 44/44

104 Future work Programming Faster gradients and higher-order derivatives Functions Statistical algorithms Riemannian Hamiltonian Monte Carlo (Penalized) mle 44/44

105 Future work Programming Faster gradients and higher-order derivatives Functions Statistical algorithms Riemannian Hamiltonian Monte Carlo (Penalized) mle (Penalized) marginal mle 44/44

106 Future work Programming Faster gradients and higher-order derivatives Functions Statistical algorithms Riemannian Hamiltonian Monte Carlo (Penalized) mle (Penalized) marginal mle Black-box variational Bayes 44/44

107 Future work Programming Faster gradients and higher-order derivatives Functions Statistical algorithms Riemannian Hamiltonian Monte Carlo (Penalized) mle (Penalized) marginal mle Black-box variational Bayes Data partitioning and expectation propagation 44/44