Section 5. Stan for Big Data. Bob Carpenter. Columbia University



Similar documents
STA 4273H: Statistical Machine Learning

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Christfried Webers. Canberra February June 2015

Basics of Statistical Machine Learning

Computational Statistics for Big Data

Big Data need Big Model 1/44

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Lecture 3: Linear methods for classification

Natural Conjugate Gradient in Variational Inference

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Statistical Machine Learning

Variations of Statistical Models

Towards running complex models on big data

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Statistical Machine Learning from Data

Manifold Learning with Variational Auto-encoder for Medical Image Analysis

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Distributed Bayesian Posterior Sampling via Moment Sharing

An Internal Model for Operational Risk Computation

Monte Carlo Simulation

Tutorial on Markov Chain Monte Carlo

Bayesian Statistics in One Hour. Patrick Lam

Estimating the evidence for statistical models

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Markov Chain Monte Carlo Simulation Made Simple

Linear Classification. Volker Tresp Summer 2015

Black Box Variational Inference

Incorporating cost in Bayesian Variable Selection, with application to cost-effective measurement of quality of health care.

Optimizing Prediction with Hierarchical Models: Bayesian Clustering

11. Time series and dynamic linear models

Bayesian Dark Knowledge

CSCI567 Machine Learning (Fall 2014)

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Model-based Synthesis. Tony O Hagan

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

BayesX - Software for Bayesian Inference in Structured Additive Regression

Introduction to Deep Learning Variational Inference, Mean Field Theory

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Chapter 8. The exponential family: Basics. 8.1 The exponential family

Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes

Machine Learning and Pattern Recognition Logistic Regression

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Variational Bayesian Inference for Big Data Marketing Models 1

Imperfect Debugging in Software Reliability

Bayesian Information Criterion The BIC Of Algebraic geometry

Logistic Regression (1/24/13)

Big Data, Statistics, and the Internet

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Big data challenges for physics in the next decades

Gaussian Processes in Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning

Methods of Data Analysis Working with probability distributions

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

Classification. Chapter 3

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Validation of Software for Bayesian Models using Posterior Quantiles. Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT

Stochastic Variational Inference

Efficient Thompson Sampling for Online Matrix-Factorization Recommendation

33. STATISTICS. 33. Statistics 1

PS 271B: Quantitative Methods II. Lecture Notes

Extreme Value Modeling for Detection and Attribution of Climate Extremes

Online Model-Based Clustering for Crisis Identification in Distributed Computing

How to assess the risk of a large portfolio? How to estimate a large covariance matrix?

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Least Squares Estimation

Kristine L. Bell and Harry L. Van Trees. Center of Excellence in C 3 I George Mason University Fairfax, VA , USA kbell@gmu.edu, hlv@gmu.

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Bayesian Clustering for Campaign Detection

Hierarchical Bayesian Modeling of the HIV Response to Therapy

Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009

A Behavior Based Kernel for Policy Search via Bayesian Optimization

Bayes and Naïve Bayes. cs534-machine Learning

Java Modules for Time Series Analysis

A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking

Reject Inference in Credit Scoring. Jie-Men Mok

Inference on Phase-type Models via MCMC

Variational Mean Field for Graphical Models

Centre for Central Banking Studies

Exploiting the Statistics of Learning and Inference

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Simultaneous or Sequential? Search Strategies in the U.S. Auto. Insurance Industry

3F3: Signal and Pattern Processing

Nonlinear Optimization: Algorithms 3: Interior-point methods

Simulation-based optimization methods for urban transportation problems. Carolina Osorio

Dealing with systematics for chi-square and for log likelihood goodness of fit statistics

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Introduction to Markov Chain Monte Carlo

Analyzing large and complex data sets in fusion: Modern tools from probability theory and pattern recognition

Gaussian Process Training with Input Noise

Linear Threshold Units

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Simultaneous or Sequential? Search Strategies in the U.S. Auto. Insurance Industry

MCMC Using Hamiltonian Dynamics

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Multiple Optimization Using the JMP Statistical Software Kodak Research Conference May 9, 2005

Transcription:

Section 5. Stan for Big Data Bob Carpenter Columbia University

Part I Overview

Scaling and Evaluation data size (bytes) 1e18 1e15 1e12 1e9 1e6 Big Model and Big Data approach state of the art big model big data model complexity Types of Scaling: data, parameters, models

Riemannian Manifold HMC Best mixing MCMC method (fixed # of continuous params) Moves on Riemannian manifold rather than Euclidean adapts to position-dependent curvature geonuts generalizes NUTS to RHMC (Betancourt arxiv) SoftAbs metric (Betancourt arxiv) eigendecompose Hessian and condition computationally feasible alternative to original Fisher info metric of Girolami and Calderhead (JRSS, Series B) requires third-order derivatives and implicit integrator Code complete; awaiting higher-order auto-diff

Adiabatic Sampling Physically motivated alternative to simulated annealing and tempering (not really simulated!) Supplies external heat bath Operates through contact manifold System relaxes more naturally between energy levels Betancourt paper on arxiv Prototype complete

Black Box Variational Inference Black box so can fit any Stan model Multivariate normal approx to unconstrained posterior covariance: diagonal mean-field or full rank not Laplace approx around posterior mean, not mode transformed back to constrained space (built-in Jacobians) Stochastic gradient-descent optimization ELBO gradient estimated via Monte Carlo + autdiff Returns approximate posterior mean / covariance Returns sample transformed to constrained space

Black Box EP Fast, approximate inference (like VB) VB and EP minimize divergence in opposite directions especially useful for Gaussian processes Asynchronous, data-parallel expectation propagation (EP) Cavity distributions control subsample variance Prototypte stage collaborating with Seth Flaxman, Aki Vehtari, Pasi Jylänki, John Cunningham, Nicholas Chopin, Christian Robert

Maximum Marginal Likelihood Fast, approximate inference for hierarchical models Marginalize out lower-level parameters Optimize higher-level parameters and fix Optimize lower-level parameters given higher-level Errors estimated as in MLE aka empirical Bayes but not fully Bayesian and no more empirical than full Bayes Design complete; awaiting parameter tagging

Part II Posterior Modes & Laplace Approximation

Laplace Approximation Maximum (penalized) likelihood as approximate Bayes Laplace approximation to posterior Compute posterior mode via optimization Estimate posterior as θ = arg max θ p(θ y) p(θ y) MultiNormal(θ H 1 ) H is the Hessian of the log posterior H i,j = 2 θ i θ j log p(θ y)

Stan s Laplace Approximation L-BFGS to compute posterior mode θ Automatic differentiation to compute H current R: finite differences of gradients soon: second-order automatic differentiation

Part III Variational Bayes

VB in a Nutshell y is observed data, θ parameters Goal is to approximate posterior p(θ y) with a convenient approximating density g(θ φ) φ is a vector of parameters of approximating density Given data y, VB computes φ minimizing KL-divergence from approximation g(θ φ) to posterior p(θ y) φ = arg min φ KL[g(θ φ) p(θ y)] = ( ) p(θ y) arg max φ log g(θ φ) dθ Θ g(θ φ)

PPROXIMATE KL-Divergence INFERENCE Example parison of ive forms for the divergence. The corresponding to ard deviations for ssian distribution riables z1 and z2, ntours represent g levels for an distribution q(z) ariables given by two independent sian distributions s are obtained by (a) the Kullbacke KL(q p), and Kullback-Leibler q). z2 1 0.5 0 0 0.5 z1 1 (a) z2 1 0.5 0 0 0.5 z1 1 (b) Green: true distribution p; Red: approx. distribution g (a) VB-like: KL[g p]; (b) EP-like: KL[p g] is used in an alternative approximate inference framework called expectation propagation. VBWe systematically therefore consider theunderstimates general problem of minimizing posterior KL(p q) when variance q(z) is a factorized approximation of the form (10.5). The KL divergence can then be written in Bishop the form (2006) Pattern Recognition and Machine Learning, fig. 10.2 [ M ] KL(p q) = p(z) ln q i(z i) dz + const (10.16) i=1

VB vs. 464 Laplace 10. APPROXIMATE INFERENCE 1 0.8 0.6 0.4 0.2 40 30 20 10 0 2 1 0 1 2 3 4 0 2 1 0 1 Figure 10.1 Illustration of the variational approximation for the example considered e left-hand plot shows the original distribution (yellow) along with the Laplace (red) and v imations, and the right-hand plot shows the negative logarithms of the corresponding c solid yellow: target; red: Laplace; green: VB VB approximates posterior mean; Laplace posterior mode However, we shall suppose the model is such that working Bishop (2006) Pattern Recognition distribution is and intractable. Machine Learning, fig. 10.2 We therefore consider instead a restricted family of dist seek the member of this family for which the KL divergence is to restrict the family sufficiently that they comprise only

Stan s Black-Box VB Typically custom g() per model based on conjugacy and analytic updates Stan uses black-box VB with multivariate Gaussian g g(θ φ) = MultiNormal(θ µ, Σ) for the unconstrained posterior e.g., scales σ log-transformed with Jacobian Stan provides two versions Mean field: Σ diagonal General: Σ dense

Stan s VB: Computation Use L-BFGS optimization to optimize θ Requires differentiable KL[g(θ φ) p(θ y)] only up to constant (i.e., use evidence lower bound (ELBO)) Approximate KL-divergence and gradient via Monte Carlo KL divergence is an expectation w.r.t. approximation g(θ φ) Monte Carlo draws i.i.d. from approximating multi-normal only need approximate gradient calculation for soundness so only a few Monte Carlo iterations are enough

Stan s VB: Computation (cont.) To support compatible plug-in inference draw Monte Carlo sample θ (1),..., θ (M) with θ (m) MultiNormal(θ µ, Σ ) inverse transfrom from unconstrained to constrained scale report to user in same way as MCMC draws Future: reweight θ (m) via importance sampling with respect to true posterior to improve expectation calculations

Near Future: Stochastic VB Data-streaming form of VB Scales to billions of observations Hoffman et al. (2013) Stochastic variational inference. JMLR 14. Mashup of stochastic gradient (Robbins and Monro 1951) and VB subsample data (e.g., stream in minibatches) upweight each minibatch to full data set size use to make unbiased estimate of true gradient take gradient step to minimimize KL-divergence Prototype code complete

The End (Section 5)