bayesian inference: principles and practice

Transcription

1 bayesian inference: and july 9, 2009 bayesian inference: and

2 motivation background bayes theorem bayesian probability bayesian inference would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena are simple enough to generalize to future observations bayesian inference: and

3 motivation background bayes theorem bayesian probability bayesian inference would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena are simple enough to generalize to future observations claim: bayesian inference provides a systematic framework to infer such models from observed data bayesian inference: and

4 motivation background bayes theorem bayesian probability bayesian inference behind bayesian interpretation of probability and bayesian inference are well established (bayes, laplace, etc., 18th century) + recent advances in mathematical techniques and computational resources have enabled successful applications of these to real-world problems bayesian inference: and

5 background bayes theorem bayesian probability bayesian inference motivation: a bayesian approach to network modularity bayesian inference: and

6 outline background bayes theorem bayesian probability bayesian inference 1 (what we d like to do) background: joint, marginal, and conditional probabilities bayes theorem: inverting conditional probabilities bayesian probability: unknowns as random variables bayesian inference: bayesian probability + bayes theorem 2 (what we re able to do) monte carlo methods: representative samples : bound optimization bayesian inference: and

7 background bayes theorem bayesian probability bayesian inference joint, marginal, and conditional probabilities joint distribution p XY (X = x, Y = y): probability X = x and Y = y conditional distribution p X Y (X = x Y = y): probability X = x given Y = y marginal distribution p X (X ): probability X = x (regardless of Y ) bayesian inference: and

8 sum and product rules background bayes theorem bayesian probability bayesian inference sum rule sum out settings of irrelevant variables: p (x) = y Ω Y p (x, y) (1) product rule the joint as the product of the conditional and marginal: p (x, y) = p (x y) p (y) (2) = p (y x) p (x) (3) bayesian inference: and

10 inverting conditional probabilities background bayes theorem bayesian probability bayesian inference equate far right- and left-hand sides of product rule p (y x) p (x) = p (x, y) = p (x y) p (y) (4) and divide: bayes theorem (bayes and price 1763) the probability of Y given X from the probability of X given Y : p (y x) = p (x y) p (y) p (x) (5) where p (x) = y Ω Y p (x y) p (y) is the normalization constant bayesian inference: and

11 example: diagnoses a la bayes background bayes theorem bayesian probability bayesian inference population 10,000 1% has (rare) disease test is 99% (relatively) effective, i.e. given a patient is sick, 99% test positive given a patient is healthy, 99% test negative bayesian inference: and

12 example: diagnoses a la bayes background bayes theorem bayesian probability bayesian inference population 10,000 1% has (rare) disease test is 99% (relatively) effective, i.e. given a patient is sick, 99% test positive given a patient is healthy, 99% test negative given positive test, what is probability the patient is sick? 1 1 follows wiggins (2006) bayesian inference: and

13 example: diagnoses a la bayes background bayes theorem bayesian probability bayesian inference sick population (100 ppl) healthy population (9900 ppl) 1% (99 ppl test +) 1% (1 ppl test -) 99% (99 ppl test +) 99% (9801 ppl test -) 99 sick patients test positive, 99 healthy patients test positive bayesian inference: and

14 example: diagnoses a la bayes background bayes theorem bayesian probability bayesian inference sick population (100 ppl) healthy population (9900 ppl) 1% (99 ppl test +) 1% (1 ppl test -) 99% (99 ppl test +) 99% (9801 ppl test -) 99 sick patients test positive, 99 healthy patients test positive given positive test, 50% probability that patient is sick bayesian inference: and

15 example: diagnoses a la bayes background bayes theorem bayesian probability bayesian inference know probability of testing positive/negative given sick/healthy use bayes theorem to invert to probability of sick/healthy given positive/negative test p (sick test +) = 99/100 1/100 {}}{{}}{ p (test + sick) p (sick) p (test +) }{{} 198/100 2 = = 1 2 (6) bayesian inference: and

16 example: diagnoses a la bayes background bayes theorem bayesian probability bayesian inference know probability of testing positive/negative given sick/healthy use bayes theorem to invert to probability of sick/healthy given positive/negative test p (sick test +) = 99/100 1/100 {}}{{}}{ p (test + sick) p (sick) p (test +) }{{} 198/100 2 = = 1 2 (6) most work in calculating denominator (normalization) bayesian inference: and

18 interpretations of probabilities (just enough philosophy) background bayes theorem bayesian probability bayesian inference frequentists: limit of relative frequency of events for large number of trials bayesians: measure of a state of knowledge, quantifying degrees of belief (jaynes 2003) bayesian inference: and

19 interpretations of probabilities (just enough philosophy) background bayes theorem bayesian probability bayesian inference frequentists: limit of relative frequency of events for large number of trials bayesians: measure of a state of knowledge, quantifying degrees of belief (jaynes 2003) key difference: bayesians permit assignment of probabilities to unknown/unobservable hypotheses (frequentists do not) bayesian inference: and

20 interpretations of probabilities (just enough philosophy) background bayes theorem bayesian probability bayesian inference e.g., inferring model parameters Θ from observed data D: frequentist approach: calculate parameter setting that maximizes likelihood of data (point estimate), Θ = argmax p(d Θ) (7) Θ bayesian approach: calculate distribution over parameter settings given data, p(θ D) =? (8) bayesian inference: and

22 background bayes theorem bayesian probability bayesian inference bayesian probability + bayes theorem bayesian inference: treat unknown quantities as random variables use bayes theorem to systematically update prior knowledge in the presence of observed data posterior {}}{ p(θ D) = likelihood prior {}}{{}}{ p(d Θ) p(θ) p(d) }{{} evidence (9) bayesian inference: and

23 example: coin flipping background bayes theorem bayesian probability bayesian inference observe independent coin flips (bernoulli trials) infer distribution over coin bias bayesian inference: and

24 example: coin flipping background bayes theorem bayesian probability bayesian inference prior p(θ) over coin bias before observing flips bayesian inference: and

25 example: coin flipping background bayes theorem bayesian probability bayesian inference observe flips: HTHHHTTHHHH bayesian inference: and

26 example: coin flipping background bayes theorem bayesian probability bayesian inference update posterior p(θ D) using bayes theorem bayesian inference: and

27 example: coin flipping background bayes theorem bayesian probability bayesian inference observe flips: HHHHHHHHHHHHTHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHHH HHHHHHTHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHHH HHHHHHHHT bayesian inference: and

28 example: coin flipping background bayes theorem bayesian probability bayesian inference update posterior p(θ D) using bayes theorem bayesian inference: and

29 quantities of interest background bayes theorem bayesian probability bayesian inference bayesian inference maintains full posterior distributions over unknowns many quantities of interest require expectations under these posteriors, e.g. posterior mean and predictive distribution: Θ = E p(θ D) [Θ] = dθ Θ p(θ D) (10) p (x D) = E p(θ D) [p (x Θ, D)] = dθ p (x Θ, D) p(θ D) (11) bayesian inference: and

30 quantities of interest background bayes theorem bayesian probability bayesian inference bayesian inference maintains full posterior distributions over unknowns many quantities of interest require expectations under these posteriors, e.g. posterior mean and predictive distribution: Θ = E p(θ D) [Θ] = dθ Θ p(θ D) (10) p (x D) = E p(θ D) [p (x Θ, D)] = dθ p (x Θ, D) p(θ D) often can t compute posterior (normalization), let alone expectations with respect to it approximation methods (11) bayesian inference: and

31 outline 1 (what we d like to do) background: joint, marginal, and conditional probabilities bayes theorem: inverting conditional probabilities bayesian probability: unknowns as random variables bayesian inference: bayesian probability + bayes theorem 2 (what we re able to do) monte carlo methods: representative samples : bound optimization bayesian inference: and

32 representative samples general approach: approximate intractable expectations via sum over representative samples 2 Φ = E p(x) [φ(x)] = dx φ(x) }{{} arbitrary function p(x) }{{} target density (12) 2 follows mackay (2003), including stolen images bayesian inference: and

33 representative samples general approach: approximate intractable expectations via sum over representative samples 2 Φ = E p(x) [φ(x)] = dx φ(x) }{{} arbitrary function p(x) }{{} target density (12) Φ = 1 R R φ(x (r) ) (13) r=1 2 follows mackay (2003), including stolen images bayesian inference: and

34 representative samples general approach: approximate intractable expectations via sum over representative samples 2 Φ = E p(x) [φ(x)] = dx φ(x) }{{} arbitrary function p(x) }{{} target density (12) Φ = 1 R R φ(x (r) ) (13) r=1 shifts problem to finding good samples 2 follows mackay (2003), including stolen images bayesian inference: and

35 representative samples further complication: in general we can only evaluate the target density to within a multiplicative (normalization) constant, i.e. p(x) = p (x) Z and p (x (r) ) can be evaluated with Z unknown (14) bayesian inference: and

36 monte carlo methods uniform sampling importance sampling rejection sampling... markov chain monte carlo (mcmc) methods metropolis-hastings gibbs sampling... bayesian inference: and

37 uniform sampling sample uniformly from state space of all x values evaluate non-normalized density p (x (r) ) at each x (r) approximate normalization constant as estimate expectation as Φ = Z R = R r=1 R p (x (r) ) (15) r=1 φ(x (r) ) p (x (r) ) Z R (16) bayesian inference: and

38 uniform sampling sample uniformly from state space of all x values evaluate non-normalized density p (x (r) ) at each x (r) approximate normalization constant as estimate expectation as Φ = Z R = R r=1 R p (x (r) ) (15) r=1 φ(x (r) ) p (x (r) ) Z R (16) requires prohibitively large number of samples in high dimensions with concentrated density bayesian inference: and

39 importance sampling ot permitted. k/mackay/itila/ for links. 29 Monte Carlo Methods modify uniform sampling by introducing a sampler density to sample from it q(x) = q (x) (x) from which we Z Q n a multiplicative choose q(x) simple enough that q (x) can be sampled from, Q (x)/zwith hope that q (x) is a reasonable approximation to p Q ). An (x) 9.5. We call Q the P (x) Q (x) R φ(x) r=1 from Q(x). If mate Φ by equaof x where Q(x) is, and points where x into account the introduce weights Figure Functions involved in importance sampling. We wish to (29.21) bayesian inference: and

40 importance sampling adjust estimator by weighting importance of each sample Φ = R r=1 w r φ(x (r) ) R w r (17) where w r = p (x (r) ) q (x (r) ) (18) bayesian inference: and

41 importance sampling adjust estimator by weighting importance of each sample Φ = R r=1 w r φ(x (r) ) R w r (17) where w r = p (x (r) ) q (x (r) ) (18) difficult to choose good q (x) as well as estimate reliability of estimator bayesian inference: and

42 rejection sampling similar to importance sampling, but proposal density strictly bounds target density, i.e. Copyright Cambridge University Press On-screen viewing permitted. Printing no for some 364known value c and all x cq (x) > p (x), (19) You can buy this book for 30 pounds or $50. See (a) P (x) cq (x) (b) P (x) c u x x bayesian inference: and

43 rejection sampling generate sample x from q (x) generate uniformly random number u from [0, cq (x)] add x to set {x (r) } if p (x) > u estimate expectation as Φ = 1 R R r=1 φ(x (r) ) ress On-screen viewing permitted. Printing not permitted. ds or $50. See for links. P (x) cq (x) x (b) P (x) u x cq (x) undred samples, what will the typical range of weights be? stimate the ratio of the largest jakeweight hofman to the median weight x bayesian inference: and 29 Mon Figure R (a) The functi rejection samp samples from are able to dra Q(x) Q (x), value c such th for all x. (b) A generated at r shaded area un c Q (x). If thi below P (x) th

44 rejection sampling generate sample x from q (x) generate uniformly random number u from [0, cq (x)] add x to set {x (r) } if p (x) > u estimate expectation as Φ = 1 R R r=1 φ(x (r) ) ress On-screen viewing permitted. Printing not permitted. ds or $50. See for links. 29 Mon Q(x) Q (x), generated at r cq (b) (x) cq (x) Figure R P (x) P (a) The functi (x) rejection samp u samples from are able to dra x x x value c such th for all x. (b) A c often prohibitively large for poor choice of q (x) or high shaded area un undred samples, dimensions what will the typical range of weights be? stimate the ratio of the largest weight to the median weight c Q (x). If thi below P (x) th bayesian inference: and

45 ,000. What will the mediate: since the metropolis-hastings P (x) to the volume ed here implies that In general, use c a grows local proposal density q(x ; x (t) ), depending on current nce rate is state expected x (t) Q(x; x (1) ) for one-dimensional generating samples P (x) x (1) x only if theconstruct proposalmarkov chain through stateq(x; space, x (2) ) converging to oblems it target is difficult density note: proposal density needn t closely approximate target e of a proposal densityden- P (x) ity Q(x ; x (t) ) might (t) bayesian inference: and

46 metropolis-hastings at time t, generate tenative state x from q(x ; x (t) ) evaluate a = p (x (t) ) q(x (t) ; x ) p (x ) q(x ; x (t) ) (20) if a 1, accept the new state; else accept the new state with probability a if new state is rejected, set x (t+1) = x (t) bayesian inference: and

47 metropolis-hastings at time t, generate tenative state x from q(x ; x (t) ) evaluate a = p (x (t) ) q(x (t) ; x ) p (x ) q(x ; x (t) ) (20) if a 1, accept the new state; else accept the new state with probability a if new state is rejected, set x (t+1) = x (t) effective for high dimensional problems, but difficult to assess convergence of markov chain 3 3 see neal (1993) bayesian inference: and

48 gibbs sampling metropolis method where proposal density is chosen as conditional distribution, i.e. ( ) q(x i ; x (t) ) = p x i {x (t) j } j i (21) useful when joint density factorizes, as in sparse graphical model 4 4 see wainwright & jordan 2008 bayesian inference: and

49 gibbs sampling metropolis method where proposal density is chosen as conditional distribution, i.e. ( ) q(x i ; x (t) ) = p x i {x (t) j } j i (21) useful when joint density factorizes, as in sparse graphical model 4 similar difficulties to metropolis, but no concerns about adjustable parameters 4 see wainwright & jordan 2008 bayesian inference: and

51 bound optimization general approach: replace integration with optimization construct auxiliary function upper-bounded by log-evidence, maximize auxiliary 9.4. The EM function Algorithm in General 453 he EM algorithm involves alterately computing a lower bound n the log likelihood for the curent parameter values and then aximizing this bound to obtain he new parameter values. See he text for a full discussion. ln p(x θ) L (q,θ) θ old θ new 5 omplete data) 5 image log likelihood from bishop function (2006) whose value we wish to maximize. We start ith some initial parameter value θ old, and injake the hofman first E step we bayesian evaluate inference: the poste- and

52 variational bayes bound log of expected value by expected value of log using jensen s inequality 6 : ln p(d) = ln dθ p(d Θ)p(Θ) = ln dθ p(d Θ)p(Θ) q(θ) q(θ) [ ] p(d Θ)p(Θ) dθ ln q(θ) q(θ) for sufficiently simple (i.e. factorized) approximating distribution q(θ), right-hand side can be easily evaluated and optimized 6 image from feynman (1972) bayesian inference: and

53 variational bayes iterative coordinate ascent algorithm provides controlled analytic approxmations to posterior and evidence approximate posterior q(θ) minimizes kullback-leibler distance to true posterior resulting deterministic algorithm is often fast and scalable bayesian inference: and

54 variational bayes iterative coordinate ascent algorithm provides controlled analytic approxmations to posterior and evidence approximate posterior q(θ) minimizes kullback-leibler distance to true posterior resulting deterministic algorithm is often fast and scalable complexity of approximation often limited (to, e.g., mean-field theory, assuming weak interaction between unknowns) iterative algorithm requires restarts, no guarantees on quality of approximation bayesian inference: and

55 example: a bayesian approach to network modularity bayesian inference: and

56 example: a bayesian approach to network modularity nodes: authors, edges: co-authored papers can we infer (community) structure in the giant component? bayesian inference: and

59 example: a bayesian approach to network modularity inferred topological communities correspond to sub-disciplines bayesian inference: and

61 information theory, inference, and learning algorithms, mackay (2003) pattern recognition and machine learning, bishop (2006) bayesian data analysis, gelman, et. al. (2003) probabilistic inference using markov chain monte carlo methods, neal (1993) graphical models, exponential families, and variational inference, wainwright & jordan (2006) probability theory: the logic of science, jaynes (2003) what is bayes theorem..., wiggins (2006) bayesian inference view on cran variational-bayes.org variational bayesian inference for network modularity bayesian inference: and