Bayesian probability theory



Similar documents
CHAPTER 2 Estimating Probabilities

Linear Classification. Volker Tresp Summer 2015

Click on the links below to jump directly to the relevant section

Section 4.1 Rules of Exponents

Figure 1. A typical Laboratory Thermometer graduated in C.

Appendix A: Science Practices for AP Physics 1 and 2

Simple Regression Theory II 2010 Samuel L. Baker

Fairfield Public Schools

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Entropy and Mutual Information

CONNECTING LESSONS NGSS STANDARD

1.6 The Order of Operations

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

Solving Rational Equations

Obtaining Knowledge. Lecture 7 Methods of Scientific Observation and Analysis in Behavioral Psychology and Neuropsychology.

Masters research projects. 1. Adapting Granger causality for use on EEG data.

Likelihood: Frequentist vs Bayesian Reasoning

(Refer Slide Time: 2:03)

Review of Fundamental Mathematics

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Session 7 Fractions and Decimals

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

5.5. Solving linear systems by the elimination method

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

SIMPLIFYING ALGEBRAIC FRACTIONS

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Language Modeling. Chapter Introduction

CHAPTER 3. Methods of Proofs. 1. Logical Arguments and Formal Proofs

Microeconomics Sept. 16, 2010 NOTES ON CALCULUS AND UTILITY FUNCTIONS

Overview. Essential Questions. Precalculus, Quarter 4, Unit 4.5 Build Arithmetic and Geometric Sequences and Series

1 Prior Probability and Posterior Probability

Association Between Variables

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

L4: Bayesian Decision Theory

The Deadly Sins of Algebra

Linear Threshold Units

Tagging with Hidden Markov Models

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Module 3: Correlation and Covariance

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Tom wants to find two real numbers, a and b, that have a sum of 10 and have a product of 10. He makes this table.

Statistical Machine Learning

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Lies My Calculator and Computer Told Me

STA 4273H: Statistical Machine Learning

The Taxman Game. Robert K. Moniot September 5, 2003

Chapter 1 Introduction. 1.1 Introduction

Using kernel methods to visualise crime data

The CUSUM algorithm a small review. Pierre Granjon

Binary Adders: Half Adders and Full Adders

The Basics of Graphical Models

Biological Neurons and Neural Networks, Artificial Neurons

Parallel DC circuits

HOW TO DO A SCIENCE PROJECT Step-by-Step Suggestions and Help for Elementary Students, Teachers, and Parents Brevard Public Schools

Voltage, Current, and Resistance

This is a square root. The number under the radical is 9. (An asterisk * means multiply.)

6.1. The Exponential Function. Introduction. Prerequisites. Learning Outcomes. Learning Style

II. DISTRIBUTIONS distribution normal distribution. standard scores

3. Mathematical Induction

6 EXTENDING ALGEBRA. 6.0 Introduction. 6.1 The cubic equation. Objectives

Introduction to Detection Theory

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Monotonicity Hints. Abstract

Binomial Sampling and the Binomial Distribution

Self Organizing Maps: Fundamentals

Second Order Linear Nonhomogeneous Differential Equations; Method of Undetermined Coefficients. y + p(t) y + q(t) y = g(t), g(t) 0.

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Session 7 Bivariate Data and Analysis

Statistical tests for SPSS

1 Solving LPs: The Simplex Algorithm of George Dantzig

Problem of the Month: Fair Games

PS 271B: Quantitative Methods II. Lecture Notes

Course: Model, Learning, and Inference: Lecture 5

Multiple Regression: What Is It?

Introduction to the Practice of Statistics Fifth Edition Moore, McCabe

=

Chapter 3. Distribution Problems. 3.1 The idea of a distribution The twenty-fold way

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

Nonlinear Algebraic Equations Example

Lecture 3: Linear methods for classification

PREPARATION FOR MATH TESTING at CityLab Academy

= δx x + δy y. df ds = dx. ds y + xdy ds. Now multiply by ds to get the form of the equation in terms of differentials: df = y dx + x dy.

MULTIPLICATION AND DIVISION OF REAL NUMBERS In this section we will complete the study of the four basic operations with real numbers.

plc numbers Encoded values; BCD and ASCII Error detection; parity, gray code and checksums

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Data, Measurements, Features

Signal Detection C H A P T E R SIGNAL DETECTION AS HYPOTHESIS TESTING

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

Chapter 11 Number Theory

Bayes and Naïve Bayes. cs534-machine Learning

False Discovery Rates

Transcription:

Bayesian probability theory Bruno A. Olshausen arch 1, 2004 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using probability. The foundations of Bayesian probability theory were laid down some 200 years ago by people such as Bernoulli, Bayes, and Laplace, but it has been held suspect or controversial by modern statisticians. The last few decades though have seen the occurrence of a Bayesian revolution, and Bayesian probability theory is now commonly employed (oftentimes with stunning success) in many scientific disciplines, from astrophysics to neuroscience. It is most often used to judge the relative validity of hypotheses in the face of noisy, sparse, or uncertain data, or to adjust the parameters of a specific model. Here we discuss the basics of Bayesian probability theory and show how it has been utilized in models of cortical processing and neural decoding. Bayes rule Bayes rule really involves nothing more than the manipulation of conditional probabilities. Remember that the joint probability of two events, A&B, can be expressed as P (AB) = P (A B)P (B) (1) = P (B A)P (A) (2) In Bayesian probability theory, one of these events is the hypothesis, H, and the other is data, D, and we wish to judge the relative truth of the hypothesis given the data. According to Bayes rule, we do this via the relation P (H D) = P (D H)P (H) P (D) The term P (D H) is called the likelihood function and it assesses the probability of the observed data arising from the hypothesis. Usually this is known by the uch of this material is adapted from the excellent treatise by T.J. Loredo, From Laplace to supernova SN 1987A: Bayesian inference in astrophysics in aximum entropy and Bayesian methods, Kluwer, 1989. 1 (3)

experimenter, as it expresses one s knowledge of how one expects the data to look given that the hypothesis is true. The term P (H) is called the prior, as it reflects one s prior knowledge before the data are considered. The specification of the prior is often the most subjective aspect of Bayesian probability theory, and it is one of the reasons statisticians held Bayesian inference in contempt. But closer examination of traditional statistical methods reveals that they all have their hidden assumptions and tricks built into them. Indeed, one of the advantages of Bayesian probability theory is that one s assumptions are made up front, and any element of subjectivity in the reasoning process is directly exposed. The term P (D) is obtained by integrating (or summing) P (D H)P (H) over all H, and usually plays the role of an ignorable normalizing constant. Finally, the term P (H D) is known as the posterior, and as its name suggests, reflects the probability of the hypothesis after consideration of the data. Another way of looking at Bayes rule is that it represents learning. That is, the transformation from the prior, P (H), to the posterior, P (H D), formally reflects what we have learned about the validity of the hypothesis from consideration of the data. Now let s see how all of this is played out in a rather simplified example. A simple example A classic example of where Bayesian inference is employed is in the problem of estimation, where we must guess the value of an underlying parameter from an observation that is corrupted by noise. Let s say we have some quantity in the world, x, and our observation of this quantity, y, is corrupted by additive Gaussian noise, n, with zero mean: y = x + n (4) Our job is to make the best guess as to the value of x given the observed value y. If we knew the probability distribution of x given y, P (x y), then we might want to pick the value of x that maximizes this distribution ˆx = arg max P (x y). (5) x Alternatively, if we want to minimize the mean squared error of our guesses, then we should pick the mean of P (x y): ˆx = x P (x y) dx. (6) So, if only we knew P (x y) then we could make an optimal guess. Bayes rule tells us how to calculate P (x y): P (x y) = P (y x)p (x) P (y). (7) The two main things we need to specify here are P (y x) and P (x). The first is easy, since we specified the noise, n, to be Gaussian, zero-mean, and additive. Thus, P (y x) = P (n + x x) (8) 2

= 1 e (y x) 2 2σn 2, (9) 2πσn where σ 2 n is the variance of the noise. For the prior, we have to draw upon our existing knowledge of x. Let s say x is the voltage of a car battery, and as an experienced mechanic you have observed the voltages on thousands of cars, using very accurate voltage meters, to have a mean of 12 volts, variance of 1 volt, with an approximate Gaussian distribution. Thus, P (x) = 1 2π e (x 12)2 2. (10) Now we are in a position to write down the posterior on x: P (x y) P (y x)p (x) (11) = e (y x) 2 2σn 2 = e 1 2 e (x 12)2 [ ] (y x) 2 σn 2 +(x 12) 2 2 (12). (13) The x which maximizes P (x y) is the same as that which minimizes the exponent in brackets which may be found by simple algebraic manipulation to be ˆx = y + 12σ2 n 1 + σ 2 n The entire inference process is depicted graphically in terms of the probability distributions in figure 1. (14) 5 P prior: P(x) 0 5 10 15 estimated value, x likelihood: P(y x) observed value (y=13) x, y Figure 1: A simple example of Bayesian inference. Generative models Generative models, also known as latent variable models or causal models, provide a way of modeling how a set of observed data could have arisen from a set of 3

underlying causes. Such models have been commonly employed in the social sciences, usually in the guise of factor analysis, to make inferences about the causes leading to various social conditions or personality traits. ore recently, generative models have been used to model the function of the cerebral cortex. The reason for this is that the cortex can be seen as solving a very complex inference problem, where it must select the best hypothesis for what s out there based on the massive data stream (gigabits per second) present on the sensory receptors. The basic idea behind a generative model is illustrated in figure 2. A set of multivariate data, D, is explained in terms of a set of underlying causes,. For instance, the data on the left may be symptoms of a patient, the causes on the right might be various diseases, and the links would represent how the diseases give rise to the symptoms and how the diseases interact with each other. Alternatively, the data may be a retinal image array, in which case the causes would be the objects present in the scene along with their positions and that of the lighting source, and the links would represent the rendering operation for creating an image from these causes. In general the links may be linear (as is the case in factor analysis), or more generally they may instantiate highly non-linear interactions among the causes or between the causes and the data. inference observed data D model causes Figure 2: A generative model. There are two fundamental problems to solve in a generative model. One is to infer the best set of causes to represent a specific data item, D i. The other is to learn the best model,, for explaining the entire set of data, D = {D 1, D 2,..., D n }. For example, each D i might correspond to a specific image, and D would represent the set of all observed images in nature. In modeling the operations of the cortex, the first problem may be seen as one of perception, while the second is one of adaptation. 4

Inference (perception) Inferring the best set of causes to explain a given piece of data usually involves maximizing the posterior over (or alternatively computing its mean), similar to the computation of x in the simplified example above. ˆ = arg max i, ) (15) = arg max i, )P ( ). (16) Note that the denominator P (D i ) may be dropped here since D i is fixed. All quantities are conditioned on the model,, which specifies the overall architecture (such as depicted in the figure) within which the causes,, are defined. Learning (adaptation) The model,, specifies the set of potential causes, their prior probabilities, and the generative process by which they give rise to the data. Learning a specific model,, that best accounts for all the data is accomplished by maximizing the posterior distribution over the models, which according to Bayes rule is P ( D) P (D )P (). (17) Oftentimes though we are agnostic in the prior over the model, and so we may simply choose the model that maximizes the likelihood, P (D ). The total probability of all the data under the model is P (D ) = P (D 1 ) P (D 2 )... P (D n ) (18) = Π i P (D i ) (19) where D i denotes an individual data item (e.g., a particular image). The probability of an individual data item is obtained by summating over all of the possible causes for the data P (D i ) = P (D i, )P ( ). (20) It is this summation that forms the most computationally formidable aspect of learning in the Bayesian framework. Usually there are ways we can approximate this sum though, and much effort goes into this step. One typically maximizes the log-likelihood of the model for reasons stemming from information theory as well as the fact that many probability distributions are naturally expressed as exponentials. In this case, we seek a model,, such that = arg max log P (D ) (21) = arg max log P (D i ) (22) i = arg max log P (D ). (23) 5

An example of where this approach has been applied in modeling the cortex is in understanding the receptive field properties of so-called simple cells, which are found in the primary visual cortex of all mammals. These cells have long been noted for their spatially localized, oriented, and bandpass receptive fields, but until recently there has been no explanation, in terms of a single quantitative theory, for why cells are built this way. In terms of Bayesian probability theory, one can understand the function of these cells as forming a model of natural images based on a linear superposition of sparse, statistically independent events. That is, if one utilizes a linear generative model where the prior on the causes is factorial and sparse, the set of linear weighting functions that emerge from the adaptation process (23) are localized, oriented, and bandpass, similar to the functions observed in cortical cells and also to the basis functions of commonly used wavelet transforms (Olshausen & Field, 1996, Nature, 381:607-9). Neural decoding Another problem in neuroscience where Bayesian probability theory has proven useful is in the decoding of spike trains. One of the most common experimental paradigms employed in sensory neurophysiology is to record the activity of a neuron while a stimulus is being played to the animal. Usually the goal in these experiments is to understand the function served by the neuron in mediated perception. One plays a stimulus, s, and observes a spike train, t, and the goal is to infer what t says about s. In terms of Bayesian probability theory, we can state the problem as P (s t) P (t s)p (s). (24) This is similar to the problem faced by the nervous system i.e., given messages received in the form of spike trains, it must make inferences about the actual signals present in the world. Bialek and colleagues pioneered the technique of stimulus reconstruction, which allows one to assess via this relation (24) the amount of information relayed by a neuron about a particular stimulus. Equation 24 demonstrates an important use of the prior in Bayesian probability theory, as it makes explicit the fact that optimal inferences about the world, given the information in the spike train, depend on what assumptions are made about the structure of the world, P (s). ore likely than not, P (s) is crucial for neurons in making inferences about the world, and so it is also important for experimentalists to consider the influence of this term when trying to ascertain the function of neurons from their responses to controlled stimuli. 6