Nuno Vasconcelos UCSD

Similar documents
What is Candidate Sampling

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Recurrence. 1 Definitions and main statements

An Alternative Way to Measure Private Equity Performance

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

The Greedy Method. Introduction. 0/1 Knapsack Problem

Mean Molecular Weight

Extending Probabilistic Dynamic Epistemic Logic

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

How To Calculate The Accountng Perod Of Nequalty

CHAPTER 14 MORE ABOUT REGRESSION

Prediction of Disability Frequencies in Life Insurance

How To Find The Dsablty Frequency Of A Clam

Realistic Image Synthesis

1. Measuring association using correlation and regression

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Sketching Sampled Data Streams

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Calculation of Sampling Weights

The OC Curve of Attribute Acceptance Plans

Logistic Regression. Steve Kroon

L10: Linear discriminants analysis

Forecasting the Direction and Strength of Stock Market Movement

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

Evaluating credit risk models: A critique and a new proposal

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

Approximating Cross-validatory Predictive Evaluation in Bayesian Latent Variables Models with Integrated IS and WAIC

Implementation of Deutsch's Algorithm Using Mathcad

MARKET SHARE CONSTRAINTS AND THE LOSS FUNCTION IN CHOICE BASED CONJOINT ANALYSIS

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Regression Models for a Binary Response Using EXCEL and JMP

Quantization Effects in Digital Filters

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

A Probabilistic Theory of Coherence

Lecture 5,6 Linear Methods for Classification. Summary

1 Example 1: Axis-aligned rectangles

Statistical Methods to Develop Rating Models

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Support Vector Machines

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Implied (risk neutral) probabilities, betting odds and prediction markets

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

DEFINING %COMPLETE IN MICROSOFT PROJECT

Estimation of Dispersion Parameters in GLMs with and without Random Effects

Brigid Mullany, Ph.D University of North Carolina, Charlotte

STATISTICAL DATA ANALYSIS IN EXCEL

Traffic State Estimation in the Traffic Management Center of Berlin

Title Language Model for Information Retrieval

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Optimal Bidding Strategies for Generation Companies in a Day-Ahead Electricity Market with Risk Management Taken into Account

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

Single and multiple stage classifiers implementing logistic discrimination

Production. 2. Y is closed A set is closed if it contains its boundary. We need this for the solution existence in the profit maximization problem.

Bayesian Cluster Ensembles

Applied Research Laboratory. Decision Theory and Receiver Design

Hedging Interest-Rate Risk with Duration

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

SIMPLE LINEAR CORRELATION

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

Machine Learning and Data Mining Lecture Notes

Credit Limit Optimization (CLO) for Credit Cards

Support vector domain description

Fisher Markets and Convex Programs

The Mathematical Derivation of Least Squares

Analysis of Premium Liabilities for Australian Lines of Business

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Portfolio Loss Distribution

BERNSTEIN POLYNOMIALS

Inverse Modeling of Tight Gas Reservoirs

Lecture 2: Single Layer Perceptrons Kevin Swingler

An Interest-Oriented Network Evolution Mechanism for Online Communities

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

Lecture 3: Annuity. Study annuities whose payments form a geometric progression or a arithmetic progression.

HÜCKEL MOLECULAR ORBITAL THEORY

A Lyapunov Optimization Approach to Repeated Stochastic Games

Traffic-light a stress test for life insurance provisions

Chapter XX More advanced approaches to the analysis of survey data. Gad Nathan Hebrew University Jerusalem, Israel. Abstract

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

Vasicek s Model of Distribution of Losses in a Large, Homogeneous Portfolio

How To Solve A Problem In A Powerline (Powerline) With A Powerbook (Powerbook)

Stress test for measuring insurance risks in non-life insurance

1 De nitions and Censoring

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

AN APPOINTMENT ORDER OUTPATIENT SCHEDULING SYSTEM THAT IMPROVES OUTPATIENT EXPERIENCE

Method for assessment of companies' credit rating (AJPES S.BON model) Short description of the methodology

NON-CONSTANT SUM RED-AND-BLACK GAMES WITH BET-DEPENDENT WIN PROBABILITY FUNCTION LAURA PONTIGGIA, University of the Sciences in Philadelphia

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

When do data mining results violate privacy? Individual Privacy: Protect the record

Efficient Reinforcement Learning in Factored MDPs

A Model of Private Equity Fund Compensation

Transcription:

Bayesan parameter estmaton Nuno Vasconcelos UCSD 1

Maxmum lkelhood parameter estmaton n three steps: 1 choose a parametrc model for probabltes to make ths clear we denote the vector of parameters by Θ P X ( x; Θ note that ths means that Θ s NOT a random varable 2 assemble D = {x 1,..., x n } of examples drawn ndependently 3 select the parameters that maxmze the probablty of the data Θ * = arg max Θ P X = arg max log P Θ ( D; Θ P X ( D; Θ P X (D;Θ s the lkelhood of parameter Θ wth respect to the data 2

Least squares there are nterestng connectons between ML estmaton and least squares methods e.g. n a regresson problem we have two random varables X and Y a dataset of examples D = {(x 1,y 1, (x n,y n } a parametrc model of the form y = f (x; Θ + ε where Θ s a parameter vector, and ε a random varable that accounts for nose e.g. ε ~ N(0,σ 2 3

Least squares assumng that the famly of models s known, e.g. K f ( x ; Θ = f = 0 x ths s really just a problem of parameter estmaton where the data s dstrbuted as P Z X ( 2 z, f ( x ; Θ X ( D x ; Θ = G f, σ note that X s always known, and the mean s a functon of x and Θ n the homework, you wll show that Θ * = [ T 1 T Γ Γ] Γ y 4

Least squares where Γ = 1 K 1 K x1 M K K x n concluson: least squares estmaton s really just ML estmaton under the assumpton of Gaussan nose ndependent d sample ε ~ N(0,σ 2 once agan, probablty blt makes the assumptons explct t 5

Least squares soluton due to the connecton to parameter estmaton we can also talk about the qualty of the least squares soluton n partcular, we know that t s unbased varance goes to zero as the number of ponts ncreases t s the BLUE estmator for f(x;θ under the statstcal formulaton we can also see how the optmal estmator changes wth assumptons ML estmaton can also lead to (homework weghted least squares mnmzaton of L p norms robust estmators 6

Bayesan parameter estmaton Bayesan parameter estmaton s an alternatve framework for parameter estmaton t turns out that the dvson between Bayesan and ML methods s qute fundamental t stems from a dfferent way of nterpretng probabltes frequentst vs Bayesan there s a long debate about whch s best ths debate goes to the core of what probabltes blt mean to understand t, we have to dstngush two components the defnton of probablty (ths does not change the assessment of probablty (ths changes let s start wth a bref revew of the part that does not change 7

Probablty probablty s a language to deal wth processes that are non-determnstc examples: f I flp a con 100 tmes, how many can I expect to see heads? what s the weather gong to be lke tomorrow? are my stocks gong to be up or down? am I n front of a classroom or s ths just a pcture of t? 8

Sample space the most mportant concept s that of a sample space our process defnes a set of events these are the outcomes or states of the process example: we roll a par of dce call the value on the up face at the n th toss x n note that possble events such as odd number on second throw two sxes x 1 = 2 and x 2 = 6 can all be expressed as combnatons x 2 6 of the sample space events 1 1 6 x 1 9

Sample space s the lst of possble events that satsfes the followng propertes: fnest gran: all possble dstngushable events are lsted separately mutually exclusve: f one event happens the other does not (f x 1 = 5 t cannot be anythng else collectvely exhaustve: any possble outcome can be expressed as unons of sample space events x 2 6 1 1 6 x 1 mutually exclusve property smplfes the calculaton of the probablty of complex events collectvely exhaustve means that there s no possble outcome to whch h we cannot assgn a probablty blt 10

Probablty measure probablty of an event: number expressng the chance that the event wll be the outcome of the process probablty measure: satsfes three axoms P(A 0 for any event A P(unversal event = 1 f A B =, then P(A+B = P(A + P(B all of ths has to do wth the defnton of probablty 1 s the same under Bayes and frequentst vews what changes s how probabltes are assessed x 2 6 1 6 x 1 11

Frequentst vew under the frequentst vew probabltes are relatve frequences I throw my dce n tmes n m of those the sum s 5 I say that P ( sum = 5 = m n ths s ntmately connected wth the ML method t s the ML estmate for the probablty of a Bernoull process wth states ( 5, everythng else makes sense when we have a lot of observatons no bas; decreasng varance; converges to true probablty blt 12

Problems many nstances where we do not have a large number of observatons consder the problem of crossng a street ths s a decson problem wth two states Y = 0: I am gong to get hurt Y = 1: I wll make t safely optmal decson computable by Bayes decson rule collect some measurements that are nformatve e.g. (X = {sze, dstance, speed} of ncomng cars collect examples under both states and estmate all probabltes somehow ths does not sound lke a great dea! 13

Problems under the frequentst vew you need to repeat an experment a large number of tmes to estmate any probabltes yet, people are very good at estmatng probabltes for problems n whch t s mpossble to set up such experments for example: wll I de f I jon the army? wll Democrats or Republcans wn the next electon? s there a God? wll I graduate n two years? to the pont where they make lfe-changng decsons based on these probablty estmates (enlstng n the army, etc. 14

Subjectve probablty ths motvates an alternatve defnton of probabltes note that ths has to do more wth how probabltes are assessed than wth the probablty defnton tself we stll have a sample space, a probablty measure, etc however the probabltes are not equated to relatve counts ths s usually referred to as subjectve probablty probabltes are degrees of belef on the outcomes of the experment they are ndvdual (vary from person to person they are not ratos of expermental outcomes e.g. for very relgous person P(god exsts ~ 1 for casual churchgoer P(god exsts ~ 0.8 (e.g. accepts evoluton, etc. for non-relgous P(god exsts ~ 0 15

Problems n practce, why do we care about ths? under the noton of subjectve probablty, the entre ML framework makes lttle sense there s a magc number that s estmated from the world and determnes our belefs to evaluate my estmates I have to run experments over and over agan and measure quanttes lke bas and varance ths s not how people behave, when we make estmates we attach a degree of confdence to them, wthout further experments there s only one model (the ML model for the probablty of the data, no multple explanatons there s no way to specfy that some models are, a pror, better than others 16

Bayesan parameter estmaton the man dfference wth respect to ML s that n the Bayesan case Θ s a random varable basc concepts tranng set D = {x 1,..., x n } of examples drawn ndependently probablty densty for observatons gven parameter P X Θ ( x pror dstrbuton b t for parameter confguratons P Θ ( that encodes pror belefs about them goal: to compute the posteror dstrbuton PΘ X D ( D 17

Bayes vs ML there are a number of sgnfcant dfferences between Bayesan and ML estmates D 1 : ML produces a number, the best estmate to measure ts goodness we need to measure bas and varance ths can only be done wth repeated experments Bayes produces a complete characterzaton of the parameter from the sngle dataset n addton to the most probable estmate, we obtan a characterzaton of the uncertanty lower uncertanty hgher uncertanty 18

Bayes vs ML D 2 : optmal estmate under ML there s one best estmate under Bayes there s no best estmate only a random varable that takes dfferent values wth dfferent probabltes techncally speakng, t makes no sense to talk about the best estmate D 3 : predctons remember that we do not really care about the parameters themselves they are needed only n the sense that they allow us to buld models that can be used to make predctons (e.g. the BDR unlke ML, Bayes uses ALL nformaton n the tranng set to make predctons 19

Bayes vs ML let s consder the BDR under the 0-1 loss and an ndependent sample D = {x 1,..., x n } ML-BDR: pck f two steps: fnd * * * ( x = arg max P ( x ; * where plug nto the BDR X Y = arg max P X Y P ( Y ( D, all nformaton not captured by * s lost, not used at decson tme 20

Bayes vs ML note that we know that nformaton s lost e.g. we can t even know how good of an estmate * s unless we run multple experments and measure bas/varance Bayesan BDR under the Bayesan framework, everythng s condtoned on the tranng data denote T = {X 1,..., X n } the set of random varables from whch the tranng sample D = {x 1,..., x n n} s drawn B-BDR: pck f * ( x = arg max PX Y, ( x, D P ( the decson s condtoned d on the entre tranng set T Y 21

Bayesan BDR to compute the condtonal probabltes, we use the margnalzaton equaton P X Y, T ( x, D ( ( PX Θ, Y, T x,, D PΘ Y, T, D = d note 1: when the parameter value s known, x no longer depends on T, e.g. XΘ ~ N(,σ 2 we can, smplfy equaton above nto P ( x, D ( ( PX Θ, Y x, PΘ Y, T D = d X Y, T, note 2: once agan can be done n two steps (per class fnd P ΘT (D compute P XY,T (x, D and plug nto the BDR no tranng nformaton s lost 22

Bayesan BDR n summary pck f * note: ( x = arg max PX Y, where P T ( x, D P Y ( ( x, D P ( x, P ( D d X Y, T X Y, Θ Θ Y, T, = as before the bottom equaton s repeated for each class hence, we can drop the dependence on the class and consder the more general problem of estmatng P ( x D P ( x P ( D d X T X Θ Θ T = 23

The predctve dstrbuton the dstrbuton ( x D P ( x P ( D d P = X T X Θ Θ T s known as the predctve dstrbuton ths follows from the fact that t allows us to predct the value of x gven ALL the nformaton avalable n the tranng set note that t t can also be wrtten as P ( x D E P ( x [ T D] X T = Θ T X Θ = snce each parameter value defnes a model ths s an expectaton over all possble models each model s weghted by ts posteror probablty, gven tranng data 24

The predctve dstrbuton suppose that 2 P ( x ~ N(,1 and P ( D ~ N( µ σ X Θ Θ T, P T ( D π P X T 1 ( x D weght π 2 Θ weght π 1 weght π 2 π σ 2 1 2 µ 2 µ µ 1 µ 2 µ µ 1 the predctve dstrbuton s an average of all these Gaussans P ( x D P ( x P ( D d X T X Θ Θ T = 1 1 x 25

The predctve dstrbuton Bayes vs ML ML: pck one model Bayes: average all models are Bayesan predctons very dfferent than those of ML? they can be, unless the pror s narrow P T ( D Θ P T ( D Θ max max Bayes ~ ML very dfferent 26

The predctve dstrbuton hence, ML can be seen as a specal case of Bayes when you are very confdent about the model pckng one s good enough n comng lectures we wll see that f the sample s qute large, the pror tends to be narrow ntutve: gven a lot of tranng data, there s lttle uncertanty about what the model s Bayes can make a dfference when there s lttle data we have already seen that ths s the mportant case snce the varance of ML tends to go down as the sample ncreases overall Bayes regularzes the ML estmate when ths s uncertan converges to ML when there s a lot of certanty 27

MAP approxmaton ths sounds good, why use ML at all? the man problem wth Bayes s that the ntegral P can be qute nasty ( x D P ( x P ( D d = X T X Θ Θ T n practce one s frequently forced to use approxmatons one possblty s to do somethng smlar to ML,.e. pck only one model ths can be made to account for the pror by pckng the model that has the largest posteror probablty gven the tranng data ( D MAP = arg max P Θ T 28

MAP approxmaton ths can usually be computed snce arg max P ( D MAP P Θ T = D T Θ ( D ( = arg max P P and corresponds to approxmatng the pror by a delta functon centered at ts maxmum Θ ( D PΘ T ( D P T Θ MAP MAP 29

MAP approxmaton n ths case P X T the BDR becomes pck f * ( x D = PX Θ ( x δ ( MAP d d = P ( x X Θ ( x = arg max PX Y MAP ( MAP x ; ( ( D, P ( MAP where = arg max PT Y, Θ Θ Y P Y when compared to the ML ths has the advantage of stll accountng for the pror (although only approxmately 30

MAP vs ML ML-BDR pck f * * ( x = arg max P ( x ; where Bayes MAP-BDR pck f * ( x where * = MAP X Y arg max = arg max P X Y P X Y = arg max P P ( Y ( D, ( MAP x ; P ( T Y, Θ Y ( D, P ( the dfference s non-neglgble only when the dataset s small there are better alternatve approxmatons Θ Y 31

The Laplace approxmaton ths s a method for approxmatng any dstrbuton P X (x conssts of approxmatng P X (x by a Gaussan centered at ts peak let s assume that 1 Z ( x g( x P X = where g(x s an unormalzed dstrbuton (g(x > 0, for all x and Z the normalzaton constant Z = g ( x dx we make a Taylor seres approxmaton of g(x at ts maxmum x 0 32

Laplace approxmaton the Taylor expanson s log g( x = log g( x c ( o x x 2 2 0 +K (the frst-order term s zero because x 0 s a maxmum wth 2 c = x 2 log g( x x= x 0 x 0 P X (x and we approxmate g(x by an unormalzed Gaussan { ( 2 } c x x g' ( x = g( xo exp 2 and then compute the normalzaton constant 0 Z = g( x o 2π c 33

Laplace approxmaton ths can obvously be extended to the multvarate case the approxmaton s T log g( x = log g( xo ( x x ( 2 0 A x x0 wth A the Hessan of g(x at x 0 A j = 2 x x j log g( x and the normalzaton constant Z = g( x o ( 2 d 2π A 1 x= x 0 n physcs ths s also called a saddle-pont approxmaton 34

Laplace approxmaton note that the approxmaton can be made for the predctve dstrbuton ( x D = G( x, x Α P X T *, X T or for the parameter posteror n whch case ( D G(, A P Θ T = MAP, Θ T P ( x D P ( x G(, A d X T X Θ MAP, Θ T = ths s clearly superor to the MAP approxmaton ( x D = P Θ ( x δ ( d P X T X Θ MAP 35

Other methods there are two other man alternatves, when ths s not enough varatonal approxmatons samplng methods (Markov Chan Monte Carlo varatonal approxmatons consst of boundng the ntractable functon searchng for the best bound samplng methods consst desgnng a Markov chan that has the desred dstrbuton as ts equlbrum dstrbuton sample from ths chan samplng methods converge to the true dstrbuton but convergence s slow and hard to detect 36

37