MAP for Gaussian mean and variance

Similar documents
1 Maximum likelihood estimation

Bayes and Naïve Bayes. cs534-machine Learning

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Basics of Statistical Machine Learning

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

CSE 473: Artificial Intelligence Autumn 2010

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

The Optimality of Naive Bayes

Classification Problems

CSCI567 Machine Learning (Fall 2014)

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Factoring Trinomials of the Form x 2 bx c

CHAPTER 2 Estimating Probabilities

LCs for Binary Classification

Linear Threshold Units

Naïve Bayes and Hadoop. Shannon Quinn

Introduction to Game Theory IIIii. Payoffs: Probability and Expected Utility

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Question 2 Naïve Bayes (16 points)

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Machine Learning.

Unit 4 The Bernoulli and Binomial Distributions

The Basics of Graphical Models

Section 6.1 Joint Distribution Functions

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Christfried Webers. Canberra February June 2015

An Introduction to Information Theory

Learning from Data: Naive Bayes

Data Mining. Nonlinear Classification

BUILDING A SPAM FILTER USING NAÏVE BAYES. CIS 391- Intro to AI 1

Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Knowledge Discovery and Data Mining

Introduction to the Practice of Statistics Fifth Edition Moore, McCabe Section 4.4 Homework

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

Lecture 3: Linear methods for classification

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Linear Classification. Volker Tresp Summer 2015

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Data Mining - Evaluation of Classifiers

Bayes Theorem. Bayes Theorem- Example. Evaluation of Medical Screening Procedure. Evaluation of Medical Screening Procedure

Lecture 9: Bayesian hypothesis testing

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Because the slope is, a slope of 5 would mean that for every 1cm increase in diameter, the circumference would increase by 5cm.

Lecture 9: Introduction to Pattern Analysis

MA 1125 Lecture 14 - Expected Values. Friday, February 28, Objectives: Introduce expected values.

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

3F3: Signal and Pattern Processing

Prediction of Heart Disease Using Naïve Bayes Algorithm

Cell Phone based Activity Detection using Markov Logic Network

Recursive Estimation

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

CONTINGENCY (CROSS- TABULATION) TABLES

Predict Influencers in the Social Network

Lecture 8: Signal Detection and Noise Assumption

The Binomial Distribution

Bayesian probability theory

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Knowledge Discovery and Data Mining

Practice Problems for Homework #8. Markov Chains. Read Sections Solve the practice problems below.

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

A Logistic Regression Approach to Ad Click Prediction

Likelihood: Frequentist vs Bayesian Reasoning

Random variables P(X = 3) = P(X = 3) = 1 8, P(X = 1) = P(X = 1) = 3 8.

HT2015: SC4 Statistical Data Mining and Machine Learning

Detecting Corporate Fraud: An Application of Machine Learning

: Introduction to Machine Learning Dr. Rita Osadchy

A Few Basics of Probability

Chapter 4. Probability and Probability Distributions

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

MAS108 Probability I

Lecture 10: Depicting Sampling Distributions of a Sample Proportion

Definition: Suppose that two random variables, either continuous or discrete, X and Y have joint density

Comparison of machine learning methods for intelligent tutoring systems

Ex. 2.1 (Davide Basilio Bartolini)

Supervised Learning (Big Data Analytics)

Inference of Probability Distributions for Trust and Security applications

Point Biserial Correlation Tests

Machine Learning Logistic Regression

Bayesian Networks. Read R&N Ch Next lecture: Read R&N

Joint Exam 1/P Sample Exam 1

Machine Learning in Spam Filtering

Testing Research and Statistical Hypotheses

Hypothesis Testing for Beginners

Introduction to Learning & Decision Trees

Probability & Probability Distributions

A Learning Based Method for Super-Resolution of Low Resolution Images

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Component Ordering in Independent Component Analysis Based on Data Power

MAS2317/3317. Introduction to Bayesian Statistics. More revision material

Chapter 4 Lecture Notes

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Transcription:

MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian prior Variance: Wishart Distribution Prior for mean: = N(h,l 2 ) 1

MAP for Gaussian Mean (Assuming known variance s 2 ) MAP under Gauss-Wishart prior - Homework Independent of s 2 if l 2 = s 2 /s 2

Bayes Optimal Classifier Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Classification Goal: Sports Science News Features, X Labels, Y Probability of Error 4

Optimal Classification Optimal predictor: (Bayes classifier) 1 0.5 Bayes risk 0 Even the optimal classifier makes mistakes R(f*) > 0 Optimal classifier depends on unknown distribution 5

Optimal Classifier Bayes Rule: Optimal classifier: Class conditional density Class prior 6

Example Decision Boundaries Gaussian class conditional densities (1-dimension/feature) Decision Boundary 7

Example Decision Boundaries Gaussian class conditional densities (2-dimensions/features) Decision Boundary 8

Learning the Optimal Classifier Optimal classifier: Class conditional density Class prior Need to know Prior P(Y = y) for all y Likelihood P(X=x Y = y) for all x,y 9

Learning the Optimal Classifier Task: Predict whether or not a picnic spot is enjoyable Training Data: X = (X 1 X 2 X 3 X d ) Y n rows Lets learn P(Y X) how many parameters? Prior: P(Y = y) for all y K-1 if K labels Likelihood: P(X=x Y = y) for all x,y (2 d 1)K if d binary features 10

Learning the Optimal Classifier Task: Predict whether or not a picnic spot is enjoyable Training Data: X = (X 1 X 2 X 3 X d ) Y n rows Lets learn P(Y X) how many parameters? 2 d K 1 (K classes, d binary features) Need n >> 2 d K 1 number of training data to learn all parameters 11

Conditional Independence X is conditionally independent of Y given Z: probability distribution governing X is independent of the value of Y, given the value of Z Equivalent to: e.g., Note: does NOT mean Thunder is independent of Rain 12

Conditional vs. Marginal Independence C calls A and B separately and tells them a number n ϵ {1,,10} Due to noise in the phone, A and B each imperfectly (and independently) draw a conclusion about what the number was. A thinks the number was n a and B thinks it was n b. Are n a and n b marginally independent? No, we expect e.g. P(n a = 1 n b = 1) > P(n a = 1) Are n a and n b conditionally independent given n? Yes, because if we know the true number, the outcomes n a and n b are purely determined by the noise in each phone. P(n a = 1 n b = 1, n = 2) = P(n a = 1 n = 2) 13

Prediction using Conditional Independence Predict Lightening From two conditionally Independent features Thunder Rain # parameters needed to learn likelihood given L P(T,R L) (2 2-1)2 = 6 With conditional independence assumption P(T,R L) = P(T L) P(R L) (2-1)2 + (2-1)2 = 4 14

Naïve Bayes Assumption Naïve Bayes assumption: Features are independent given class: More generally: How many parameters now? (2-1)dK vs. (2 d -1)K Suppose X is composed of d binary features 15

Naïve Bayes Classifier Given: Class Prior P(Y) d conditionally independent features X given the class Y For each X i, we have likelihood P(X i Y) Decision rule: If conditional independence assumption holds, NB is optimal classifier! But worse otherwise. 16

Naïve Bayes Algo Discrete features Training Data Maximum Likelihood Estimates For Class Prior For Likelihood NB Prediction for test data 17

Subtlety 1 Violation of NB Assumption Usually, features are not conditionally independent: Actual probabilities P(Y X) often biased towards 0 or 1 (Why?) Nonetheless, NB is the single most used classifier out there NB often performs well, even when assumption is violated [Domingos & Pazzani 96] discuss some conditions for good performance 18

Subtlety 2 Insufficient training data What if you never see a training instance where X 1 =a when Y=b? e.g., Y={SpamEmail}, X 1 ={ Earn } P(X 1 =a Y=b) = 0 Thus, no matter what the values X 2,,X d take: P(Y=b X 1 =a,x 2,,X d ) = 0 What now??? 19

MLE vs. MAP What if we toss the coin too few times? You say: Probability next toss is a head = 0 Billionaire says: You re fired! with prob 1 Beta prior equivalent to extra coin flips As N 1, prior is forgotten But, for small sample size, prior is important! 20

Naïve Bayes Algo Discrete features Training Data Maximum A Posteriori Estimates add m virtual examples Assume priors MAP Estimate # virtual examples with Y = b Now, even if you never observe a class/feature posterior probability never zero. 21

Case Study: Text Classification Classify e-mails Y = {Spam,NotSpam} Classify news articles Y = {what is the topic of the article?} Classify webpages Y = {Student, professor, project, } What about the features X? The text! 22

Features X are entire document X i for i th word in article 23

NB for Text Classification P(X Y) is huge!!! Article at least 1000 words, X={X 1,,X 1000 } X i represents i th word in document, i.e., the domain of X i is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc. NB assumption helps a lot!!! P(X i =x i Y=y) is just the probability of observing word x i at the i th position in a document on topic y 24

Bag of words model Typical additional assumption Position in document doesn t matter: P(X i =x i Y=y) = P(X k =x i Y=y) Bag of words model order of words on the page ignored Sounds really silly, but often works very well! When the lecture is over, remember to wake up the person sitting next to you in the lecture room. 25

Bag of words model Typical additional assumption Position in document doesn t matter: P(X i =x i Y=y) = P(X k =x i Y=y) Bag of words model order of words on the page ignored Sounds really silly, but often works very well! in is lecture lecture next over person remember room sitting the the the to to up wake when you 26

Bag of words approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0... gas 1... oil 1 Zaire 0 27

NB with Bag of Words for text classification Learning phase: Class Prior P(Y) P(X i Y) Explore in HW Test phase: For each document Use naïve Bayes decision rule 28

Twenty news groups results 29

Learning curve for twenty news groups 30

What if features are continuous? Eg., character recognition: X i is intensity at i th pixel Gaussian Naïve Bayes (GNB): Different mean and variance for each class k and each pixel i. Sometimes assume variance is independent of Y (i.e., s i ), or independent of X i (i.e., s k ) or both (i.e., s) 31

Estimating parameters: Y discrete, X i continuous Maximum likelihood estimates: i th pixel in j th training image k th class j th training image 32

Example: GNB for classifying mental states [Mitchell et al.] ~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe measures Blood Oxygen Level Dependent (BOLD) response 33

Gaussian Naïve Bayes: Learned voxel,word [Mitchell et al.] 15,000 voxels or features 10 training examples or subjects per class 34

Learned Naïve Bayes Models Means for P(BrainActivity WordCategory) Pairwise classification accuracy: 85% People words Animal words [Mitchell et al.] 35

What you should know Optimal decision using Bayes Classifier Naïve Bayes classifier What s the assumption Why we use it How do we learn it Why is Bayesian estimation important Text classification Bag of words model Gaussian NB Features are still conditionally independent Each feature has a Gaussian distribution given class 36