A crash course in probability and Naïve Bayes classification

Similar documents
1 Maximum likelihood estimation

Bayes and Naïve Bayes. cs534-machine Learning

Basics of Statistical Machine Learning

Machine Learning.

Learning from Data: Naive Bayes

Linear Classification. Volker Tresp Summer 2015

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Linear Threshold Units

STA 4273H: Statistical Machine Learning

Statistical Machine Learning

Christfried Webers. Canberra February June 2015

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

CHAPTER 2 Estimating Probabilities

Classification Problems

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

MAS108 Probability I

CSCI567 Machine Learning (Fall 2014)

Lecture 3: Linear methods for classification

Introduction to General and Generalized Linear Models

Statistical Machine Learning from Data

Question 2 Naïve Bayes (16 points)

Lecture 9: Introduction to Pattern Analysis

The Optimality of Naive Bayes

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Less naive Bayes spam detection

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Cell Phone based Activity Detection using Markov Logic Network

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Course: Model, Learning, and Inference: Lecture 5

Machine Learning in Spam Filtering

Machine Learning Overview

Principle of Data Reduction

L4: Bayesian Decision Theory

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Introduction to Probability

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

Probability for Estimation (review)

CMPSCI 240: Reasoning about Uncertainty

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Multivariate Normal Distribution

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

STA 256: Statistics and Probability I

CSE 473: Artificial Intelligence Autumn 2010

Detecting Corporate Fraud: An Application of Machine Learning

3F3: Signal and Pattern Processing

Chapter 5. Random variables

Machine Learning and Pattern Recognition Logistic Regression

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Section 6.1 Joint Distribution Functions

Natural Language Processing using Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning

Characteristics of Binomial Distributions

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

1 Prior Probability and Posterior Probability

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Introduction to Detection Theory

Bayesian Networks. Read R&N Ch Next lecture: Read R&N

Gaussian Processes in Machine Learning

Bayesian Analysis for the Social Sciences

A Tutorial on Probability Theory

Bayesian probability theory

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Categorical Data Visualization and Clustering Using Subjective Factors

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

Simple Language Models for Spam Detection

Message-passing sequential detection of multiple change points in networks

Part 2: One-parameter models

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Continuous Random Variables

Lecture 6: Logistic Regression

1. A survey of a group s viewing habits over the last year revealed the following

Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application

( ) = ( ) = {,,, } β ( ), < 1 ( ) + ( ) = ( ) + ( )

Probability Distribution for Discrete Random Variables

The Basics of Graphical Models

Gaussian Processes for Big Data Problems

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Markov Chain Monte Carlo Simulation Made Simple

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Lecture 8: Signal Detection and Noise Assumption

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

An Introduction to Information Theory

A Game Theoretical Framework for Adversarial Learning

Master s Theory Exam Spring 2006

Probability and statistics; Rehearsal for pattern recognition

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

An Introduction to Basic Statistics and Probability

A Logistic Regression Approach to Ad Click Prediction

2. Discrete random variables

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Classification with Hybrid Generative/Discriminative Models

Math Quizzes Winter 2009

Probabilistic user behavior models in online stores for recommender systems

Consistent Binary Classification with Generalized Performance Metrics

Transcription:

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish between discrete and continuous variables. The distribution of a discrete random variable: The probabilities of each value it can take. Notation: P(X = x i ). These numbers satisfy: X P (X = x i )=1 i 1 2 Probability theory Probability theory A joint probability distribution for two variables is a table. Marginal Probability If the two variables are binary, how many parameters does it have? Joint Probability Conditional Probability What about joint probability of d variables P(X 1,,X d )? How many parameters does it have if each variable is binary? 4 1

Probability theory The Rules of Probability Marginalization: Marginalization Product Rule Product Rule Independence: X and Y are independent if P(Y X) = P(Y) This implies P(X,Y) = P(X) P(Y) Using probability in learning P (Y X) Interested in: 1% X features Y labels 0.5% 0% For example, when classifying spam, we could estimate P(Y Viagara, lottery) We would then classify an example if P(Y X) > 0.5. However, it s usually easier to model P(X Y) Maximum likelihood Fit a probabilistic model P(x θ) to data Estimate θ Given independent identically distributed (i.i.d.) data X = (x 1, x 2,, x n ) Likelihood Log likelihood P (X ) =P (x 1 )P (x 2 ),...,P(x n ) ln P (X ) = nx ln P (x i ) i=1 Maximum likelihood solution: parameters θ that maximize ln P(X θ) 7 2

: coin toss Estimate the probability p that a coin lands Heads using the result of n coin tosses, h of which resulted in heads. The likelihood of the data: Log likelihood: Taking a derivative and setting to 0: P (X ) =p h (1 p) n h ln P (X ) =h ln p +(n h)ln(1 p) @ ln P (X ) @p = h p (n h) (1 p) =0 From the product rule: and: Therefore: P(Y, X) = P(Y X) P(X) P(Y, X) = P(X Y) P(Y) P (Y X) = This is known as Bayes rule Bayes rule P (X Y )P (Y ) P (X) ) p = h n 9 10 P (Y X) = posterior Bayes rule P (X Y )P (Y ) P (X) posterior likelihood prior P(X) can be computed as: P (X) = X Y P (X Y )P (Y ) likelihood prior Maximum a-posteriori and maximum likelihood The maximum a posteriori (MAP) rule: y MAP = arg max Y P (X Y )P (Y ) P (Y X) = arg max Y P (X) If we ignore the prior distribution or assume it is uniform we obtain the maximum likelihood rule: y ML = arg max P (X Y ) Y = arg max P (X Y )P (Y ) Y A classifier that has access to P(Y X) is a Bayes optimal classifier. But is not important for inferring a label 12 3

Naïve Bayes classifier We would like to model P(X Y), where X is a feature vector, and Y is its associated label. Task:%Predict%whether%or%not%a%picnic%spot%is%enjoyable% % Training&Data:&& X%=%(X 1 %%%%%%%X 2 %%%%%%%%X 3 %%%%%%%% %%%%%%%% %%%%%%%X d )%%%%%%%%%%Y% Naïve Bayes classifier We would like to model P(X Y), where X is a feature vector, and Y is its associated label. Simplifying assumption: conditional independence: given the class label the features are independent, i.e. P (X Y )=P (x 1 Y )P (x 2 Y ),...,P(x d Y ) n&rows& How many parameters now? How many parameters? Prior: P(Y) k-1 if k classes Likelihood: P(X Y) (2 d 1)k for binary features 13 14 Naïve Bayes classifier We would like to model P(X Y), where X is a feature vector, and Y is its associated label. Simplifying assumption: conditional independence: given the class label the features are independent, i.e. P (X Y )=P (x 1 Y )P (x 2 Y ),...,P(x d Y ) Naïve Bayes decision rule: y NB = arg max Y Naïve Bayes classifier P (X Y )P (Y ) = arg max Y dy P (x i Y )P (Y ) i=1 If conditional independence holds, NB is an optimal classifier! How many parameters now? dk + k - 1 15 16 4

Training a Naïve Bayes classifier Training data: Feature matrix X (n x d) and labels y 1, y n Maximum likelihood estimates: Class prior: Likelihood: ˆP (y) = {i : y i = y} n ˆP (x i y) = ˆP (x i,y) ˆP (y) = {i : X ij = x i,y i = y} /n {i : y i = y} /n Email classification Suppose our vocabulary contains three words a, b and c, and we use a multivariate Bernoulli model for our e-mails, with parameters = (0.5,0.67,0.33) = (0.67,0.33,0.33) This means, for example, that the presence of b is twice as likely in spam (+), compared with ham. The e-mail to be classified contains words a and b but not c, and hence is described by the bit vector x = (1,1,0). We obtain likelihoods P(x ) = 0.5 0.67 (1 0.33) = 0.222 P(x ) = 0.67 0.33 (1 0.33) = 0.148 The ML classification of x is thus spam. 17 18 Email classification: training data E-mail a? b? c? Class e 1 0 1 0 + e 2 0 1 1 + e 3 1 0 0 + e 4 1 1 0 + e 5 1 1 0 e 6 1 0 1 e 7 1 0 0 e 8 0 0 0 ount vectors. (right) The same data set What are the parameters of the model? Email classification: training data E-mail a? b? c? Class e1 0 1 0 + e2 0 1 1 + e3 1 0 0 + e4 1 1 0 + e5 1 1 0 e6 1 0 1 e7 1 0 0 e8 0 0 0 ount vectors. (right) The same data set What are the parameters of the model? ˆP (y) = {i : y i = y} n ˆP (x i y) = ˆP (x i,y) ˆP (y) = {i : X ij = x i,y i = y} /n {i : y i = y} /n 19 20 5

Email classification: training data E-mail a? b? c? Class e1 0 1 0 + e2 0 1 1 + e3 1 0 0 + e4 1 1 0 + e5 1 1 0 e6 1 0 1 e7 1 0 0 e8 0 0 0 ount vectors. (right) The same data set What are the parameters of the model? P(+) = 0.5, P(-) = 0.5 P(a +) = 0.5, P(a -) = 0.75 P(b +) = 0.75, P(b -)= 0.25 P(c +) = 0.25, P(c -)= 0.25 ˆP (x i y) = ˆP (x i,y) ˆP (y) ˆP (y) = {i : y i = y} n = {i : Xij = xi,yi = y} /n {i : y i = y} /n Comments on Naïve Bayes Usually features are not conditionally independent, i.e. P (X Y ) 6= P (x 1 Y )P (x 2 Y ),...,P(x d Y ) And yet, one of the most widely used classifiers. Easy to train! It often performs well even when the assumption is violated. Domingos, P., & Pazzani, M. (1997). Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. Machine Learning. 29, 103-130. 21 22 When there are few training examples What if you never see a training example where x 1 =a when y=spam? P(x spam) = P(a spam) P(b spam) P(c spam) = 0 What to do? When there are few training examples What if you never see a training example where x 1 =a when y=spam? P(x spam) = P(a spam) P(b spam) P(c spam) = 0 What to do? Add virtual examples for which x 1 =a when y=spam. 23 24 6

Naïve Bayes for continuous variables Continuous Probability Distributions Need to talk about continuous distributions! The probability of the random variable assuming a value within some given interval from x 1 to x 2 is defined to be the area under the graph of the probability density function between x 1 and x 2. f (x) Uniform f (x) Normal/Gaussian x x x x 1 x 1 x 2 2 25 Expectations The Gaussian (normal) distribution Discrete variables Continuous variables Conditional expectation (discrete) Approximate expectation (discrete and continuous) 7

Properties of the Gaussian distribution Standard Normal Distribution 99.72% 95.44% 68.26% A random variable having a normal distribution with a mean of 0 and a standard deviation of 1 is said to have a standard normal probability distribution. µ 3σ µ 1σ µ 2σ µ µ + 1σ µ + 2σ µ + 3σ x Standard Normal Probability Distribution Gaussian Parameter Estimation Converting to the Standard Normal Distribution z x = µ σ Likelihood function We can think of z as a measure of the number of standard deviations x is from µ. 8

Maximum (Log) Likelihood 34 Gaussian models Assume we have data that belongs to three classes, and assume a likelihood that follows a Gaussian distribution Likelihood function: P (X i = x Y = y k )= Gaussian Naïve Bayes 1 (x µik ) 2 p exp 2 ik 2 2 ik Need to estimate mean and variance for each feature in each class. 35 36 9

Summary Naïve Bayes classifier: ² ² ² What s the assumption Why we make it How we learn it Naïve Bayes for discrete data Gaussian naïve Bayes 37 10