1 Maximum likelihood estimation



Similar documents
Bayes and Naïve Bayes. cs534-machine Learning

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Lecture 3: Linear methods for classification

STA 4273H: Statistical Machine Learning

CSE 473: Artificial Intelligence Autumn 2010

Sentiment analysis using emoticons

Classification Problems

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Question 2 Naïve Bayes (16 points)

CHAPTER 2 Estimating Probabilities

Linear Classification. Volker Tresp Summer 2015

Machine Learning Final Project Spam Filtering

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Learning from Data: Naive Bayes

Logistic Regression for Spam Filtering

Chapter 6: Point Estimation. Fall Probability & Statistics

Lecture 8: Signal Detection and Noise Assumption

Christfried Webers. Canberra February June 2015

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Local classification and local likelihoods

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Simple Language Models for Spam Detection

A Logistic Regression Approach to Ad Click Prediction

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Statistical Machine Learning

CSCI567 Machine Learning (Fall 2014)

Spam Filtering with Naive Bayesian Classification

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

The Optimality of Naive Bayes

Basics of Statistical Machine Learning

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Probabilistic user behavior models in online stores for recommender systems

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Economics 1011a: Intermediate Microeconomics

Making Sense of the Mayhem: Machine Learning and March Madness

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

STAT 35A HW2 Solutions

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Projektgruppe. Categorization of text documents via classification

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Machine Learning in Spam Filtering

Social Media Mining. Data Mining Essentials

Analysis of Bayesian Dynamic Linear Models

Chapter 6. The stacking ensemble approach

: Introduction to Machine Learning Dr. Rita Osadchy

The Chinese Restaurant Process

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Introduction to Probability

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

The Basics of Graphical Models

Supervised Learning (Big Data Analytics)

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Prediction of Heart Disease Using Naïve Bayes Algorithm

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Author Gender Identification of English Novels

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Homework Assignment 7

Lecture 9: Introduction to Pattern Analysis

Predict Influencers in the Social Network

Customer Classification And Prediction Based On Data Mining Technique

Class Imbalance Learning in Software Defect Prediction

AP Physics 1 and 2 Lab Investigations

Bayesian Spam Detection

E-commerce Transaction Anomaly Classification

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

SAMPLE SELECTION BIAS IN CREDIT SCORING MODELS

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Linear Threshold Units

Linear Models for Classification

Logistic Regression (1/24/13)

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Knowledge Discovery and Data Mining

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Data Mining - Evaluation of Classifiers

Multivariate Normal Distribution

Simple and efficient online algorithms for real world applications

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Java Modules for Time Series Analysis

Machine Learning and Pattern Recognition Logistic Regression

A Content based Spam Filtering Using Optical Back Propagation Technique

Classification Techniques for Remote Sensing

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Data Mining. Nonlinear Classification

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

PS 271B: Quantitative Methods II. Lecture Notes

Data Mining Practical Machine Learning Tools and Techniques

Lecture 25. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Predicting Flight Delays

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Lecture 10: Regression Trees

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Classification algorithm in Data mining: An Overview

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Transcription:

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N flips of the coin, the MLE of the bias of the coin is π = number of heads N (1) One of the reasons that we like to use MLE is because it is consistent. In the example above, as the number of flipped coins N approaches infinity, our the MLE of the bias ˆπ approaches the true bias π, as we can see from the graph below. 1.2 MLE of a Gaussian random variable The parameters of a Gaussian distribution are the mean (µ) and variance (σ 2 ). Given observations x 1,..., x N, the likelihood of those observations for a certain µ and σ 2 (assuming that the observations came from a Gaussian distribution) is and the log likelihood is { p(x 1,..., x N µ, σ 2 1 (xn µ) ) = 2 } exp 2πσ 2σ 2 L(µ, σ) = 1 N 2 N (x n µ) 2 log(2πσ2 ) 2σ 2 (3) We can then find the values of µ and σ 2 that maximize the log likelihood by taking derivative with respect to the desired variable and solving the equation obtained. By doing so, we find that the MLE of the mean is (2) ˆµ = 1 N N x n (4) and the MLE of the variance is ˆσ 2 = 1 N N (x n ˆµ) 2 (5)

1.3 Gaussian MLE case study In the graph above, we have plotted the annual presidential approval ratings along with the Gaussian distribution fitted to the sample mean and variance. However, there are three main problems in using a Gaussian model to analyze this data. First, the approval ratings are restricted between 0 and 100, while the Gaussian density function is over all real numbers. Secondly, simply looking at the approval ratings as separate data points ignore the sequential nature of the data. Lastly, using the Gaussian distribution assumes that the data points are independent and identically distributed, when in fact we need to take into account of the fact that the term of a presidency is four or eight years, so there are years when the ratings are correlated. All models are merely cartoons of the data that we re analyzing, and it s hard to find a model that describe all the data. As the statistician George Box once said, All models are wrong, but some models are useful. 2 Naive Bayes 2.1 Classification Classification algorithms work on labeled training data. A learning algorithm takes the data produces a classifier, which is a function that takes in new unlabeled test examples and outputs predictions about the labels of those test examples based on the patterns in the training data. An example would be spam classification, where the data are e-mails and the labels are whether the e-mails are spam or not (also called ham). 2.1.1 Binary text classification In binary text classification, we are given documents and their classes {w d,1:n, c d }, where w d,1:n are the N words in document d, and c d is the class of the document. For example, suppose the documents are e-mails; then the categories are whether the e-mail is spam or ham. Our goal is to build a classifier that can predict the class of a new document. To do so, we would divide the data into a training set and testing set. The purpose of holding out a portion of the data as testing set is to allow for cross validation. After all, it would be cheating if we were to both train and test our classifier using the same data set. Thus, we use the labeled testing set that has not been seen by the classifier during training to 2

evaluate the classifier. One good evaluation metric is accuracy, which is defined as # correctly predicted labels in the test set # of instances in the test set (6) 2.2 Naive Bayes model The naive Bayes model defines a joint distribution over words and classes that is, the probability that a sequence of words belongs to some class c. p(c, w 1:N π, θ) = p(c π) p(w n θ c ) (7) π is the probability distribution over the classes (in this case, the probability of seeing spam e-mails and of seeing ham e-mails). It s basically a discrete probability distribution over the classes that sums to 1. θ c is the class conditional probability distribution that is, given a certain class, the probability distribution of the words in your vocabulary. Each class has a probability for each word in the vocabulary (in this case, there is a set of probabilities for the spam class and one for the ham class). Given a new test example, we can classify it using the model by calculating the conditional probability of the classes given the words in the e-mail p(c w n ) and see which class is most likely. There is an implicit independence assumption behind the model. Since we re multiplying the p(w n θ c ) terms together, it is assumed that the words in a document are conditionally independent given the class. 2.3 Class prediction with naive Bayes We classify using the posterior distribution of the classes given the words, which is proportional to the joint distribution since P (X Y ) = P(X, Y)/P(Y) P(X, Y). The posterior distribution is p(c w 1:N, π, θ) p(c π) p(w n θ c ) (8) Note that we don t need a normalizing constant because we only care about which probability is bigger. The classifier assigns the e-mail to the class which has highest probability. 2.4 Fitting a naive Bayes model with maximum likelihood We find the parameters in the posterior distribution used in class prediction by learning on the labeled training data (in this case, the ham and spam e-mails). We use the maximum likelihood method method in finding parameters that maximize the likelihood of the observed data set. Given data {w d,1:n, c d } D d=1, the likelihood under a certain model with parameters (θ 1:C, π) is p(d θ 1:C, π) = ( ) D p(c d π) p(w n θ cd ) d=1 3

= ( D C N ) 1(cd =c) V π c θ 1(w d,n=v) c,v d=1 c=1 v=1 which is the product over each email (D), each category (C), each word in the e-mail (N), and each word in the vocabulary (V ). The 1(c d = c) term serves to filter out e-mails that are not under the specific category, and the 1(w d,n = v) term serves to filter out the words in the vocabulary that don t appear in the e-mail. Then, taking the logarithm of each side gives L(π, θ 1:C ; D) = Dd=1 Cc=1 1(c d = c) log π c + Dd=1 Cc=1 N Vv=1 1(c d = c)1(w d,n = v) log θ c,v and since the logarithm is a monotonic function, maximizing the log likelihood is the same as maximizing the likelihood of the data. Taking the log allows you to decompose the likelihood into the two separate parts of your model one term contains only π, and the other term contains only θ c,v. Thus, taking the derivative with respect to one of the parameters eliminates the other one. To maximize the values of π and θ c,v, take the derivative with respect to each and solve for zero. This leads to the maximum likelihood estimates: ˆπ c = n c D (9) and ˆθ c,v = n c,v v n c,v (10) where n c is the number of times you see class c and n c,v is the number of times you see word v in class c. 2.5 Full procedure So, we have a general outline of how to build and use our model as follows: Estimate the model from the training set. Predict the class of each test example. Compute the accuracy. 2.6 Naive Bayes case study We ll consider an example of using the naive Bayes model to do spam filtering. The e-mail data set comes from Enron s subpoenaed e-mails. The training set size is 10,000 e-mails (all labeled as spam or ham) and the test set size is 1,000 e-mails. Preprocessing has removed common words such as the from data set. In the graph below, the x-axis represents training size and the y-axis represents accuracy. Note the sensitivity of model performance to training data size. Test accuracy starts off bad, but gets better as training size increases and converges at around 70% accuracy. On the other hand, the training accuracy starts off very high and then drops off. This is due to overfitting, since when you re only training on a few e-mails, the word distribution can perfectly describe the training e-mails. 4

2.7 Smoothing However, suppose in the above case study, we see a rare word like peanut appear in one spam e-mail and none of the ham e-mails. In that case, θ spam,peanut will be some small value, but θ ham,peanut will be 0. Since the probabilities are multiplied together, this leads to the problem that any e-mail containing the word peanut will never be classified as ham. This is clearly a ludicrous conclusion. To resolve that, we use smoothing to make the probability distribution more reasonable by adding some number λ to the per-word class counts. The result is that all words will have non-zero probabilities, while the more frequent words have slighly decreased probability, leading to a smoother overall distribution. Thus, the new equation is ˆθ c,w = n c,w + λ w n c,w + V λ (11) When λ = 1, we call it Laplace smoothing. When λ = 0.5, we call it Jeffrey s smoothing. The x-axis is λ and the y-axis is the accuracy. We can see that just adding a little bit of smoothing results in a huge increase in accuracy. It seems that the problem of 5

misclassification based on the appearances of rare words is fairly significant. 2.8 Questions about naive Bayes This model assumes that the future looks like the past - that is, the test cases come from the same distribution as the training cases. What other assumptions are made? What effect do the assumptions have on this classifier? What is strange about the NB model of text? Can you adapt NB to different data, e.g., vectors of reals? We will address these questions next time. 6