Bayes and Naïve Bayes. cs534-machine Learning



Similar documents
1 Maximum likelihood estimation

STA 4273H: Statistical Machine Learning

Question 2 Naïve Bayes (16 points)

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Basics of Statistical Machine Learning

Statistical Machine Learning

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

3F3: Signal and Pattern Processing

CSE 473: Artificial Intelligence Autumn 2010

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Linear Threshold Units

Christfried Webers. Canberra February June 2015

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

CHAPTER 2 Estimating Probabilities

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Machine Learning Final Project Spam Filtering

Week 2: Conditional Probability and Bayes formula

Lecture 3: Linear methods for classification

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Classification Problems

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

MACHINE LEARNING IN HIGH ENERGY PHYSICS

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

Supervised and unsupervised learning - 1

A SURVEY OF TEXT CLASSIFICATION ALGORITHMS

Part 2: One-parameter models

Topic models for Sentiment analysis: A Literature Survey

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

Monotonicity Hints. Abstract

Linear Classification. Volker Tresp Summer 2015

Discrete Structures for Computer Science

HT2015: SC4 Statistical Data Mining and Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

13.0 Central Limit Theorem

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Bayesian Spam Detection

Machine Learning Overview

Simple Language Models for Spam Detection

Recursive Estimation

CSCI567 Machine Learning (Fall 2014)

Automatic Text Processing: Cross-Lingual. Text Categorization

Gaussian Conjugate Prior Cheat Sheet

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Learning from Data: Naive Bayes

Gaussian Processes in Machine Learning

Inference of Probability Distributions for Trust and Security applications

Bayesian Analysis for the Social Sciences

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Dirichlet Processes A gentle tutorial

REPEATED TRIALS. The probability of winning those k chosen times and losing the other times is then p k q n k.

Classification with Hybrid Generative/Discriminative Models

Linear regression methods for large n and streaming data

1 Prior Probability and Posterior Probability

Master s Theory Exam Spring 2006

Bayesian Statistics: Indian Buffet Process

Lecture 8: Signal Detection and Noise Assumption

Predict Influencers in the Social Network

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Master s thesis tutorial: part III

Load Balancing and Switch Scheduling

Statistical Machine Learning from Data

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

SAS Software to Fit the Generalized Linear Model

SAMPLE SELECTION BIAS IN CREDIT SCORING MODELS

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Introductory Probability. MATH 107: Finite Mathematics University of Louisville. March 5, 2014

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

Author Gender Identification of English Novels

A STATE-SPACE METHOD FOR LANGUAGE MODELING. Vesa Siivola and Antti Honkela. Helsinki University of Technology Neural Networks Research Centre

Probability and statistics; Rehearsal for pattern recognition

Logistic Regression (1/24/13)

Content-Based Recommendation

Statistical machine learning, high dimension and big data

Principled Hybrids of Generative and Discriminative Models

Structural Health Monitoring Tools (SHMTools)

Supervised Learning (Big Data Analytics)

Component Ordering in Independent Component Analysis Based on Data Power

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Discrete Math in Computer Science Homework 7 Solutions (Max Points: 80)

Less naive Bayes spam detection

Server Load Prediction

Statistics 100A Homework 8 Solutions

Bayesian Statistics in One Hour. Patrick Lam

Problem sets for BUEC 333 Part 1: Probability and Statistics

Joint Exam 1/P Sample Exam 1

BUILDING A SPAM FILTER USING NAÏVE BAYES. CIS 391- Intro to AI 1

Introduction to General and Generalized Linear Models

Probability and Random Variables. Generation of random variables (r.v.)

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Multi-Class and Structured Classification

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Transcription:

Bayes and aïve Bayes cs534-machine Learning

Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule In LDA we assumed that is Gaussian with different means and shared covariance matrix

Joint Density estimation More generally, learning x is a density estimation problem, which is a challenging task Consider the case where x is a -dimensional binary vector Learning the joint distribution of x involves estimating 2 1parameters For a large, this number is prohibitively large Typically we don t have enough data to estimate the joint distribution accurately It is common to encounter the situation where no training examples have the exact x 1, 2,. value combination then x 0 for all values of.

aïve Bayes Assumption Assume that each feature is independent from one another given the class label Definition: is conditionally independent of given, if the probability distribution governing is independent of the value of, given the value of,,, Often denoted as, For example:, If, are conditional independent given, we have:,

aïve Bayes Classifier Under aïve Bayes assumption, we have: x Thus, no need to estimate the joint distribution Instead, only need to estimate for each feature Example: with binary features and classes, we reduce the number of parameters from 2 1 to Significantly reduces overfitting

Case study: spam filtering Bag-of-words to describe emails Represent an email by a vector whose dimension = the number of words in our vocabulary Example: Bernoulli feature x i =1 if the ith word is present x i =0 if the ith word is not present The ordering/position of the words do not matter

MLE for aïve Bayes with Bernoulli Model Suppose our training set contained emails, likelihood estimate of the parameters are : the maximum P( y 1) P( x i, where i.e., the fraction of spam emails where P( x i 1 1 y 1) 1 y 0) i 1 1 i 0 0, 1 is the number of i.e., the fraction of the nonspam emails where x i spam emails appeared x i appeared

Multinomial Model Instead of treating the absence/presence of each word in the dictionary as a Bernoulli, we can treat each word in the email as rolling a sided die, where is the total size of the dictionary Each of the words has a fixed probability of being selected This allows us to take the counts of the words into consideration

MLE for aïve Bayes with Multinomial model The likelihood of observing one email : MLE estimate for the -th word in the dictionary: Total number of parameters:

Problem with MLE Suppose you picked up the new word Mahalanobis in your class and started using it in your email x Because Mahalanobis (say it s the n+1 th word in the vocabulary) has never appeared in any of the training emails, the probability estimate for this word will be p n+1 1 =0 and p n+1 0 =0 P( x y) 0 ow P( x y ) = for both y=0 and y=1 i i Given limited training data, MLE can result in probabilities of 0 or 1. Such extreme probabilities are too strong and cause problems.

Bayesian Parameter Estimation In the Bayesian framework, we consider the parameters we are trying to estimate to be random variables We give them a prior distribution, which is used to capture our prior believe about the parameter When the data is sparse and this allows us to fall back to the prior and avoid the issues faced by MLE

Example: Bernoulli Given a unfair coin, we want to estimate - the probability of head We toss the coin times, and observe heads MLE estimate: ow let s go Bayesian and assume that is a random variable and has a prior distribution For reasons that will become clear later, we assume the following prior for : ~,, ;, 1, 1

Beta distribution uniform : U-shape uni-model symmetric

Posterior distribution of 1 1, 1 1, 1 oting that the posterior has exactly the same form as the prior. This is not a coincidence, it is due to careful selection of the prior distribution conjugate prior

Maximum a-posterior (MAP) A full Bayesian treatment will not care about estimating the parameter Instead, it will use the posterior distribution to make predictions for the target variable But if we do need to have a concrete estimate of the parameter We can use Maximum A Posterior estimation, that is

MAP for Bernoulli 1 D, 1 Mode: the maximum point for, is 1 2 In this case we have: 1 2 Let 2,2, we have Comparing the MLE, it is like adding some fake coin tosses (one head, one tail) into the observed data this is also called sometimes as Laplacian smoothing

Laplace Smoothing Suppose we estimate a probability P(z) and we have n 0 examples where 0and n 1 examples where 1 MLE estimate is n P ( z = 1) = 1 n 0 + n 1 Laplace Smoothing. Add 1 to the numerator and 2 to the denominator n1 1 P( z 1) n0 n1 2 If we don t observe any examples, we expect P(z=1) = 0.5, but our belief is weak (equivalent to seeing one example of each outcome).

MAP estimation for Multinomials The conjugate prior for multinomial is called Dirichlet distribution For outcomes, a Dirichlet distribution has parameters, each serves a similar purpose as Beta distribution s parameters Acting as fake observation(s) for the corresponding output, the count depends on the value of the parameter Laplace smoothing for this case: P( z k) n n k 1 K

MAP for aïve Bayes Spam Filter When estimating 1 and 0 MLE MAP Bernoulli case: P( x i 1 y MLE Multinomial case: 0 0) i 0 0 P( x 1 total # of word in ns emails total # of words in ns emails total # of word in ns emails1 0 total # of words in ns emails V where is the size of the vocabulary 1 2 When encounter a new word that has not appeared in training set, now the probabilities do not go to zero i y MAP 0) i 0 0

aïve Bayes Summary Generative classifier learn P( x y ) and P(y) Use Bayes rule to compute P( y x ) for classification Assumes conditional independence between features given class labels Greatly reduces the numbers of parameters to learn MAP estimation (or Laplace smoothing) is necessary to avoid overfitting and extreme probability values