Log-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models



Similar documents
Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

CSE 473: Artificial Intelligence Autumn 2010

Machine Learning for natural language processing

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Course: Model, Learning, and Inference: Lecture 5

1 Maximum likelihood estimation

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Question 2 Naïve Bayes (16 points)

Log-Linear Models. Michael Collins

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Christfried Webers. Canberra February June 2015

Lecture 3: Linear methods for classification

Introduction to Learning & Decision Trees

Effective Self-Training for Parsing

The equivalence of logistic regression and maximum entropy models

Statistical Machine Translation: IBM Models 1 and 2

CSCI567 Machine Learning (Fall 2014)

Machine Learning.

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Language Modeling. Chapter Introduction

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Tagging with Hidden Markov Models

Learning is a very general term denoting the way in which agents:

Lecture 10: Regression Trees

Hidden Markov Models Chapter 15

CSC 411: Lecture 07: Multiclass Classification

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Supervised Learning (Big Data Analytics)

Classification Problems

The Basics of Graphical Models

Environmental Remote Sensing GEOG 2021

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

Clustering Connectionist and Statistical Language Processing

Data Mining Practical Machine Learning Tools and Techniques

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Machine Translation. Agenda

Data Mining Algorithms Part 1. Dejan Sarka

Simple and efficient online algorithms for real world applications

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

ECE302 Spring 2006 HW4 Solutions February 6,

Week 1: Introduction to Online Learning

Using Artificial Intelligence to Manage Big Data for Litigation

CHAPTER 2 Estimating Probabilities

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Linear Threshold Units

Machine Learning and Pattern Recognition Logistic Regression

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Projektgruppe. Categorization of text documents via classification

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Big Data and Scripting. (lecture, computer science, bachelor/master/phd)

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Decision Theory Rational prospecting

Statistical Machine Learning

Linear Classification. Volker Tresp Summer 2015

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Linear Programming Notes VII Sensitivity Analysis

: Introduction to Machine Learning Dr. Rita Osadchy

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Bayes and Naïve Bayes. cs534-machine Learning

Lecture 6: Logistic Regression

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar anoop/

Statistical Machine Translation

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Discrete Structures for Computer Science

A New Interpretation of Information Rate

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

Estimating Probability Distributions

Cell Phone based Activity Detection using Markov Logic Network

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Monotonicity Hints. Abstract

CS 533: Natural Language. Word Prediction

Big Data Analytics CSCI 4030

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

Theorem (informal statement): There are no extendible methods in David Chalmers s sense unless P = NP.

Data Mining Part 5. Prediction

Statistical Machine Learning from Data

STA 4273H: Statistical Machine Learning

Online Algorithms: Learning & Optimization with No Regret.

Section 6: Model Selection, Logistic Regression and more...

Introduction to SQL for Data Scientists

1 Choosing the right data mining techniques for the job (8 minutes,

Data Mining Techniques Chapter 6: Decision Trees

MULTIPLE-OBJECTIVE DECISION MAKING TECHNIQUE Analytical Hierarchy Process

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Introduction to Online Learning Theory

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

Spam Detection A Machine Learning Approach

Conditional Random Fields: An Introduction

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Transcription:

Log-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models Natural Language Processing CS 6120 Spring 2014 Northeastern University David Smith (some slides from Jason Eisner and Dan Klein)

summary of half of the course (statistics) 2 Probability is Useful

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions.

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions. Pick best output text T from a set of candidates

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions. Pick best output text T from a set of candidates speech recognition; machine translation; OCR; spell correction...

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions. Pick best output text T from a set of candidates speech recognition; machine translation; OCR; spell correction... maximize p 1 (T) for some appropriate distribution p 1

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions. Pick best output text T from a set of candidates speech recognition; machine translation; OCR; spell correction... maximize p 1 (T) for some appropriate distribution p 1 Pick best annotation T for a fixed input I

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions. Pick best output text T from a set of candidates speech recognition; machine translation; OCR; spell correction... maximize p 1 (T) for some appropriate distribution p 1 Pick best annotation T for a fixed input I text categorization; parsing; POS tagging; language ID

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions. Pick best output text T from a set of candidates speech recognition; machine translation; OCR; spell correction... maximize p 1 (T) for some appropriate distribution p 1 Pick best annotation T for a fixed input I text categorization; parsing; POS tagging; language ID maximize p(t I); equivalently maximize joint probability p(i,t)

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions. Pick best output text T from a set of candidates speech recognition; machine translation; OCR; spell correction... maximize p 1 (T) for some appropriate distribution p 1 Pick best annotation T for a fixed input I text categorization; parsing; POS tagging; language ID maximize p(t I); equivalently maximize joint probability p(i,t) often define p(i,t) by noisy channel: p(i,t) = p(t) * p(i T)

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions. Pick best output text T from a set of candidates speech recognition; machine translation; OCR; spell correction... maximize p 1 (T) for some appropriate distribution p 1 Pick best annotation T for a fixed input I text categorization; parsing; POS tagging; language ID maximize p(t I); equivalently maximize joint probability p(i,t) often define p(i,t) by noisy channel: p(i,t) = p(t) * p(i T) speech recognition & other tasks above are cases of this too:

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions. Pick best output text T from a set of candidates speech recognition; machine translation; OCR; spell correction... maximize p 1 (T) for some appropriate distribution p 1 Pick best annotation T for a fixed input I text categorization; parsing; POS tagging; language ID maximize p(t I); equivalently maximize joint probability p(i,t) often define p(i,t) by noisy channel: p(i,t) = p(t) * p(i T) speech recognition & other tasks above are cases of this too: we re maximizing an appropriate p 1 (T) defined by p(t I)

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions. Pick best output text T from a set of candidates speech recognition; machine translation; OCR; spell correction... maximize p 1 (T) for some appropriate distribution p 1 Pick best annotation T for a fixed input I text categorization; parsing; POS tagging; language ID maximize p(t I); equivalently maximize joint probability p(i,t) often define p(i,t) by noisy channel: p(i,t) = p(t) * p(i T) speech recognition & other tasks above are cases of this too: we re maximizing an appropriate p 1 (T) defined by p(t I) Pick best probability distribution (a meta-problem)

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions. Pick best output text T from a set of candidates speech recognition; machine translation; OCR; spell correction... maximize p 1 (T) for some appropriate distribution p 1 Pick best annotation T for a fixed input I text categorization; parsing; POS tagging; language ID maximize p(t I); equivalently maximize joint probability p(i,t) often define p(i,t) by noisy channel: p(i,t) = p(t) * p(i T) speech recognition & other tasks above are cases of this too: we re maximizing an appropriate p 1 (T) defined by p(t I) Pick best probability distribution (a meta-problem) really, pick best parameters θ: train HMM, PCFG, n-grams, clusters

summary of half of the course (statistics) 2 Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions. Pick best output text T from a set of candidates speech recognition; machine translation; OCR; spell correction... maximize p 1 (T) for some appropriate distribution p 1 Pick best annotation T for a fixed input I text categorization; parsing; POS tagging; language ID maximize p(t I); equivalently maximize joint probability p(i,t) often define p(i,t) by noisy channel: p(i,t) = p(t) * p(i T) speech recognition & other tasks above are cases of this too: we re maximizing an appropriate p 1 (T) defined by p(t I) Pick best probability distribution (a meta-problem) really, pick best parameters θ: train HMM, PCFG, n-grams, clusters maximum likelihood; smoothing; EM if unsupervised (incomplete data)

summary of half of the course (statistics) Probability is Useful We love probability distributions We ve learned how to define & use p( ) functions. Pick best output text T from a set of candidates speech recognition; machine translation; OCR; spell correction... maximize p 1 (T) for some appropriate distribution p 1 Pick best annotation T for a fixed input I text categorization; parsing; POS tagging; language ID maximize p(t I); equivalently maximize joint probability p(i,t) often define p(i,t) by noisy channel: p(i,t) = p(t) * p(i T) speech recognition & other tasks above are cases of this too: we re maximizing an appropriate p 1 (T) defined by p(t I) Pick best probability distribution (a meta-problem) really, pick best parameters θ: train HMM, PCFG, n-grams, clusters maximum likelihood; smoothing; EM if unsupervised (incomplete data) Bayesian smoothing: max p(θ data) = max p(θ, data) =p(θ)p(data θ) 2

summary of other half of the course (linguistics) 3 Probability is Flexible

summary of other half of the course (linguistics) 3 Probability is Flexible We love probability distributions

summary of other half of the course (linguistics) 3 Probability is Flexible We love probability distributions We ve learned how to define & use p( ) functions.

summary of other half of the course (linguistics) 3 Probability is Flexible We love probability distributions We ve learned how to define & use p( ) functions. We want p( ) to define probability of linguistic objects

summary of other half of the course (linguistics) 3 Probability is Flexible We love probability distributions We ve learned how to define & use p( ) functions. We want p( ) to define probability of linguistic objects Trees of (non)terminals (PCFGs; CKY, Earley, pruning, inside-outside)

summary of other half of the course (linguistics) 3 Probability is Flexible We love probability distributions We ve learned how to define & use p( ) functions. We want p( ) to define probability of linguistic objects Trees of (non)terminals (PCFGs; CKY, Earley, pruning, inside-outside) Sequences of words, tags, morphemes, phonemes (n-grams, FSAs, FSTs; regex compilation, best-paths, forward-backward, collocations)

summary of other half of the course (linguistics) 3 Probability is Flexible We love probability distributions We ve learned how to define & use p( ) functions. We want p( ) to define probability of linguistic objects Trees of (non)terminals (PCFGs; CKY, Earley, pruning, inside-outside) Sequences of words, tags, morphemes, phonemes (n-grams, FSAs, FSTs; regex compilation, best-paths, forward-backward, collocations) Vectors (clusters)

summary of other half of the course (linguistics) 3 Probability is Flexible We love probability distributions We ve learned how to define & use p( ) functions. We want p( ) to define probability of linguistic objects Trees of (non)terminals (PCFGs; CKY, Earley, pruning, inside-outside) Sequences of words, tags, morphemes, phonemes (n-grams, FSAs, FSTs; regex compilation, best-paths, forward-backward, collocations) Vectors (clusters) NLP also includes some not-so-probabilistic stuff

summary of other half of the course (linguistics) 3 Probability is Flexible We love probability distributions We ve learned how to define & use p( ) functions. We want p( ) to define probability of linguistic objects Trees of (non)terminals (PCFGs; CKY, Earley, pruning, inside-outside) Sequences of words, tags, morphemes, phonemes (n-grams, FSAs, FSTs; regex compilation, best-paths, forward-backward, collocations) Vectors (clusters) NLP also includes some not-so-probabilistic stuff Syntactic features, morph. Could be stochasticized?

summary of other half of the course (linguistics) 3 Probability is Flexible We love probability distributions We ve learned how to define & use p( ) functions. We want p( ) to define probability of linguistic objects Trees of (non)terminals (PCFGs; CKY, Earley, pruning, inside-outside) Sequences of words, tags, morphemes, phonemes (n-grams, FSAs, FSTs; regex compilation, best-paths, forward-backward, collocations) Vectors (clusters) NLP also includes some not-so-probabilistic stuff Syntactic features, morph. Could be stochasticized? Methods can be quantitative & data-driven but not fully probabilistic: transf.-based learning, bottom-up clustering, LSA, competitive linking

summary of other half of the course (linguistics) 3 Probability is Flexible We love probability distributions We ve learned how to define & use p( ) functions. We want p( ) to define probability of linguistic objects Trees of (non)terminals (PCFGs; CKY, Earley, pruning, inside-outside) Sequences of words, tags, morphemes, phonemes (n-grams, FSAs, FSTs; regex compilation, best-paths, forward-backward, collocations) Vectors (clusters) NLP also includes some not-so-probabilistic stuff Syntactic features, morph. Could be stochasticized? Methods can be quantitative & data-driven but not fully probabilistic: transf.-based learning, bottom-up clustering, LSA, competitive linking But probabilities have wormed their way into most things

summary of other half of the course (linguistics) 3 Probability is Flexible We love probability distributions We ve learned how to define & use p( ) functions. We want p( ) to define probability of linguistic objects Trees of (non)terminals (PCFGs; CKY, Earley, pruning, inside-outside) Sequences of words, tags, morphemes, phonemes (n-grams, FSAs, FSTs; regex compilation, best-paths, forward-backward, collocations) Vectors (clusters) NLP also includes some not-so-probabilistic stuff Syntactic features, morph. Could be stochasticized? Methods can be quantitative & data-driven but not fully probabilistic: transf.-based learning, bottom-up clustering, LSA, competitive linking But probabilities have wormed their way into most things p( ) has to capture our intuitions about the ling. data

An Alternative Tradition 4

4 An Alternative Tradition Old AI hacking technique: Possible parses (or whatever) have scores. Pick the one with the best score. How do you define the score? Completely ad hoc Throw anything you want into the stew Add a bonus for this, a penalty for that, etc.

4 An Alternative Tradition Old AI hacking technique: Possible parses (or whatever) have scores. Pick the one with the best score. How do you define the score? Completely ad hoc Throw anything you want into the stew Add a bonus for this, a penalty for that, etc. Learns over time as you adjust bonuses and penalties by hand to improve performance. J

4 An Alternative Tradition Old AI hacking technique: Possible parses (or whatever) have scores. Pick the one with the best score. How do you define the score? Completely ad hoc Throw anything you want into the stew Add a bonus for this, a penalty for that, etc. Learns over time as you adjust bonuses and penalties by hand to improve performance. J Total kludge, but totally flexible too Can throw in any intuitions you might have

really so alternative? 4 An Alternative Tradition Old AI hacking technique: Possible parses (or whatever) have scores. Pick the one with the best score. How do you define the score? Completely ad hoc Throw anything you want into the stew Add a bonus for this, a penalty for that, etc. Learns over time as you adjust bonuses and penalties by hand to improve performance. J Total kludge, but totally flexible too Can throw in any intuitions you might have

really so alternative? 5 An Alternative Tradition Old AI hacking technique: Possible parses (or whatever) have scores. Pick the one with the best score. How do you define the score? Completely ad hoc Throw anything you want into the stew Add a bonus for this, a penalty for that, etc. Learns over time as you adjust bonuses and penalties by hand to improve performance. J Total kludge, but totally flexible too Can throw in any intuitions you might have

really so alternative? An Alternative Tradition Exposé at 9 Old AI hacking technique: Possible Probabilistic parses (or whatever) Revolution have scores. Pick the one with the best score. Not Really a Revolution, How do you define the score? Critics Say Completely ad hoc Throw anything you want into the stew Add a bonus for this, a penalty for that, etc. Log-probabilities no more Learns over time as you adjust bonuses and penalties by than hand scores to improve in disguise performance. J Total kludge, but totally flexible too We re just adding stuff up like the old corrupt regime did, admits spokesperson Can throw in any intuitions you might have 5

Nuthin but adding weights 6

6 Nuthin but adding weights n-grams: + log p(w7 w5,w6) + log(w8 w6, w7) +

6 Nuthin but adding weights n-grams: + log p(w7 w5,w6) + log(w8 w6, w7) + PCFG: log p(np VP S) + log p(papa NP) + log p(vp PP VP)

6 Nuthin but adding weights n-grams: + log p(w7 w5,w6) + log(w8 w6, w7) + PCFG: log p(np VP S) + log p(papa NP) + log p(vp PP VP) HMM tagging: + log p(t7 t5, t6) + log p(w7 t7) +

6 Nuthin but adding weights n-grams: + log p(w7 w5,w6) + log(w8 w6, w7) + PCFG: log p(np VP S) + log p(papa NP) + log p(vp PP VP) HMM tagging: + log p(t7 t5, t6) + log p(w7 t7) + Noisy channel: [log p(source)] + [log p(data source)]

6 Nuthin but adding weights n-grams: + log p(w7 w5,w6) + log(w8 w6, w7) + PCFG: log p(np VP S) + log p(papa NP) + log p(vp PP VP) HMM tagging: + log p(t7 t5, t6) + log p(w7 t7) + Noisy channel: [log p(source)] + [log p(data source)] Cascade of FSTs: [log p(a)] + [log p(b A)] + [log p(c B)] +

6 Nuthin but adding weights n-grams: + log p(w7 w5,w6) + log(w8 w6, w7) + PCFG: log p(np VP S) + log p(papa NP) + log p(vp PP VP) HMM tagging: + log p(t7 t5, t6) + log p(w7 t7) + Noisy channel: [log p(source)] + [log p(data source)] Cascade of FSTs: [log p(a)] + [log p(b A)] + [log p(c B)] + Naïve Bayes: log p(class) + log p(feature1 Class) + log p(feature2 Class)

6 Nuthin but adding weights n-grams: + log p(w7 w5,w6) + log(w8 w6, w7) + PCFG: log p(np VP S) + log p(papa NP) + log p(vp PP VP) HMM tagging: + log p(t7 t5, t6) + log p(w7 t7) + Noisy channel: [log p(source)] + [log p(data source)] Cascade of FSTs: [log p(a)] + [log p(b A)] + [log p(c B)] + Naïve Bayes: log p(class) + log p(feature1 Class) + log p(feature2 Class) Note: Today we ll use +logprob not logprob: i.e., bigger weights are better.

Nuthin but adding weights n-grams: + log p(w7 w5,w6) + log(w8 w6, w7) + PCFG: log p(np VP S) + log p(papa NP) + log p(vp PP VP) Can regard any linguistic object as a collection of features (here, tree = a collection of context-free rules) Weight of the object = total weight of features Our weights have always been conditional log-probs ( 0) but that is going to change in a few minutes HMM tagging: + log p(t7 t5, t6) + log p(w7 t7) + Noisy channel: [log p(source)] + [log p(data source)] Cascade of FSTs: [log p(a)] + [log p(b A)] + [log p(c B)] + Naïve Bayes: log(class) + log(feature1 Class) + log(feature2 Class) + 7

8

Probabilists Rally Behind Paradigm 8

8 Probabilists Rally Behind Paradigm.2,.4,.6,.8 We re not gonna take your bait

8 Probabilists Rally Behind Paradigm.2,.4,.6,.8 We re not gonna take your bait 1. Can estimate our parameters automatically e.g., log p(t7 t5, t6) (trigram tag probability) from supervised or unsupervised data

8 Probabilists Rally Behind Paradigm.2,.4,.6,.8 We re not gonna take your bait 1. Can estimate our parameters automatically e.g., log p(t7 t5, t6) (trigram tag probability) from supervised or unsupervised data 2. Our results are more meaningful Can use probabilities to place bets, quantify risk e.g., how sure are we that this is the correct parse?

Probabilists Rally Behind Paradigm.2,.4,.6,.8 We re not gonna take your bait 1. Can estimate our parameters automatically e.g., log p(t7 t5, t6) (trigram tag probability) from supervised or unsupervised data 2. Our results are more meaningful Can use probabilities to place bets, quantify risk e.g., how sure are we that this is the correct parse? 3. Our results can be meaningfully combined modularity Multiply indep. conditional probs normalized, unlike scores p(english text) * p(english phonemes English text) * p(jap. phonemes English phonemes) * p(jap. text Jap. phonemes) p(semantics) * p(syntax semantics) * p(morphology syntax) * p(phonology morphology) * p(sounds phonology) 8

83% of Probabilists Rally Behind Paradigm ^.2,.4,.6,.8 We re not gonna take your bait 1. Can estimate our parameters automatically e.g., log p(t7 t5, t6) (trigram tag probability) from supervised or unsupervised data 2. Our results are more meaningful Can use probabilities to place bets, quantify risk e.g., how sure are we that this is the correct parse? 3. Our results can be meaningfully combined modularity Multiply indep. conditional probs normalized, unlike scores p(english text) * p(english phonemes English text) * p(jap. phonemes English phonemes) * p(jap. text Jap. phonemes) p(semantics) * p(syntax semantics) * p(morphology syntax) * p(phonology morphology) * p(sounds phonology) 8

9

Probabilists Regret Being Bound by Principle 9

9 Probabilists Regret Being Bound by Principle Ad-hoc approach does have one advantage

9 Probabilists Regret Being Bound by Principle Ad-hoc approach does have one advantage Consider e.g. Naïve Bayes for text categorization: Buy this supercalifragilistic Ginsu knife set for only $39 today

9 Probabilists Regret Being Bound by Principle Ad-hoc approach does have one advantage Consider e.g. Naïve Bayes for text categorization: Buy this supercalifragilistic Ginsu knife set for only $39 today Some useful features: Contains Buy Contains supercalifragilistic Contains a dollar amount under $100 Contains an imperative sentence Reading level = 8 th grade Mentions money (use word classes and/or regexp to detect this)

9 Probabilists Regret Being Bound by Principle Ad-hoc approach does have one advantage Consider e.g. Naïve Bayes for text categorization: Buy this supercalifragilistic Ginsu knife set for only $39 today Some useful features: Contains Buy Contains supercalifragilistic Contains a dollar amount under $100 Contains an imperative sentence Reading level = 8 th grade Mentions money (use word classes and/or regexp to detect this) Naïve Bayes: pick C maximizing p(c) * p(feat 1 C) *

Probabilists Regret Being Bound by Principle Ad-hoc approach does have one advantage Consider e.g. Naïve Bayes for text categorization: Buy this supercalifragilistic Ginsu knife set for only $39 today Some useful features: Contains Buy Contains supercalifragilistic Contains a dollar amount under $100 Contains an imperative sentence Reading level = 8 th grade Mentions money (use word classes and/or regexp to detect this) Naïve Bayes: pick C maximizing p(c) * p(feat 1 C) * What assumption does Naïve Bayes make? True here? 9

Probabilists Regret Being Bound by Principle Ad-hoc approach does have one advantage Consider e.g. Naïve Bayes for text categorization: Buy this supercalifragilistic Ginsu knife set for only $39 today Some useful features: Contains Buy.5.02.9.1 spam ling Contains supercalifragilistic Contains a dollar amount under $100 Contains an imperative sentence Reading level = 8 th grade Mentions money (use word classes and/or regexp to detect this) Naïve Bayes: pick C maximizing p(c) * p(feat 1 C) * What assumption does Naïve Bayes make? True here? 9

Probabilists Regret Being Bound by Principle Ad-hoc approach does have one advantage Consider e.g. Naïve Bayes for text categorization: Buy this supercalifragilistic Ginsu knife set for only $39 today Some useful features:.5.02.9.1 spam ling Contains a dollar amount under $100 Mentions money Naïve Bayes: pick C maximizing p(c) * p(feat 1 C) * What assumption does Naïve Bayes make? True here? 10

Probabilists Regret Being Bound by Principle Ad-hoc approach does have one advantage Consider e.g. Naïve Bayes for text categorization: Buy this supercalifragilistic Ginsu knife set for only $39 today Some useful features:.5.02.9.1 spam ling 50% of spam has this 25x more likely than in ling Contains a dollar amount under $100 Mentions money Naïve Bayes: pick C maximizing p(c) * p(feat 1 C) * What assumption does Naïve Bayes make? True here? 10

Probabilists Regret Being Bound by Principle Ad-hoc approach does have one advantage Consider e.g. Naïve Bayes for text categorization: Buy this supercalifragilistic Ginsu knife set for only $39 today Some useful features:.5.02.9.1 spam ling 50% of spam has this 25x more likely than in ling Contains a dollar amount under $100 90% of spam has this 9x more likely than in ling Mentions money Naïve Bayes: pick C maximizing p(c) * p(feat 1 C) * What assumption does Naïve Bayes make? True here? 10

Probabilists Regret Being Bound by Principle Ad-hoc approach does have one advantage Consider e.g. Naïve Bayes for text categorization: Buy this supercalifragilistic Ginsu knife set for only $39 today Some useful features:.5.02.9.1 spam ling Contains a dollar amount under $100 50% of spam has this 25x more likely than in ling 90% of spam has this 9x more likely than in ling Mentions money Naïve Bayes claims.5*.9=45% of spam has both features 25*9=225x more likely than in ling. Naïve Bayes: pick C maximizing p(c) * p(feat 1 C) * What assumption does Naïve Bayes make? True here? 10

Probabilists Regret Being Bound by Principle Ad-hoc approach does have one advantage Consider e.g. Naïve Bayes for text categorization: Buy this supercalifragilistic Ginsu knife set for only $39 today Some useful features:.5.02.9.1 spam ling 50% of spam has this 25x more likely than in ling Contains a dollar amount under $100 but here are the emails with both features only 25x 90% of spam has this 9x more likely than in ling Mentions money Naïve Bayes claims.5*.9=45% of spam has both features 25*9=225x more likely than in ling. Naïve Bayes: pick C maximizing p(c) * p(feat 1 C) * What assumption does Naïve Bayes make? True here? 10

Probabilists Regret Being Bound by Principle But ad-hoc approach does have one advantage Can adjust scores to compensate for feature overlap Some useful features of this message:.5.02.9.1 spam ling Contains a dollar amount under $100 Mentions money Naïve Bayes: pick C maximizing p(c) * p(feat 1 C) * What assumption does Naïve Bayes make? True here? 11

Probabilists Regret Being Bound by Principle But ad-hoc approach does have one advantage Can adjust scores to compensate for feature overlap Some useful features of this message:.5.02.9.1 spam ling Contains a dollar amount under $100 Mentions money log prob -1-5.6 -.15-3.3 spam ling Naïve Bayes: pick C maximizing p(c) * p(feat 1 C) * What assumption does Naïve Bayes make? True here? 11

Probabilists Regret Being Bound by Principle But ad-hoc approach does have one advantage Can adjust scores to compensate for feature overlap Some useful features of this message:.5.02.9.1 spam ling Contains a dollar amount under $100 Mentions money log prob adjusted -1-5.6 -.15-3.3 spam ling Naïve Bayes: pick C maximizing p(c) * p(feat 1 C) * What assumption does Naïve Bayes make? True here? -.85-2.3 -.15-3.3 11 spam ling

12

Revolution Corrupted by Bourgeois Values 12

12 Revolution Corrupted by Bourgeois Values Naïve Bayes needs overlapping but independent features

12 Revolution Corrupted by Bourgeois Values Naïve Bayes needs overlapping but independent features But not clear how to restructure these features like that: Contains Buy Contains supercalifragilistic Contains a dollar amount under $100 Contains an imperative sentence Reading level = 7 th grade Mentions money (use word classes and/or regexp to detect this)

12 Revolution Corrupted by Bourgeois Values Naïve Bayes needs overlapping but independent features But not clear how to restructure these features like that: Contains Buy Contains supercalifragilistic Contains a dollar amount under $100 Contains an imperative sentence Reading level = 7 th grade Mentions money (use word classes and/or regexp to detect this) Boy, we d like to be able to throw all that useful stuff in without worrying about feature overlap/independence.

12 Revolution Corrupted by Bourgeois Values Naïve Bayes needs overlapping but independent features But not clear how to restructure these features like that: Contains Buy Contains supercalifragilistic Contains a dollar amount under $100 Contains an imperative sentence Reading level = 7 th grade Mentions money (use word classes and/or regexp to detect this) Boy, we d like to be able to throw all that useful stuff in without worrying about feature overlap/independence. Well, maybe we can add up scores and pretend like we got a log probability:

13 Revolution Corrupted by Bourgeois Values Naïve Bayes needs overlapping but independent features But not clear how to restructure these features like that: +4 +0.2 +1 +2-3 +5 Contains Buy Contains supercalifragilistic Contains a dollar amount under $100 total: 5.77 Contains an imperative sentence Reading level = 7 th grade Mentions money (use word classes and/or regexp to detect this) Boy, we d like to be able to throw all that useful stuff in without worrying about feature overlap/independence. Well, maybe we can add up scores and pretend like we got a log probability: log p(feats spam) = 5.77

Revolution Corrupted by Bourgeois Values Naïve Bayes needs overlapping but independent features But not clear how to restructure these features like that: +4 +0.2 +1 +2-3 +5 Contains Buy Contains supercalifragilistic Contains a dollar amount under $100 total: 5.77 Contains an imperative sentence Reading level = 7 th grade Mentions money (use word classes and/or regexp to detect this) Boy, we d like to be able to throw all that useful stuff in without worrying about feature overlap/independence. Well, maybe we can add up scores and pretend like we got a log probability: log p(feats spam) = 5.77 Oops, then p(feats spam) = exp 5.77 = 320.5 13

Renormalize by 1/Z to get a Log-Linear Model 14 p(feats spam) = exp 5.77 = 320.5

14 Renormalize by 1/Z to get a Log-Linear Model p(feats spam) = exp 5.77 = 320.5 scale down so everything < 1 and sums to 1

14 Renormalize by 1/Z to get a Log-Linear Model p(feats spam) = exp 5.77 = 320.5 scale down so everything < 1 and sums to 1 p(m spam) = (1/Z(λ)) exp i λ i f i (m) where m is the email message λ i is weight of feature i f i (m) {0,1} according to whether m has feature i More generally, allow f i (m) = count or strength of feature. 1/Z(λ) is a normalizing factor making m p(m spam)=1 (summed over all possible messages m hard to find)

14 Renormalize by 1/Z to get a Log-Linear Model p(feats spam) = exp 5.77 = 320.5 scale down so everything < 1 and sums to 1 p(m spam) = (1/Z(λ)) exp i λ i f i (m) where m is the email message λ i is weight of feature i f i (m) {0,1} according to whether m has feature i More generally, allow f i (m) = count or strength of feature. 1/Z(λ) is a normalizing factor making m p(m spam)=1 (summed over all possible messages m hard to find) The weights we add up are basically arbitrary.

14 Renormalize by 1/Z to get a Log-Linear Model p(feats spam) = exp 5.77 = 320.5 scale down so everything < 1 and sums to 1 p(m spam) = (1/Z(λ)) exp i λ i f i (m) where m is the email message λ i is weight of feature i f i (m) {0,1} according to whether m has feature i More generally, allow f i (m) = count or strength of feature. 1/Z(λ) is a normalizing factor making m p(m spam)=1 (summed over all possible messages m hard to find) The weights we add up are basically arbitrary. They don t have to mean anything, so long as they give us a good probability.

Renormalize by 1/Z to get a Log-Linear Model p(feats spam) = exp 5.77 = 320.5 p(m spam) = (1/Z(λ)) exp i λ i f i (m) where m is the email message λ i is weight of feature i f i (m) {0,1} according to whether m has feature i More generally, allow f i (m) = count or strength of feature. 1/Z(λ) is a normalizing factor making m p(m spam)=1 (summed over all possible messages m hard to find) The weights we add up are basically arbitrary. scale down so everything < 1 and sums to 1 They don t have to mean anything, so long as they give us a good probability. Why is it called log-linear? 14

Why Bother? 15

15 Why Bother? Gives us probs, not just scores. Can use em to bet, or combine w/ other probs.

15 Why Bother? Gives us probs, not just scores. Can use em to bet, or combine w/ other probs. We can now learn weights from data Choose weights λ i that maximize logprob of labeled training data = log j p(c j ) p(m j c j ) where c j {ling,spam} is classification of message m j and p(m j c j ) is log-linear model from previous slide Convex function easy to maximize (why?)

Why Bother? Gives us probs, not just scores. Can use em to bet, or combine w/ other probs. We can now learn weights from data Choose weights λ i that maximize logprob of labeled training data = log j p(c j ) p(m j c j ) where c j {ling,spam} is classification of message m j and p(m j c j ) is log-linear model from previous slide Convex function easy to maximize (why?) But: p(m j c j ) for a given λ requires Z(λ): hard 15

Attempt to Cancel out Z 16

16 Attempt to Cancel out Z Set weights to maximize j p(c j ) p(m j c j ) where p(m spam) = (1/Z(λ)) exp i λ i f i (m) But normalizer Z(λ) is awful sum over all possible emails

16 Attempt to Cancel out Z Set weights to maximize j p(c j ) p(m j c j ) where p(m spam) = (1/Z(λ)) exp i λ i f i (m) But normalizer Z(λ) is awful sum over all possible emails So instead: Maximize j p(c j m j ) Doesn t model the emails m j, only their classifications c j Makes more sense anyway given our feature set

16 Attempt to Cancel out Z Set weights to maximize j p(c j ) p(m j c j ) where p(m spam) = (1/Z(λ)) exp i λ i f i (m) But normalizer Z(λ) is awful sum over all possible emails So instead: Maximize j p(c j m j ) Doesn t model the emails m j, only their classifications c j Makes more sense anyway given our feature set

16 Attempt to Cancel out Z Set weights to maximize j p(c j ) p(m j c j ) where p(m spam) = (1/Z(λ)) exp i λ i f i (m) But normalizer Z(λ) is awful sum over all possible emails So instead: Maximize j p(c j m j ) Doesn t model the emails m j, only their classifications c j Makes more sense anyway given our feature set p(spam m) = p(spam)p(m spam) / (p(spam)p(m spam)+p(ling)p(m ling))

16 Attempt to Cancel out Z Set weights to maximize j p(c j ) p(m j c j ) where p(m spam) = (1/Z(λ)) exp i λ i f i (m) But normalizer Z(λ) is awful sum over all possible emails So instead: Maximize j p(c j m j ) Doesn t model the emails m j, only their classifications c j Makes more sense anyway given our feature set p(spam m) = p(spam)p(m spam) / (p(spam)p(m spam)+p(ling)p(m ling)) Z appears in both numerator and denominator

16 Attempt to Cancel out Z Set weights to maximize j p(c j ) p(m j c j ) where p(m spam) = (1/Z(λ)) exp i λ i f i (m) But normalizer Z(λ) is awful sum over all possible emails So instead: Maximize j p(c j m j ) Doesn t model the emails m j, only their classifications c j Makes more sense anyway given our feature set p(spam m) = p(spam)p(m spam) / (p(spam)p(m spam)+p(ling)p(m ling)) Z appears in both numerator and denominator Alas, doesn t cancel out because Z differs for the spam and ling models

Attempt to Cancel out Z Set weights to maximize j p(c j ) p(m j c j ) where p(m spam) = (1/Z(λ)) exp i λ i f i (m) But normalizer Z(λ) is awful sum over all possible emails So instead: Maximize j p(c j m j ) Doesn t model the emails m j, only their classifications c j Makes more sense anyway given our feature set p(spam m) = p(spam)p(m spam) / (p(spam)p(m spam)+p(ling)p(m ling)) Z appears in both numerator and denominator Alas, doesn t cancel out because Z differs for the spam and ling models But we can fix this 16

So: Modify Setup a Bit 17

17 So: Modify Setup a Bit Instead of having separate models p(m spam)*p(spam) vs. p(m ling)*p(ling)

17 So: Modify Setup a Bit Instead of having separate models p(m spam)*p(spam) vs. p(m ling)*p(ling) Have just one joint model p(m,c) gives us both p(m,spam) and p(m,ling)

17 So: Modify Setup a Bit Instead of having separate models p(m spam)*p(spam) vs. p(m ling)*p(ling) Have just one joint model p(m,c) gives us both p(m,spam) and p(m,ling) Equivalent to changing feature set to: spam spam and Contains Buy spam and Contains supercalifragilistic ling ling and Contains Buy ling and Contains supercalifragilistic

So: Modify Setup a Bit Instead of having separate models p(m spam)*p(spam) vs. p(m ling)*p(ling) Have just one joint model p(m,c) gives us both p(m,spam) and p(m,ling) Equivalent to changing feature set to: spam spam and Contains Buy spam and Contains supercalifragilistic ling ling and Contains Buy ling and Contains supercalifragilistic No real change, but 2 categories now share single feature set and single value of Z(λ) 17

So: Modify Setup a Bit Instead of having separate models p(m spam)*p(spam) vs. p(m ling)*p(ling) Have just one joint model p(m,c) gives us both p(m,spam) and p(m,ling) Equivalent to changing feature set to: spam spam and Contains Buy ß old spam model s weight for contains Buy spam and Contains supercalifragilistic ling ling and Contains Buy ß old ling model s weight for contains Buy ling and Contains supercalifragilistic No real change, but 2 categories now share single feature set and single value of Z(λ) 17

So: Modify Setup a Bit Instead of having separate models p(m spam)*p(spam) vs. p(m ling)*p(ling) Have just one joint model p(m,c) gives us both p(m,spam) and p(m,ling) Equivalent to changing feature set to: spam ß weight of this feature is log p(spam) + a constant spam and Contains Buy ß old spam model s weight for contains Buy spam and Contains supercalifragilistic ling ß weight of this feature is log p(ling) + a constant ling and Contains Buy ß old ling model s weight for contains Buy ling and Contains supercalifragilistic No real change, but 2 categories now share single feature set and single value of Z(λ) 17

Now we can cancel out Z 18

18 Now we can cancel out Z Now p(m,c) = (1/Z(λ)) exp i λ i f i (m,c) where c {ling, spam}

18 Now we can cancel out Z Now p(m,c) = (1/Z(λ)) exp i λ i f i (m,c) where c {ling, spam} Old: choose weights λ i that maximize prob of labeled training data = j p(m j, c j )

18 Now we can cancel out Z Now p(m,c) = (1/Z(λ)) exp i λ i f i (m,c) where c {ling, spam} Old: choose weights λ that maximize prob of labeled training data = i j p(m j, c j ) New: choose weights λ that maximize prob of labels given messages i = j p(c j m j )

18 Now we can cancel out Z Now p(m,c) = (1/Z(λ)) exp i λ i f i (m,c) where c {ling, spam} Old: choose weights λ that maximize prob of labeled training data = i j p(m j, c j ) New: choose weights λ that maximize prob of labels given messages i = j p(c j m j )

18 Now we can cancel out Z Now p(m,c) = (1/Z(λ)) exp i λ i f i (m,c) where c {ling, spam} Old: choose weights λ that maximize prob of labeled training data = i j p(m j, c j ) New: choose weights λ that maximize prob of labels given messages i = j p(c j m j ) Now Z cancels out of conditional probability

18 Now we can cancel out Z Now p(m,c) = (1/Z(λ)) exp i λ i f i (m,c) where c {ling, spam} Old: choose weights λ that maximize prob of labeled training data = i j p(m j, c j ) New: choose weights λ that maximize prob of labels given messages i = j p(c j m j ) Now Z cancels out of conditional probability p(spam m) = p(m,spam) / (p(m,spam) + p(m,ling))

18 Now we can cancel out Z Now p(m,c) = (1/Z(λ)) exp i λ i f i (m,c) where c {ling, spam} Old: choose weights λ that maximize prob of labeled training data = i j p(m j, c j ) New: choose weights λ that maximize prob of labels given messages i = j p(c j m j ) Now Z cancels out of conditional probability p(spam m) = p(m,spam) / (p(m,spam) + p(m,ling)) = exp i λ i f i (m,spam) / (exp i λ i f i (m,spam) + exp i λ i f i (m,ling))

18 Now we can cancel out Z Now p(m,c) = (1/Z(λ)) exp i λ i f i (m,c) where c {ling, spam} Old: choose weights λ that maximize prob of labeled training data = i j p(m j, c j ) New: choose weights λ that maximize prob of labels given messages i = j p(c j m j ) Now Z cancels out of conditional probability p(spam m) = p(m,spam) / (p(m,spam) + p(m,ling)) = exp i λ i f i (m,spam) / (exp i λ i f i (m,spam) + exp i λ i f i (m,ling)) Easy to compute now

Now we can cancel out Z Now p(m,c) = (1/Z(λ)) exp i λ i f i (m,c) where c {ling, spam} Old: choose weights λ that maximize prob of labeled training data = i j p(m j, c j ) New: choose weights λ that maximize prob of labels given messages i = j p(c j m j ) Now Z cancels out of conditional probability p(spam m) = p(m,spam) / (p(m,spam) + p(m,ling)) = exp i λ i f i (m,spam) / (exp i λ i f i (m,spam) + exp i λ i f i (m,ling)) Easy to compute now j p(c j m j ) is still convex, so easy to maximize too 18

Generative vs. Conditional What is the most likely label for a given input? How likely is a given label for a given input? What is the most likely input value? How likely is a given input value? How likely is a given input value with a given label? What is the most likely label for an input that might have one of two values (but we don't know which)? 19

Generative vs. Conditional What is the most likely label for a given input? How likely is a given label for a given input? What is the most likely input value? How likely is a given input value? How likely is a given input value with a given label? What is the most likely label for an input that might have one of two values (but we don't know which)? 19

Maximum Entropy 20

20 Maximum Entropy Suppose there are 10 classes, A through J.

20 Maximum Entropy Suppose there are 10 classes, A through J. I don t give you any other information.

20 Maximum Entropy Suppose there are 10 classes, A through J. I don t give you any other information. Question: Given message m: what is your guess for p(c m)?

20 Maximum Entropy Suppose there are 10 classes, A through J. I don t give you any other information. Question: Given message m: what is your guess for p(c m)?

20 Maximum Entropy Suppose there are 10 classes, A through J. I don t give you any other information. Question: Given message m: what is your guess for p(c m)? Suppose I tell you that 55% of all messages are in class A.

20 Maximum Entropy Suppose there are 10 classes, A through J. I don t give you any other information. Question: Given message m: what is your guess for p(c m)? Suppose I tell you that 55% of all messages are in class A. Question: Now what is your guess for p(c m)?

20 Maximum Entropy Suppose there are 10 classes, A through J. I don t give you any other information. Question: Given message m: what is your guess for p(c m)? Suppose I tell you that 55% of all messages are in class A. Question: Now what is your guess for p(c m)?

20 Maximum Entropy Suppose there are 10 classes, A through J. I don t give you any other information. Question: Given message m: what is your guess for p(c m)? Suppose I tell you that 55% of all messages are in class A. Question: Now what is your guess for p(c m)? Suppose I also tell you that 10% of all messages contain Buy and 80% of these are in class A or C.

20 Maximum Entropy Suppose there are 10 classes, A through J. I don t give you any other information. Question: Given message m: what is your guess for p(c m)? Suppose I tell you that 55% of all messages are in class A. Question: Now what is your guess for p(c m)? Suppose I also tell you that 10% of all messages contain Buy and 80% of these are in class A or C. Question: Now what is your guess for p(c m), if m contains Buy?

Maximum Entropy Suppose there are 10 classes, A through J. I don t give you any other information. Question: Given message m: what is your guess for p(c m)? Suppose I tell you that 55% of all messages are in class A. Question: Now what is your guess for p(c m)? Suppose I also tell you that 10% of all messages contain Buy and 80% of these are in class A or C. Question: Now what is your guess for p(c m), if m contains Buy? OUCH 20

21 Maximum Entropy A B C D E F G H I J Buy 0.051 0.0025 0.029 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 Other 0.499 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 Column A sums to 0.55 ( 55% of all messages are in class A )

22 Maximum Entropy A B C D E F G H I J Buy 0.051 0.0025 0.029 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 Other 0.499 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 Column A sums to 0.55 Row Buy sums to 0.1 ( 10% of all messages contain Buy )

23 Maximum Entropy A B C D E F G H I J Buy 0.051 0.0025 0.029 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 Other 0.499 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 Column A sums to 0.55 Row Buy sums to 0.1 (Buy, A) and (Buy, C) cells sum to 0.08 ( 80% of the 10% )

23 Maximum Entropy A B C D E F G H I J Buy 0.051 0.0025 0.029 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 Other 0.499 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 Column A sums to 0.55 Row Buy sums to 0.1 (Buy, A) and (Buy, C) cells sum to 0.08 ( 80% of the 10% ) Given these constraints, fill in cells as equally as possible : maximize the entropy (related to cross-entropy, perplexity)

23 Maximum Entropy A B C D E F G H I J Buy 0.051 0.0025 0.029 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 Other 0.499 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 Column A sums to 0.55 Row Buy sums to 0.1 (Buy, A) and (Buy, C) cells sum to 0.08 ( 80% of the 10% ) Given these constraints, fill in cells as equally as possible : maximize the entropy (related to cross-entropy, perplexity) Entropy = -.051 log.051 -.0025 log.0025 -.029 log.029 -

Maximum Entropy A B C D E F G H I J Buy 0.051 0.0025 0.029 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 Other 0.499 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 Column A sums to 0.55 Row Buy sums to 0.1 (Buy, A) and (Buy, C) cells sum to 0.08 ( 80% of the 10% ) Given these constraints, fill in cells as equally as possible : maximize the entropy (related to cross-entropy, perplexity) Entropy = -.051 log.051 -.0025 log.0025 -.029 log.029 - Largest if probabilities are evenly distributed 23

Maximum Entropy A B C D E F G H I J Buy 0.051 0.0025 0.029 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 0.0025 Other 0.499 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 Column A sums to 0.55 Row Buy sums to 0.1 (Buy, A) and (Buy, C) cells sum to 0.08 ( 80% of the 10% ) Given these constraints, fill in cells as equally as possible : maximize the entropy Now p(buy, C) =.029 and p(c Buy) =.29 We got a compromise: p(c Buy) < p(a Buy) <.55 24

25 Generalizing to More Features <$100 Other A B C D E F G H Buy 0.051 0.0025 0.029 0.0025 0.0025 0.0025 0.0025 0.0025 Other 0.499 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446 0.0446

What we just did 26

26 What we just did For each feature ( contains Buy ), see what fraction of training data has it

26 What we just did For each feature ( contains Buy ), see what fraction of training data has it Many distributions p(c,m) would predict these fractions (including the unsmoothed one where all mass goes to feature combos we ve actually seen)

26 What we just did For each feature ( contains Buy ), see what fraction of training data has it Many distributions p(c,m) would predict these fractions (including the unsmoothed one where all mass goes to feature combos we ve actually seen) Of these, pick distribution that has max entropy

26 What we just did For each feature ( contains Buy ), see what fraction of training data has it Many distributions p(c,m) would predict these fractions (including the unsmoothed one where all mass goes to feature combos we ve actually seen) Of these, pick distribution that has max entropy

26 What we just did For each feature ( contains Buy ), see what fraction of training data has it Many distributions p(c,m) would predict these fractions (including the unsmoothed one where all mass goes to feature combos we ve actually seen) Of these, pick distribution that has max entropy Amazing Theorem: This distribution has the form p(m,c) = (1/Z(λ)) exp i λ i f i (m,c) So it is log-linear. In fact it is the same log-linear distribution that maximizes j p(m j, c j ) as before

What we just did For each feature ( contains Buy ), see what fraction of training data has it Many distributions p(c,m) would predict these fractions (including the unsmoothed one where all mass goes to feature combos we ve actually seen) Of these, pick distribution that has max entropy Amazing Theorem: This distribution has the form p(m,c) = (1/Z(λ)) exp i λ i f i (m,c) So it is log-linear. In fact it is the same log-linear distribution that maximizes j p(m j, c j ) as before Gives another motivation for the log-linear approach. 26

27 26 Log-linear form derivation Say we are given some constraints in the form of feature expectations: In general, there may be many distributions p(x) that satisfy the constraints. Which one to pick? The one with maximum entropy (making fewest possible additional assumptions---occum s Razor) This yields an optimization problem

Log-linear form derivation 28 27

MaxEnt = Max Likelihood 29 28

30 30

31 31

32 32

33 33

34 34

35 35

By gradient ascent or conjugate gradient. 36 36

37 37

38 38

39 Overfitting If we have too many features, we can choose weights to model the training data perfectly. If we have a feature that only appears in spam training, not ling training, it will get weight to maximize p(spam feature) at 1. These behaviors overfit the training data. Will probably do poorly on test data.

Solutions to Overfitting 40

40 Solutions to Overfitting 1. Throw out rare features. Require every feature to occur > 4 times, and > 0 times with ling, and > 0 times with spam.

40 Solutions to Overfitting 1. Throw out rare features. Require every feature to occur > 4 times, and > 0 times with ling, and > 0 times with spam. 2. Only keep 1000 features. Add one at a time, always greedily picking the one that most improves performance on held-out data.

40 Solutions to Overfitting 1. Throw out rare features. Require every feature to occur > 4 times, and > 0 times with ling, and > 0 times with spam. 2. Only keep 1000 features. Add one at a time, always greedily picking the one that most improves performance on held-out data. 3. Smooth the observed feature counts.

40 Solutions to Overfitting 1. Throw out rare features. Require every feature to occur > 4 times, and > 0 times with ling, and > 0 times with spam. 2. Only keep 1000 features. Add one at a time, always greedily picking the one that most improves performance on held-out data. 3. Smooth the observed feature counts. 4. Smooth the weights by using a prior. max p(λ data) = max p(λ, data) =p(λ)p(data λ) decree p(λ) to be high when most weights close to 0

41 41

42 42

43 43 Recipe for a Conditional MaxEnt Classifier 1. Gather constraints from training data: 2. Initialize all parameters to zero. 3. Classify training data with current parameters. Calculate expectations. 4. Gradient is 5. Take a step in the direction of the gradient 6. Until convergence, return to step 3.