Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression



Similar documents
STA 4273H: Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

CSC 411: Lecture 07: Multiclass Classification

CSCI567 Machine Learning (Fall 2014)

Lecture 10: Regression Trees

Log-Linear Models. Michael Collins

Log-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear Threshold Units

Classification Problems

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Logistic Regression for Spam Filtering

Machine Learning Logistic Regression

Statistical Machine Learning

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Lecture 6: Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Lecture 3: Linear methods for classification

Maximize Revenues on your Customer Loyalty Program using Predictive Analytics

Linear Models for Classification

Classification using Logistic Regression

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Nominal and ordinal logistic regression

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Logistic Regression (1/24/13)

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Machine Learning and Pattern Recognition Logistic Regression

1 Maximum likelihood estimation

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Lecture 2: The SVM classifier

Programming Exercise 3: Multi-class Classification and Neural Networks

Bayesian Multinomial Logistic Regression for Author Identification

3F3: Signal and Pattern Processing

Christfried Webers. Canberra February June 2015

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning for natural language processing

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

The equivalence of logistic regression and maximum entropy models

Hidden Markov Models

The Artificial Prediction Market

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Big Data Analytics. Lucas Rego Drumond

Some Essential Statistics The Lure of Statistics

Regression 3: Logistic Regression

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Big Data - Lecture 1 Optimization reminders

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

A Simple Introduction to Support Vector Machines

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Lecture notes: single-agent dynamics 1

Question 2 Naïve Bayes (16 points)

Conditional Random Fields: An Introduction

Lecture 6. Artificial Neural Networks

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

Vowpal Wabbit 7 Tutorial. Stephane Ross

Classification with Hybrid Generative/Discriminative Models

Hidden Markov Models Chapter 15

Package MDM. February 19, 2015

Week 1: Introduction to Online Learning

LCs for Binary Classification

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Introduction to Logistic Regression

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Lecture 8 February 4

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Machine Learning: Multi Layer Perceptrons

Bayes and Naïve Bayes. cs534-machine Learning

Journée Thématique Big Data 13/03/2015

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Penalized Logistic Regression and Classification of Microarray Data

Making Sense of the Mayhem: Machine Learning and March Madness

SAS Software to Fit the Generalized Linear Model

Distributed Machine Learning and Big Data

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

SAMPLE SELECTION BIAS IN CREDIT SCORING MODELS

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Predict Influencers in the Social Network

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Simple and efficient online algorithms for real world applications

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Predicting the Stock Market with News Articles

Linear Classification. Volker Tresp Summer 2015

10.2 Series and Convergence

How to Win at the Track

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Trading regret rate for computational efficiency in online learning with limited feedback

Data Mining Lab 5: Introduction to Neural Networks

Transcription:

Natural Language Processing Lecture 13 10/6/2015 Jim Martin Today Multinomial Logistic Regression Aka log-linear models or maximum entropy (maxent) Components of the model Learning the parameters 10/1/15 Speech and Language Processing - Jurafsky and Martin 2 Logistic Regression Models Estimate P(c d) directly (discriminative model) Features Model w/ weights on features Classification 10/1/15 Speech and Language Processing - Jurafsky and Martin 3 1!

Features The kind of features used in NLP-oriented machine learning systems typically involve Binary values Think of a feature as being on or off rather than as a feature with a value Values that are relative to an object/class pair rather than being a function of the object alone. Typically have lots and lots of features (100,000s of features isn t unusual.) 10/1/15 Speech and Language Processing - Jurafsky and Martin 4 POS Features 10/1/15 Speech and Language Processing - Jurafsky and Martin 5 Sentiment Features 10/4/15 Speech and Language Processing - Jurafsky and Martin 6 2!

Sentiment Features w/ Weights 1.9.9.7 -.8 10/4/15 Speech and Language Processing - Jurafsky and Martin 7 Logistic Regression Model 10/1/15 Speech and Language Processing - Jurafsky and Martin 8 Logistic Regression Model 10/4/15 Speech and Language Processing - Jurafsky and Martin 9 3!

Logistic Regression Model - 10/4/15 Speech and Language Processing - Jurafsky and Martin 10 Argmax If we didn t really care about the probabilities then this Reduces to the just a comparison of the sum of the weights 1.9 >.9+.7+-.8 1.9 >.8 10/4/15 Speech and Language Processing - Jurafsky and Martin 11 Model So back to P(C D) given: A training set of documents (D) And an associated set of labels (C) A finite set of features of the documents and classes And a weight vector over the features (w) This produces a proper probability distribution over the classes given an document 10/1/15 Speech and Language Processing - Jurafsky and Martin 12 4!

Learning the Weights We have a training set of labels and documents (y, x) We re going to use a Maximum Likelihood approach Choose the parameters (weights) that maximize the probability of the labels given the observations (features derived from the documents) And we ll do that in log-space, so maximize the log prob of the labels given the data 10/4/15 Speech and Language Processing - Jurafsky and Martin 13 Learning the Weights So choose the weights that maximizes 10/4/15 Speech and Language Processing - Jurafsky and Martin 14 Learning the Weights So choose the weights that maximizes 10/4/15 Speech and Language Processing - Jurafsky and Martin 15 5!

Learning the Weights So choose the weights that maximizes the following in log space Fortunately, that corresponds neatly to a convex optimization problem for which there are many possible approaches 10/4/15 Speech and Language Processing - Jurafsky and Martin 16 Convex Problems What does convex/concave mean? And why does it matter 10/4/15 Speech and Language Processing - Jurafsky and Martin 17 Derivatives 10/4/15 Speech and Language Processing - Jurafsky and Martin 18 6!

Gradient Ascent Initialize the weights to zero w = 0 Loop until convergence Calculate derivative for each feature δ k for each of the k features Set w = w + β* δ for each feature k 10/4/15 Speech and Language Processing - Jurafsky and Martin 19 Optimization In practice, that can actually be slow to converge because the algorithm can either be taking steps That are too small and hence take us too long to get where we re going Or too large which leads us to overshoot the target and wander around too much Fortunately, you don t have to worry about this. Lots of packages available where you need to specify L and L and you re done. LBFGs 10/4/15 Speech and Language Processing - Jurafsky and Martin 20 Overfitting A problem with the approach as we ve described it is that it is overly eager to match the training set as closely as it can With thousands or millions of features this can lead to poor performance on new data With logistic models this usually manifests itself as extremely large weights on a limited set of features 10/6/15 Speech and Language Processing - Jurafsky and Martin 21 7!

Regularization Solution is to add a penalty term to the objective function. The job of the penalty term is to squash the weights of features that are getting out of control. Parameter > 0 learned from held out dev data. 10/6/15 Speech and Language Processing - Jurafsky and Martin 22 Regularization Solution is to add a penalty term to the objective function. The job of the penalty term is to squash the weights of features that are getting out of control. 10/6/15 Speech and Language Processing - Jurafsky and Martin 23 Regularization and Learning With regularization, the parameter learning has to balance finding models that match the training data well and use small weights. 10/6/15 Speech and Language Processing - Jurafsky and Martin 24 8!

Review Finite state methods Practical issues in segmentation and tokenization N-gram language models HMMs HMMs applied to POS tagging Logistic regression and text classification 10/4/15 Speech and Language Processing - Jurafsky and Martin 25 9!