Machine Learning for NLP

Similar documents
Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Big Data Analytics CSCI 4030

CSCI567 Machine Learning (Fall 2014)

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Statistical Machine Learning

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Supervised Learning (Big Data Analytics)

LCs for Binary Classification

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

An Introduction to Data Mining

Data Mining - Evaluation of Classifiers

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Statistical Machine Translation: IBM Models 1 and 2

Support Vector Machines

Linear Threshold Units

Chapter 6. The stacking ensemble approach

Lecture 2: The SVM classifier

Social Media Mining. Data Mining Essentials

Employer Health Insurance Premium Prediction Elliott Lui

CSE 473: Artificial Intelligence Autumn 2010

Microsoft Azure Machine learning Algorithms

Question 2 Naïve Bayes (16 points)

Machine Learning in Spam Filtering

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Lecture 3: Linear methods for classification

CSC 411: Lecture 07: Multiclass Classification

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Spam Detection A Machine Learning Approach

Machine Learning Final Project Spam Filtering

Tagging with Hidden Markov Models

Machine Learning and Pattern Recognition Logistic Regression

DEPENDENCY PARSING JOAKIM NIVRE

Active Learning SVM for Blogs recommendation

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

A Simple Introduction to Support Vector Machines

Tekniker för storskalig parsning

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Support Vector Machines with Clustering for Training with Very Large Datasets

4.1 Introduction - Online Learning Model

Local classification and local likelihoods

Introduction to Online Learning Theory

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Content-Based Recommendation

Towards better accuracy for Spam predictions

Trading regret rate for computational efficiency in online learning with limited feedback

Neural Networks and Support Vector Machines

Classification algorithm in Data mining: An Overview

Simple and efficient online algorithms for real world applications

Notes from Week 1: Algorithms for sequential prediction

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Machine Learning Big Data using Map Reduce

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

STA 4273H: Statistical Machine Learning

The Data Mining Process

Data Mining Practical Machine Learning Tools and Techniques

Big Data - Lecture 1 Optimization reminders

Log-Linear Models. Michael Collins

Server Load Prediction

Predict Influencers in the Social Network

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

An Overview of Predictive Analytics for Practitioners. Dean Abbott, Abbott Analytics

Artificial Intelligence Exam DT2001 / DT2006 Ordinarie tentamen

Consistent Binary Classification with Generalized Performance Metrics

: Introduction to Machine Learning Dr. Rita Osadchy

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Direct Loss Minimization for Structured Prediction

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Data Mining. Nonlinear Classification

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Scalable Developments for Big Data Analytics in Remote Sensing

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Coding science news (intrinsic and extrinsic features)

Making Sense of the Mayhem: Machine Learning and March Madness

Big Data Analytics and Optimization

Online Classification on a Budget

Several Views of Support Vector Machines

Section IV.1: Recursive Algorithms and Recursion Trees

Semantic parsing with Structured SVM Ensemble Classification Models

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Machine Learning.

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Car Insurance. Havránek, Pokorný, Tomášek

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Games Manipulators Play

REVIEW OF ENSEMBLE CLASSIFICATION

Logistic Regression for Spam Filtering

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Collaborative Filtering. Radek Pelánek

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Inner product. Definition of inner product

Data Mining Techniques for Prognosis in Pancreatic Cancer

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Transcription:

Machine Learning for NLP Perceptron Learning Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(25)

Introduction Linear Classifiers Classifiers covered so far: Decision trees Nearest neighbor Next three lectures: linear classifiers Statistics from Google Scholar (October 2009): Maximum Entropy & NLP 2660 hits, 141 before 2000 SVM & NLP 2210 hits, 16 before 2000 Perceptron & NLP, 947 hits, 118 before 2000 All are linear classifiers and basic tools in NLP They are also the bridge to deep learning techniques Machine Learning for NLP 2(25)

Introduction Outline Today: Preliminaries: input/output, features, etc. Perceptron Assignment 2 Next time: Large-margin classifiers (SVMs, MIRA) Logistic regression (Maximum Entropy) Final session: Naive Bayes classifiers Generative and discriminative models Machine Learning for NLP 3(25)

Preliminaries Inputs and Outputs Input: x X e.g., document or sentence with some words x = w1... w n, or a series of previous actions Output: y Y e.g., parse tree, document class, part-of-speech tags, word-sense Input/output pair: (x, y) X Y e.g., a document x and its label y Sometimes x is explicit in y, e.g., a parse tree y will contain the sentence x Machine Learning for NLP 4(25)

Preliminaries Feature Representations We assume a mapping from input-output pairs (x, y) to a high dimensional feature vector f(x, y) : X Y R m For some cases, i.e., binary classification Y = { 1, +1}, we can map only from the input to the feature space f(x) : X R m However, most problems in NLP require more than two classes, so we focus on the multi-class case For any vector v R m, let v j be the j th value Machine Learning for NLP 5(25)

Preliminaries Features and Classes All features must be numerical Numerical features are represented directly as fi (x, y) R Binary (boolean) features are represented as f i (x, y) {0, 1} Multinomial (categorical) features must be binarized Instead of: fi (x, y) {v 0,..., v p } We have: fi+0 (x, y) {0, 1},..., f i+p (x, y) {0, 1} Such that: f i+j (x, y) = 1 iff f i (x, y) = v j We need distinct features for distinct output classes Instead of: fi (x) (1 i m) We have: fi+0m (x, y),..., f i+nm (x, y) for Y = {0,..., N} Such that: f i+jm (x, y) = f i (x) iff y = y j Machine Learning for NLP 6(25)

Preliminaries Examples x is a document and y is a label 1 if x contains the word interest f j (x, y) = and y = financial f j (x, y) = % of words in x with punctuation and y = scientific x is a word and y is a part-of-speech tag { 1 if x = bank and y = Verb f j (x, y) = Machine Learning for NLP 7(25)

Preliminaries Examples x is a name, y is a label classifying the name f 0 (x, y) = f 1 (x, y) = f 2 (x, y) = f 3 (x, y) = 1 if x contains George and y = Person 1 if x contains Washington and y = Person 1 if x contains Bridge and y = Person 1 if x contains General and y = Person f 4 (x, y) = f 5 (x, y) = f 6 (x, y) = f 7 (x, y) = 1 if x contains George and y = Object 1 if x contains Washington and y = Object 1 if x contains Bridge and y = Object 1 if x contains General and y = Object x=general George Washington, y=person f(x, y) = [1 1 0 1 0 0 0 0] x=george Washington Bridge, y=object f(x, y) = [0 0 0 0 1 1 1 0] x=george Washington George, y=object f(x, y) = [0 0 0 0 1 1 0 0] Machine Learning for NLP 8(25)

Preliminaries Block Feature Vectors x=general George Washington, y=person f(x, y) = [1 1 0 1 0 0 0 0] x=george Washington Bridge, y=object f(x, y) = [0 0 0 0 1 1 1 0] x=george Washington George, y=object f(x, y) = [0 0 0 0 1 1 0 0] One equal-size block of the feature vector for each label Input features duplicated in each block Non-zero values allowed only in one block Machine Learning for NLP 9(25)

Linear Classifiers Linear classifier: score (or probability) of a particular classification is based on a linear combination of features and their weights Let w R m be a high dimensional weight vector If we assume that w is known, then we define our classifier as Multiclass Classification: Y = {0, 1,..., N} y = arg max y = arg max y w f(x, y) m w j f j (x, y) j=0 Binary Classification just a special case of multiclass Machine Learning for NLP 10(25)

Linear Classifiers Bias Terms Often linear classifiers presented as y = arg max y m w j f j (x, y) + b y j=0 Where b is a bias or offset term But this can be folded into f x=general George Washington, y=person f(x, y) = [1 1 0 1 1 0 0 0 0 0] x=general George Washington, y=object f(x, y) = [0 0 0 0 0 1 1 0 1 1] { { 1 y = Person 1 y = Object f 4 (x, y) = f 9 (x, y) = w 4 and w 9 are now the bias terms for the labels Machine Learning for NLP 11(25)

Binary Linear Classifier Divides all points: Machine Learning for NLP 12(25)

Multiclass Linear Classifier Defines regions of space: i.e., + are all points (x, y) where + = arg max y w f(x, y) Machine Learning for NLP 13(25)

Separability A set of points is separable, if there exists a w such that classification is perfect Separable Not Separable This can also be defined mathematically (and we will shortly) Machine Learning for NLP 14(25)

Supervised Learning how to find w Input: training examples T = {(x t, y t )} T t=1 Input: feature representation f Output: w that maximizes/minimizes some important function on the training set minimize error (Perceptron, SVMs, Boosting) maximize likelihood of data (Logistic Regression, Naive Bayes) Assumption: The training data is separable Not necessary, just makes life easier There is a lot of good work in machine learning to tackle the non-separable case Machine Learning for NLP 15(25)

Perceptron Choose a w that minimizes error w = arg min 1 1[y t = arg max w f(x t, y)] w y t 1[p] = { 1 p is true This is a 0-1 loss function Aside: when minimizing error people tend to use hinge-loss or other smoother loss functions Machine Learning for NLP 16(25)

Perceptron Learning Algorithm Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t, y) 5. if y y t 6. w (i+1) = w (i) + f(x t, y t ) f(x t, y ) 7. i = i + 1 8. return w i Machine Learning for NLP 17(25)

Perceptron: Separability and Margin Given an training instance (x t, y t ), define: Ȳ t = Y {y t } i.e., Ȳ t is the set of incorrect labels for x t A training set T is separable with margin γ > 0 if there exists a vector u with u = 1 such that: u f(x t, y t ) u f(x t, y ) γ for all y Ȳ t and u = j u2 j (Euclidean or L 2 norm) Assumption: the training set is separable with margin γ Machine Learning for NLP 18(25)

Perceptron: Main Theorem Theorem: For any training set separable with a margin of γ, the following holds for the perceptron algorithm: mistakes made during training R2 γ 2 where R f(x t, y t ) f(x t, y ) for (x t, y t ) T and y Ȳ t Thus, after a finite number of training iterations, the error on the training set will converge to zero For proof, see Collins (2002) Machine Learning for NLP 19(25)

Practical Considerations The perceptron is sensitive to the order of training examples Consider: 500 positive instances + 500 negative instances Shuffling: Randomly permute training instances between iterations Voting and averaging: Let w1,... w n be all the weight vectors seen in training The voted perceptron predicts the majority vote of w1,... w n The averaged perceptron predicts using the average vector: w = 1 w 1,... w n n i Machine Learning for NLP 20(25)

Perceptron Summary Learns a linear classifier that minimizes error Guaranteed to find a w in a finite amount of time Improvement 1: shuffle training data between iterations Improvement 2: average weight vectors seen during training Perceptron is an example of an online learning algorithm w is updated based on a single training instance in isolation w (i+1) = w (i) + f(x t, y t ) f(x t, y ) Compare decision trees that perform batch learning All training instances are used to find best split Machine Learning for NLP 21(25)

Assignment 2 Implement the perceptron (starter code in Python) Three subtasks: Implement the linear classifier (dot product) Implement the perceptron update Evaluate on the spambase data set For VG Implement the averaged perceptron Machine Learning for NLP 22(25)

Appendix Proofs and Derivations Machine Learning for NLP 23(25)

Perceptron Learning Algorithm Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t, y) 5. if y y t 6. w (i+1) = w (i) + f(x t, y t ) f(x t, y ) 7. i = i + 1 8. return w i Convergence Proof for Perceptron w (k 1) are the weights before k th mistake Suppose k th mistake made at the t th example, (x t, y t) y = arg max y w (k 1) f(x t, y) y y t w (k) = w (k 1) + f(x t, y t) f(x t, y ) Now: u w (k) = u w (k 1) + u (f(x t, y t) f(x t, y )) u w (k 1) + γ Now: w (0) = 0 and u w (0) = 0, by induction on k, u w (k) kγ Now: since u w (k) u w (k) and u = 1 then w (k) kγ Now: w (k) 2 = w (k 1) 2 + f(x t, y t) f(x t, y ) 2 + 2w (k 1) (f(x t, y t) f(x t, y )) w (k) 2 w (k 1) 2 + R 2 (since R f(x t, y t) f(x t, y ) and w (k 1) f(x t, y t) w (k 1) f(x t, y ) 0) Machine Learning for NLP 24(25)

Convergence Proof for Perceptron Perceptron Learning Algorithm We have just shown that w (k) kγ and w (k) 2 w (k 1) 2 + R 2 By induction on k and since w (0) = 0 and w (0) 2 = 0 w (k) 2 kr 2 Therefore, k 2 γ 2 w (k) 2 kr 2 and solving for k k R2 γ 2 Therefore the number of errors is bounded! Machine Learning for NLP 25(25)