Machine Learning for NLP Perceptron Learning Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(25)
Introduction Linear Classifiers Classifiers covered so far: Decision trees Nearest neighbor Next three lectures: linear classifiers Statistics from Google Scholar (October 2009): Maximum Entropy & NLP 2660 hits, 141 before 2000 SVM & NLP 2210 hits, 16 before 2000 Perceptron & NLP, 947 hits, 118 before 2000 All are linear classifiers and basic tools in NLP They are also the bridge to deep learning techniques Machine Learning for NLP 2(25)
Introduction Outline Today: Preliminaries: input/output, features, etc. Perceptron Assignment 2 Next time: Large-margin classifiers (SVMs, MIRA) Logistic regression (Maximum Entropy) Final session: Naive Bayes classifiers Generative and discriminative models Machine Learning for NLP 3(25)
Preliminaries Inputs and Outputs Input: x X e.g., document or sentence with some words x = w1... w n, or a series of previous actions Output: y Y e.g., parse tree, document class, part-of-speech tags, word-sense Input/output pair: (x, y) X Y e.g., a document x and its label y Sometimes x is explicit in y, e.g., a parse tree y will contain the sentence x Machine Learning for NLP 4(25)
Preliminaries Feature Representations We assume a mapping from input-output pairs (x, y) to a high dimensional feature vector f(x, y) : X Y R m For some cases, i.e., binary classification Y = { 1, +1}, we can map only from the input to the feature space f(x) : X R m However, most problems in NLP require more than two classes, so we focus on the multi-class case For any vector v R m, let v j be the j th value Machine Learning for NLP 5(25)
Preliminaries Features and Classes All features must be numerical Numerical features are represented directly as fi (x, y) R Binary (boolean) features are represented as f i (x, y) {0, 1} Multinomial (categorical) features must be binarized Instead of: fi (x, y) {v 0,..., v p } We have: fi+0 (x, y) {0, 1},..., f i+p (x, y) {0, 1} Such that: f i+j (x, y) = 1 iff f i (x, y) = v j We need distinct features for distinct output classes Instead of: fi (x) (1 i m) We have: fi+0m (x, y),..., f i+nm (x, y) for Y = {0,..., N} Such that: f i+jm (x, y) = f i (x) iff y = y j Machine Learning for NLP 6(25)
Preliminaries Examples x is a document and y is a label 1 if x contains the word interest f j (x, y) = and y = financial f j (x, y) = % of words in x with punctuation and y = scientific x is a word and y is a part-of-speech tag { 1 if x = bank and y = Verb f j (x, y) = Machine Learning for NLP 7(25)
Preliminaries Examples x is a name, y is a label classifying the name f 0 (x, y) = f 1 (x, y) = f 2 (x, y) = f 3 (x, y) = 1 if x contains George and y = Person 1 if x contains Washington and y = Person 1 if x contains Bridge and y = Person 1 if x contains General and y = Person f 4 (x, y) = f 5 (x, y) = f 6 (x, y) = f 7 (x, y) = 1 if x contains George and y = Object 1 if x contains Washington and y = Object 1 if x contains Bridge and y = Object 1 if x contains General and y = Object x=general George Washington, y=person f(x, y) = [1 1 0 1 0 0 0 0] x=george Washington Bridge, y=object f(x, y) = [0 0 0 0 1 1 1 0] x=george Washington George, y=object f(x, y) = [0 0 0 0 1 1 0 0] Machine Learning for NLP 8(25)
Preliminaries Block Feature Vectors x=general George Washington, y=person f(x, y) = [1 1 0 1 0 0 0 0] x=george Washington Bridge, y=object f(x, y) = [0 0 0 0 1 1 1 0] x=george Washington George, y=object f(x, y) = [0 0 0 0 1 1 0 0] One equal-size block of the feature vector for each label Input features duplicated in each block Non-zero values allowed only in one block Machine Learning for NLP 9(25)
Linear Classifiers Linear classifier: score (or probability) of a particular classification is based on a linear combination of features and their weights Let w R m be a high dimensional weight vector If we assume that w is known, then we define our classifier as Multiclass Classification: Y = {0, 1,..., N} y = arg max y = arg max y w f(x, y) m w j f j (x, y) j=0 Binary Classification just a special case of multiclass Machine Learning for NLP 10(25)
Linear Classifiers Bias Terms Often linear classifiers presented as y = arg max y m w j f j (x, y) + b y j=0 Where b is a bias or offset term But this can be folded into f x=general George Washington, y=person f(x, y) = [1 1 0 1 1 0 0 0 0 0] x=general George Washington, y=object f(x, y) = [0 0 0 0 0 1 1 0 1 1] { { 1 y = Person 1 y = Object f 4 (x, y) = f 9 (x, y) = w 4 and w 9 are now the bias terms for the labels Machine Learning for NLP 11(25)
Binary Linear Classifier Divides all points: Machine Learning for NLP 12(25)
Multiclass Linear Classifier Defines regions of space: i.e., + are all points (x, y) where + = arg max y w f(x, y) Machine Learning for NLP 13(25)
Separability A set of points is separable, if there exists a w such that classification is perfect Separable Not Separable This can also be defined mathematically (and we will shortly) Machine Learning for NLP 14(25)
Supervised Learning how to find w Input: training examples T = {(x t, y t )} T t=1 Input: feature representation f Output: w that maximizes/minimizes some important function on the training set minimize error (Perceptron, SVMs, Boosting) maximize likelihood of data (Logistic Regression, Naive Bayes) Assumption: The training data is separable Not necessary, just makes life easier There is a lot of good work in machine learning to tackle the non-separable case Machine Learning for NLP 15(25)
Perceptron Choose a w that minimizes error w = arg min 1 1[y t = arg max w f(x t, y)] w y t 1[p] = { 1 p is true This is a 0-1 loss function Aside: when minimizing error people tend to use hinge-loss or other smoother loss functions Machine Learning for NLP 16(25)
Perceptron Learning Algorithm Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t, y) 5. if y y t 6. w (i+1) = w (i) + f(x t, y t ) f(x t, y ) 7. i = i + 1 8. return w i Machine Learning for NLP 17(25)
Perceptron: Separability and Margin Given an training instance (x t, y t ), define: Ȳ t = Y {y t } i.e., Ȳ t is the set of incorrect labels for x t A training set T is separable with margin γ > 0 if there exists a vector u with u = 1 such that: u f(x t, y t ) u f(x t, y ) γ for all y Ȳ t and u = j u2 j (Euclidean or L 2 norm) Assumption: the training set is separable with margin γ Machine Learning for NLP 18(25)
Perceptron: Main Theorem Theorem: For any training set separable with a margin of γ, the following holds for the perceptron algorithm: mistakes made during training R2 γ 2 where R f(x t, y t ) f(x t, y ) for (x t, y t ) T and y Ȳ t Thus, after a finite number of training iterations, the error on the training set will converge to zero For proof, see Collins (2002) Machine Learning for NLP 19(25)
Practical Considerations The perceptron is sensitive to the order of training examples Consider: 500 positive instances + 500 negative instances Shuffling: Randomly permute training instances between iterations Voting and averaging: Let w1,... w n be all the weight vectors seen in training The voted perceptron predicts the majority vote of w1,... w n The averaged perceptron predicts using the average vector: w = 1 w 1,... w n n i Machine Learning for NLP 20(25)
Perceptron Summary Learns a linear classifier that minimizes error Guaranteed to find a w in a finite amount of time Improvement 1: shuffle training data between iterations Improvement 2: average weight vectors seen during training Perceptron is an example of an online learning algorithm w is updated based on a single training instance in isolation w (i+1) = w (i) + f(x t, y t ) f(x t, y ) Compare decision trees that perform batch learning All training instances are used to find best split Machine Learning for NLP 21(25)
Assignment 2 Implement the perceptron (starter code in Python) Three subtasks: Implement the linear classifier (dot product) Implement the perceptron update Evaluate on the spambase data set For VG Implement the averaged perceptron Machine Learning for NLP 22(25)
Appendix Proofs and Derivations Machine Learning for NLP 23(25)
Perceptron Learning Algorithm Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t, y) 5. if y y t 6. w (i+1) = w (i) + f(x t, y t ) f(x t, y ) 7. i = i + 1 8. return w i Convergence Proof for Perceptron w (k 1) are the weights before k th mistake Suppose k th mistake made at the t th example, (x t, y t) y = arg max y w (k 1) f(x t, y) y y t w (k) = w (k 1) + f(x t, y t) f(x t, y ) Now: u w (k) = u w (k 1) + u (f(x t, y t) f(x t, y )) u w (k 1) + γ Now: w (0) = 0 and u w (0) = 0, by induction on k, u w (k) kγ Now: since u w (k) u w (k) and u = 1 then w (k) kγ Now: w (k) 2 = w (k 1) 2 + f(x t, y t) f(x t, y ) 2 + 2w (k 1) (f(x t, y t) f(x t, y )) w (k) 2 w (k 1) 2 + R 2 (since R f(x t, y t) f(x t, y ) and w (k 1) f(x t, y t) w (k 1) f(x t, y ) 0) Machine Learning for NLP 24(25)
Convergence Proof for Perceptron Perceptron Learning Algorithm We have just shown that w (k) kγ and w (k) 2 w (k 1) 2 + R 2 By induction on k and since w (0) = 0 and w (0) 2 = 0 w (k) 2 kr 2 Therefore, k 2 γ 2 w (k) 2 kr 2 and solving for k k R2 γ 2 Therefore the number of errors is bounded! Machine Learning for NLP 25(25)