POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

POS Def. Part of Speech POS POS L645 POS = Assigning word class information to words Dept. of Linguistics, Indiana University Fall 2009 ex: the man bought a book determiner noun verb determiner noun 1 / 23 2 / 23 Linguistic Questions POS : Multiwords POS How do we divide the text into individual word tokens? How do we choose a tagset to represent all words? How do we select appropriate tags for individual words? Multiwords: in spite of the firm promise because of technical problems Merged words: I don t know he couldn t come Compounds: Great Northern Nekoosa Corp. an Atlanta-based forest-products company 3 / 23 4 / 23 Possible Solution: Layered Analysis POS Issues in POS <w pos=in>in spite of</w> <w pos=md+rb>shouldn t</w> <w pos=jj> <w pos=nns>hundreds</w> -<w pos=in>of</w> -<w pos=nns>billions</w> -<w pos=in>of</w> -<w pos=nns>yen</w></w> <w pos=nn>market</w> define which words are considered multiwords: no multiwords, ditto tags, layered annotation describe how mergers are treated: combined tags, splitting up describe how compounds are treated: surface oriented, layered annotation can the solutions be implemented automatically 5 / 23 6 / 23

Issues in Selecting a Tagset POS POS Representations POS conciseness: short labels better than long ones prep preposition perspicuity: labels that are easily interpreted are better prep in analysability: should be possible to decompose in different parts vmfin: verb, modal, finite pds: pronoun, demonstrative, substituting Horizontal Format I/PP will/md then/ maybe/ travel/vb directly/ on/in to/in Berlin/NP Vertical Format I will then maybe travel directly on to Berlin PP MD VB IN IN NP 7 / 23 8 / 23 Tagset Size POS Penn Treebank Tagset POS English: TOSCA 32 Penn treebank 36 BNC C5 61 Brown 77 LOB 132 London-Lund Corpus 197 TOSCA-ICE 270 Romanian: 614 Hungarian: ca. 2 100 Annotating POS Tags 9 / 23 POS CC Coord. conjunction Adverb CD Cardinal number R Adverb, comparative DT Determiner S Adverb, superlative EX Existential there RP Particle FW Foreign word SYM Symbol IN Prep. / subord. conj. TO to JJ Adjective UH Interjection JJR Adjective, comparative VB Verb, base form JJS Adjective, superlative VBD Verb, past tense LS List item marker VBG Verb, gerund / present part. MD Modal VBN Verb, past part. NN Noun, singular or mass VBP Verb, non-3rd p., sing. pres. NNS Noun, plural VBZ Verb, 3rd p. sing. pres. NP Proper noun, singular WDT Wh-determiner NPS Proper noun, plural WP Wh-pronoun PDT Predeterminer WP Possessive wh-pronoun POS Possessive ending W Wh-adverb PRP Personal pronoun, Comma PRP$ Possessive pronoun. Sentence-final punctuation 10 / 23 POS two fundamentally different approaches: start from scratch, find characteristics in words or context ( = rules) which give indication of word class i.e. if word ends in ion, tag it as noun accumulate lexicon, disambiguate words with more than one tag i.e. possible categories for about : preposition, adverb, particle Assumption: local context is sufficient examples: for the man: noun or verb? we will man: noun or verb? I can put: verb base form or past? re-cap real quick: adjective or adverb? 11 / 23 12 / 23

Bigram POS Bigram Example POS basic assumption: POS tag only depends on word itself and on the POS tag of the previous word use lexicon to retrieve ambiguity class for words e.g. word: beginning, ambiguity class: [JJ, NN, VBG] for unknown words: use heuristics, e.g. all open class POS tags disambiguation: look for most likely path through possibilities time flies like an arrow S NN VBZ IN DT NN E VB NNS VB JJ 13 / 23 14 / 23 Bigram Probabilities POS Bigram Probabilities POS P(t 1...t 5 ) = P(t 1 S)P(w 1 t 1 )P(t 2 t 1 )P(w 2 t 2 )... P(t 5 t 4 )P(w 5 t 5 )P(E t 5 ) P(t 1...t 5 ) = P(t 1 S)P(w 1 t 1 )P(t 2 t 1 )P(w 2 t 2 )... P(t 5 t 4 )P(w 5 t 5 )P(E t 5 ) = P(t 1, S) P(w 1, t 1 ) P(t 2, t 1 ) P(w 2, t 2 )... P(S) P(t 1 ) P(t 1 ) P(t 2 ) P(t 5, t 4 ) P(w 5, t 5 ) P(E, t 5 ) P(t 4 ) P(t 5 ) P(t 5 ) green = transition probabilities blue = lexical probabilities green = transition probabilities blue = lexical probabilities 15 / 23 15 / 23 Bigram Probability Table POS Bigram Counter-Examples POS the course or start before P(time NN) = 7.0727 P(NN S) = 0.6823 P(IN NNS) = 21.8302 P(time VB) = 0.0005 P(VB S) = 0.5294 P(VB VBZ) = 0.7002 P(time JJ) = 0 P(JJ S) = 0.8033 P (VB NNS) = 11.1406 P(flies VBZ) = 0.4754 P(VBZ NN) = 3.9005 P( VBZ) = 15.0350 P(flies NNS) = 0.1610 P(VBZ VB) = 0.0566 P( NNS) = 6.4721 P(like IN) = 2.6512 P(VBZ JJ) = 2.0934 P(DT IN) = 31.4263 P(like VB) = 2.8413 P(NNS NN) = 1.6076 P(DT VB) = 15.2649 P(like ) = 0.5086 P(NNS VB) = 0.6566 P(DT ) = 5.3113 P(an DT) = 1.4192 P(NNS JJ) = 2.4383 P(NN DT) = 38.0170 P(arrow NN) = 0.0215 P(IN VBZ) = 8.5862 P(E NN) = 0.2069 16 / 23

Bigram Counter-Examples POS Bigram Counter-Examples POS the course or start before the course or start before barely changed he was barely changed or he barely changed his contents Bigram Counter-Examples POS POS the course or start before simplest way to calculate such probabilities from a corpus: P MLE (t n t n 1 )= C(t n 1 t n ) C(t n 1 ) barely changed he was barely changed or he barely changed his contents P MLE (w n t n )= C(w n t n ) C(t n ) that beginning that beginning part or that beginning frightened the students or with that beginning early, he was forced... 18 / 23 POS (2) POS simplest way to calculate such probabilities from a corpus: P MLE (t n t n 1 )= C(t n 1 t n ) C(t n 1 ) P MLE (w n t n )= C(w n t n ) C(t n ) uses relative frequency maximizes the probabilities of the corpus 18 / 23

(2) POS (2) POS need smoothing or discounting method to give minimal probabilities to unseen events (2) POS Motivating Hidden Markov Models POS need smoothing or discounting method to give minimal probabilities to unseen events simplest possibility: learn from hapax legomena (words that appear only once) Thinking back to Markov models: we are now given a sequence of words and want to find the POS tags The underlying event of POS tags can be thought of as generating the words in the sentence Each state in the Markov model can be a POS tag We don t know the correct state sequence (Hidden Markov Model (HMM)) This requires an additional emission matrix, linking words with POS tags (cf. P(arrow NN)) 20 / 23 Example HMM POS Using Example HMM POS Assume DET, N, and VB as hidden states, with this transition matrix (A):... emission matrix (B): DET N VB DET 0.01 0.89 0.10 N 0.30 0.20 0.50 VB 0.67 0.23 0.10 dogs bit the chased a these cats... DET 0.0 0.0 0.33 0.0 0.33 0.33 0.0... N 0.2 0.1 0.0 0.0 0.0 0.0 0.15... VB 0.1 0.6 0.0 0.3 0.0 0.0 0.0... In order to generate words, we: 1. Choose tag/state from π 2. Choose emitted word from relevant row of B 3. Choose transition from relevant row of A 4. Repeat #2 & #3, until we hit a stopping point keeping track of probabilities as we go along We could generate all possibilities this way and find the most probable sequence Want a more efficient way of finding most probable sequence... and initial probability matrix (π): DET 0.7 N 0.2 VB 0.1 21 / 23 22 / 23