Part of Speech Tagging - A solved problem?

Similar documents
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Tagging with Hidden Markov Models

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

PoS-tagging Italian texts with CORISTagger

Context Grammar and POS Tagging

Training and evaluation of POS taggers on the French MULTITAG corpus

Brill s rule-based PoS tagger

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5

31 Case Studies: Java Natural Language Tools Available on the Web

Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language

Improving statistical POS tagging using Linguistic feature for Hindi and Telugu

Albert Pye and Ravensmere Schools Grammar Curriculum

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Word Completion and Prediction in Hebrew

Third Grade Language Arts Learning Targets - Common Core

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

ONLINE ENGLISH LANGUAGE RESOURCES

KINDGERGARTEN. Listen to a story for a particular reason

English. Universidad Virtual. Curso de sensibilización a la PAEP (Prueba de Admisión a Estudios de Posgrado) Parts of Speech. Nouns.

Livingston Public Schools Scope and Sequence K 6 Grammar and Mechanics

PTE Academic Preparation Course Outline

Pupil SPAG Card 1. Terminology for pupils. I Can Date Word

Ling 201 Syntax 1. Jirka Hana April 10, 2006

English Appendix 2: Vocabulary, grammar and punctuation

Turkish Radiology Dictation System

CS 533: Natural Language. Word Prediction

Shallow Parsing with PoS Taggers and Linguistic Features

Reliable and Cost-Effective PoS-Tagging

Speech Recognition on Cell Broadband Engine UCRL-PRES

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

Building a Question Classifier for a TREC-Style Question Answering System

Statistical Machine Translation: IBM Models 1 and 2

Evaluation of Bayesian Spam Filter and SVM Spam Filter

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning

Building A Vocabulary Self-Learning Speech Recognition System

Part-of-Speech Tagging for Bengali

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

An online semi automated POS tagger for. Presented By: Pallav Kumar Dutta Indian Institute of Technology Guwahati

Sentence Blocks. Sentence Focus Activity. Contents

Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

According to the Argentine writer Jorge Luis Borges, in the Celestial Emporium of Benevolent Knowledge, animals are divided

Glossary of literacy terms

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Speech and Language Processing

10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words

English Grammar Checker

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Year 1 reading expectations (New Curriculum) Year 1 writing expectations (New Curriculum)

Course Syllabus My TOEFL ibt Preparation Course Online sessions: M, W, F 15:00-16:30 PST

Nouns may show possession or ownership. Use an apostrophe with a noun to show something belongs to someone or to something.

SEPTEMBER Unit 1 Page Learning Goals 1 Short a 2 b 3-5 blends 6-7 c as in cat 8-11 t p

Curriculum Catalog

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

UNIVERSITÀ DEGLI STUDI DELL AQUILA CENTRO LINGUISTICO DI ATENEO

GMAT.cz GMAT.cz KET (Key English Test) Preparating Course Syllabus

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar anoop/

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

Index. 344 Grammar and Language Workbook, Grade 8

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Learning Morphological Disambiguation Rules for Turkish

Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort

The Basics of Graphical Models

Unit: Fever, Fire and Fashion Term: Spring 1 Year: 5

Customizing an English-Korean Machine Translation System for Patent Translation *

Categorizing and Tagging Words

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Hybrid algorithms for preprocessing agglutinative languages and less-resourced domains effectively

Stock Market Prediction Using Data Mining

Points of Interference in Learning English as a Second Language

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

Collecting Polish German Parallel Corpora in the Internet

Experiments in Cross-Language Morphological Annotation Transfer

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System

Learning Translation Rules from Bilingual English Filipino Corpus

Introduction to Algorithmic Trading Strategies Lecture 2

EAP Grammar Competencies Levels 1 6

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

Mining Online Diaries for Blogger Identification

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

openmind 1 Practice Online

Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability

A Primer on Text Analytics

Cambridge English: First (FCE) Frequently Asked Questions (FAQs)

Morphemes, roots and affixes. 28 October 2011

Non-exam Assessment Tasks

The Design of a Proofreading Software Service

CST and CAHSEE Academic Vocabulary

Texas Success Initiative (TSI) Assessment

MODULE 15 Diagram the organizational structure of your company.

EAS Basic Outline. Overview

Transcription:

Part of Speech Tagging - A solved problem? Wolfgang Fischl - 0602106 December 8, 2009 Abstract Since 100 B.C. humans are aware that the language consists of several distinct parts, called parts-of-speech. Identifying those parts-of-speech plays a crucial role in many fields of linguistics. Since TAGGIT, the first large-scale part-of-speech tagger, many algorithms and methods have been developed. Such include rule-based, probabilistic and hybrid taggers. When tagging large text corpora some problems might arise and finally the question is asked if part-of-speech tagging is a solved problem. 1 Introduction Part-of-speech (also known as: POS, tagsets, world classes, morphological classes, or lexical tags) play an important role in nearly every human language. What are part-of-speech? Already 100 B.C. someone identified eight parts-of-speech in Greek language: noun, verb, pronoun, preposition, adverb, conjunction, participle, and article. These eight are the basis for most part-of-speech descriptions and are also included in the more recent lists of parts-of-speech (see Table 1). Name year of appearance No. of classes Brown Corpus 1979 87 Penn Treebank 1993 45 C5 tagset 1997 61 C7 tagset 1997 146 Table 1: Recent part-of-speech lists The POS tag of a word is not only influenced by the word itself, but also by the words neighbors. In fact one word can have different POS tags depending on its location. For example the word object can be used either as noun (the object) or as verb (to object). Knowing the exact POS can be very helpful in many language processing applications. The POS of the word object changes the pronunciation (as noun the first syllable is pronounced as verb the second) [1]. Other applications include text indexing or the linguistic analysis of large tagged text corpora [2]. Part-of-speech tagging is the automatic assignment of such tags to entire texts. In this paper the goal is to understand the part-of-speech tags, the methods used by all part-of-speech tagging algorithms, some part-of-speech taggers and their problems and to explore if part-of-speech tagging is an already solved computational problem. 1

2 Part-of-speech tagging 2.1 Tags in English The eight part-of-speech tags identified in 100 B.C. are just a broad description of the English language. Actually there is more interest in identifying more specific POS tags. The smallest now used POS tagset is 45 tags big, the largest has 146 tags (see Table 1). What are now these 45 tags of the Penn Treebank tagset? For example the nouns themselves are separated into four classes: Singular, or mass nouns (llama, snow); Plural nouns (llamas); Proper nouns, singular (IBM); Proper nouns, plural (Austrians). The verbs consist of 6 classes: base form (eat), past tense (ate), gerund (eating), past participle (eaten), non- 3sg pres (eat), 3g pres (eats). A detailed list of the Penn Treebank tags can be found in [3]. The classes can be categorized in open and closed classes. Closed classes are those that have a fixed number of members, for example articles are a closed class (in English articles are only: the, a and an). Open classes don t have a fixed number of members, because occasionally new words are found, or borrowed from other languages. In the English language there are four open classes: nouns, verbs, adjectives and adverbs [1]. 2.2 The tagging process The input for every tagging algorithm is a string of words and a specified tagset. The output is a single best tag for each word. Most algorithms have 3 steps in common to find the best tag: ([1], [2]): 1. Tokenization is part of most tagging algorithms or is done as preprocessing step. The text is divided into tokens. These include end-of-sentence punctuation marks and word-like units. 2. Ambiguity look-up uses a lexicon and the tagset to assign each word a list of possible part-of-speech tags. Is a word not in the lexicon a guesser can make assumptions using the word neighborhood. The guesser only needs to consider the open class tag types, because most lexicons contain all closed-class words. 3. Disambiguation is the last and also the hardest part of the 3 steps. After step 2 every word has multiple possible tags. Disambiguation now tries to eliminate all but one tag. There are two classes of methods that can be used for disambiguation. Rule-based tries to eliminate tags by using specific rules, and one such tagger is described in section 3. Stochastic taggers use a tagged training set to compute a probability of a given word in a specific context. The tag with the highest probability is chosen. A stochastic tagger is described in section 4. Some methods use a combination of both. One such method is described in section 5. 2

3 Rule-based POS taggers One of the first POS taggers was called TAGGIT and proposed in 1971 [4]. It was based on context-pattern rules and used a 71-item tagset and a disambiguation grammar of 3,300 rules [1]. TAGGIT was correct in 77 per cent of all words in the million word Brown University corpus. Because all rules need to be hand-written, a rule-based tagger requires lots of work and knowledge of the language. In 1992 a new tagger was proposed by Brill, et. al., which learns the rules itself [5]. The motivation behind the simple rule-based part-of-speech tagger from Eric Brill is that stochastic or probabilistic taggers (see section 4) achieve substantially better results in tagging. On the other hand earlier rule-based taggers needed a wide variety of rules and knowledge of the language. Therefore he created a tagger which generates rules from a set of templates and a training text. All words are automatically assigned an initial tag with a basic lexical tagger. Then patches are generated out of eight patch templates (excerpted from [5]): Change tag a to tag b when: 1. The preceding (following) word is tagged z. 2. The word two before (after) is tagged z. 3. One of the two preceding (following) words is tagged z. 4. One of the three preceding (following) words is tagged z. 5. The preceding word is tagged z and the following word is tagged w. 6. The preceding (following) word is tagged z and the word two before (after) is tagged w. 7. The current word is (is not) capitalized. 8. The previous word is (is not) capitalized. For each error triple < tag a, tag b, number > and patch the reduction in error is computed. The patch with the highest reduction of error is added to the list of patches. The patch acquisition procedure continues until a specific threshold is reached (e.g. no more reduction in error). A new text is tagged by first assigning an initial tag with the same lexical tagger the training corpus has been tagged. Then each patch from the list of patches is applied, hopefully decreasing the error rate. [5] showed that an accuracy of 95%-99% is possible. Furthermore only 71 patches were generated achieving these high accuracies, which showed that rulebased taggers can compete with stochastic methods. This tagger is very easy to understand and can be used with many different text genres, just by training a tagger for that specific genre. This kind of tagging is sometimes also called transformation-based tagging. 3

The advantages of rule-based taggers are that rules can be hand-written and easy comprehended, further rule-based taggers can achieve good results with just a few rules. The disadvantages are that the rules are language and corpus specific and that programming a really good tagger takes a large amount of work and needs lots of linguistic knowledge. 4 Stochastic POS taggers Stochastic taggers make use of probabilities. The goal is to search the most probable sequence of tags for a given set of words. This can be modeled as a special case of Bayesian inference, called Hidden Markov Model. ˆt n 1 = argmax ˆt n 1 P (t n 1 w n 1 ) argmax ˆt n 1 n P (w i t i ) P (t i t i 1 ) (1) ˆt n 1 of eqn. 1 gives us the most probable sequence of tags. It can be calculated with formula 1, allthough several assumptions are made: 1. The probability (P (t n 1 w1 n )) of a sequence of words w1 n receiving the sequence of tags t n 1 can be transformed with Bayes rule to P (wn 1 tn 1 )P (tn 1 ). i=1 P(w n 1 ) The denominator of the fraction can be dropped since the probability of a sequence of words w n 1 doesn t change for a different tag sequence. Leaving us with a prior P (t n 1 ) and a likelihood P (w n 1 t n 1 ). The likelihood is now the probability of seeing a sequence of words given a sequence of tags. 2. The words are independent of each other. Therefore P (w n 1 t n 1 ) n i=1 P (w i t i ). 3. The probability of a tag is only dependent on its predecessor: P (t n 1 ) n i=1 P (t i t i 1 ). Such taggers are called bigram taggers. The algorithm can be improved by also using trigrams (tag is dependent on two predecessors) and also a combination of bigram and trigram probabilities is possible. Combining all assumptions gives us eqn. 1. [1] 4.1 Calculation probabilities In eqn. 1 are 2 unknown probabilities but those are easy to train. An estimate of probabilities can be calculated from an annotated corpus, e.g. the 1 million word Brown corpus. To calculate an estimate for the probability of a tag t i coming after a tag t i 1 the ratio of counts in eqn. 2 is calculated for every pair < t i, t i 1 > in the corpus. P (t i t i 1 ) = C (t i 1, t i ) C (t i 1 ) An estimate for the word likelihoods P (w i t i ) can also be calculated from a ratio of counts. (eqn. 3) (2) P (w i t i ) = C (t i, w i ) C (t i ) (3) 4

The computation of ˆt n 1 (eqn. 1) over all possible tags t i and all possible words w j is exponential. As formalized Hidden Markov Model ˆt n 1 can be calculated in linear time. 4.2 Hidden Markov Models [1] To calculate the most probable sequence of tag ˆt n 1 we use a Hidden Markov Model. It s called Hidden because only the observations are given and the states (or tags in our case) are not known and can t be determined, because our equation (1) depends on tags of previous words. A Hidden Markov Model consist of the tuple M = (Q, A, B, S, E) (4) Q is the set of states q 1, q 2,..., q n. Here the states are all possible tags. A is the set of transition probabilities a 01, a 02,..., a n1,..., a nn and a ij represents the transition of state q i to state q j. This set can be written as a transition probability matrix and corresponds to the prior P (t i t i 1 ). Each element in the matrix tells us the probability of having a tag j after the tag i. B is the set of observation likelihoods b i (o t ) and are the probability of an observation o t being generated from a state i. The observation likelihoods B can also be written as matrix and correspond to the likelihood P (w i t i ), therefore the element in the matrix tells us the probability of a word w i having tag t i. S is the start state and E the end state. Further there is a set of observation symbols O, which are the words w 1,..., w n. The most likely tag sequence or transitions have to be found given this sequence of observation symbols. The Viterbi algorithm is the most common decoding algorithm for HMMs that gives the most likely tag sequence given a set of states, transition probabilities and observation likelihoods. Algorithm 1 Viterbi 1: num states #tags 2: Create path probability matrix viterbi[num states + 2, n + 2] 3: viterbi[0, 0] 1.0 4: for word w = 1 to n do 5: for tag t = 1 to num states do 6: viterbi[t, w] max 1 t num states[viterbi[t, w 1] a t,t] b t (o w ) 7: back pointer[t, w] argmax 1 t num states[viterbi[t, w 1] a t,t] 8: end for 9: end for 10: Backtrace from highest probability state in final column of viterbi[] and 11: return path. The algorithm 1 works now as following: 5

1. A path probability matrix is created. A start and end tag and a start and end observation is added. 2. The cell with the start tag and the start observation (viterbi[0, 0]) is initialized with 1. All other values in the first column are 0. 3. Now the algorithm moves from column to column. 4. Filling every row in the current column with the maximum of the previous probability of the tag times the transition probability from tag t to t. This maximum is then multiplied with the observation likelihood of the current word o given the current tag t. 5. Also a pointer is remembered to backtrace the path of the maximum probabilities. 6. Step 3-5 are repeated until the last column is reached. When the path is back-traced from the largest value in the last column to the first column, the sequence of most probable tags can be reconstructed. Stochastic taggers can be easily adapted to new languages and new corpora by just re-training the model. On the other hand stochastic taggers can be hard to program. 5 Hybrid Taggers Both stochastic and probabilistic taggers are available, but there are also taggers that combine both methods. CLAWS4 is such a hybrid tagger [6]. CLAWS4 uses in a first step a sequence of eight tests to assign initial tags. These tests also include rules for spoken English, multipart words, tags for specific word-endings and words containing hyphens. After the first step each word has one or more tags and in the second step a HMM eliminates all but one tag. [6] claims an accuracy of 96-97 percent. 6 Problems in POS Tagging Unknown words The largest problem for POS taggers are unknown words. As mentioned earlier those can only appear with open-class types, but this is still a fair amount of tags to choose from. Several approaches are available to tag unknown words: Guess the tag from the context. This means that depending on the previous and following tags a tag for the word is chosen. This further implies that the tag is ambiguous among all possible tags with equal probability. Some algorithms use the morphology of the word to guess its tag. Morphology means that the prefix and suffix of a word is analyzed (e.g.: Capitalization means that the word is likely to be a proper noun). Other algorithms consider probability of all the suffixes of all words in the training corpus. 6

Other algorithms use a combination of all features (prefixes, suffixes, capitalization and context) of a word together with a linear regression model, or another classification task, to predict a tag for the unknown word. Spelling errors So far only text without spelling errors are considered. But what if we start to tag web pages or magazines that have many type- and spelling-errors? Correcting these text beforehand is very crucial and often the correction is as ambiguous as assigning a tag to a word. For example the misspelled word acress can have at least 6 candidate corrections: actress, cress, caress, access, across, acres. Choosing the unintended candidate can change the tags of a whole sequence of words. Tokenization In section 2.2 the first step of tagging is described as tokenization. It depends on the tagging algorithm, but starting with wrong tokens can also change the sequence of tags. For example some algorithms split contractions and the s-genitive from the word stems (e.g.: children s) or separate proper nouns like New York. 7 Conclusion In many classification tasks one would be very happy to achieve results near the 96% as shown in the tagging algorithms presented in this paper and also in [1]. Therefore the question arises: Is part-of-speech tagging a solved problem? If one looks at the results of Marcus et. al. [3], for example, where they found out that human annotators agreed in just 96% percent of the cases, the answer would be yes, since the last 3% can be of the ambiguity in the language itself. Some special cases may not be clearly defined and therefore lead to different tags. The accuracy can also be limited by the error in the training data. On the other hand we have just looked at algorithms for the English language. All though there are taggers for different languages and also some taggers can be easily modified for another language than English, not every language is as suitable for tagging with the same methods used in the English language. Further research could investigate how to tag different languages with just one tagger, and with that tagger it might be possible to find common structures in all languages. Or some languages turn out to be not tagable. Apparently achieving an accuracy of 100% might not even be possible, it would be a challenging task to find such tagger. The computational linguistic community and research in part-of-speech tagging proposed many algorithms that are now used in many different fields (e.g.: Hidden Markov Models). The accuracy of any part-of-speech tagger depends to the main part on a welltagged training corpus. As we have seen, current algorithms perform very well in part-of-speech tagging. Taggers for other languages and also other research domains can benefit from the ongoing research in part-of-speech tagging. 7

References [1] D. Jurafsky and J. H. Martin, Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing., ch. 5: Part-of-Speech Tagging, pp. 1 52. Prentice Hall, 2006. [2] A. Voutilainen, The Oxford handbook of computational linguistics, ch. 11: Part-of-Speech Tagging, pp. 219 232. Oxford University Press, 2005. [3] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, Building a large annotated corpus of english: The penn treebank, Computational Linguistics, vol. 19, no. 2, pp. 313 330, 1993. [4] B. B. Greene and G. M. Rubin, Automatic grammatical tagging of english, tech. rep., Department of Linguistics, Brown University, 1971. [5] E. Brill, A simple rule-based part of speech tagger, in HLT 91: Proceedings of the workshop on Speech and Natural Language, (Morristown, NJ, USA), pp. 112 116, Association for Computational Linguistics, 1992. [6] R. Garside and N. Smith, A hybrid grammatical tagger: Claws4, in Corpus Annotation: Linguistic Information from Computer Text Corpora, pp. 102 121, Longman, 1997. 8