Part of Speech Tagging - A solved problem? Wolfgang Fischl - 0602106 December 8, 2009 Abstract Since 100 B.C. humans are aware that the language consists of several distinct parts, called parts-of-speech. Identifying those parts-of-speech plays a crucial role in many fields of linguistics. Since TAGGIT, the first large-scale part-of-speech tagger, many algorithms and methods have been developed. Such include rule-based, probabilistic and hybrid taggers. When tagging large text corpora some problems might arise and finally the question is asked if part-of-speech tagging is a solved problem. 1 Introduction Part-of-speech (also known as: POS, tagsets, world classes, morphological classes, or lexical tags) play an important role in nearly every human language. What are part-of-speech? Already 100 B.C. someone identified eight parts-of-speech in Greek language: noun, verb, pronoun, preposition, adverb, conjunction, participle, and article. These eight are the basis for most part-of-speech descriptions and are also included in the more recent lists of parts-of-speech (see Table 1). Name year of appearance No. of classes Brown Corpus 1979 87 Penn Treebank 1993 45 C5 tagset 1997 61 C7 tagset 1997 146 Table 1: Recent part-of-speech lists The POS tag of a word is not only influenced by the word itself, but also by the words neighbors. In fact one word can have different POS tags depending on its location. For example the word object can be used either as noun (the object) or as verb (to object). Knowing the exact POS can be very helpful in many language processing applications. The POS of the word object changes the pronunciation (as noun the first syllable is pronounced as verb the second) [1]. Other applications include text indexing or the linguistic analysis of large tagged text corpora [2]. Part-of-speech tagging is the automatic assignment of such tags to entire texts. In this paper the goal is to understand the part-of-speech tags, the methods used by all part-of-speech tagging algorithms, some part-of-speech taggers and their problems and to explore if part-of-speech tagging is an already solved computational problem. 1
2 Part-of-speech tagging 2.1 Tags in English The eight part-of-speech tags identified in 100 B.C. are just a broad description of the English language. Actually there is more interest in identifying more specific POS tags. The smallest now used POS tagset is 45 tags big, the largest has 146 tags (see Table 1). What are now these 45 tags of the Penn Treebank tagset? For example the nouns themselves are separated into four classes: Singular, or mass nouns (llama, snow); Plural nouns (llamas); Proper nouns, singular (IBM); Proper nouns, plural (Austrians). The verbs consist of 6 classes: base form (eat), past tense (ate), gerund (eating), past participle (eaten), non- 3sg pres (eat), 3g pres (eats). A detailed list of the Penn Treebank tags can be found in [3]. The classes can be categorized in open and closed classes. Closed classes are those that have a fixed number of members, for example articles are a closed class (in English articles are only: the, a and an). Open classes don t have a fixed number of members, because occasionally new words are found, or borrowed from other languages. In the English language there are four open classes: nouns, verbs, adjectives and adverbs [1]. 2.2 The tagging process The input for every tagging algorithm is a string of words and a specified tagset. The output is a single best tag for each word. Most algorithms have 3 steps in common to find the best tag: ([1], [2]): 1. Tokenization is part of most tagging algorithms or is done as preprocessing step. The text is divided into tokens. These include end-of-sentence punctuation marks and word-like units. 2. Ambiguity look-up uses a lexicon and the tagset to assign each word a list of possible part-of-speech tags. Is a word not in the lexicon a guesser can make assumptions using the word neighborhood. The guesser only needs to consider the open class tag types, because most lexicons contain all closed-class words. 3. Disambiguation is the last and also the hardest part of the 3 steps. After step 2 every word has multiple possible tags. Disambiguation now tries to eliminate all but one tag. There are two classes of methods that can be used for disambiguation. Rule-based tries to eliminate tags by using specific rules, and one such tagger is described in section 3. Stochastic taggers use a tagged training set to compute a probability of a given word in a specific context. The tag with the highest probability is chosen. A stochastic tagger is described in section 4. Some methods use a combination of both. One such method is described in section 5. 2
3 Rule-based POS taggers One of the first POS taggers was called TAGGIT and proposed in 1971 [4]. It was based on context-pattern rules and used a 71-item tagset and a disambiguation grammar of 3,300 rules [1]. TAGGIT was correct in 77 per cent of all words in the million word Brown University corpus. Because all rules need to be hand-written, a rule-based tagger requires lots of work and knowledge of the language. In 1992 a new tagger was proposed by Brill, et. al., which learns the rules itself [5]. The motivation behind the simple rule-based part-of-speech tagger from Eric Brill is that stochastic or probabilistic taggers (see section 4) achieve substantially better results in tagging. On the other hand earlier rule-based taggers needed a wide variety of rules and knowledge of the language. Therefore he created a tagger which generates rules from a set of templates and a training text. All words are automatically assigned an initial tag with a basic lexical tagger. Then patches are generated out of eight patch templates (excerpted from [5]): Change tag a to tag b when: 1. The preceding (following) word is tagged z. 2. The word two before (after) is tagged z. 3. One of the two preceding (following) words is tagged z. 4. One of the three preceding (following) words is tagged z. 5. The preceding word is tagged z and the following word is tagged w. 6. The preceding (following) word is tagged z and the word two before (after) is tagged w. 7. The current word is (is not) capitalized. 8. The previous word is (is not) capitalized. For each error triple < tag a, tag b, number > and patch the reduction in error is computed. The patch with the highest reduction of error is added to the list of patches. The patch acquisition procedure continues until a specific threshold is reached (e.g. no more reduction in error). A new text is tagged by first assigning an initial tag with the same lexical tagger the training corpus has been tagged. Then each patch from the list of patches is applied, hopefully decreasing the error rate. [5] showed that an accuracy of 95%-99% is possible. Furthermore only 71 patches were generated achieving these high accuracies, which showed that rulebased taggers can compete with stochastic methods. This tagger is very easy to understand and can be used with many different text genres, just by training a tagger for that specific genre. This kind of tagging is sometimes also called transformation-based tagging. 3
The advantages of rule-based taggers are that rules can be hand-written and easy comprehended, further rule-based taggers can achieve good results with just a few rules. The disadvantages are that the rules are language and corpus specific and that programming a really good tagger takes a large amount of work and needs lots of linguistic knowledge. 4 Stochastic POS taggers Stochastic taggers make use of probabilities. The goal is to search the most probable sequence of tags for a given set of words. This can be modeled as a special case of Bayesian inference, called Hidden Markov Model. ˆt n 1 = argmax ˆt n 1 P (t n 1 w n 1 ) argmax ˆt n 1 n P (w i t i ) P (t i t i 1 ) (1) ˆt n 1 of eqn. 1 gives us the most probable sequence of tags. It can be calculated with formula 1, allthough several assumptions are made: 1. The probability (P (t n 1 w1 n )) of a sequence of words w1 n receiving the sequence of tags t n 1 can be transformed with Bayes rule to P (wn 1 tn 1 )P (tn 1 ). i=1 P(w n 1 ) The denominator of the fraction can be dropped since the probability of a sequence of words w n 1 doesn t change for a different tag sequence. Leaving us with a prior P (t n 1 ) and a likelihood P (w n 1 t n 1 ). The likelihood is now the probability of seeing a sequence of words given a sequence of tags. 2. The words are independent of each other. Therefore P (w n 1 t n 1 ) n i=1 P (w i t i ). 3. The probability of a tag is only dependent on its predecessor: P (t n 1 ) n i=1 P (t i t i 1 ). Such taggers are called bigram taggers. The algorithm can be improved by also using trigrams (tag is dependent on two predecessors) and also a combination of bigram and trigram probabilities is possible. Combining all assumptions gives us eqn. 1. [1] 4.1 Calculation probabilities In eqn. 1 are 2 unknown probabilities but those are easy to train. An estimate of probabilities can be calculated from an annotated corpus, e.g. the 1 million word Brown corpus. To calculate an estimate for the probability of a tag t i coming after a tag t i 1 the ratio of counts in eqn. 2 is calculated for every pair < t i, t i 1 > in the corpus. P (t i t i 1 ) = C (t i 1, t i ) C (t i 1 ) An estimate for the word likelihoods P (w i t i ) can also be calculated from a ratio of counts. (eqn. 3) (2) P (w i t i ) = C (t i, w i ) C (t i ) (3) 4
The computation of ˆt n 1 (eqn. 1) over all possible tags t i and all possible words w j is exponential. As formalized Hidden Markov Model ˆt n 1 can be calculated in linear time. 4.2 Hidden Markov Models [1] To calculate the most probable sequence of tag ˆt n 1 we use a Hidden Markov Model. It s called Hidden because only the observations are given and the states (or tags in our case) are not known and can t be determined, because our equation (1) depends on tags of previous words. A Hidden Markov Model consist of the tuple M = (Q, A, B, S, E) (4) Q is the set of states q 1, q 2,..., q n. Here the states are all possible tags. A is the set of transition probabilities a 01, a 02,..., a n1,..., a nn and a ij represents the transition of state q i to state q j. This set can be written as a transition probability matrix and corresponds to the prior P (t i t i 1 ). Each element in the matrix tells us the probability of having a tag j after the tag i. B is the set of observation likelihoods b i (o t ) and are the probability of an observation o t being generated from a state i. The observation likelihoods B can also be written as matrix and correspond to the likelihood P (w i t i ), therefore the element in the matrix tells us the probability of a word w i having tag t i. S is the start state and E the end state. Further there is a set of observation symbols O, which are the words w 1,..., w n. The most likely tag sequence or transitions have to be found given this sequence of observation symbols. The Viterbi algorithm is the most common decoding algorithm for HMMs that gives the most likely tag sequence given a set of states, transition probabilities and observation likelihoods. Algorithm 1 Viterbi 1: num states #tags 2: Create path probability matrix viterbi[num states + 2, n + 2] 3: viterbi[0, 0] 1.0 4: for word w = 1 to n do 5: for tag t = 1 to num states do 6: viterbi[t, w] max 1 t num states[viterbi[t, w 1] a t,t] b t (o w ) 7: back pointer[t, w] argmax 1 t num states[viterbi[t, w 1] a t,t] 8: end for 9: end for 10: Backtrace from highest probability state in final column of viterbi[] and 11: return path. The algorithm 1 works now as following: 5
1. A path probability matrix is created. A start and end tag and a start and end observation is added. 2. The cell with the start tag and the start observation (viterbi[0, 0]) is initialized with 1. All other values in the first column are 0. 3. Now the algorithm moves from column to column. 4. Filling every row in the current column with the maximum of the previous probability of the tag times the transition probability from tag t to t. This maximum is then multiplied with the observation likelihood of the current word o given the current tag t. 5. Also a pointer is remembered to backtrace the path of the maximum probabilities. 6. Step 3-5 are repeated until the last column is reached. When the path is back-traced from the largest value in the last column to the first column, the sequence of most probable tags can be reconstructed. Stochastic taggers can be easily adapted to new languages and new corpora by just re-training the model. On the other hand stochastic taggers can be hard to program. 5 Hybrid Taggers Both stochastic and probabilistic taggers are available, but there are also taggers that combine both methods. CLAWS4 is such a hybrid tagger [6]. CLAWS4 uses in a first step a sequence of eight tests to assign initial tags. These tests also include rules for spoken English, multipart words, tags for specific word-endings and words containing hyphens. After the first step each word has one or more tags and in the second step a HMM eliminates all but one tag. [6] claims an accuracy of 96-97 percent. 6 Problems in POS Tagging Unknown words The largest problem for POS taggers are unknown words. As mentioned earlier those can only appear with open-class types, but this is still a fair amount of tags to choose from. Several approaches are available to tag unknown words: Guess the tag from the context. This means that depending on the previous and following tags a tag for the word is chosen. This further implies that the tag is ambiguous among all possible tags with equal probability. Some algorithms use the morphology of the word to guess its tag. Morphology means that the prefix and suffix of a word is analyzed (e.g.: Capitalization means that the word is likely to be a proper noun). Other algorithms consider probability of all the suffixes of all words in the training corpus. 6
Other algorithms use a combination of all features (prefixes, suffixes, capitalization and context) of a word together with a linear regression model, or another classification task, to predict a tag for the unknown word. Spelling errors So far only text without spelling errors are considered. But what if we start to tag web pages or magazines that have many type- and spelling-errors? Correcting these text beforehand is very crucial and often the correction is as ambiguous as assigning a tag to a word. For example the misspelled word acress can have at least 6 candidate corrections: actress, cress, caress, access, across, acres. Choosing the unintended candidate can change the tags of a whole sequence of words. Tokenization In section 2.2 the first step of tagging is described as tokenization. It depends on the tagging algorithm, but starting with wrong tokens can also change the sequence of tags. For example some algorithms split contractions and the s-genitive from the word stems (e.g.: children s) or separate proper nouns like New York. 7 Conclusion In many classification tasks one would be very happy to achieve results near the 96% as shown in the tagging algorithms presented in this paper and also in [1]. Therefore the question arises: Is part-of-speech tagging a solved problem? If one looks at the results of Marcus et. al. [3], for example, where they found out that human annotators agreed in just 96% percent of the cases, the answer would be yes, since the last 3% can be of the ambiguity in the language itself. Some special cases may not be clearly defined and therefore lead to different tags. The accuracy can also be limited by the error in the training data. On the other hand we have just looked at algorithms for the English language. All though there are taggers for different languages and also some taggers can be easily modified for another language than English, not every language is as suitable for tagging with the same methods used in the English language. Further research could investigate how to tag different languages with just one tagger, and with that tagger it might be possible to find common structures in all languages. Or some languages turn out to be not tagable. Apparently achieving an accuracy of 100% might not even be possible, it would be a challenging task to find such tagger. The computational linguistic community and research in part-of-speech tagging proposed many algorithms that are now used in many different fields (e.g.: Hidden Markov Models). The accuracy of any part-of-speech tagger depends to the main part on a welltagged training corpus. As we have seen, current algorithms perform very well in part-of-speech tagging. Taggers for other languages and also other research domains can benefit from the ongoing research in part-of-speech tagging. 7
References [1] D. Jurafsky and J. H. Martin, Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing., ch. 5: Part-of-Speech Tagging, pp. 1 52. Prentice Hall, 2006. [2] A. Voutilainen, The Oxford handbook of computational linguistics, ch. 11: Part-of-Speech Tagging, pp. 219 232. Oxford University Press, 2005. [3] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, Building a large annotated corpus of english: The penn treebank, Computational Linguistics, vol. 19, no. 2, pp. 313 330, 1993. [4] B. B. Greene and G. M. Rubin, Automatic grammatical tagging of english, tech. rep., Department of Linguistics, Brown University, 1971. [5] E. Brill, A simple rule-based part of speech tagger, in HLT 91: Proceedings of the workshop on Speech and Natural Language, (Morristown, NJ, USA), pp. 112 116, Association for Computational Linguistics, 1992. [6] R. Garside and N. Smith, A hybrid grammatical tagger: Claws4, in Corpus Annotation: Linguistic Information from Computer Text Corpora, pp. 102 121, Longman, 1997. 8