Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger

Transcription

1 Enrchng the Knowledge Sources Used n a Maxmum Entropy Part-of-Speech Tagger Krstna Toutanova Dept of Computer Scence Gates Bldg 4A, 353 Serra Mall Stanford, CA , USA krstna@cs.stanford.edu Chrstopher D. Mannng Depts of Computer Scence and Lngustcs Gates Bldg 4A, 353 Serra Mall Stanford, CA , USA mannng@cs.stanford.edu Abstract Ths paper presents results for a maxmumentropy-based part of speech tagger, whch acheves superor performance prncpally by enrchng the nformaton sources used for taggng. In partcular, we get mproved results by ncorporatng these features: () more extensve treatment of captalzaton for unknown words; () features for the dsambguaton of the tense forms of verbs; () features for dsambguatng partcles from prepostons and adverbs. The best resultng accuracy for the tagger on the Penn Treebank s 96.86% overall, and 86.91% on prevously unseen words. Introducton 1 There are now numerous systems for automatc assgnment of parts of speech ( taggng ), employng many dfferent machne learnng methods. Among recent top performng methods are Hdden Markov Models (Brants 2000), maxmum entropy approaches (Ratnaparkh 1996), and transformaton-based learnng (Brll 1994). An overvew of these and other approaches can be found n Mannng and Schütze (1999, ch. 10). However, all these methods use largely the same nformaton sources for taggng, and often almost the same features as well, and as a consequence they also offer very smlar levels of performance. Ths stands n contrast to the (manually-bul EngCG tagger, whch acheves better performance by usng lexcal and contextual nformaton sources and generalzatons beyond those avalable to such statstcal taggers, as Samuelsson and Voutlanen (1997) demonstrate. 1 We thank Dan Klen and Mchael Saunders for useful dscussons, and the anonymous revewers for many helpful comments. Ths paper explores the noton that automatcally bult tagger performance can be further mproved by expandng the knowledge sources avalable to the tagger. We pay specal attenton to unknown words, because the markedly lower accuracy on unknown word taggng means that ths s an area where sgnfcant performance gans seem possble. We adopt a maxmum entropy approach because t allows the ncluson of dverse sources of nformaton wthout causng fragmentaton and wthout necessarly assumng ndependence between the predctors. A maxmum entropy approach has been appled to partof-speech taggng before (Ratnaparkh 1996), but the approach s ablty to ncorporate nonlocal and non-hmm-tagger-type evdence has not been fully explored. Ths paper descrbes the models that we developed and the experments we performed to evaluate them. 1 The Baselne Maxmum Entropy Model We started wth a maxmum entropy based tagger that uses features very smlar to the ones proposed n Ratnaparkh (1996). The tagger learns a loglnear condtonal probablty model from tagged text, usng a maxmum entropy method. The model assgns a probablty for every tag t n the set Τ of possble tags gven a word and ts context h, whch s usually defned as the sequence of several words and tags precedng the word. Ths model can be used for estmatng the probablty of a tag sequence t 1 t n gven a sentence w 1 w n : t t w w ) = n t t, w w ) t h ) 1 n 1 n n = 1 = 1 t n As usual, taggng s the process of assgnng the maxmum lkelhood tag sequence to a strng of words. The dea of maxmum entropy modelng s to choose the probablty dstrbuton p that has the hghest entropy out of those dstrbutons

2 that satsfy a certan set of constrants. The constrants restrct the model to behave n accordance wth a set of statstcs collected from the tranng data. The statstcs are expressed as the expected values of approprate functons defned on the contexts h and tags t. In partcular, the constrants demand that the expectatons of the features for the model match the emprcal expectatons of the features over the tranng data. For example, f we want to constran the model to tag make as a verb or noun wth the same frequency as the emprcal model nduced by the tranng data, we defne the features: f ( h, = 1 1 f ( h, = 1 2 ff ff w w = make and t = NN = make and t = VB Some commonly used statstcs for part of speech taggng are: how often a certan word was tagged n a certan way; how often two tags appeared n sequence or how often three tags appeared n sequence. These look a lot lke the statstcs a Markov Model would use. However, n the maxmum entropy framework t s possble to easly defne and ncorporate much more complex statstcs, not restrcted to n-gram sequences. The constrants n our model are that the expectatons of these features accordng to the jont dstrbuton p are equal to the expectatons of the features n the emprcal (tranng data) dstrbuton ~ p : E p ( h, f ( h, = E ~ p ( h, f ( h,. Havng defned a set of constrants that our model should accord wth, we proceed to fnd the model satsfyng the constrants that maxmzes the condtonal entropy of p. The ntuton s that such a model assumes nothng apart from that t should satsfy the gven constrants. Followng Berger et al. (1996), we approxmate p ( h,, the jont dstrbuton of contexts and tags, by the product of ~ p ( h ), the emprcal dstrbuton of hstores h, and the condtonal dstrbuton p ( t h) : h, ~ p ( h) t h). Then for the example above, our constrants j 1,2 : would be the followng, for { } h H, t T ~ p ( h, f j ( h, = h H, t T ~ p ( h) t h) f j ( h, Ths approxmaton s used to enable effcent computaton. The expectaton for a feature f s: E f = h H, t T ~ p ( h) t h) f ( h, where H s the space of possble contexts h when predctng a part of speech tag t. Snce the contexts contan sequences of words and tags and other nformaton, the space H s huge. But usng ths approxmaton, we can nstead sum just over the smaller space of observed contexts X n the tranng sample, because the emprcal pror ~ p ( h ) s zero for unseen contexts h: E f = ~ h) t h) f ( h, (1) h X, t T The model that s a soluton to ths constraned optmzaton task s an exponental (or equvalently, loglnear) model wth the parametrc form: λ j f j ( h, e p t h = j= 1,, K ( ) λ j f j ( h, t ) t T j= 1, e, K where the denomnator s a normalzng term (sometmes referred to as the partton functon). The parameters j correspond to weghts for the features f j. We wll not dscuss n detal the characterstcs of the model or the parameter estmaton procedure used Improved Iteratve Scalng. For a more extensve dscusson of maxmum entropy methods, see Berger et al. (1996) and Jelnek (1997). However, we note that our parameter estmaton algorthm drectly uses equaton (1). Ratnaparkh (1996: 134) suggests use of an approxmaton summng over the tranng data, whch does not sum over possble tags: E f j n = 1 ~ p ( h ) t h ) f j ( h, t ) However, we beleve ths passage s n error: such an estmate s neffectve n the teratve scalng algorthm. Further, we note that expectatons of the form (1) appear n Ratnaparkh (1998: 12). 1.1 Features n the Baselne Model In our baselne model, the context avalable when predctng the part of speech tag of a word w n a sentence of words {w 1 w n } wth tags {t 1 t n } s {t -1 t -2 w w +1 }. The features that defne the constrants on the model are obtaned by nstantaton of feature templates as n Ratnaparkh (1996). Specal feature templates exst for rare words n the tranng data, to ncrease the model s predcton capacty for unknown words.

3 The actual feature templates for ths model are shown n the next table. They are a subset of the features used n Ratnaparkh (1996). No. Feature Type Template 1. General w =X & t =T 2. General t -1 =T 1 & t =T 3. General t -1 =T 1 & t -2 =T 2 & t =T 4. General w +1 =X & t =T 5. Rare Suffx of w =S, S <5 & t =T 6. Rare Prefx of w =P, 1< P <5 & t =T 7. Rare w contans a number & t =T 8. Rare w contans an uppercase character & t =T 9. Rare w contans a hyphen & t =T Table 1 Baselne Model Features General feature templates can be nstantated by arbtrary contexts, whereas rare feature templates are nstantated only by hstores where the current word w s rare. Rare words are defned to be words that appear less than a certan number of tmes n the tranng data (here, the value 7 was used). In order to be able to throw out features that would gve msleadng statstcs due to sparseness or nose n the data, we use two dfferent cutoff values for general and rare feature templates (n ths mplementaton, 5 and 45 respectvely). As seen n Table 1 the features are conjunctons of a boolean functon on the hstory h and a boolean functon on the tag t. Features whose frst conjuncts are true for more than the correspondng threshold number of hstores n the tranng data are ncluded n the model. The feature templates n Ratnaparkh (1996) that were left out were the ones that look at the prevous word, the word two postons before the current, and the word two postons after the current. These features are of the same form as template 4 n Table 1, but they look at words n dfferent postons. Our motvaton for leavng these features out was the results from some experments on successvely addng feature templates. Addng template 4 to a model that ncorporated the general feature templates 1 to 3 only and the rare feature templates 5 8 sgnfcantly ncreased the accuracy on the development set from 96.0% to 96.52%. The addton of a feature template that looked at the precedng word and the current tag to the resultng model slghtly reduced the accuracy. 1.2 Testng and Performance The model was traned and tested on the part-ofspeech tagged WSJ secton of the Penn Treebank. The data was dvded nto contguous parts: sectons 0 20 were used for tranng, sectons as a development test set, and sectons as a fnal test set. The data set szes are shown below together wth numbers of unknown words. Data Set Tokens Unknown Tranng 1,061,768 Words Development 116, (2.81%) Test 111, (2.59%) Table 2 Data Szes The testng procedure uses a beam search to fnd the tag sequence wth maxmal probablty gven a sentence. In our experments we used a beam of sze 5. Increasng the beam sze dd not result n mproved accuracy. The precedng tags for the word at the begnnng of the sentence are regarded as havng the pseudo-tag NA. In ths way, the nformaton that a word s the frst word n a sentence s avalable to the tagger. We do not have a specal end-of-sentence symbol. We used a tag dctonary for known words n testng. Ths was bult from tags found n the tranng data but augmented so as to capture a few basc systematc tag ambgutes that are found n Englsh. Namely, for regular verbs the -ed form can be ether a VBD or a VBN and smlarly the stem form can be ether a VBP or VB. Hence for words that had occurred wth only one of these tags n the tranng data the other was also ncluded as possble for assgnment. The results on the test set for the Baselne model are shown n Table 3. Model Overall Unknown Word Accuracy Accuracy Baselne 96.72% 84.5% Ratnaparkh (1996) 96.63% 85.56% Table 3 Baselne model performance Ths table also shows the results reported n Ratnaparkh (1996: 142) for convenence. The accuracy fgure for our model s hgher overall

4 but lower for unknown words. Ths may stem from the dfferences between the two models feature templates, thresholds, and approxmatons of the expected values for the features, as dscussed n the begnnng of the secton, or may just reflect dfferences n the choce of tranng and test sets (whch are not precsely specfed n Ratnaparkh (1996)). The dfferences are not great enough to justfy any defnte statement about the dfferent use of feature templates or other partculartes of the model estmaton. One concluson that we can draw s that at present the addtonal word features used n Ratnaparkh (1996) lookng at words more than one poston away from the current do not appear to be helpng the overall performance of the models. 1.3 Dscusson of Problematc Cases A large number of words, ncludng many of the most common words, can have more than one syntactc category. Ths ntroduces a lot of ambgutes that the tagger has to resolve. Some of the ambgutes are easer for taggers to resolve and others are harder. Some of the most sgnfcant confusons that the Baselne model made on the test set can be seen n Table 5. The row labels n Table 5 sgnfy the correct tags, and the column labels sgnfy the assgned tags. For example, the number 244 n the (NN, JJ) poston s the number of words that were NNs but were ncorrectly assgned the JJ category. These partcular confusons, shown n the table, account for a large percentage of the total error (2652/3651 = 72.64%). Table 6 shows part of the Baselne model s confuson matrx for just unknown words. Table 4 shows the Baselne model s overall assgnment accuraces for dfferent parts of speech. For example, the accuracy on nouns s greater than the accuracy on adjectves. The accuracy on NNPS (plural proper nouns) s a surprsngly low 41.1%. Tag Accuracy Tag Accuracy IN 97.3% JJ 93.0% NN 96.5% RB 92.2% NNP 96.2% VBN 90.4% VBD 95.2% RP 41.5% VB 94.0% NNPS 41.1% VBP 93.4% Table 4 Accuracy of assgnments for dfferent parts of speech for the Baselne model. Tagger errors are of varous types. Some are the result of nconsstency n labelng n the tranng data (Ratnaparkh 1996), whch usually reflects a lack of lngustc clarty or determnaton of the correct part of speech n context. For nstance, the status of varous noun premodfers (whether chef or maxmum s NN or JJ, or whether a word n -ng s actng as a JJ or VBG) s of ths type. Some, such as errors between NN/NNP/NNPS/NNS largely reflect dffcultes wth unknown words. But other cases, such as VBN/VBD and VB/VBP/NN, represent systematc tag ambguty patterns n Englsh, for whch the rght answer s nvarably clear n context, and for whch there are n general good structural contextual clues that one should be able to use to dsambguate. Fnally, n another class of cases, of whch the most promnent s probably the RP/IN/RB ambguty of words lke up, out, and on, the lngustc dstnctons, whle havng a sound emprcal bass (e.g., see Baker (1995: ), are qute subtle, and often requre semantc ntutons. There are not good syntactc cues for the correct tag (and furthermore, human taggers not nfrequently make errors). Wthn ths classfcaton, the greatest hopes for taggng mprovement appear to come from mnmzng errors n the second and thrd classes of ths classfcaton. In the followng sectons we dscuss how we nclude addtonal knowledge sources to help n the assgnment of tags to forms of verbs, captalzed unknown words, partcle words, and n the overall accuracy of part of speech assgnments. 2 Improvng the Unknown Words Model The accuracy of the baselne model s markedly lower for unknown words than for prevously seen ones. Ths s also the case for all other taggers, and reflects the mportance of lexcal nformaton to taggers: n the best accuracy fgures publshed for corpus-based taggers, known word accuracy s around 97%, whereas unknown word accuracy s around 85%. In followng experments, we examned ways of usng addtonal features to mprove the accuracy of taggng unknown words. As prevously dscussed n Mkheev (1999), t s possble to mprove the accuracy on captalzed words that mght be proper nouns or the frst word n a sentence, etc.

5 JJ NN NNP NNPS RB RP IN VB VBD VBN VBP Total JJ NN NNP NNPS RB RP IN VB VBD VBN VBP Total Table 5 Confuson matrx of the Baselne model showng top confuson pars overall JJ NN NNP NNS NNPS VBN Total JJ NN NNP NNPS NNS VBN Total Table 6 Confuson matrx of the Baselne model for unknown words showng top confuson pars Baselne Model 1 Model 2 Model 3 Captalzaton Verb forms Partcles Accuracy Test Set 96.72% 96.76% 96.83% 96.86% Unknown Words Accuracy Test Set 84.50% 86.76% 86.87% 86.91% Accuracy Development Set 96.53% 96.55% 96.58% 96.62% Unknown Words Accuracy Development Set 85.48% 86.03% 86.03% 86.06% Table 7 Accuraces of all models on the test and development sets Baselne Model 1 Captalzaton Model 2 Verb Forms Model 3 Partcles 1. Current word 15,832 15,832 15,837 15, Prevous tag 1,424 1,424 1,424 1, Prevous two tags 16,124 16,124 16,124 16, Next word 80,075 80,075 80,075 80, Suffxes 3,361 3,361 3,361 3, Prefxes 5, Contans uppercase character Contans number Contans hyphen Captalzed and md. sentence All letters uppercase VBP VB feature VBD VBN feature Partcles, type Partcles, type ,178 Total 122, , , ,944 Table 8 Number of features of dfferent types

6 For example, the error on the proper noun category (NNP) accounts for a sgnfcantly larger percent of the total error for unknown words than for known words. In the Baselne model, of the unknown word error 41.3% s due to words beng NNP and assgned to some other category, or beng of other category and assgned NNP. The percentage of the same type of error for known words s 16.2%. The ncorporaton of the followng two feature schemas greatly mproved NNP accuracy: (1) A feature that looks at whether all the letters of a word are uppercase. The feature that looked at captalzaton before (cf. Table 1, feature No. 8) s actvated when the word contans an uppercase character. Ths turns out to be a notable dstncton because, for example, n ttles n the WSJ data all words are n all uppercase, and the dstrbuton of tags for these words s dfferent from the overall dstrbuton for words that contan an uppercase character. (2) A feature that s actvated when the word contans an uppercase character and t s not at the start of a sentence. These word tokens also have a dfferent tag dstrbuton from the dstrbuton for all tokens that contan an uppercase character. Conversely, emprcally t was found that the prefx features for rare words were havng a net negatve effect on accuracy. We do not at present have a good explanaton for ths phenomenon. The addton of the features (1) and (2) and the removal of the prefx features consderably mproved the accuracy on unknown words and the overall accuracy. The results on the test set after addng these features are shown below: Overall Accuracy Unknown Word Accuracy 96.76% 86.76% Table 9 Accuracy when addng captalzaton features and removng prefx features. Unknown word error s reduced by 15% as compared to the Baselne model. It s mportant to note that (2) s composed of nformaton already known to the tagger n some sense. Ths feature can be vewed as the conjuncton of two features, one of whch s already n the baselne model, and the other of whch s the negaton of a feature exstng n the baselne model snce for words at the begnnng of a sentence, the precedng tag s the pseudo-tag NA, and there s a feature lookng at the precedng tag. Even though our maxmum entropy model does not requre ndependence among the predctors, t provdes for free only a smple combnaton of feature weghts, and addtonal nteracton terms are needed to model non-addtve nteractons (n log-space terms) between features. 3 Features for Dsambguatng Verb Forms Two of the most sgnfcant sources of classfer errors are the VBN/VBD ambguty and the VBP/VB ambguty. As seen n Table 5, VBN/VBD confusons account for 6.9% of the total word error. The VBP/VB confusons are a smaller 3.7% of the errors. In many cases t s easy for people (and for taggers) to determne the correct form. For example, f there s a to nfntve or a modal drectly precedng the VB/VBP ambguous word, the form s certanly non-fnte. But often the modal can be several postons away from the current poston stll obvous to a human, but out of sght for the baselne model. To help resolve a VB/VBP ambguty n such cases, we can add a feature that looks at the precedng several words (we have chosen 8 as a threshold), but not across another verb, and actvates f there s a to there, a modal verb, or a form of do, let, make, or help (verbs that frequently take a bare nfntve complemen. Rather than havng a separate feature look at each precedng poston, we defne one feature that looks at the chosen number of postons to the left. Ths both ncreases the scope of the avalable hstory for the tagger and provdes a better statstc because t avods fragmentaton. We added a smlar feature for resolvng VBD/VBN confusons. It actvates f there s a have or be auxlary form n the precedng several postons (agan the value 8 s used n the mplementaton). The form of these two feature templates was motvated by the structural rules of Englsh and not nduced from the tranng data, but t should be possble to look for predctors for certan parts of speech n the precedng words n the sentence by, for example, computng assocaton strengths. The addton of the two feature schemas helped reduce the VB/VBP and VBD/VBN confusons. Below s the performance on the test set

7 of the resultng model when features for dsambguatng verb forms are added to the model of Secton 2. The number of VB/VBP confusons was reduced by 23.1% as compared to the baselne. The number of VBD/VBN confusons was reduced by 12.3%. Overall Accuracy Unknown Word Accuracy 96.83% 86.87% Table 10 Accuracy of the extended model 4 Features for Partcle Dsambguaton As dscussed n secton 1.3 above, the task of determnng RB/RP/IN tags for words lke down, out, up s dffcult and n partcular examples, there are often no good local syntactc ndcators. For nstance, n (2), we fnd the exact same sequence of parts of speech, but (2a) s a partcle use of on, whle (2b) s a prepostonal use. Consequently, the accuracy on the rarer RP (partcles) category s as low as 41.5% for the Baselne model (cf. Table 4). (2) a. Km took on the monster. b. Km sat on the monster. We tred to mprove the tagger s capablty to resolve these ambgutes through addng nformaton on verbs preferences to take specfc words as partcles, or adverbs, or prepostons. There are verbs that take partcles more than others, and partcular words lke out are much more lkely to be used as a partcle n the context of some verb than other words ambguous between these tags. We added two dfferent feature templates to capture ths nformaton, consstng as usual of a predcate on the hstory h, and a condton on the tag t. The frst predcate s true f the current word s often used as a partcle, and f there s a verb at most 3 postons to the left, whch s known to have a good chance of takng the current word as a partcle. The verb-partcle pars that are known by the system to be very common were collected through analyss of the tranng data n a preprocessng stage. The second feature template has the form: The last verb s v and the current word s w and w has been tagged as a partcle and the current tag s t. The last verb s the pseudo-symbol NA f there s no verb n the prevous three postons. These features were some help n reducng the RB/IN/RP confusons. The accuracy on the RP category rose to 44.3%. Although the overall confusons n ths class were reduced, some of the errors were ncreased, for example, the number of INs classfed as RBs rose slghtly. There seems to be stll consderable room to mprove these results, though the attanable accuracy s lmted by the accuracy wth whch these dstnctons are marked n the Penn Treebank (on a quck nformal study, ths accuracy seems to be around 85%). The next table shows the fnal performance on the test set. Overall Accuracy Unknown Word Accuracy 96.86% 86.91% Table 11 Accuracy of the fnal model For ease of comparson, the accuraces of all models on the test and development sets are shown n Table 7. We note that accuracy s lower on the development set. Ths presumably corresponds wth Charnak s (2000: 136) observaton that Secton 23 of the Penn Treebank s easer than some others. Table 8 shows the dfferent number of feature templates of each knd that have been nstantated for the dfferent models as well as the total number of features each model has. It can be seen that the features whch help dsambguate verb forms, whch look at captalzaton and the frst of the feature templates for partcles are a very small number as compared to the features of the other knds. The mprovement n classfcaton accuracy therefore comes at the prce of addng very few parameters to the maxmum entropy model and does not result n ncreased model complexty. Concluson Even when the accuracy fgures for corpusbased part-of-speech taggers start to look extremely smlar, t s stll possble to move performance levels up. The work presented n ths paper explored just a few nformaton sources n addton to the ones usually used for taggng. Whle progress s slow, because each new feature apples only to a lmted range of cases, nevertheless the mprovement n accuracy as compared to prevous results s notceable, partcularly for the ndvdual decsons on whch we focused. The potental of maxmum entropy methods has not prevously been fully exploted for the task of assgnment of parts of speech. We ncorporated nto a maxmum entropy-based tagger

8 more lngustcally sophstcated features, whch are non-local and do not look just at partcular postons n the text. We also added features that model the nteractons of prevously employed predctors. All of these changes led to modest ncreases n taggng accuracy. Ths paper has thus presented some ntal experments n mprovng tagger accuracy through usng addtonal nformaton sources. In the future we hope to explore automatcally dscoverng nformaton sources that can be proftably ncorporated nto maxmum entropy part-of-speech predcton. References Baker, C. L Englsh Syntax. Cambrdge, MA: MIT Press, 2 nd edton. Berger, Adam L., Della Petra, Stephen A., and Della Petra, Vncent J A Maxmum Entropy Approach to Natural Language Processng. Computatonal Lngustcs 22: Brants, Thorsten TnT A Statstcal Part-of- Speech Tagger. In Proceedngs of the Sxth Appled Natural Language Processng Conference (ANLP 2000), Seattle, WA, pp Brll, Erc Some Advances n Transformaton- Based Part of Speech Taggng. Proceedngs of AAAI, Vol. 1, pp Charnak, Eugene A Maxmum-Entropy- Inspred Parser. Proceedngs of the 1 st Meetng of the North Amercan Chapter of the Assocaton for Computatonal Lngustcs, pp Jelnek, Frederck Statstcal Methods for Speech Recognton. Cambrdge, MA: MIT Press. Mannng, Chrstopher D. and Hnrch Schütze Foundatons of Statstcal Natural Language Processng. Cambrdge, MA: MIT Press. Mkheev, Andre Perods, Captalzed Words, etc. Ms., Unversty of Ednburgh. Avalable at: Ratnaparkh, Adwat A maxmum entropy model for part-of-speech taggng. In Proceedngs of the Conference on Emprcal Methods n Natural Language Processng, Unversty of Pennsylvana, pp Ratnaparkh, Adwat Maxmum Entropy Models for Natural Language Ambguty Resoluton. PhD Thess, Unversty of Pennsylvana. Samuelsson, Chrster and Atro Voutlanen Comparng a Lngustc and a Stochastc Tagger. In Proceedngs of the 25 th Annual Meetng of the Assocaton for Computatonal Lngustcs, pp