Annotation and Evaluation of Swedish Multiword Named Entities DIMITRIOS KOKKINAKIS Department of Swedish, the Swedish Language Bank University of Gothenburg Sweden dimitrios.kokkinakis@svenska.gu.se
Introduction Considerable body of work in NER plethora of identification and classification techniques; NE taxonomies and resources; Likewise, a wide variety of work in MWE; key problem for the development of large scale linguistically sound NLP technologies (Sag et al., 2002) typology; detection; function; applications; Considerably less focus at the intersection of the two; their nature/complexity/magnitude and evaluation here, we evaluate 2 Swedish NER systems on gold standard in order to provide insights on the magnitude and usage of such expressions in modern Swedish corpora
MWE-NEs and their relation to NLP composed of >1 tokens (even in combinations of characters / numerals) and, for some of those, their meaning cannot be traced back to their individual parts (Vincze et al., 2011); e.g., New York Yankees justifiable to treat such expression as a single syntactic and/or semantic entity in, e.g., treebanks (Bejček & Straňák, 2010). NLP applications require to treat MWE-NEs as a single object for ensuring: improving parsing accuracy (Nivre and Nilsson, 2004) improving question-answering (McCord et al., 2012) improving machine translation (Tan and Pal, 2014) better translation quality (Hurskainen, 2008) improving multilingual IR (Vechtomova, 2012)
Swedish Evaluation Corpora SUC3.0 (the Stockholm-Umeå Corpus, v. 3.0) is a freely available Swedish gold standard corpus, that can be used for the evaluation of MWE-NE recognition. SUC3.0 recognized 9 types of NEs: person, work, event, product, inst [itution], place, myth, other and animal. These 9 entity types have been manually annotated according to the TEIP3 guidelines SIC (the Stockholm Internet Corpus contains Swedish blog posts, automatically annotated with pos, and NEs; 13,562 tokens) Swedish Wikipedia (28 random selected articles; 16,069 tokens)
9,884 257 [NEs] + 240 [time] Swedish Evaluation Corpora SUC3.0: 9,884 MWE-NEs (roughly 30% of all NEs in the corpus) found in ~7,530 corpus lines (~155.000 tokens) none (MWE) time expressions* SIC: only 34 MWE-NEs (18 MWE time expressions) Swedish Wikipedia articles: (purpose) SUC3.0 do not contain annotated time expressions, an important category often discussed in the context of NER; 223 MWE-NEs and 222 MWE time expressions *temporal expressions: absolut temporal; relative temporal; durations
SIC+SW SUC3.0 Swedish Evaluation Corpora MWE-NE #2- token entities %* >2- token entities person 5,806 92.9% (58.7%) 458 7.1% (4.6%) place 526 85.1% (5.3%) 93 14.9% (0.9%) institution 1,117 73.4% (11.3%) 404 26.6% (4.1%) other 330 69.4% (3.3%) 145 30.6% (1.5%) work 418 40.9% (4.2%) 604 59.1% (6.1%) person 58 79.5% (11.7%) 15 20.5% (3%) place 47 97.9% (9.4%) 1 2.1% (0.2%) institution 57 76% (11.5%) 18 24% (3.6%) other 16 61.5% (3.2%) 10 38.5% (2%) work 16 45.7% (3.2%) 19 54.3% (3.8%) % time 102 42.5% (20.5%) 138 57.5% (27.8%) Available from: <http://demo.spraakdata.gu.se/svedk/pbl/sucannotsmwe-nes.gold150507.utf.gz> and <http://demo.spraakdata.gu.se/svedk/pbl/sic_o_wikimwe-nes.gold150507.utf.gz> *Percentages of bigram NEs compared to all MWE-NEs in the 2 gold standard corpora
SUC3.0 Pre-processing the NE annotation of SUC3.0 is not completely homogeneous wrt the NEs content 2 Swedish NER taggers are trained on a simplified version of the SUC3.0; using 4 entity types, namely: person organization location miscellaneous thus product, myth, event, animal, and other are merged in the miscellaneous category; institution was mapped to organization, and place to location Moreover, SUC3.0 does not provide annotation for date or time expressions, we manually annotated 28 randomly chosen Swedish Wikipedia articles for this part of the evaluation
SUC3.0 Pre-processing For the sake of the experiment prior filtering and harmonization of the SUC was necessary before the evaluation of the entities a number of person included the vocation or other features as part of the annotation: President as in President George Bush in: SUC3.0-file aa08c-019 animal (68) and myth (18) were merged into the category person In the generic NE type other (because of discrepancies in the SUC3.0 annotation) we included the for product (208) event (93) and other (174)
Evaluation All annotated texts were converted to the conll data format (columns separated by a single space) and then the conlleval script <www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt> was used for the evaluation of the automatic NER Tokens not part of an entity are tagged O O for Outside B stands for Begin I stands for Inside
Comparison and Evaluation 1 (SUC3.0) P P stagger R R stagger *FB1 FB1 stagger Gold Data person-b 95.80% 98.85% 90.28% 96.40% 92.96% 97.61% SUC3.0 based on 6,264 person-i 93.90% 98.04% 88.46% 95.33% 91.10% 96.66% place-b 94.74% 97.36% 78.26% 89.48% 85.71% 93.28% place-i 89.20% 96.92% 73.71% 81.39% 80.72% 88.48% inst-b 93.35% 97.46% 64.79% 88.23% 76.49% 92.62% inst-i 90.39% 96.44% 62.73% 81.85% 74.06% 88.55% work-b 70.73% 81.31% 25.47% 60.86% 37.45% 69.61% work-i 54.27% 80.39% 20.15% 48.92% 29.39% 60.83% other-b 89.68% 93.64% 62.73% 80.63% 73.82% 86.65% other-i 80.80% 95.41% 56.29% 74.32% 66.35% 83.55% based on 6,795 SUC3.0 based on 619 based on 741 SUC3.0 based on 1,521 based on 2,130 SUC3.0 based on 1,022 based on 2,513 SUC3.0 based on 475 based on 675 * FB1 = 2*P*R/P+R ** <https://github.com/mvanerp/ner/blob/master/scripts/conlleval.pl> by Erik Tjong Kim Sang *** Note! Stagger was trained on the SUC3.0 NE!
Comparison and Evaluation 2 (SW+SIC) P P stagger R R stagger FB1 FB1 stagger Gold Data person-b 75.78% 49.6% 89.04% 84.93% 81.76% 62.63% SW+SIC based on 73 person-i 75.45% 74.76% 88.30% 81.91% 81.37% 78.17% place-b 78.57% 47.06% 68.75% 16.67% 73.33% 24.62% place-i 76.74% 100% 67.35% 8.16% 71.74% 15.09% inst-b 71.15% 50% 49.33% 21.33% 58.27% 29.91% inst-i 67.12% 58.06% 46.67% 17.14% 55.06% 26.47% work-b 66.67% 12.50% 38.71% 5% 48.98% 7.14% work-i 77.42% 11.11% 40% 2.33% 52.75% 3.85% other-b 64.29% 50% 30% 4.88% 40.91% 8.89% other-i 76.47% 40% 27.66% 3.12% 40.62% 5.8% time-b 91.03% 84.58% 87.69% time-i 98.21% 81.32% 88.97% based on 94 SW+SIC based on 48 based on 49 SW+SIC based on 75 based on 105 SW+SIC based on 20 based on 43 SW+SIC based on 41 based on 64 SW+SIC: based on 240 based on 471
Error Analysis, some observations The NE type work seems to be the most difficult MWE-NE to identify; usually there are no orthographic or other identifiable signs in their immediate context, the use of common vocabulary makes things even more difficult kk48-011: Vi hade tidigare spelat en komedi, <work>de båda direktörerna</work>. ( We had previously played a comedy, The both directors. ) Non-consistent : e.g. between work & inst in both cases below the annotation should have been work: kk72-126: [ ] efter artikeln i <inst>svenska Dagbladet</inst> [ ] [ ] after an article in Svenska Dagbladet [ ] ; while in kl10-046 the same entity is given as: [ ] annonsen kommer i <work>svenska Dagbladet</work> [ ] [ ] the advertisement is posted in Svenska Dagbladet [ ]
Error Analysis, some observations The types inst and other exhibit very low recall for various reasons, e.g. systematic polysemy between an organization and a location, e.g. in file jg05b-005 [ ] mottagningen på <inst>sandvikens sjukhus</inst> ( [ ] the reception at the Sandviken hospital ) where the obtained annotation from the NER system was place and probably correct; or other not so obvious reasons as e.g. in file he06d-002: I <other>konsum Huddinge centrum</other> är en torgyta intill [ ] ( In Konsum at Huddinge center there is a square area next to [ ] while the obtained annotation by the NER system was once again place.
Conclusions an experiment to automatically annotate and evaluate Swedish MWE-NEs the evaluation results show a large variation wrt the type of NEs concerned, with the worse results to be found for the categories work and other during the analysis of the SUC3.0 MWE NEs we discovered inconsistencies and discrepancies that affect the results in a negative way. A newer version of these, with the inconsistencies resolved could contribute to a much reliable gold standard for Swedish NER (e.g. training and/or testing)