Annotation and Evaluation of Swedish Multiword Named Entities

Similar documents
Multiword Expressions and Named Entities in the Wiki50 Corpus

Terminology Extraction from Log Files

Shallow Parsing with Apache UIMA

Special Topics in Computer Science

Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Terminology Extraction from Log Files

Automatic Detection and Correction of Errors in Dependency Treebanks

Word Completion and Prediction in Hebrew

ETL Ensembles for Chunking, NER and SRL

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

A Systematic Cross-Comparison of Sequence Classifiers

Micro blogs Oriented Word Segmentation System

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Automatic Text Analysis Using Drupal

An Online Service for SUbtitling by MAchine Translation

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship

Survey Results: Requirements and Use Cases for Linguistic Linked Data

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Interactive Dynamic Information Extraction

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Optimizing Multilingual Search With Solr

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Transition-Based Dependency Parsing with Long Distance Collocations

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

PoS-tagging Italian texts with CORISTagger

Building gold-standard treebanks for Norwegian

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

Mining a Corpus of Job Ads

Research Portfolio. Beáta B. Megyesi January 8, 2007

Named Entity Recognition in Broadcast News Using Similar Written Texts

Text Analysis for Big Data. Magnus Sahlgren

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Simple Type-Level Unsupervised POS Tagging

Named Entity Recognition Experiments on Turkish Texts

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

LGPLLR : an open source license for NLP (Natural Language Processing) Sébastien Paumier. Université Paris-Est Marne-la-Vallée

The University of Amsterdam s Question Answering System at QA@CLEF 2007

Technical Report. The KNIME Text Processing Feature:

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

How to make Ontologies self-building from Wiki-Texts

Hybrid Strategies. for better products and shorter time-to-market

PPInterFinder A Web Server for Mining Human Protein Protein Interaction

Errors in Operational Spreadsheets: A Review of the State of the Art

Integrating Annotation Tools into UIMA for Interoperability

Natural Language Processing in the EHR Lifecycle

A Method for Automatic De-identification of Medical Records

Text Mining - Scope and Applications

Customizing an English-Korean Machine Translation System for Patent Translation *

SVM Based Learning System For Information Extraction

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Automatic Knowledge Base Construction Systems. Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin

Context Grammar and POS Tagging

Natural Language to Relational Query by Using Parsing Compiler

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)

Collecting Polish German Parallel Corpora in the Internet

Study Plan for Master of Arts in Applied Linguistics

Brill s rule-based PoS tagger

Identifying Focus, Techniques and Domain of Scientific Papers

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language

31 Case Studies: Java Natural Language Tools Available on the Web

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

Duplication in Corpora

TMUNSW: Identification of disorders and normalization to SNOMED-CT terminology in unstructured clinical notes

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

Software Engineering EMR Project Report

Towards Task-Based Temporal Extraction and Recognition

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Using Knowledge Extraction and Maintenance Techniques To Enhance Analytical Performance

Overview of the EVALITA 2009 PoS Tagging Task

Transcription:

Annotation and Evaluation of Swedish Multiword Named Entities DIMITRIOS KOKKINAKIS Department of Swedish, the Swedish Language Bank University of Gothenburg Sweden dimitrios.kokkinakis@svenska.gu.se

Introduction Considerable body of work in NER plethora of identification and classification techniques; NE taxonomies and resources; Likewise, a wide variety of work in MWE; key problem for the development of large scale linguistically sound NLP technologies (Sag et al., 2002) typology; detection; function; applications; Considerably less focus at the intersection of the two; their nature/complexity/magnitude and evaluation here, we evaluate 2 Swedish NER systems on gold standard in order to provide insights on the magnitude and usage of such expressions in modern Swedish corpora

MWE-NEs and their relation to NLP composed of >1 tokens (even in combinations of characters / numerals) and, for some of those, their meaning cannot be traced back to their individual parts (Vincze et al., 2011); e.g., New York Yankees justifiable to treat such expression as a single syntactic and/or semantic entity in, e.g., treebanks (Bejček & Straňák, 2010). NLP applications require to treat MWE-NEs as a single object for ensuring: improving parsing accuracy (Nivre and Nilsson, 2004) improving question-answering (McCord et al., 2012) improving machine translation (Tan and Pal, 2014) better translation quality (Hurskainen, 2008) improving multilingual IR (Vechtomova, 2012)

Swedish Evaluation Corpora SUC3.0 (the Stockholm-Umeå Corpus, v. 3.0) is a freely available Swedish gold standard corpus, that can be used for the evaluation of MWE-NE recognition. SUC3.0 recognized 9 types of NEs: person, work, event, product, inst [itution], place, myth, other and animal. These 9 entity types have been manually annotated according to the TEIP3 guidelines SIC (the Stockholm Internet Corpus contains Swedish blog posts, automatically annotated with pos, and NEs; 13,562 tokens) Swedish Wikipedia (28 random selected articles; 16,069 tokens)

9,884 257 [NEs] + 240 [time] Swedish Evaluation Corpora SUC3.0: 9,884 MWE-NEs (roughly 30% of all NEs in the corpus) found in ~7,530 corpus lines (~155.000 tokens) none (MWE) time expressions* SIC: only 34 MWE-NEs (18 MWE time expressions) Swedish Wikipedia articles: (purpose) SUC3.0 do not contain annotated time expressions, an important category often discussed in the context of NER; 223 MWE-NEs and 222 MWE time expressions *temporal expressions: absolut temporal; relative temporal; durations

SIC+SW SUC3.0 Swedish Evaluation Corpora MWE-NE #2- token entities %* >2- token entities person 5,806 92.9% (58.7%) 458 7.1% (4.6%) place 526 85.1% (5.3%) 93 14.9% (0.9%) institution 1,117 73.4% (11.3%) 404 26.6% (4.1%) other 330 69.4% (3.3%) 145 30.6% (1.5%) work 418 40.9% (4.2%) 604 59.1% (6.1%) person 58 79.5% (11.7%) 15 20.5% (3%) place 47 97.9% (9.4%) 1 2.1% (0.2%) institution 57 76% (11.5%) 18 24% (3.6%) other 16 61.5% (3.2%) 10 38.5% (2%) work 16 45.7% (3.2%) 19 54.3% (3.8%) % time 102 42.5% (20.5%) 138 57.5% (27.8%) Available from: <http://demo.spraakdata.gu.se/svedk/pbl/sucannotsmwe-nes.gold150507.utf.gz> and <http://demo.spraakdata.gu.se/svedk/pbl/sic_o_wikimwe-nes.gold150507.utf.gz> *Percentages of bigram NEs compared to all MWE-NEs in the 2 gold standard corpora

SUC3.0 Pre-processing the NE annotation of SUC3.0 is not completely homogeneous wrt the NEs content 2 Swedish NER taggers are trained on a simplified version of the SUC3.0; using 4 entity types, namely: person organization location miscellaneous thus product, myth, event, animal, and other are merged in the miscellaneous category; institution was mapped to organization, and place to location Moreover, SUC3.0 does not provide annotation for date or time expressions, we manually annotated 28 randomly chosen Swedish Wikipedia articles for this part of the evaluation

SUC3.0 Pre-processing For the sake of the experiment prior filtering and harmonization of the SUC was necessary before the evaluation of the entities a number of person included the vocation or other features as part of the annotation: President as in President George Bush in: SUC3.0-file aa08c-019 animal (68) and myth (18) were merged into the category person In the generic NE type other (because of discrepancies in the SUC3.0 annotation) we included the for product (208) event (93) and other (174)

Evaluation All annotated texts were converted to the conll data format (columns separated by a single space) and then the conlleval script <www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt> was used for the evaluation of the automatic NER Tokens not part of an entity are tagged O O for Outside B stands for Begin I stands for Inside

Comparison and Evaluation 1 (SUC3.0) P P stagger R R stagger *FB1 FB1 stagger Gold Data person-b 95.80% 98.85% 90.28% 96.40% 92.96% 97.61% SUC3.0 based on 6,264 person-i 93.90% 98.04% 88.46% 95.33% 91.10% 96.66% place-b 94.74% 97.36% 78.26% 89.48% 85.71% 93.28% place-i 89.20% 96.92% 73.71% 81.39% 80.72% 88.48% inst-b 93.35% 97.46% 64.79% 88.23% 76.49% 92.62% inst-i 90.39% 96.44% 62.73% 81.85% 74.06% 88.55% work-b 70.73% 81.31% 25.47% 60.86% 37.45% 69.61% work-i 54.27% 80.39% 20.15% 48.92% 29.39% 60.83% other-b 89.68% 93.64% 62.73% 80.63% 73.82% 86.65% other-i 80.80% 95.41% 56.29% 74.32% 66.35% 83.55% based on 6,795 SUC3.0 based on 619 based on 741 SUC3.0 based on 1,521 based on 2,130 SUC3.0 based on 1,022 based on 2,513 SUC3.0 based on 475 based on 675 * FB1 = 2*P*R/P+R ** <https://github.com/mvanerp/ner/blob/master/scripts/conlleval.pl> by Erik Tjong Kim Sang *** Note! Stagger was trained on the SUC3.0 NE!

Comparison and Evaluation 2 (SW+SIC) P P stagger R R stagger FB1 FB1 stagger Gold Data person-b 75.78% 49.6% 89.04% 84.93% 81.76% 62.63% SW+SIC based on 73 person-i 75.45% 74.76% 88.30% 81.91% 81.37% 78.17% place-b 78.57% 47.06% 68.75% 16.67% 73.33% 24.62% place-i 76.74% 100% 67.35% 8.16% 71.74% 15.09% inst-b 71.15% 50% 49.33% 21.33% 58.27% 29.91% inst-i 67.12% 58.06% 46.67% 17.14% 55.06% 26.47% work-b 66.67% 12.50% 38.71% 5% 48.98% 7.14% work-i 77.42% 11.11% 40% 2.33% 52.75% 3.85% other-b 64.29% 50% 30% 4.88% 40.91% 8.89% other-i 76.47% 40% 27.66% 3.12% 40.62% 5.8% time-b 91.03% 84.58% 87.69% time-i 98.21% 81.32% 88.97% based on 94 SW+SIC based on 48 based on 49 SW+SIC based on 75 based on 105 SW+SIC based on 20 based on 43 SW+SIC based on 41 based on 64 SW+SIC: based on 240 based on 471

Error Analysis, some observations The NE type work seems to be the most difficult MWE-NE to identify; usually there are no orthographic or other identifiable signs in their immediate context, the use of common vocabulary makes things even more difficult kk48-011: Vi hade tidigare spelat en komedi, <work>de båda direktörerna</work>. ( We had previously played a comedy, The both directors. ) Non-consistent : e.g. between work & inst in both cases below the annotation should have been work: kk72-126: [ ] efter artikeln i <inst>svenska Dagbladet</inst> [ ] [ ] after an article in Svenska Dagbladet [ ] ; while in kl10-046 the same entity is given as: [ ] annonsen kommer i <work>svenska Dagbladet</work> [ ] [ ] the advertisement is posted in Svenska Dagbladet [ ]

Error Analysis, some observations The types inst and other exhibit very low recall for various reasons, e.g. systematic polysemy between an organization and a location, e.g. in file jg05b-005 [ ] mottagningen på <inst>sandvikens sjukhus</inst> ( [ ] the reception at the Sandviken hospital ) where the obtained annotation from the NER system was place and probably correct; or other not so obvious reasons as e.g. in file he06d-002: I <other>konsum Huddinge centrum</other> är en torgyta intill [ ] ( In Konsum at Huddinge center there is a square area next to [ ] while the obtained annotation by the NER system was once again place.

Conclusions an experiment to automatically annotate and evaluate Swedish MWE-NEs the evaluation results show a large variation wrt the type of NEs concerned, with the worse results to be found for the categories work and other during the analysis of the SUC3.0 MWE NEs we discovered inconsistencies and discrepancies that affect the results in a negative way. A newer version of these, with the inconsistencies resolved could contribute to a much reliable gold standard for Swedish NER (e.g. training and/or testing)