Named Entity Recognition Experiments on Turkish Texts



Similar documents
Named Entity Recognition on Turkish Tweets

Experiments to Improve Named Entity Recognition on Turkish Tweets

Named Entity Recognition in Broadcast News Using Similar Written Texts

Semantic annotation of requirements for automatic UML class diagram generation

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

Introduction to Text Mining. Module 2: Information Extraction in GATE

Turkish Radiology Dictation System

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Research Portfolio. Beáta B. Megyesi January 8, 2007

SVM Based Learning System For Information Extraction

How to make Ontologies self-building from Wiki-Texts

Natural Language to Relational Query by Using Parsing Compiler

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

NetOwl(TM) Extractor Technical Overview March 1997

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Learning Morphological Disambiguation Rules for Turkish

SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks

A Method for Automatic De-identification of Medical Records

Financial Events Recognition in Web News for Algorithmic Trading

Sentiment analysis for news articles

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

PROMT Technologies for Translation and Big Data

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Zeynep Azar. English Teacher, Açı Private Primary School, Istanbul, Turkey Azar, E.Z.

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

PoS-tagging Italian texts with CORISTagger

Information Extraction

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

AnnoMarket: An Open Cloud Platform for NLP

Text Analysis beyond Keyword Spotting

Ontology-based information extraction for market monitoring and technology watch

Zemberek, an open source NLP framework for Turkic Languages

Technical Report. The KNIME Text Processing Feature:

Terminology Extraction from Log Files

Projektgruppe. Information Extraction An Incomplete Overview

31 Case Studies: Java Natural Language Tools Available on the Web

Text-Driven Ontology Generation and Extension in the Finance Domain. Mihaela Vela Language Technology Lab DFKI Saarbrücken

Interactive Dynamic Information Extraction

MEDAR Mediterranean Arabic Language and Speech Technology An intermediate report on the MEDAR Survey of actors, projects, products

Natural Language Processing in the EHR Lifecycle

Towards Task-Based Temporal Extraction and Recognition

CENG 734 Advanced Topics in Bioinformatics

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

A Machine Translation System Between a Pair of Closely Related Languages

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Automatic slide assignation for language model adaptation

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

A GrAF-compliant Indonesian Speech Recognition Web Service on the Language Grid for Transcription Crowdsourcing

Empirical Machine Translation and its Evaluation

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Annotation and Evaluation of Swedish Multiword Named Entities

Collecting Polish German Parallel Corpora in the Internet

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Brill s rule-based PoS tagger

SDL BeGlobal: Machine Translation for Multilingual Search and Text Analytics Applications

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

Integrating Annotation Tools into UIMA for Interoperability

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Automatic Pronominal Anaphora Resolution in English Texts

Context Grammar and POS Tagging

Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications

Text Generation for Abstractive Summarization

Modern foreign languages

BANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION

Automatic Pronominal Anaphora Resolution. in English Texts

How RAI's Hyper Media News aggregation system keeps staff on top of the news

Giuseppe Riccardi, Marco Ronchetti. University of Trento

Statistical Analyses of Named Entity Disambiguation Benchmarks

Blog Post Extraction Using Title Finding

The Child s Rights, MediaLiteracy & New Media

Automated Annotation of Events Related to Central Venous Catheterization in Norwegian Clinical Notes

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Predicting stocks returns correlations based on unstructured data sources

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar anoop/

Transcription:

Named Entity Recognition Experiments on Dilek Küçük 1 and Adnan Yazıcı 2 1 TÜBİTAK - Uzay Institute, Ankara - Turkey dilek.kucuk@uzay.tubitak.gov.tr 2 Dept. of Computer Engineering, METU, Ankara - Turkey yazici@ceng.metu.edu.tr

Outline Introduction Named Entity Recognition in Turkish Evaluation Evaluation on News Texts Evaluation on Child Stories and Historical Texts Evaluation on Video Texts Future Work Conclusion 2

Introduction [1] Named entity recognition (NER) is one of the main information extraction (IE) tasks recognition of names of people, locations, organizations as well as temporal and numeric expressions in texts (Nadeau and Sekine, 2007). NER task is known to be a solved problem especially for English with state-of-the-art performance above 90 %. 3

Introduction [2] NER research in Turkish is known to be rare. Language-independent IE system (Cucerzan and Yarowsky, 1999) Statistical name tagger for Turkish (Tür et al, 2003) Person name tagger for financial news texts (Bayraktar and Taşkaya-Temizel, 2008) Person mention extractor and a string matching based coreference resolver (Küçük and Yazıcı, 2008) 4

Introduction [3] In this study, we present a rule-based system for named entity recognition from Turkish texts. Proposed for the domain of news texts. Evaluated on Newswire texts Child stories and historical texts News video transcriptions 5

Named Entity Recognition in Turkish [1] The domain is determined as news texts. News texts from METU Turkish corpus (Say et al., 2002) are examined. Capitalization and punctuation clues are not utilized Since they may be missing in automatic speech recognition (ASR) outputs and texts obtained from the Web. 6

Named Entity Recognition in Turkish [2] A set of information sources has been compiled. 7

Named Entity Recognition in Turkish [3] The lexical resources include a dictionary of person names in Turkish comprising about 8300 entries, a list of well-known political people, a list of well-known locations (the names of cities and towns) in Turkey as well as in the world, a list of well-known organizations in Turkey and those in the world. 8

Named Entity Recognition in Turkish [4] Pattern bases for the extraction of location/organization names as well as that of the numeric/temporal expressions. The system makes use of a simple morphological analyzer to validate candidates. 9

Evaluation [1] The system tags its output with Message Understanding Conference (MUC) style named entity tags: ENAMEX, TIMEX, and NUMEX An annotation tool is developed to annotate the evaluation texts with the same tags to create answer sets. Evaluation is performed by comparing the answer set with that of the system output. 10

The Annotation Tool Evaluation [2] 11

Evaluation [3] Evaluation is performed in terms of precision, recall, and f-measure 12

Evaluation on News Text [1] 13

Evaluation on News Text [2] 14

Evaluation on News Text [3] The precision of person name recognition using only a dictionary of person names turns out to be too low. Savaş ( war ), barış ( peace ), özen ( care ) During location and organization name recognition, the system performs erroneous extractions. anlatmanın yolu (the way to tell), ilk üniversitesi (first university) Organization name recognition also suffers from the erroneous extractions in case of compound organization names. İstanbul Üniversitesi Siyasal Bilgiler Fakültesi İstanbul University Political Science Faculty as İstanbul Üniversitesi and Bilgiler Fakültesi 15

Evaluation on News Text [4] As opposed to the statistical system (Tür et al., 2003), the rule based system considers numeric and temporal expressions in addition to the person, location, organization names. The statistical system has been trained on a set of news articles with 492821 words (37277 NEs). The statistical system has been tested on a news article set of about 28000 words (2197 NEs) and has achieved a best performance of 91.56 % in f-measure. The rule-based system has been tested on a set of 20131 words (1591 NEs) and achieved an f-measure of 78.7 %. The statistical system performs deeper language processing compared to the rule-based system. 16

Evaluation on Child Stories and Historical Texts [1] The child stories set comprises two stories by the same author (Ilgaz, 2003a-b). The historical text includes the first three chapters of a book describing five cities mostly on their historical basis (Tanpınar, 2007). 17

Evaluation on Child Stories and Historical Texts [2] The main problem for child stories data set is the existence of foreign person names throughout the stories. The performance drop for historical text is due to the nonexistence of historical person names and organizations (such as the names of empires) in the lexical resources. The results are in line with the well-known finding that rulebased systems suffer from performance degradation when ported to other domains. 18

Evaluation on Video Texts [1] An important research area which can benefit from IE techniques is automatic multimedia annotation. Several studies are carried out on employing especially NER output for semantic multimedia annotation. Multimedia indexing system for English, German and Dutch football videos (Saggion et al., 2004) Video annotation system for Italian news videos (Basili et al., 2005) Automatic annotation system for BBC radio and TV news (Dowman et al., 2005) 19

Evaluation on Video Texts [2] We have compiled a video data set of Turkish news videos From the Web site of Turkish Radio and Television Company (TRT). Comprising 16 videos with a total duration of two hours. The videos are manually transcribed leading to a text of 9804 words Since no general purpose automatic speech recognizer exists for Turkish. 20

Evaluation on Video Texts [3] The transcription text is annotated with named entity tags resulting in 1090 named entities (256 person, 479 location, and 222 organization names, 70 numeric and 63 temporal expressions). Evaluation of the recognizer on the text resulted in a precision of 73.3%, a recall of 77.0%, and so an f-measure of 75.1%. The results on video transcriptions are satisfactory for a first attempt of named entity recognition on genuine video texts It is significant step towards the employment of IE techniques for semantic annotation of videos in Turkish. 21

Future Work Future work based on the current study includes Improvement of the system benefiting from the error analyses. Extending the system to output finer grained named entity classes employing a named entity ontology. Employment of machine learning algorithms for the NER task The results can be compared with that of the rule based recognizer. 22

Conclusion [1] Information extraction in Turkish is a rarely studied research area. In this study, we have presented a rule-based system for named entity recognition from Turkish texts. Initially engineered for news texts. Employs a set of lexical resources and pattern bases. Being a rule-based system, needs no training data. Evaluated on diverse text types including news texts, child stories, historical texts, and news video transcriptions. 23

Conclusion [2] The evaluation results for the news texts and news video transcriptions are satisfactory for a first attempt Yet, the results for child stories and historical texts are very low. In line with the finding that rule-based IE systems suffer from considerable performance drop when evaluated on other domains. 24

References [1] 1. Roberto Basili, Marco Cammisa, and Emanuale Donati. RitroveRAI: A web application for semantic indexing and hyperlinking of multimedia news. In Proceedings of International Semantic Web Conference, 2005. 2. Özkan Bayraktar and Tuğba Taşkaya-Temizel. Person name extraction from Turkish Financial news text using local grammar based approach. In Proceedings of the International Symposium on Computer and Information Sciences, 2008. 3. Silviu Cucerzan and David Yarowsky. Language independent named entity recognition combining morphological and contextual evidence. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999. 4. Mike Dowman, Valentin Tablan, Hamish Cunningham, and Borislav Popov. Web-assisted annotation, semantic indexing and search of television and radio news. In Proceedings of the International Conference on World Wide Web (WWW), 2005. 5. Rıfat Ilgaz. Bacaksız Kamyon Sürücüsü. Çınar Publications, 2003. 6. Rıfat Ilgaz. Bacaksız Tatil Köyünde. Çınar Publications, 2003. 25

References [2] 7. Dilek Küçük and Adnan Yazıcı. Identification of coreferential chains in video texts for semantic annotation of news videos. In Proceedings of the International Symposium on Computer and Information Sciences, 2008. 8. David Nadeau and Satoshi Sekine. A Survey of Named Entity Recognition and Classification, Linguistica Investigationes, 2007, vol. 30, no. 1, pp.3-26. 9. Bilge Say, Deniz Zeyrek, Kemal Oflazer, and Umut Özge. Development of a corpus and a treebank for present-day written Turkish. In Proceedings of the 11th International Conference of Turkish Linguistics (ICTL), 2002. 10. Ahmet Hamdi Tanpınar. Beş Şehir. Dergah Publications, 2007. 11. Horacio Saggion, Hamish Cunningham, Kalina Bontcheva, Diana Maynard, Oana Hamza, and Yorick Wilks. Multimedia indexing through multi-source and multi-language information extraction: MUMIS project. Data and Knowledge Engineering, 48:247-264, 2004. 12. Gökhan Tür, Dilek Hakkani-Tür, and Kemal Oflazer. A statistical information extraction system for Turkish. Natural Language Engineering, 9, 2:181-210, 2003. 26

Thank You 27