Named Entity Recognition Experiments on Turkish Texts

Size: px
Start display at page:

Download "Named Entity Recognition Experiments on Turkish Texts"

Transcription

1 Named Entity Recognition Experiments on Dilek Küçük 1 and Adnan Yazıcı 2 1 TÜBİTAK - Uzay Institute, Ankara - Turkey 2 Dept. of Computer Engineering, METU, Ankara - Turkey

2 Outline Introduction Named Entity Recognition in Turkish Evaluation Evaluation on News Texts Evaluation on Child Stories and Historical Texts Evaluation on Video Texts Future Work Conclusion 2

3 Introduction [1] Named entity recognition (NER) is one of the main information extraction (IE) tasks recognition of names of people, locations, organizations as well as temporal and numeric expressions in texts (Nadeau and Sekine, 2007). NER task is known to be a solved problem especially for English with state-of-the-art performance above 90 %. 3

4 Introduction [2] NER research in Turkish is known to be rare. Language-independent IE system (Cucerzan and Yarowsky, 1999) Statistical name tagger for Turkish (Tür et al, 2003) Person name tagger for financial news texts (Bayraktar and Taşkaya-Temizel, 2008) Person mention extractor and a string matching based coreference resolver (Küçük and Yazıcı, 2008) 4

5 Introduction [3] In this study, we present a rule-based system for named entity recognition from Turkish texts. Proposed for the domain of news texts. Evaluated on Newswire texts Child stories and historical texts News video transcriptions 5

6 Named Entity Recognition in Turkish [1] The domain is determined as news texts. News texts from METU Turkish corpus (Say et al., 2002) are examined. Capitalization and punctuation clues are not utilized Since they may be missing in automatic speech recognition (ASR) outputs and texts obtained from the Web. 6

7 Named Entity Recognition in Turkish [2] A set of information sources has been compiled. 7

8 Named Entity Recognition in Turkish [3] The lexical resources include a dictionary of person names in Turkish comprising about 8300 entries, a list of well-known political people, a list of well-known locations (the names of cities and towns) in Turkey as well as in the world, a list of well-known organizations in Turkey and those in the world. 8

9 Named Entity Recognition in Turkish [4] Pattern bases for the extraction of location/organization names as well as that of the numeric/temporal expressions. The system makes use of a simple morphological analyzer to validate candidates. 9

10 Evaluation [1] The system tags its output with Message Understanding Conference (MUC) style named entity tags: ENAMEX, TIMEX, and NUMEX An annotation tool is developed to annotate the evaluation texts with the same tags to create answer sets. Evaluation is performed by comparing the answer set with that of the system output. 10

11 The Annotation Tool Evaluation [2] 11

12 Evaluation [3] Evaluation is performed in terms of precision, recall, and f-measure 12

13 Evaluation on News Text [1] 13

14 Evaluation on News Text [2] 14

15 Evaluation on News Text [3] The precision of person name recognition using only a dictionary of person names turns out to be too low. Savaş ( war ), barış ( peace ), özen ( care ) During location and organization name recognition, the system performs erroneous extractions. anlatmanın yolu (the way to tell), ilk üniversitesi (first university) Organization name recognition also suffers from the erroneous extractions in case of compound organization names. İstanbul Üniversitesi Siyasal Bilgiler Fakültesi İstanbul University Political Science Faculty as İstanbul Üniversitesi and Bilgiler Fakültesi 15

16 Evaluation on News Text [4] As opposed to the statistical system (Tür et al., 2003), the rule based system considers numeric and temporal expressions in addition to the person, location, organization names. The statistical system has been trained on a set of news articles with words (37277 NEs). The statistical system has been tested on a news article set of about words (2197 NEs) and has achieved a best performance of % in f-measure. The rule-based system has been tested on a set of words (1591 NEs) and achieved an f-measure of 78.7 %. The statistical system performs deeper language processing compared to the rule-based system. 16

17 Evaluation on Child Stories and Historical Texts [1] The child stories set comprises two stories by the same author (Ilgaz, 2003a-b). The historical text includes the first three chapters of a book describing five cities mostly on their historical basis (Tanpınar, 2007). 17

18 Evaluation on Child Stories and Historical Texts [2] The main problem for child stories data set is the existence of foreign person names throughout the stories. The performance drop for historical text is due to the nonexistence of historical person names and organizations (such as the names of empires) in the lexical resources. The results are in line with the well-known finding that rulebased systems suffer from performance degradation when ported to other domains. 18

19 Evaluation on Video Texts [1] An important research area which can benefit from IE techniques is automatic multimedia annotation. Several studies are carried out on employing especially NER output for semantic multimedia annotation. Multimedia indexing system for English, German and Dutch football videos (Saggion et al., 2004) Video annotation system for Italian news videos (Basili et al., 2005) Automatic annotation system for BBC radio and TV news (Dowman et al., 2005) 19

20 Evaluation on Video Texts [2] We have compiled a video data set of Turkish news videos From the Web site of Turkish Radio and Television Company (TRT). Comprising 16 videos with a total duration of two hours. The videos are manually transcribed leading to a text of 9804 words Since no general purpose automatic speech recognizer exists for Turkish. 20

21 Evaluation on Video Texts [3] The transcription text is annotated with named entity tags resulting in 1090 named entities (256 person, 479 location, and 222 organization names, 70 numeric and 63 temporal expressions). Evaluation of the recognizer on the text resulted in a precision of 73.3%, a recall of 77.0%, and so an f-measure of 75.1%. The results on video transcriptions are satisfactory for a first attempt of named entity recognition on genuine video texts It is significant step towards the employment of IE techniques for semantic annotation of videos in Turkish. 21

22 Future Work Future work based on the current study includes Improvement of the system benefiting from the error analyses. Extending the system to output finer grained named entity classes employing a named entity ontology. Employment of machine learning algorithms for the NER task The results can be compared with that of the rule based recognizer. 22

23 Conclusion [1] Information extraction in Turkish is a rarely studied research area. In this study, we have presented a rule-based system for named entity recognition from Turkish texts. Initially engineered for news texts. Employs a set of lexical resources and pattern bases. Being a rule-based system, needs no training data. Evaluated on diverse text types including news texts, child stories, historical texts, and news video transcriptions. 23

24 Conclusion [2] The evaluation results for the news texts and news video transcriptions are satisfactory for a first attempt Yet, the results for child stories and historical texts are very low. In line with the finding that rule-based IE systems suffer from considerable performance drop when evaluated on other domains. 24

25 References [1] 1. Roberto Basili, Marco Cammisa, and Emanuale Donati. RitroveRAI: A web application for semantic indexing and hyperlinking of multimedia news. In Proceedings of International Semantic Web Conference, Özkan Bayraktar and Tuğba Taşkaya-Temizel. Person name extraction from Turkish Financial news text using local grammar based approach. In Proceedings of the International Symposium on Computer and Information Sciences, Silviu Cucerzan and David Yarowsky. Language independent named entity recognition combining morphological and contextual evidence. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, Mike Dowman, Valentin Tablan, Hamish Cunningham, and Borislav Popov. Web-assisted annotation, semantic indexing and search of television and radio news. In Proceedings of the International Conference on World Wide Web (WWW), Rıfat Ilgaz. Bacaksız Kamyon Sürücüsü. Çınar Publications, Rıfat Ilgaz. Bacaksız Tatil Köyünde. Çınar Publications,

26 References [2] 7. Dilek Küçük and Adnan Yazıcı. Identification of coreferential chains in video texts for semantic annotation of news videos. In Proceedings of the International Symposium on Computer and Information Sciences, David Nadeau and Satoshi Sekine. A Survey of Named Entity Recognition and Classification, Linguistica Investigationes, 2007, vol. 30, no. 1, pp Bilge Say, Deniz Zeyrek, Kemal Oflazer, and Umut Özge. Development of a corpus and a treebank for present-day written Turkish. In Proceedings of the 11th International Conference of Turkish Linguistics (ICTL), Ahmet Hamdi Tanpınar. Beş Şehir. Dergah Publications, Horacio Saggion, Hamish Cunningham, Kalina Bontcheva, Diana Maynard, Oana Hamza, and Yorick Wilks. Multimedia indexing through multi-source and multi-language information extraction: MUMIS project. Data and Knowledge Engineering, 48: , Gökhan Tür, Dilek Hakkani-Tür, and Kemal Oflazer. A statistical information extraction system for Turkish. Natural Language Engineering, 9, 2: ,

27 Thank You 27

Named Entity Recognition on Turkish Tweets

Named Entity Recognition on Turkish Tweets Named Entity Recognition on Turkish Tweets Dilek Küçük, Guillaume Jacquet, Ralf Steinberger European Commission, Joint Research Centre Via E. Fermi 2749 21027 Ispra (VA), Italy firstname.lastname@jrc.ec.europa.eu

More information

Experiments to Improve Named Entity Recognition on Turkish Tweets

Experiments to Improve Named Entity Recognition on Turkish Tweets Experiments to Improve Named Entity Recognition on Turkish Tweets Dilek Küçük and Ralf Steinberger European Commission, Joint Research Centre Via E. Fermi 2749 21027 Ispra (VA), Italy firstname.lastname@jrc.ec.europa.eu

More information

Named Entity Recognition in Broadcast News Using Similar Written Texts

Named Entity Recognition in Broadcast News Using Similar Written Texts Named Entity Recognition in Broadcast News Using Similar Written Texts Niraj Shrestha Ivan Vulić KU Leuven, Belgium KU Leuven, Belgium niraj.shrestha@cs.kuleuven.be ivan.vulic@@cs.kuleuven.be Abstract

More information

Towards a semantic extraction of named entities

Towards a semantic extraction of named entities Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham Dept of Computer Science University of Sheffield Sheffield, S1 4DP, UK diana@dcs.shef.ac.uk Abstract In

More information

Semantic annotation of requirements for automatic UML class diagram generation

Semantic annotation of requirements for automatic UML class diagram generation www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute

More information

SOCIS: Scene of Crime Information System - IGR Review Report

SOCIS: Scene of Crime Information System - IGR Review Report SOCIS: Scene of Crime Information System - IGR Review Report Katerina Pastra, Horacio Saggion, Yorick Wilks June 2003 1 Introduction This report reviews the work done by the University of Sheffield on

More information

Introduction to IE with GATE

Introduction to IE with GATE Introduction to IE with GATE based on Material from Hamish Cunningham, Kalina Bontcheva (University of Sheffield) Melikka Khosh Niat 8. Dezember 2010 1 What is IE? 2 GATE 3 ANNIE 4 Annotation and Evaluation

More information

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH Journal of Computer Science 9 (7): 922-927, 2013 ISSN: 1549-3636 2013 doi:10.3844/jcssp.2013.922.927 Published Online 9 (7) 2013 (http://www.thescipub.com/jcs.toc) ARABIC PERSON NAMES RECOGNITION BY USING

More information

SVM Based Learning System For Information Extraction

SVM Based Learning System For Information Extraction SVM Based Learning System For Information Extraction Yaoyong Li, Kalina Bontcheva, and Hamish Cunningham Department of Computer Science, The University of Sheffield, Sheffield, S1 4DP, UK {yaoyong,kalina,hamish}@dcs.shef.ac.uk

More information

Introduction to Text Mining. Module 2: Information Extraction in GATE

Introduction to Text Mining. Module 2: Information Extraction in GATE Introduction to Text Mining Module 2: Information Extraction in GATE The University of Sheffield, 1995-2013 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence

More information

Turkish Radiology Dictation System

Turkish Radiology Dictation System Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey arisoyeb@boun.edu.tr, arslanle@boun.edu.tr

More information

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990

More information

Research Portfolio. Beáta B. Megyesi January 8, 2007

Research Portfolio. Beáta B. Megyesi January 8, 2007 Research Portfolio Beáta B. Megyesi January 8, 2007 Research Activities Research activities focus on mainly four areas: Natural language processing During the last ten years, since I started my academic

More information

Combining Ontological Knowledge and Wrapper Induction techniques into an e-retail System 1

Combining Ontological Knowledge and Wrapper Induction techniques into an e-retail System 1 Combining Ontological Knowledge and Wrapper Induction techniques into an e-retail System 1 Maria Teresa Pazienza, Armando Stellato and Michele Vindigni Department of Computer Science, Systems and Management,

More information

How to make Ontologies self-building from Wiki-Texts

How to make Ontologies self-building from Wiki-Texts How to make Ontologies self-building from Wiki-Texts Bastian HAARMANN, Frederike GOTTSMANN, and Ulrich SCHADE Fraunhofer Institute for Communication, Information Processing & Ergonomics Neuenahrer Str.

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

NetOwl(TM) Extractor Technical Overview March 1997

NetOwl(TM) Extractor Technical Overview March 1997 NetOwl(TM) Extractor Technical Overview March 1997 1 Overview NetOwl Extractor is an automatic indexing system that finds and classifies key phrases in text, such as personal names, corporate names, place

More information

Learning Morphological Disambiguation Rules for Turkish

Learning Morphological Disambiguation Rules for Turkish Learning Morphological Disambiguation Rules for Turkish Deniz Yuret Dept. of Computer Engineering Koç University İstanbul, Turkey dyuret@ku.edu.tr Ferhan Türe Dept. of Computer Engineering Koç University

More information

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database Dina Vishnyakova 1,2, 4, *, Julien Gobeill 1,3,4, Emilie Pasche 1,2,3,4 and Patrick Ruch

More information

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded

More information

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,

More information

Natural Language Technology for Information Integration in Business Intelligence

Natural Language Technology for Information Integration in Business Intelligence Natural Language Technology for Information Integration in Business Intelligence Diana Maynard 1 and Horacio Saggion 1 and Milena Yankova 21 and Kalina Bontcheva 1 and Wim Peters 1 1 Department of Computer

More information

SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks

SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks Melike Şah, Wendy Hall and David C De Roure Intelligence, Agents and Multimedia Group,

More information

A Method for Automatic De-identification of Medical Records

A Method for Automatic De-identification of Medical Records A Method for Automatic De-identification of Medical Records Arya Tafvizi MIT CSAIL Cambridge, MA 0239, USA tafvizi@csail.mit.edu Maciej Pacula MIT CSAIL Cambridge, MA 0239, USA mpacula@csail.mit.edu Abstract

More information

FEATURES FOR AN INTERNET ACCESSIBLE CORPUS OF SPOKEN TURKISH DISCOURSE

FEATURES FOR AN INTERNET ACCESSIBLE CORPUS OF SPOKEN TURKISH DISCOURSE FEATURES FOR AN INTERNET ACCESSIBLE CORPUS OF SPOKEN TURKISH DISCOURSE Şükriye RUHİ sukruh@metu.edu.tr Derya ÇOKAL KARADAŞ cokal@metu.edu.tr Middle East Technical University THE METU SPOKEN TURKISH DISCOURSE

More information

Financial Events Recognition in Web News for Algorithmic Trading

Financial Events Recognition in Web News for Algorithmic Trading Financial Events Recognition in Web News for Algorithmic Trading Frederik Hogenboom Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, the Netherlands fhogenboom@ese.eur.nl Abstract. Due to

More information

Sentiment analysis for news articles

Sentiment analysis for news articles Prashant Raina Sentiment analysis for news articles Wide range of applications in business and public policy Especially relevant given the popularity of online media Previous work Machine learning based

More information

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics

More information

Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity

Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity 1 Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity David Nadeau 1,2, Peter D. Turney 1 and Stan Matwin 2,3 1 Institute for Information Technology National Research Council

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

PROMT Technologies for Translation and Big Data

PROMT Technologies for Translation and Big Data PROMT Technologies for Translation and Big Data Overview and Use Cases Julia Epiphantseva PROMT About PROMT EXPIRIENCED Founded in 1991. One of the world leading machine translation provider DIVERSIFIED

More information

Zeynep Azar. English Teacher, Açı Private Primary School, Istanbul, Turkey Azar, E.Z.

Zeynep Azar. English Teacher, Açı Private Primary School, Istanbul, Turkey Azar, E.Z. Zeynep Azar Date/Place of birth : 13 November 1988, Bursa, Turkey Nationality : Turkish Address : Bisschop Zwijsenstraat 103-01 Zipcode, Residence : 5021KB, Tilburg, Netherlands Phone number : +31 (0)

More information

Text Analysis beyond Keyword Spotting

Text Analysis beyond Keyword Spotting Text Analysis beyond Keyword Spotting Bastian Haarmann, Lukas Sikorski, Ulrich Schade { bastian.haarmann lukas.sikorski ulrich.schade }@fkie.fraunhofer.de Fraunhofer Institute for Communication, Information

More information

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features , pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of

More information

PoS-tagging Italian texts with CORISTagger

PoS-tagging Italian texts with CORISTagger PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy fabio.tamburini@unibo.it Abstract. This paper presents an evolution of CORISTagger [1], an high-performance

More information

EVALITA 2011. http://www.evalita.it/2011. Named Entity Recognition on Transcribed Broadcast News Guidelines for Participants

EVALITA 2011. http://www.evalita.it/2011. Named Entity Recognition on Transcribed Broadcast News Guidelines for Participants EVALITA 2011 http://www.evalita.it/2011 Named Entity Recognition on Transcribed Broadcast News Guidelines for Participants Valentina Bartalesi Lenzi Manuela Speranza Rachele Sprugnoli CELCT, Trento FBK,

More information

Information Extraction

Information Extraction Foundations and Trends R in Databases Vol. 1, No. 3 (2007) 261 377 c 2008 S. Sarawagi DOI: 10.1561/1500000003 Information Extraction Sunita Sarawagi Indian Institute of Technology, CSE, Mumbai 400076,

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

AnnoMarket: An Open Cloud Platform for NLP

AnnoMarket: An Open Cloud Platform for NLP AnnoMarket: An Open Cloud Platform for NLP Valentin Tablan, Kalina Bontcheva Ian Roberts, Hamish Cunningham University of Sheffield, Department of Computer Science 211 Portobello, Sheffield, UK Initial.Surname@dcs.shef.ac.uk

More information

METU Turkish Discourse Bank Browser

METU Turkish Discourse Bank Browser METU Turkish Discourse Bank Browser Utku Şirin 1, Ruket Çakıcı 1, Deniz Zeyrek 2 Computer Engineering Department 1, Informatics Institute 2 Middle East Technical University, Ankara, Turkey 1,2 utkusirin@gmail.com,

More information

Ontology-based information extraction for market monitoring and technology watch

Ontology-based information extraction for market monitoring and technology watch Ontology-based information extraction for market monitoring and technology watch Diana Maynard 1, Milena Yankova 1, Alexandros Kourakis 2, Antonis Kokossis 2 1 Department of Computer Science, University

More information

Brauchen die Digital Humanities eine eigene Methodologie?

Brauchen die Digital Humanities eine eigene Methodologie? Deutsche DH, Passau 26.03.2014 Brauchen die Digital Humanities eine eigene Methodologie? 26. März 2014 Heyer / Niekler / Wiedemann 1 Übersicht Aspekte der Operationalisierung geistes- und sozialwissenschaftlicher

More information

Zemberek, an open source NLP framework for Turkic Languages

Zemberek, an open source NLP framework for Turkic Languages Zemberek, an open source NLP framework for Turkic Languages Ahmet Afşın Akın Softek Inc. / Puerto-Rico ahmetaa@gmail.com Abstract Most of the NLP solutions in the IT world are based on Indo-European languages.

More information

Technical Report. The KNIME Text Processing Feature:

Technical Report. The KNIME Text Processing Feature: Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold Killian.Thiel@uni-konstanz.de Michael.Berthold@uni-konstanz.de Copyright 2012 by KNIME.com AG

More information

Terminology Extraction from Log Files

Terminology Extraction from Log Files Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier

More information

Projektgruppe. Information Extraction An Incomplete Overview

Projektgruppe. Information Extraction An Incomplete Overview Projektgruppe Henning Wachsmuth Information Extraction An Incomplete Overview 12. Mai 2010 1 Einführungsvorträge Verfassen von Seminarvortrag und paper Prof. Dr. Gregor Engels, Donnerstag 15.4., 16h-18h

More information

31 Case Studies: Java Natural Language Tools Available on the Web

31 Case Studies: Java Natural Language Tools Available on the Web 31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software

More information

Text-Driven Ontology Generation and Extension in the Finance Domain. Mihaela Vela Language Technology Lab DFKI Saarbrücken

Text-Driven Ontology Generation and Extension in the Finance Domain. Mihaela Vela Language Technology Lab DFKI Saarbrücken Text-Driven Ontology Generation and Extension in the Finance Domain Mihaela Vela Language Technology Lab DFKI Saarbrücken European MUSING project Development of Business Intelligence tools and modules

More information

SemLinker, a Modular and Open Source Framework for Named Entity Discovery and Linking

SemLinker, a Modular and Open Source Framework for Named Entity Discovery and Linking SemLinker, a Modular and Open Source Framework for Named Entity Discovery and Linking Marie-Jean Meurs 1, Hayda Almeida 2, Ludovic Jean-Louis 3, Eric Charton 4 1 Université du Québec à Montréal, 2 Concordia

More information

Interactive Dynamic Information Extraction

Interactive Dynamic Information Extraction Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken

More information

MEDAR Mediterranean Arabic Language and Speech Technology An intermediate report on the MEDAR Survey of actors, projects, products

MEDAR Mediterranean Arabic Language and Speech Technology An intermediate report on the MEDAR Survey of actors, projects, products MEDAR Mediterranean Arabic Language and Speech Technology An intermediate report on the MEDAR Survey of actors, projects, products Khalid Choukri Evaluation and Language resources Distribution Agency;

More information

CENG 734 Advanced Topics in Bioinformatics

CENG 734 Advanced Topics in Bioinformatics CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the

More information

Towards Task-Based Temporal Extraction and Recognition

Towards Task-Based Temporal Extraction and Recognition Towards Task-Based Temporal Extraction and Recognition David Ahn, Sisay Fissaha Adafre and Maarten de Rijke Informatics Institute, University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam, The Netherlands

More information

Natural Language Processing in the EHR Lifecycle

Natural Language Processing in the EHR Lifecycle Insight Driven Health Natural Language Processing in the EHR Lifecycle Cecil O. Lynch, MD, MS cecil.o.lynch@accenture.com Health & Public Service Outline Medical Data Landscape Value Proposition of NLP

More information

Towards Semantic Web Information Extraction

Towards Semantic Web Information Extraction Towards Semantic Web Information Extraction Borislav Popov, Atanas Kiryakov, Dimitar Manov, Angel Kirilov, Damyan Ognyanoff, Miroslav Goranov Ontotext Lab, Sirma AI EOOD, 135 Tsarigradsko Shose, Sofia

More information

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work Unsupervised Paraphrase Acquisition via Relation Discovery Takaaki Hasegawa Cyberspace Laboratories Nippon Telegraph and Telephone Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa 239-0847, Japan hasegawa.takaaki@lab.ntt.co.jp

More information

A Machine Translation System Between a Pair of Closely Related Languages

A Machine Translation System Between a Pair of Closely Related Languages A Machine Translation System Between a Pair of Closely Related Languages Kemal Altintas 1,3 1 Dept. of Computer Engineering Bilkent University Ankara, Turkey email:kemal@ics.uci.edu Abstract Machine translation

More information

SPEAKER IDENTITY INDEXING IN AUDIO-VISUAL DOCUMENTS

SPEAKER IDENTITY INDEXING IN AUDIO-VISUAL DOCUMENTS SPEAKER IDENTITY INDEXING IN AUDIO-VISUAL DOCUMENTS Mbarek Charhad, Daniel Moraru, Stéphane Ayache and Georges Quénot CLIPS-IMAG BP 53, 38041 Grenoble cedex 9, France Georges.Quenot@imag.fr ABSTRACT The

More information

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Jan Paralic, Peter Smatana Technical University of Kosice, Slovakia Center for

More information

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged

More information

Automatic slide assignation for language model adaptation

Automatic slide assignation for language model adaptation Automatic slide assignation for language model adaptation Applications of Computational Linguistics Adrià Agustí Martínez Villaronga May 23, 2013 1 Introduction Online multimedia repositories are rapidly

More information

DEFINING EFFECTIVENESS FOR BUSINESS AND COMPUTER ENGLISH ELECTRONIC RESOURCES

DEFINING EFFECTIVENESS FOR BUSINESS AND COMPUTER ENGLISH ELECTRONIC RESOURCES Teaching English with Technology, vol. 3, no. 1, pp. 3-12, http://www.iatefl.org.pl/call/callnl.htm 3 DEFINING EFFECTIVENESS FOR BUSINESS AND COMPUTER ENGLISH ELECTRONIC RESOURCES by Alejandro Curado University

More information

Introduction to Information Extraction Technology

Introduction to Information Extraction Technology Introduction to Information Extraction Technology A Tutorial Prepared for IJCAI-99 by Douglas E. Appelt and David J. Israel Artificial Intelligence Center SRI International 333 Ravenswood Ave. Menlo Park,

More information

Build Vs. Buy For Text Mining

Build Vs. Buy For Text Mining Build Vs. Buy For Text Mining Why use hand tools when you can get some rockin power tools? Whitepaper April 2015 INTRODUCTION We, at Lexalytics, see a significant number of people who have the same question

More information

Semantic Web Enabled, Open Source Language Technology

Semantic Web Enabled, Open Source Language Technology Semantic Web Enabled, Open Source Language Technology Kalina Bontcheva University of Sheffield Regent Crt., 211 Portobello St. Sheffield S1 4DP, UK kalina@dcs.shef.ac.uk Atanas Kiryakov Ontotext Lab, Sirma

More information

A GrAF-compliant Indonesian Speech Recognition Web Service on the Language Grid for Transcription Crowdsourcing

A GrAF-compliant Indonesian Speech Recognition Web Service on the Language Grid for Transcription Crowdsourcing A GrAF-compliant Indonesian Speech Recognition Web Service on the Language Grid for Transcription Crowdsourcing LAW VI JEJU 2012 Bayu Distiawan Trisedya & Ruli Manurung Faculty of Computer Science Universitas

More information

Empirical Machine Translation and its Evaluation

Empirical Machine Translation and its Evaluation Empirical Machine Translation and its Evaluation EAMT Best Thesis Award 2008 Jesús Giménez (Advisor, Lluís Màrquez) Universitat Politècnica de Catalunya May 28, 2010 Empirical Machine Translation Empirical

More information

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering

More information

Annotation and Evaluation of Swedish Multiword Named Entities

Annotation and Evaluation of Swedish Multiword Named Entities Annotation and Evaluation of Swedish Multiword Named Entities DIMITRIOS KOKKINAKIS Department of Swedish, the Swedish Language Bank University of Gothenburg Sweden dimitrios.kokkinakis@svenska.gu.se Introduction

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Survey Results: Requirements and Use Cases for Linguistic Linked Data Survey Results: Requirements and Use Cases for Linguistic Linked Data 1 Introduction This survey was conducted by the FP7 Project LIDER (http://www.lider-project.eu/) as input into the W3C Community Group

More information

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS Gürkan Şahin 1, Banu Diri 1 and Tuğba Yıldız 2 1 Faculty of Electrical-Electronic, Department of Computer Engineering

More information

SDL BeGlobal: Machine Translation for Multilingual Search and Text Analytics Applications

SDL BeGlobal: Machine Translation for Multilingual Search and Text Analytics Applications INSIGHT SDL BeGlobal: Machine Translation for Multilingual Search and Text Analytics Applications José Curto David Schubmehl IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200

More information

Brill s rule-based PoS tagger

Brill s rule-based PoS tagger Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based

More information

BANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION

BANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION BANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION ROBERT LEAMAN Department of Computer Science and Engineering, Arizona State University GRACIELA GONZALEZ * Department of

More information

Integrating Annotation Tools into UIMA for Interoperability

Integrating Annotation Tools into UIMA for Interoperability Integrating Annotation Tools into UIMA for Interoperability Scott Piao, Sophia Ananiadou and John McNaught School of Computer Science & National Centre for Text Mining The University of Manchester UK {scott.piao;sophia.ananiadou;john.mcnaught}@manchester.ac.uk

More information

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD. Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.

More information

Accelerating Corporate Research in the Development, Application and Deployment of Human Language Technologies

Accelerating Corporate Research in the Development, Application and Deployment of Human Language Technologies Accelerating Corporate Research in the Development, Application and Deployment of Human Language Technologies David Ferrucci IBM T.J. Watson Research Center Yorktown Heights, NY 10598 ferrucci@us.ibm.com

More information

Constructing Dictionaries for Named Entity Recognition on Specific Domains from the Web

Constructing Dictionaries for Named Entity Recognition on Specific Domains from the Web Constructing Dictionaries for Named Entity Recognition on Specific Domains from the Web Keiji Shinzato 1, Satoshi Sekine 2, Naoki Yoshinaga 3, and Kentaro Torisawa 4 1 Graduate School of Informatics, Kyoto

More information

Information Extraction. Information Extraction. IE: History. IE: Definition. Definition. History. Architecture of IE systems

Information Extraction. Information Extraction. IE: History. IE: Definition. Definition. History. Architecture of IE systems Information Extraction Definition Information Extraction Katharina Kaiser http://www.ifs.tuwien.ac.at/~kaiser History Architecture of IE systems Wrapper systems Approaches Evaluation 2 IE: Definition IE:

More information

Context Grammar and POS Tagging

Context Grammar and POS Tagging Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 dick.chen@lexisnexis.com don.loritz@lexisnexis.com

More information

Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources

Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources Michelle

More information

Automatic Pronominal Anaphora Resolution in English Texts

Automatic Pronominal Anaphora Resolution in English Texts Computational Linguistics and Chinese Language Processing Vol. 9, No.1, February 2004, pp. 21-40 21 The Association for Computational Linguistics and Chinese Language Processing Automatic Pronominal Anaphora

More information

Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications

Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications Morphological Analysis and Named Entity Recognition for your Lucene / Solr Search Applications Berlin Berlin Buzzwords 2011, Dr. Christoph Goller, IntraFind AG Outline IntraFind AG Indexing Morphological

More information

Speech Processing Applications in Quaero

Speech Processing Applications in Quaero Speech Processing Applications in Quaero Sebastian Stüker www.kit.edu 04.08 Introduction! Quaero is an innovative, French program addressing multimedia content! Speech technologies are part of the Quaero

More information

Automated Annotation of Events Related to Central Venous Catheterization in Norwegian Clinical Notes

Automated Annotation of Events Related to Central Venous Catheterization in Norwegian Clinical Notes Automated Annotation of Events Related to Central Venous Catheterization in Norwegian Clinical Notes Ingrid Andås Berg Healthcare Informatics Submission date: March 2014 Supervisor: Øystein Nytrø, IDI

More information

Text Generation for Abstractive Summarization

Text Generation for Abstractive Summarization Text Generation for Abstractive Summarization Pierre-Etienne Genest, Guy Lapalme RALI-DIRO Université de Montréal P.O. Box 6128, Succ. Centre-Ville Montréal, Québec Canada, H3C 3J7 {genestpe,lapalme}@iro.umontreal.ca

More information

The role of named entities in Web People Search

The role of named entities in Web People Search The role of named entities in Web People Search Javier Artiles UNED NLP & IR group Madrid, Spain javart@bec.uned.es Enrique Amigó UNED NLP & IR group Madrid, Spain enrique@lsi.uned.es Julio Gonzalo UNED

More information

Corpus Design for a Unit Selection Database

Corpus Design for a Unit Selection Database Corpus Design for a Unit Selection Database Norbert Braunschweiler Institute for Natural Language Processing (IMS) Stuttgart 8 th 9 th October 2002 BITS Workshop, München Norbert Braunschweiler Corpus

More information

A Mutually Beneficial Integration of Data Mining and Information Extraction

A Mutually Beneficial Integration of Data Mining and Information Extraction In the Proceedings of the Seventeenth National Conference on Artificial Intelligence(AAAI-2000), pp.627-632, Austin, TX, 20001 A Mutually Beneficial Integration of Data Mining and Information Extraction

More information

viii Javier E. Díaz-Vera, Rosario Caballero

viii Javier E. Díaz-Vera, Rosario Caballero Introduction More than ten years ago, David Crystal (1997: 106) stated that most of the scientific, technological and academic information in the world is expressed in English and over 80% of all the information

More information

Automatic Pronominal Anaphora Resolution. in English Texts

Automatic Pronominal Anaphora Resolution. in English Texts Automatic Pronominal Anaphora Resolution in English Texts Tyne Liang and Dian-Song Wu Department of Computer and Information Science National Chiao Tung University Hsinchu, Taiwan Email: tliang@cis.nctu.edu.tw;

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Statistical Analyses of Named Entity Disambiguation Benchmarks

Statistical Analyses of Named Entity Disambiguation Benchmarks Statistical Analyses of Named Entity Disambiguation Benchmarks Nadine Steinmetz, Magnus Knuth, and Harald Sack Hasso Plattner Institute for Software Systems Engineering, Potsdam, Germany, firstname.lastname@hpi.uni-potsdam.de

More information

How RAI's Hyper Media News aggregation system keeps staff on top of the news

How RAI's Hyper Media News aggregation system keeps staff on top of the news How RAI's Hyper Media News aggregation system keeps staff on top of the news 13 th Libre Software Meeting Media, Radio, Television and Professional Graphics Geneva - Switzerland, 10 th July 2012 Maurizio

More information

Giuseppe Riccardi, Marco Ronchetti. University of Trento

Giuseppe Riccardi, Marco Ronchetti. University of Trento Giuseppe Riccardi, Marco Ronchetti University of Trento 1 Outline Searching Information Next Generation Search Interfaces Needle E-learning Application Multimedia Docs Indexing, Search and Presentation

More information

Reverse Engineering a Rule-Based Finnish Named Entity Recognizer

Reverse Engineering a Rule-Based Finnish Named Entity Recognizer Reverse Engineering a Rule-Based Finnish Named Entity Recognizer Department of Modern Languages University of Helsinki June 9, 2015 Introduction How well can the behavior of a rule-based NE recognizer

More information

Multilingual XML-Based Named Entity Recognition for E-Retail Domains

Multilingual XML-Based Named Entity Recognition for E-Retail Domains Multilingual XML-Based Named Entity Recognition for E-Retail Domains Claire Grover, Scott McDonald, Donnla Nic Gearailt, Vangelis Karkaletsis Ý, Dimitra Farmakiotou Ý, Georgios Samaritakis Ý, Georgios

More information

Leveraging ASEAN Economic Community through Language Translation Services

Leveraging ASEAN Economic Community through Language Translation Services Leveraging ASEAN Economic Community through Language Translation Services Hammam Riza Center for Information and Communication Technology Agency for the Assessment and Application of Technology (BPPT)

More information