Priberam s question answering system for Portuguese



Similar documents
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Interoperability, Standards and Open Advancement

Natural Language Interfaces to Databases: simple tips towards usability

Building a Question Classifier for a TREC-Style Question Answering System

Question Answering and Multilingual CLEF 2008

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Terminology Extraction from Log Files

Natural Language to Relational Query by Using Parsing Compiler

Processing: current projects and research at the IXA Group

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

BINLI: An Ontology-Based Natural Language Interface for Multidimensional Data Analysis

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Overview of MT techniques. Malek Boualem (FT)

Learning Translation Rules from Bilingual English Filipino Corpus

Comprendium Translator System Overview

TREC 2003 Question Answering Track at CAS-ICT

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Medicine.Ask: a Natural Language Search System for Medicine Information

Using NLP and Ontologies for Notary Document Management Systems

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Micro blogs Oriented Word Segmentation System

Customizing an English-Korean Machine Translation System for Patent Translation *

Collecting Polish German Parallel Corpora in the Internet

Natural Language Processing using Machine Learning

Technical Writing - A Glossary of Useful Spanish Language Resources

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

Evaluation of a Segmental Durations Model for TTS

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

The University of Lisbon at CLEF 2006 Ad-Hoc Task

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

M LTO Multilingual On-Line Translation

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

An Efficient Database Design for IndoWordNet Development Using Hybrid Approach

TS3: an Improved Version of the Bilingual Concordancer TransSearch

Terminology Extraction from Log Files

Natural Language Database Interface for the Community Based Monitoring System *

Statistical Machine Translation

Interactive Dynamic Information Extraction

The XLDB Group at CLEF 2004

The 2006 IEEE / WIC / ACM International Conference on Web Intelligence Hong Kong, China

PortuguesePod101.com Learn Portuguese with FREE Podcasts

PoS-tagging Italian texts with CORISTagger

Coupling Natural Language Interfaces to Database and Named Entity Recognition

2 F@QA@CLEF. 1 Introduction. Categories and Subject Descriptors. General Terms. Keywords

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

An experience with Semantic Web technologies in the news domain

Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy)

Semantic annotation of requirements for automatic UML class diagram generation

Opentrad: bringing to the market open source based Machine Translators

TechWatch. Technology and Market Observation powered by SMILA

Databases and computerized information retrieval

Anotaciones semánticas: unidades de busqueda del futuro?

TweetAlert: Semantic Analytics in Social Networks for Citizen Opinion Mining in the City of the Future

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization

Introduction to IE with GATE

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

Europass Curriculum Vitae

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation Project no.

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Special Topics in Computer Science

Word Completion and Prediction in Hebrew

Why are Organizations Interested?

Report on the embedding and evaluation of the second MT pilot

Semantic Search in Portals using Ontologies

Chapter 2 The Information Retrieval Process

A MULTILINGUAL AND LOCATION EVALUATION OF SEARCH ENGINES FOR WEBSITES AND SEARCHED FOR KEYWORDS

An Unsupervised Approach to Domain-Specific Term Extraction

Unifying Search for the Desktop, the Enterprise and the Web

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

MULINEX. Multilingual Indexing, Navigation and Editing Extensions for the World-Wide Web

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task

Leveraging ASEAN Economic Community through Language Translation Services

Brill s rule-based PoS tagger

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

SOCIS: Scene of Crime Information System - IGR Review Report

Numerical Data Integration for Cooperative Question-Answering

Language and Computation

PROMT Technologies for Translation and Big Data

REACTION Workshop Overview Porto, FEUP. Mário J. Silva IST/INESC-ID, Portugal REACTION

Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure

How To Complete The Danish Masters Program In Lct

RRSS - Rating Reviews Support System purpose built for movies recommendation

Specialty Answering Service. All rights reserved.

The Successful Application of Natural Language Processing for Information Retrieval

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

Disambiguating Implicit Temporal Queries by Clustering Top Relevant Dates in Web Snippets

A Workbench for Prototyping XML Data Exchange (extended abstract)

Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science

The University of Amsterdam s Question Answering System at QA@CLEF 2007

Transcription:

Priberam Informática Av. Defensores de Chaves, 32 3º Esq. 1000-119 Lisboa, Portugal Tel.: +351 21 781 72 60 / Fax: +351 21 781 72 79 Summary Priberam s question answering system for Portuguese Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto Introduction A workbench for NLP Lexical resources Software tools Question categorization System description Indexing process Question analysis Document retrieval Sentence retrieval Answer extraction Evaluation & Results Conclusions CLEF Workshop, Vienna, 21-23 of September, 2005 1 2 Introduction Lexical resources Goal: to build a question answering (QA) engine that finds a unique exact answer for NL questions. Evaluation: QA@CLEF Portuguese monolingual task. Previous work by Priberam on this subject: LegiX a juridical information system SintaGest a workbench for NLP TRUST project (Text Retrieval Using Semantics Technology) development of the Portuguese module in a cross-language environment. Lexicon: Lemmas, inflections and POS; Sense definitions (*); Semantic features, subcategorization and selection restrictions; Ontological and terminological domains; English and French equivalents (*); Lexical-semantic relations (e.g. derivations). (*) Not used in the QA system. Thesaurus Ontology: Multilingual (**) (English, French, Portuguese) enables translations; Designed by Synapse Développement for TRUST (**) Only Portuguese information is used in the QA system. 3 4

Software tools Question categorization (I) Priberam s SintaGest a NLP application that allows: Building & testing a context-free grammar (CFG); Building & testing contextual rules for: Morphological disambiguation; Named entity & fixed expressions recognition; Building & testing patterns for question categorization/answer extraction; Compressing & compiling all data into binary files. Statistical POS tagger: Used together w/ contextual rules for morphological disambiguation; HMM-based (2nd order), trained with the CETEMPublico corpus; Fast & efficient performance => Viterbi algorithm. 86 question categories, flat structure <DENOMINATION>, <DATE OF EVENT>, <TOWN NAME>, <BIRTH DATE>, <FUNCTION>, Categorization: performed through rich patterns (more powerful than regular expressions) More than one category is allowed (avoiding hard decisions); Rich patterns are conditional expressions w/ words (Word), lemmas (Root), POS (Cat), ontology entries (Ont), question identifiers (QuestIdent), and constant phrases; Everything built & tested through SintaGest. 5 6 Question categorization (II) QA system overview There are 3 kinds of patterns: Question patterns (QPs): for question categorization. Answer patterns (APs): for sentence categorization (during indexation). Question answering patterns (QAPs): for answer extraction. Heuristic scores The system architecture is composed by 5 major modules: QPs QAPs APs Question (FUNCTION) : Word(quem) Distance(0,3) Root(ser) AnyCat(Nprop, ENT) = 15 // e.g. Quem é Jorge Sampaio? : Word(que) QuestIdent(FUNCTION_N) Distance(0,3) QuestIdent(FUNCTION_V) = 15 // e.g. Que cargo desempenha Jorge Sampaio? Answer : Pivot & AnyCat (Nprop, ENT) Root(ser) {Definition With Ergonym?} = 20 // e.g. Jorge Sampaio é o {Presidente da República}... : {NounPhrase With Ergonym?} AnyCat (Trav, Vg) Pivot & AnyCat (Nprop, ENT) = 15 // e.g. O {presidente da República}, Jorge Sampaio... ; Answer (FUNCTION) : QuestIdent(FUNCTION_N) = 10 : Ergonym = 10 ; 7 8

Indexing process Question analysis The collection of target documents is analysed (off-line) and information is stored in a index database. Each document first feeds the sentence analyser; Sentence categorization: each sentence is classified with one or more question categories through the APs. We build indices for: Lemmas Heads of derivation NEs and fixed expressions Question categories Ontology domains (at document level) 9 A NL question (e.g. Quem é o presidente da Albânia? ) Sentence analysis Question categorization & activation of QAPs (through the QPs) Extraction of pivots (words, NEs, phrases, dates, abbreviations, ) Query expansion (heads of derivation & synonyms) Pivots lemmas, heads & synonyms (e.g. presidente, Albânia, presidir, albanês, chefe de estado) Question categories (e.g. <FUNCTION>, <DENOMINATION>) Relevant ontological domains Active QAPs 10 Document retrieval Sentence retrieval Pivots lemmas (w Li ), heads (w Hi ) & synonyms (w S ij ) Question categories (c k ) & ontological domains (o l ) Word weighting "(w) according to: POS; ilf (inv. lexical freq.); idf (inv. docum. freq.). Each document d is given a score! d : The top 30 scored documents.! d := 0 For Each pivot i If d contains lemma w i L Then! d += K L "(w i L ) Else If d contains head w i H Then! d += K H "(w i H ) Else If d contains any synonym w ij S Then! d += max j (K S # (w ij S, w i L ) "(w ij S )) If d contains any question category c k Then! d += K C If d contains any ontology domain o l Then! d += K O! d := RewardPivotProximity(d,! d ) 11 Scored documents {(d,! d )} w/ relevant sentences marked. Sentence analysis Sentence scoring Each sentence s is given a score! s according to: # pivots lemmas, heads & synonyms matching s; # partial matches: Fidel! Fidel Castro; Order & proximity of pivots in s; Existence of common question categories between q and s; Score! d of document d containing s. Scored sentences {(s,! s )} above a fixed threshold. 12

Answer extraction Results & evaluation (I) Scored sentences {(s,! s )} Active QAPs (from the Question Analysis module) Answer extraction & scoring through the QAPs Answer coherence Each answer a is rescored to! a taking into account its coherence to the whole collection of candidate answers (e.g., Sali Berisha, Ramiz Alia, Berisha ) Selection of the final answer. QA@CLEF evaluation: Portuguese monolingual task 210734 target documents (~564 Mb) from Portuguese & Brazilian newspaper corpora: Público1994, Público1995, Folha1994, Folha1995 Test set of 200 questions (in Brazilian and European Portuguese). Results 64,5% of right answers (R): e.g. O Presidente da Albânia, Sali Berisha, tentou evitar o pior, afirmando que não está provado que o Governo grego esteja envolvido no ataque. The answer a with highest! a or NIL if none answer was extracted. 13 14 Results & evaluation (II) Conclusions Reasons for bad answers (W+X+U): Priberam s QA system exhibited encouraging results: 16,5% 8,0% 6,5% 4,5% Extraction of candidate answers NIL validation Choice of the final answer Document retrieval Como se chama a Organização para a Alimentação e Agricultura das Nações Unidas? Que partido foi fundado por Andrei Brejnev? O que é a Sabena? Diga o nome de um assassino em série americano. Overextraction: (...) que viria a estar na origem da FAO (a Organização para a Alimentação e a Agricultura das Nações Unidas) Should return NIL 1st answer: No caso da Sabena, a Swissair ( ) terá de pronunciar-se. 2nd answer: (...) o acordo de união entre a companhia aérea belga Sabena The right document was missed. No match between americano and EUA in (...) John Wayne Gacy, maior assassino em série da história dos EUA ( ) 15 State-of-the-art accuracy (64.5%) in QA@CLEF evaluation Possible advantages over other systems: Adjustable & powerful patterns for categorization & extraction (SintaGest) Query expansion through heads of derivation & synonyms Use of ontology to introduce semantic knowledge Some future work: Confidence measure for final answer validation Handling of list-, how-, & temporally-restricted questions Semantic disambiguation & further exploiting of the ontology Syntactical parsing & anaphora resolution Refinement for Web & book searching 16

Priberam Informática Av. Defensores de Chaves, 32 3º Esq. 1000-119 Lisboa, Portugal Tel.: +351 21 781 72 60 / Fax: +351 21 781 72 79 Ontology Priberam s question answering system for Portuguese Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto Concept-based Tree-structured, 4 levels Nodes are concepts Leaves are senses of words Words are translated in several languages (English, French, Portuguese, Italian, Polish, and soon Spanish and Czech) There are 3387 terminal nodes (the most specific concepts) CLEF Workshop, Vienna, 21-23 of September, 2005 17 18