Priberam s question answering system for Portuguese

Priberam Informática Av. Defensores de Chaves, 32 3º Esq. 1000-119 Lisboa, Portugal Tel.: +351 21 781 72 60 / Fax: +351 21 781 72 79 Summary Priberam s question answering system for Portuguese Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto Introduction A workbench for NLP Lexical resources Software tools Question categorization System description Indexing process Question analysis Document retrieval Sentence retrieval Answer extraction Evaluation & Results Conclusions CLEF Workshop, Vienna, 21-23 of September, 2005 1 2 Introduction Lexical resources Goal: to build a question answering (QA) engine that finds a unique exact answer for NL questions. Evaluation: QA@CLEF Portuguese monolingual task. Previous work by Priberam on this subject: LegiX a juridical information system SintaGest a workbench for NLP TRUST project (Text Retrieval Using Semantics Technology) development of the Portuguese module in a cross-language environment. Lexicon: Lemmas, inflections and POS; Sense definitions (*); Semantic features, subcategorization and selection restrictions; Ontological and terminological domains; English and French equivalents (*); Lexical-semantic relations (e.g. derivations). (*) Not used in the QA system. Thesaurus Ontology: Multilingual (**) (English, French, Portuguese) enables translations; Designed by Synapse Développement for TRUST (**) Only Portuguese information is used in the QA system. 3 4

Software tools Question categorization (I) Priberam s SintaGest a NLP application that allows: Building & testing a context-free grammar (CFG); Building & testing contextual rules for: Morphological disambiguation; Named entity & fixed expressions recognition; Building & testing patterns for question categorization/answer extraction; Compressing & compiling all data into binary files. Statistical POS tagger: Used together w/ contextual rules for morphological disambiguation; HMM-based (2nd order), trained with the CETEMPublico corpus; Fast & efficient performance => Viterbi algorithm. 86 question categories, flat structure <DENOMINATION>, <DATE OF EVENT>, <TOWN NAME>, <BIRTH DATE>, <FUNCTION>, Categorization: performed through rich patterns (more powerful than regular expressions) More than one category is allowed (avoiding hard decisions); Rich patterns are conditional expressions w/ words (Word), lemmas (Root), POS (Cat), ontology entries (Ont), question identifiers (QuestIdent), and constant phrases; Everything built & tested through SintaGest. 5 6 Question categorization (II) QA system overview There are 3 kinds of patterns: Question patterns (QPs): for question categorization. Answer patterns (APs): for sentence categorization (during indexation). Question answering patterns (QAPs): for answer extraction. Heuristic scores The system architecture is composed by 5 major modules: QPs QAPs APs Question (FUNCTION) : Word(quem) Distance(0,3) Root(ser) AnyCat(Nprop, ENT) = 15 // e.g. Quem é Jorge Sampaio? : Word(que) QuestIdent(FUNCTION_N) Distance(0,3) QuestIdent(FUNCTION_V) = 15 // e.g. Que cargo desempenha Jorge Sampaio? Answer : Pivot & AnyCat (Nprop, ENT) Root(ser) {Definition With Ergonym?} = 20 // e.g. Jorge Sampaio é o {Presidente da República}... : {NounPhrase With Ergonym?} AnyCat (Trav, Vg) Pivot & AnyCat (Nprop, ENT) = 15 // e.g. O {presidente da República}, Jorge Sampaio... ; Answer (FUNCTION) : QuestIdent(FUNCTION_N) = 10 : Ergonym = 10 ; 7 8

Indexing process Question analysis The collection of target documents is analysed (off-line) and information is stored in a index database. Each document first feeds the sentence analyser; Sentence categorization: each sentence is classified with one or more question categories through the APs. We build indices for: Lemmas Heads of derivation NEs and fixed expressions Question categories Ontology domains (at document level) 9 A NL question (e.g. Quem é o presidente da Albânia? ) Sentence analysis Question categorization & activation of QAPs (through the QPs) Extraction of pivots (words, NEs, phrases, dates, abbreviations, ) Query expansion (heads of derivation & synonyms) Pivots lemmas, heads & synonyms (e.g. presidente, Albânia, presidir, albanês, chefe de estado) Question categories (e.g. <FUNCTION>, <DENOMINATION>) Relevant ontological domains Active QAPs 10 Document retrieval Sentence retrieval Pivots lemmas (w Li ), heads (w Hi ) & synonyms (w S ij ) Question categories (c k ) & ontological domains (o l ) Word weighting "(w) according to: POS; ilf (inv. lexical freq.); idf (inv. docum. freq.). Each document d is given a score! d : The top 30 scored documents.! d := 0 For Each pivot i If d contains lemma w i L Then! d += K L "(w i L ) Else If d contains head w i H Then! d += K H "(w i H ) Else If d contains any synonym w ij S Then! d += max j (K S # (w ij S, w i L ) "(w ij S )) If d contains any question category c k Then! d += K C If d contains any ontology domain o l Then! d += K O! d := RewardPivotProximity(d,! d ) 11 Scored documents {(d,! d )} w/ relevant sentences marked. Sentence analysis Sentence scoring Each sentence s is given a score! s according to: # pivots lemmas, heads & synonyms matching s; # partial matches: Fidel! Fidel Castro; Order & proximity of pivots in s; Existence of common question categories between q and s; Score! d of document d containing s. Scored sentences {(s,! s )} above a fixed threshold. 12

Answer extraction Results & evaluation (I) Scored sentences {(s,! s )} Active QAPs (from the Question Analysis module) Answer extraction & scoring through the QAPs Answer coherence Each answer a is rescored to! a taking into account its coherence to the whole collection of candidate answers (e.g., Sali Berisha, Ramiz Alia, Berisha ) Selection of the final answer. QA@CLEF evaluation: Portuguese monolingual task 210734 target documents (~564 Mb) from Portuguese & Brazilian newspaper corpora: Público1994, Público1995, Folha1994, Folha1995 Test set of 200 questions (in Brazilian and European Portuguese). Results 64,5% of right answers (R): e.g. O Presidente da Albânia, Sali Berisha, tentou evitar o pior, afirmando que não está provado que o Governo grego esteja envolvido no ataque. The answer a with highest! a or NIL if none answer was extracted. 13 14 Results & evaluation (II) Conclusions Reasons for bad answers (W+X+U): Priberam s QA system exhibited encouraging results: 16,5% 8,0% 6,5% 4,5% Extraction of candidate answers NIL validation Choice of the final answer Document retrieval Como se chama a Organização para a Alimentação e Agricultura das Nações Unidas? Que partido foi fundado por Andrei Brejnev? O que é a Sabena? Diga o nome de um assassino em série americano. Overextraction: (...) que viria a estar na origem da FAO (a Organização para a Alimentação e a Agricultura das Nações Unidas) Should return NIL 1st answer: No caso da Sabena, a Swissair ( ) terá de pronunciar-se. 2nd answer: (...) o acordo de união entre a companhia aérea belga Sabena The right document was missed. No match between americano and EUA in (...) John Wayne Gacy, maior assassino em série da história dos EUA ( ) 15 State-of-the-art accuracy (64.5%) in QA@CLEF evaluation Possible advantages over other systems: Adjustable & powerful patterns for categorization & extraction (SintaGest) Query expansion through heads of derivation & synonyms Use of ontology to introduce semantic knowledge Some future work: Confidence measure for final answer validation Handling of list-, how-, & temporally-restricted questions Semantic disambiguation & further exploiting of the ontology Syntactical parsing & anaphora resolution Refinement for Web & book searching 16

Priberam Informática Av. Defensores de Chaves, 32 3º Esq. 1000-119 Lisboa, Portugal Tel.: +351 21 781 72 60 / Fax: +351 21 781 72 79 Ontology Priberam s question answering system for Portuguese Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto Concept-based Tree-structured, 4 levels Nodes are concepts Leaves are senses of words Words are translated in several languages (English, French, Portuguese, Italian, Polish, and soon Spanish and Czech) There are 3387 terminal nodes (the most specific concepts) CLEF Workshop, Vienna, 21-23 of September, 2005 17 18