MIRACLE Question Answering System for Spanish at CLEF 2007
|
|
|
- Tyrone Whitehead
- 10 years ago
- Views:
Transcription
1 MIRACLE Question Answering System for Spanish at CLEF 2007 César de Pablo-Sánchez, José Luis Martínez, Ana García-Ledesma, Doaa Samy, Paloma Martínez, Antonio Moreno-Sandoval, Harith Al-Jumaily Universidad Carlos III de Madrid Universidad Autónoma de Madrid DAEDALUS, Data, Decisions and Systems S.A. Abstract This paper describes the system developed by MIRACLE group to participate in the Spanish monolingual question answering task at A basic subsystem, similar to our last year participation, was used separately for EFE and Wikipedia collection. Answers from the two subsystems are combined using temporal information from the questions and the collections. The system is also enhanced with a correference module that processes question series based on a few simple heuristics that constraint the structure of the dialogue. The analysis of the results show that the reuse of strategies for factoids is feasible but definitions would benefit from adaptation. Regarding questions series, our heuristics have good coverage but we should find alternatives to avoid error chaining from previous questions. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries General Terms Measurement, Performance, Experimentation Keywords Question answering, Questions beyond factoids, Spanish 1 Introduction QA@CLEF 2007 has introduced two innovations over last year evaluation. The first change consists on topic-related questions, a series of ordered questions about a common topic that simulates This work has been partially supported by the Regional Government of Madrid under the Research Network MAVIR (S-0505/TIC-0267) and two projects by the Spanish Ministry of Education and Science (TIN2004/07083 and TIN C03-02.)
2 the dialogue between a user and the system to obtain information related to the topic. In this dialogue the user could introduce anaphoric expressions to refer to mentioned entities or events that appear in previous answers or questions. The second innovation is related to the inclusion of the November 2006 dump of the Wikipedia [14] as a source of answers in addition to the classic newspaper collections. MIRACLE submitted a run for the Spanish monolingual task using a system that was enhanced to answer questions from Wikipedia and EFE collections. Each subsystem results were combined in a unified ranked list. The system also included a module that handles topic related questions. It identifies the topic and use it to enhance the representation of the following questions. The basic QA system was based on the architecture of our last year submission [6] although almost all components have evolved since then. This system was based on the use of filters for semantic information and therefore is tailored for factual questions. Last year we also tried to improve it for temporally restricted questions. An additional requirement has been to develop a fast system that could be competitive in real time with a classic information retrieval system. This paper is structured as follow, the next section decribes the system architecture with special attention to the new modules. Section 3 introduces the results and a preliminary analysis of the kind of errors that the system made. Conclusions and directions of future work to solve the main problems follow in Section 4. 2 System Overview The architecture of the system used this year is presented in figure 1 and it is composed of two streams for each of the sources similar to [4, 3]. The first stream uses the EFE newswire collection as a source of answers while the second uses Wikipedia. Each stream produces a ranked list of answers that are merged and combined by the Answer Source Mixer component described below. The two QA streams share a similar basic pipeline architecture and work as an independent QA system, with different configuration parameters, collections, etc. The way we perform question analysis is common to the two streams and therefore, when the two streams are composed, this module is shared. Another new common module complements question analysis for managing context and anaphora resolution in topic-related question series. 2.1 Basic system architecture The basic system follows the classic pipeline architecture used by many other QA systems [8, 11]. The different operations are split between those that are performed online and offline Offline operations These operations are performed during the preparation of the collection to speed up online document retrieval and answer extraction. Collection indexing. Collections are indexed at the word level using Lucene [1]. They are processed to extract the text that will be indexed. In the case of Wikipedia we remove format and links and therefore we do not use this information for relevance. Documents are indexed after tokenization, stopword removal and stemming based on Snowball [10]. Collection processing. In order to speed the process of answer extraction we can enable the use of preprocessed collections. In this case, collections are analyzed using the output of language tools or services. For Spanish we use DAEDALUS STILUS[5] analyzers that provide tokenization, sentence detection and token analysis. Token analysis include detailed part of speech analysis, lemmatization and semantic information. STILUS has been improved from last year to support part of speech tagging. Regarding the semantic information, we have use STILUS Named Entities tagging, that is based on linguistic resources organized in a classification inspired by Sekine s typology[13]. The processed collection is stored on disk. Without compression the collection is about 10 times larger than the corresponding original.
3 Figure 1: MIRACLE 2007 System architecture Online operations Online operations has been described in detail in previous participations [6]. description of the three main modules here and remark only the main changes. We outline the Question analysis. Questions are analyzed using STILUS online and the module produce a custom representation of the question that includes focus, theme, relevant and query terms, question type and expected answer type (EAT). Passage retrieval. We have change this module to use Lucene as information retrieval engine. The question reresentation is used to build a query using Lucene syntax and relevant documents are retrieved. Lucene uses a vector model for document representation and cosine similarity for ranking. Documents are analyzed and only sentences that contain a number of relevant terms are selected for the next step. Answer selection. This module uses the question class and the expected answer type to select an appropiate filter to locate candidate answers. Extracted candidates are grouped if they are similar and ranked using a score that combines redundancy, sentence score and document score.
4 2.2 New modules Group topic identification in question series With the inclusion of topic related questions, the system needed a method to solve referential expressions that appear between questions and answers in the same series. The system processes the first question and generates a set of candidates with the topic, the focus and the future answer. We have implemented a few rules that we believed covered the most common cases to select the best topic for the whole group. The rules use the information available after question analysis and simplified assumptions about the syntactic structure of the questions. The rules to locate the topic for the question series are only applied to the first question: Answers of subtypes NUMEX (numbers and quantities) are ignored as topics for questions series. We believe it is improbable for a number, if not representing any other type of entity, to be a natural topic. We use the subject, the topic of this question, as the topic of the question series. Answers of subtypes TIMEX (dates, years, etc...) are also ignored as topics, but in this case it is because we believed that they fall outside the guidelines for this year. We use the same rule than for NUMEX so far. Using a temporal expresion as a topic is very natural. In fact, most temporally restricted questions and specially those that have an event as restrictions are naturally split into two questions. This strategy has already been used by [12] to answer this kind of questions. We would need to recognize referential expressions like ese año (that year), durante ese período (during that period), etc. and solve the reference correctly in a second step. The question ask for a definition like Quién es George Bush? (Who is George Bush?) will add the topic (Named Entity) and the answer (presidente de los Estados Unidos) to the group topic. We have a similar case when we have questions like Quién es el presidente de los Estados Unidos? ( Who is the president of the United States?). The question follows the pattern Qué NP *? (What NP *?) like Qué organización se fundó en 1995? (Which organization was created in 1995?). In this cases the noun group that follow the interrogative article is the focus of the question. Both the answer and the focus would be added to the group topic. We should remark that this case is different from a question beginning with a preposition like En qué lugar...? (In which place...?). For the rest of the classes we use the answer as the topic for the rest of the group. Once the topic for the group is identified, the rest of the questions use it as an additional relevant term in order to locate documents and filter relevant sentences. This is obviously a problem when the topic was the answer but the system was not able to find the right one. These rules are based on the structure of the information seeking dialogue and how we introduce new topics in conversation. Our rules select the most promising candidate using the firt question and ignoring the rest. Nevertheless, it is posible to shift the topic by introducing a longer referential expresion like the examples mentioned for TIMEX. We plan to investigate how to modify our procedure to work in two steps, generating a ranking list of cantidates and selecting the best candidate depending on constraints expressed in the referential expression Combining EFE and Wikipedia answers As already mentioned, this year there are two possible collections to find the right answer to a question, one is Wikipedia and the other is the EFE newswire collection for years 1994 and This means that there should be some automatic method to decide which of these sources should be more relevant to extract candidate answers to a given question. This is the role of the Source Mixer component, based on very simple heuristics:
5 If the verb of the question appears in present tense preference is given to answers appearing in the Wikipedia collection. If the verb is in past tense and the question makes reference to the period covered by the EFE news collection, i.e., 1994 or 1995 years appearing in the question, then preference is given to answers from the EFE news collection. The preference given to each answer is measured by two parameters, one referring to the verb tense factor and the other referring to the time period factor. In this way, no answer is really dropped from the candidates list but the list is reordered according to this clues. At the output of the Source Mixer component there is a list of candidates ordered according to their source and the information present in the question. 3 Results 3.1 Run description Using the system described above we have submitted one monolingual run for the Spanish subtask of this edition. Evaluation results and combined measures are outlined in tables 1 and 2. Table 1: Judged answers for submitted run Name Right Wrong Inexact Unsupp. MIRA071ESES Table 2: Evaluation measures for submitted run Name Acc@1(F) Acc@1(T) Acc@1(D) Acc@1(All) MIRA071ESES Results are disappointing in general, as they are lower than previous years results for almost all types, despite the inclusion of new sources like Wikipedia. Even though almost all modules have been improved, the overall accuracy is not improving. We believe that whether the question set is more difficult or the system was not tuned as much as needed. We do not show results for the ten list questions because we did not implement an specific strategy. The case of definitions is analyzed with more detail below. We believe that question series introduce additional complexity in the task and this is reflected in the results. If we ignore question series we obtain an accuracy around 18% for the rest of the questions, which supports this idea but reflect that even base results are not good. 3.2 Analysis of errors We are analysing our results in order to estimate which parts of the system need further improvement. For the moment, we are only able to present a preliminary analysis of the contents of our submission that uses one answer per question with their supporting sentence and document. The document returned is correct in 44% of the results, which is a lower bound to measure the performance of the document retrieval system. It also includes errors caused by an incorrect selection of a document collection. Given a correct document, 31% of the sentences selected with the first answer do not really contain a correct answer. Even if the sentence is correct, the error selecting a correct string is about 47%. These kind of error accumulates incorrect selection of a candidate, incorrect extraction and incorrect identification of the expected answer type. Finally, at least 8
6 questions have found an unsupported answer, which signals that in those cases where the same answer string has been extracted from several sentences, the score to select the most representative answer is not working as well as expected. A more detailed analysis isolating the contribution of the main modules will be presented in the final version of this article. Analyzing the behaviour of the system for different types we have detected that the accuracy of definition questions has dropped dramatically. We believe that the heuristics that we have defined to extract definitions in EFE do not work for Wikipedia. The system used to signal appositions, nominal phrases before Named Entities and expansion of acronyms as valid definitions of persons and organizations. In contrast, definitional sentences in Wikipedia usually are copulative, they have a longer distance between the defined object and a valid definition and they usually are placed in the beginning of the document. Unfortunately, we did not implement any special strategy for these questions. If we consider the rest of the types, the behaviour is similar to previous years, most of the factual questions with well defined Named Entities achieve a reasonable accuracy. Questions with EAT OTHER are in contrast much harder. There were 20 topic related groups of questions that made a total of 50 questions. The system correctly answered 5 questions of this kind, from three different groups. This behaviour was expected as errors usually chain, specially in the case when the answer to the first question is not correct and this happens to be the topic. We have analyze the main source of errors in order to evaluate the correference component. Rules for topic detection are the source of errors only for three of the cases, and therefore seems to work reasonably well although there is room for some simple improvements. In one of these cases, the error is due to a question not correctly identified as a NUMEX type. The rest of the errors are due to incorrect identification of the first answer and the chaining of errors. 4 Conclusions and Future Work We have presented the architecture of the MIRACLE QA system and the main modifications introduced to cope with new challenges in the CLEF@QA task. Using this system we have submitted one run for the monolingual Spanish subtask where results were lower than expected. The analysis of the performance across types have signalled that for some type of questions the style of the text is an important issue. This is specially accute in the case of definitions. We have employed the same subsystem and strategies for the EFE and the Wikipedia collections with dissapointing results. The analysis of the errors have shown that the methods for document retrieval and candidate extraction could be adapted to improve the accuracy. We plan to use type speficic approaches like the ones used by [7, 9] and collection specific approaches to improve results for these types. We have found that the module for correference resolution is effective even if it uses a limited amount of knowledge. In contrast, the greater contribution of errors is due to the low accuracy at the first answer. This is even more accute as a great deal of the questions that set the initial topic are definitional. Besides type specific approaches, we need to improve the correference method to consider more than one candidate answer and cope with uncertainty. Another question that we have not evaluated thoroughly yet is the way we combine results. We use a semantic kind of combination that exploits the different time spans of the two collections. It is still unclear if these method is appropriate or whether techniques adapted from information retrieval from heterogeneus collections as in [2] would work better. References [1] Lucene webpage. August [2] Rita M. Aceves-Perez, Manuel Montes y Gomez, and Luis Villasenor-Pineda. Fusion de respuestas en la busqueda de respuestas multilingue. SEPLN, Sociedad Espaola para el Procesamiento del Lenguaje Natural, 38:35 41, 2007.
7 [3] David Ahn, Valentin Jijkoun, Karin Mller, Maarten de Rijke, Stefan Schlobach, and Gilad Mishne. Making Stone Soup: Evaluating a Recall-Oriented Multi-stream Question Answering System for Dutch [4] Jennifer Chu-Carroll, John M. Prager, Christopher A. Welty, Krzysztof Czuba, and David A. Ferrucci. A multi-strategy and multi-source approach to question answering. In TREC, [5] DAEDALUS. Stilus website. On line July [6] Cesar de Pablo-Sanchez, Ana Gonzalez-Ledesma, Antonio Moreno-Sandoval, and Maria- Teresa Vicente-Diez. MIRACLE experiments in QA@CLEF 2006 in spanish: main task, real-time QA and exploratory QA using wikipedia (WiQA) [7] Abdessamad Echihabi, Eduard Hovy, and Michael Fleischman. Offline strategies for online question answering: [8] Dan I. Moldovan, Sanda M. Harabagiu, Marius Pasca, Rada Mihalcea, Roxana Girju, Richard Goodrum, and Vasile Rus. The structure and performance of an open-domain question answering system. In ACL, [9] Manuel Montes-y Gomez, Luis Villaseñor Pineda, Manuel Pérez-Coutiño, José Manuel Gómez-Soriano, Emilio Sanchís-Arnal, and Paolo Rosso. A full data-driven system for multiple language question answering. : Accessing Multilingual Information Repositories, pages , [10] Martin Porter. Snowball stemmers and resources website. July [11] John Prager, Eric Brown, Anni Coden, and Dragomir Radev. Question-answering by predictive annotation. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Question Answering, pages , [12] E. Saquete, J.L. Vicedo, P. Martínez-Barco, R. Muñoz, and F. Llopis. Evaluation of complex temporal questions in clef-qa. : Multilingual Information Access for Text, Speech and Images, pages , [13] Satoshi Sekine. Sekine s extended named entity hierarchy. On line August [14] November 2006 dump of wikipedia /pages-articles.xml.bz2, July 2007.
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
The University of Amsterdam s Question Answering System at QA@CLEF 2007
The University of Amsterdam s Question Answering System at QA@CLEF 2007 Valentin Jijkoun, Katja Hofmann, David Ahn, Mahboob Alam Khalid, Joris van Rantwijk, Maarten de Rijke, and Erik Tjong Kim Sang ISLA,
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
TREC 2003 Question Answering Track at CAS-ICT
TREC 2003 Question Answering Track at CAS-ICT Yi Chang, Hongbo Xu, Shuo Bai Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China [email protected] http://www.ict.ac.cn/
The University of Amsterdam at QA@CLEF 2005
The University of Amsterdam at QA@CLEF 2005 David Ahn Valentin Jijkoun Karin Müller Maarten de Rijke Erik Tjong Kim Sang Informatics Institute, University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam,
LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task
LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task Jacinto Mata, Mariano Crespo, Manuel J. Maña Dpto. de Tecnologías de la Información. Universidad de Huelva Ctra. Huelva - Palos de la Frontera s/n.
Work: Lecturer in Applied and Computational Linguistics Faculty of Arts, Cairo University Email: [email protected] Cell phone: +2 010 10 50 498
Doaa Samy 1 Doaa A. Samy Work: Lecturer in Applied and Computational Linguistics Faculty of Arts, Cairo University Email: [email protected] Cell phone: +2 010 10 50 498 EDUCATION Ph.D. in Computational
2009 Springer. This document is published in: Adaptive Multimedia Retrieval. LNCS 6535 (2009) pp. 12 23 DOI: 10.1007/978-3-642-18449-9_2
Institutional Repository This document is published in: Adaptive Multimedia Retrieval. LNCS 6535 (2009) pp. 12 23 DOI: 10.1007/978-3-642-18449-9_2 2009 Springer Some Experiments in Evaluating ASR Systems
A Text Mining Approach for Definition Question Answering
A Text Mining Approach for Definition Question Answering Claudia Denicia-Carral, Manuel Montes-y-Gómez, Luis Villaseñor-Pineda, René García Hernández Language Technologies Group, Computer Science Department,
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering
Natural Language Database Interface for the Community Based Monitoring System *
Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University
On the Feasibility of Answer Suggestion for Advice-seeking Community Questions about Government Services
21st International Congress on Modelling and Simulation, Gold Coast, Australia, 29 Nov to 4 Dec 2015 www.mssanz.org.au/modsim2015 On the Feasibility of Answer Suggestion for Advice-seeking Community Questions
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Flattening Enterprise Knowledge
Flattening Enterprise Knowledge Do you Control Your Content or Does Your Content Control You? 1 Executive Summary: Enterprise Content Management (ECM) is a common buzz term and every IT manager knows it
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
The University of Lisbon at CLEF 2006 Ad-Hoc Task
The University of Lisbon at CLEF 2006 Ad-Hoc Task Nuno Cardoso, Mário J. Silva and Bruno Martins Faculty of Sciences, University of Lisbon {ncardoso,mjs,bmartins}@xldb.di.fc.ul.pt Abstract This paper reports
TweetAlert: Semantic Analytics in Social Networks for Citizen Opinion Mining in the City of the Future
TweetAlert: Semantic Analytics in Social Networks for Citizen Opinion Mining in the City of the Future Julio Villena-Román 1,2, Adrián Luna-Cobos 1,3, José Carlos González-Cristóbal 3,1 1 DAEDALUS - Data,
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED
Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED 17 19 June 2013 Monday 17 June Salón de Actos, Facultad de Psicología, UNED 15.00-16.30: Invited talk Eneko Agirre (Euskal Herriko
A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow
A Framework-based Online Question Answering System Oliver Scheuer, Dan Shen, Dietrich Klakow Outline General Structure for Online QA System Problems in General Structure Framework-based Online QA system
Learning analytics in the LMS: Using browser extensions to embed visualizations into a Learning Management System
Learning analytics in the LMS: Using browser extensions to embed visualizations into a Learning Management System Derick Leony, Abelardo Pardo, Luis de la Fuente Valentín, Iago Quiñones, and Carlos Delgado
Medical-Miner at TREC 2011 Medical Records Track
Medical-Miner at TREC 2011 Medical Records Track 1 J.M. Córdoba, 1 M.J. Maña, 1 N.P. Cruz, 1 J. Mata, 2 F. Aparicio, 2 M. Buenaga, 3 D. Glez-Peña, 3 F. Fdez-Riverola 1 Universidad de Huelva 2 Universidad
ASKWIKI : SHALLOW SEMANTIC PROCESSING TO QUERY WIKIPEDIA. Felix Burkhardt and Jianshen Zhou. Deutsche Telekom Laboratories, Berlin, Germany
20th European Signal Processing Conference (EUSIPCO 2012) Bucharest, Romania, August 27-31, 2012 ASKWIKI : SHALLOW SEMANTIC PROCESSING TO QUERY WIKIPEDIA Felix Burkhardt and Jianshen Zhou Deutsche Telekom
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:
M3039 MPEG 97/ January 1998
INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC1/SC29/WG11 CODING OF MOVING PICTURES AND ASSOCIATED AUDIO INFORMATION ISO/IEC JTC1/SC29/WG11 M3039
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 [email protected] 2 [email protected] Abstract A vast amount of assorted
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System
Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System Oana NICOLAE Faculty of Mathematics and Computer Science, Department of Computer Science, University of Craiova, Romania [email protected]
Folksonomies versus Automatic Keyword Extraction: An Empirical Study
Folksonomies versus Automatic Keyword Extraction: An Empirical Study Hend S. Al-Khalifa and Hugh C. Davis Learning Technology Research Group, ECS, University of Southampton, Southampton, SO17 1BJ, UK {hsak04r/hcd}@ecs.soton.ac.uk
Learning Translation Rules from Bilingual English Filipino Corpus
Proceedings of PACLIC 19, the 19 th Asia-Pacific Conference on Language, Information and Computation. Learning Translation s from Bilingual English Filipino Corpus Michelle Wendy Tan, Raymond Joseph Ang,
Abstract 1. INTRODUCTION
A Virtual Database Management System For The Internet Alberto Pan, Lucía Ardao, Manuel Álvarez, Juan Raposo and Ángel Viña University of A Coruña. Spain e-mail: {alberto,lucia,mad,jrs,avc}@gris.des.fi.udc.es
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:
Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization
Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Atika Mustafa, Ali Akbar, and Ahmer Sultan National University of Computer and Emerging
Interactive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Principal MDM Components and Capabilities
Principal MDM Components and Capabilities David Loshin Knowledge Integrity, Inc. 1 Agenda Introduction to master data management The MDM Component Layer Model MDM Maturity MDM Functional Services Summary
Report on the Dagstuhl Seminar Data Quality on the Web
Report on the Dagstuhl Seminar Data Quality on the Web Michael Gertz M. Tamer Özsu Gunter Saake Kai-Uwe Sattler U of California at Davis, U.S.A. U of Waterloo, Canada U of Magdeburg, Germany TU Ilmenau,
Software Architecture Document
Software Architecture Document Natural Language Processing Cell Version 1.0 Natural Language Processing Cell Software Architecture Document Version 1.0 1 1. Table of Contents 1. Table of Contents... 2
Application of ontologies for the integration of network monitoring platforms
Application of ontologies for the integration of network monitoring platforms Jorge E. López de Vergara, Javier Aracil, Jesús Martínez, Alfredo Salvador, José Alberto Hernández Networking Research Group,
The Prolog Interface to the Unstructured Information Management Architecture
The Prolog Interface to the Unstructured Information Management Architecture Paul Fodor 1, Adam Lally 2, David Ferrucci 2 1 Stony Brook University, Stony Brook, NY 11794, USA, [email protected] 2 IBM
Multi language e Discovery Three Critical Steps for Litigating in a Global Economy
Multi language e Discovery Three Critical Steps for Litigating in a Global Economy 2 3 5 6 7 Introduction e Discovery has become a pressure point in many boardrooms. Companies with international operations
Effective Mentor Suggestion System for Collaborative Learning
Effective Mentor Suggestion System for Collaborative Learning Advait Raut 1 U pasana G 2 Ramakrishna Bairi 3 Ganesh Ramakrishnan 2 (1) IBM, Bangalore, India, 560045 (2) IITB, Mumbai, India, 400076 (3)
An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials
ehealth Beyond the Horizon Get IT There S.K. Andersen et al. (Eds.) IOS Press, 2008 2008 Organizing Committee of MIE 2008. All rights reserved. 3 An Ontology Based Method to Solve Query Identifier Heterogeneity
Automatic Question Answering: Beyond the Factoid
Automatic Question Answering: Beyond the Factoid Abstract Radu Soricut Information Sciences Institute University of Southern California 4676 Admiralty Way Marina del Rey, CA 90292, USA [email protected] In
3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work
Unsupervised Paraphrase Acquisition via Relation Discovery Takaaki Hasegawa Cyberspace Laboratories Nippon Telegraph and Telephone Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa 239-0847, Japan [email protected]
Business Intelligence and Decision Support Systems
Chapter 12 Business Intelligence and Decision Support Systems Information Technology For Management 7 th Edition Turban & Volonino Based on lecture slides by L. Beaubien, Providence College John Wiley
POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
Information Retrieval Elasticsearch
Information Retrieval Elasticsearch IR Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches
D 2.2.3 EUOSME: European Open Source Metadata Editor (revised 2010-12-20)
Project start date: 01 May 2009 Acronym: EuroGEOSS Project title: EuroGEOSS, a European Theme: FP7-ENV-2008-1: Environment (including climate change) Theme title: ENV.2008.4.1.1.1: European Environment
Projektgruppe. Categorization of text documents via classification
Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction
PiQASso: Pisa Question Answering System
PiQASso: Pisa Question Answering System Giuseppe Attardi, Antonio Cisternino, Francesco Formica, Maria Simi, Alessandro Tommasi Dipartimento di Informatica, Università di Pisa, Italy {attardi, cisterni,
Software Engineering
Software Engineering Lecture 06: Design an Overview Peter Thiemann University of Freiburg, Germany SS 2013 Peter Thiemann (Univ. Freiburg) Software Engineering SWT 1 / 35 The Design Phase Programming in
Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.
Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Pranali Chilekar 1, Swati Ubale 2, Pragati Sonkambale 3, Reema Panarkar 4, Gopal Upadhye 5 1 2 3 4 5
Electronic Document Management Using Inverted Files System
EPJ Web of Conferences 68, 0 00 04 (2014) DOI: 10.1051/ epjconf/ 20146800004 C Owned by the authors, published by EDP Sciences, 2014 Electronic Document Management Using Inverted Files System Derwin Suhartono,
Task Description for English Slot Filling at TAC- KBP 2014
Task Description for English Slot Filling at TAC- KBP 2014 Version 1.1 of May 6 th, 2014 1. Changes 1.0 Initial release 1.1 Changed output format: added provenance for filler values (Column 6 in Table
Integration of a Multilingual Keyword Extractor in a Document Management System
Integration of a Multilingual Keyword Extractor in a Document Management System Andrea Agili *, Marco Fabbri *, Alessandro Panunzi +, Manuel Zini * * DrWolf s.r.l., + Dipartimento di Italianistica - Università
D2.4: Two trained semantic decoders for the Appointment Scheduling task
D2.4: Two trained semantic decoders for the Appointment Scheduling task James Henderson, François Mairesse, Lonneke van der Plas, Paola Merlo Distribution: Public CLASSiC Computational Learning in Adaptive
Intelligent Log Analyzer. André Restivo <[email protected]>
Intelligent Log Analyzer André Restivo 9th January 2003 Abstract Server Administrators often have to analyze server logs to find if something is wrong with their machines.
Stanford s Distantly-Supervised Slot-Filling System
Stanford s Distantly-Supervised Slot-Filling System Mihai Surdeanu, Sonal Gupta, John Bauer, David McClosky, Angel X. Chang, Valentin I. Spitkovsky, Christopher D. Manning Computer Science Department,
Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features
Semantic Video Annotation by Mining Association Patterns from and Speech Features Vincent. S. Tseng, Ja-Hwung Su, Jhih-Hong Huang and Chih-Jen Chen Department of Computer Science and Information Engineering
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
NetOwl(TM) Extractor Technical Overview March 1997
NetOwl(TM) Extractor Technical Overview March 1997 1 Overview NetOwl Extractor is an automatic indexing system that finds and classifies key phrases in text, such as personal names, corporate names, place
Chapter 2 The Information Retrieval Process
Chapter 2 The Information Retrieval Process Abstract What does an information retrieval system look like from a bird s eye perspective? How can a set of documents be processed by a system to make sense
Towards a Visually Enhanced Medical Search Engine
Towards a Visually Enhanced Medical Search Engine Lavish Lalwani 1,2, Guido Zuccon 1, Mohamed Sharaf 2, Anthony Nguyen 1 1 The Australian e-health Research Centre, Brisbane, Queensland, Australia; 2 The
Optimization of Internet Search based on Noun Phrases and Clustering Techniques
Optimization of Internet Search based on Noun Phrases and Clustering Techniques R. Subhashini Research Scholar, Sathyabama University, Chennai-119, India V. Jawahar Senthil Kumar Assistant Professor, Anna
Evaluation of an exercise for measuring impact in e-learning: Case study of learning a second language
Evaluation of an exercise for measuring impact in e-learning: Case study of learning a second language J.M. Sánchez-Torres Universidad Nacional de Colombia Bogotá D.C., Colombia [email protected]
Open Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks
System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks OnurSoft Onur Tolga Şehitoğlu November 10, 2012 v1.0 Contents 1 Introduction 3 1.1 Purpose..............................
Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP
Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP Bridge Consulting Based in Florence, Italy Foundedin 1998 98 employees Business Areas Retail, Manufacturing and Fashion Knowledge
Engineering Process Software Qualities Software Architectural Design
Engineering Process We need to understand the steps that take us from an idea to a product. What do we do? In what order do we do it? How do we know when we re finished each step? Production process Typical
Exploiting the Web of Data for cross-domain information retrieval and recommendation
Exploiting the Web of Data for cross-domain information retrieval and recommendation Ignacio Fernández-Tobías under the supervision of Iván Cantador Grupo de Recuperación de Información Universidad Autónoma
Overview of MT techniques. Malek Boualem (FT)
Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,
