Textual Entailment Recognition and its Applicability in NLP Tasks

Similar documents

Building a Question Classifier for a TREC-Style Question Answering System

The Role of Sentence Structure in Recognizing Textual Entailment

Natural Language Database Interface for the Community Based Monitoring System *

Interactive Dynamic Information Extraction

Collecting Polish German Parallel Corpora in the Internet

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Natural Language to Relational Query by Using Parsing Compiler

D2.4: Two trained semantic decoders for the Appointment Scheduling task

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Overview of MT techniques. Malek Boualem (FT)

Overview of the TACITUS Project

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Special Topics in Computer Science

Experiments in Web Page Classification for Semantic Web

Distributed Database for Environmental Data Integration

Thesis Proposal Verb Semantics for Natural Language Understanding

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Software Engineering EMR Project Report

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

How To Write A Summary Of A Review

Clustering Connectionist and Statistical Language Processing

Technical Report. The KNIME Text Processing Feature:

Word Completion and Prediction in Hebrew

INF5820 Natural Language Processing - NLP. H2009 Jan Tore Lønning jtl@ifi.uio.no

Search Engine Based Intelligent Help Desk System: iassist

Text Mining - Scope and Applications

Semantic Search in Portals using Ontologies

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

A Knowledge-based System for Translating FOL Formulas into NL Sentences

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

A Semantic Portal for the International Affairs Sector

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Semantic annotation of requirements for automatic UML class diagram generation

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

PoS-tagging Italian texts with CORISTagger

Automated Extraction of Vulnerability Information for Home Computer Security

Customer Intentions Analysis of Twitter Based on Semantic Patterns

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE

Specialty Answering Service. All rights reserved.

The Prolog Interface to the Unstructured Information Management Architecture

Terminology Extraction from Log Files

Stock Market Prediction Using Data Mining

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Brill s rule-based PoS tagger

CREATING LEARNING OUTCOMES

Semantic analysis of text and speech

Mining Opinion Features in Customer Reviews

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Using Knowledge Extraction and Maintenance Techniques To Enhance Analytical Performance

ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU

The Alignment of Common Core and ACT s College and Career Readiness System. June 2010

Text Analytics with Ambiverse. Text to Knowledge.

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task

Automatic Text Analysis Using Drupal

Comparison of K-means and Backpropagation Data Mining Algorithms

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Text Analytics. A business guide

HELP DESK SYSTEMS. Using CaseBased Reasoning

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Learning Translation Rules from Bilingual English Filipino Corpus

Statistical Machine Translation

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Reusable Knowledge-based Components for Building Software. Applications: A Knowledge Modelling Approach

Computer Aided Document Indexing System

Data Isn't Everything

The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle

Making Sense of the Mayhem: Machine Learning and March Madness

BILINGUAL TRANSLATION SYSTEM

ONTOLOGY FOR MOBILE PHONE OPERATING SYSTEMS

PTE Academic Preparation Course Outline

Transcription:

Departamento de Lenguajes y Sistemas Informáticos University of Alicante Textual Entailment Recognition and its Applicability in NLP Tasks Óscar Ferrández Escámez PhD Dissertation This work was carried out under the supervision of Dr. Rafael Muñoz Guillena, University of Alicante. Alicante, July 2009

Departamento de Lenguajes y Sistemas Informáticos University of Alicante Textual Entailment Recognition and its Applicability in NLP Tasks Óscar Ferrández Escámez PhD Dissertation This work was carried out under the supervision of Dr. Rafael Muñoz Guillena, University of Alicante. Alicante, July 2009 This work has been partially funded by the QALL-ME consortium, which is a 6th Framework Research Programme of the European Union (EU), contract number: FP6-IST-033860, and by the Spanish Government under the project CICyT number TIN2006-1526-C06-01.

Abstract This thesis exposes the major topics in textual entailment by means of examples together with thorough discussions. As a result, an end-to-end textual entailment system was developed following the idea that textual entailment relations can be recognised from different linguistic levels. Specifically, we present three perspectives: Lexical, Syntactic and Semantic, each performing a set of useful inferences to determine entailment relations. The lexical perspective consists of the computation of several measures based on lexical distances between words. The syntactic perspective obtains similarity degrees according to the dependency trees derived from the texts and computing a similarity function between them. Finally, the semantic perspective implements some inferences focused on WordNet similarity distances, negated terms, named entities and verbs relations, and frame semantics. All these perspectives are individually processed as well as in a collaborative manner. The final entailment decision is taken by a machine learning classifier which uses as features the set of inferences from our perspectives. Extensive evaluations over the PASCAL Recognising Textual Entailment datasets have been carried out in order to estimate the contribution of different combinations of the proposed perspectives as well as demonstrate that our perspectives are complementary to each other. Moreover, a study measuring the importance of the entities and verbs involved within the entailment relation was also developed. Furthermore, another motivation as well as a contribution of this thesis consisted of applying our system to other Natural Language Processing tasks. It is what we call an extrinsic evaluation. Regarding this, our textual entailment system was successfully applied to: Two different Question Answering paradigms: (i) the validation of the answers returned by actual Question Answering systems; and (ii) the construction of an entailment-based Question Answering system for

restricted domains. Automatic Text Summarization: developing a preliminary summary for a summarization approach made up of the non-entailed sentences from the whole document. The particular semantic task of linking Wikipedia categories to Word- Net glosses: our system was applied to this task in order to automatically enrich the construction of a named entity repository. Therefore, with our investigations we prove the suitability of the proposed perspectives for textual entailment recognition, and the ability of our system to solve semantic variabilities. Moreover, in this thesis we also present further investigations mainly based on enriching the knowledge of our system by discovering semantic relations derived from ontologies and/or extensive sources of information.

Contents 1 Introduction 1 1.1 The Richness of Language: a Huge Problem for Computers.. 3 1.2 The Practical Motivation..................... 7 1.2.1 Addressing the Textual Entailment Problem...... 8 1.3 This Thesis............................ 11 1.3.1 Reader s Guide...................... 11 2 Related Work and Relevant Resources and Tools 13 2.1 Related Work History....................... 14 2.1.1 The Lexical Model.................... 14 2.1.2 The Syntactic Model................... 17 2.1.3 The Semantic Model................... 23 2.1.4 The Logic Model..................... 32 2.1.5 Models Combination................... 35 2.1.6 Conclusions........................ 39 2.2 Relevant Resources and Tools for this Thesis......... 40 2.2.1 The FreeLing Toolkit................... 40 2.2.2 The MINIPAR Parser.................. 40 2.2.3 The NERUA System................... 42 2.2.4 WordNet.......................... 43 2.2.5 FrameNet......................... 44 2.2.6 The Shalmaneser Tool.................. 46 2.2.7 VerbNet.......................... 47 2.2.8 VerbOcean......................... 47 2.2.9 Paraphrase corpora.................... 48 2.3 The PASCAL Recognizing Textual Entailment Challenges.. 50 2.4 The Answer Validation Exercise................. 52 i

CONTENTS 3 The Idea: A Perspective-based Textual Entailment System 55 3.1 The System at a Glance..................... 55 3.2 Previous and Shared Steps.................... 57 3.3 Lexical Perspective........................ 57 3.3.1 Measuring Lexical Similarities.............. 58 3.4 Syntactic Perspective....................... 71 3.4.1 Tree Generation...................... 72 3.4.2 Tree Filtering....................... 72 3.4.3 Graph Embedding Detection............... 73 3.4.4 Graph Node Matching.................. 75 3.5 Semantic Perspective....................... 78 3.5.1 Measuring Semantic Similarity.............. 78 3.5.2 The Negation Feature.................. 85 3.5.3 The Importance of Being a Verb............. 86 3.5.4 The Importance of Being a Named Entity....... 88 3.5.5 Applying Frame Semantic Analyses........... 89 3.6 Summary............................. 104 4 A Pure Entailment Evaluation: Experiments, Results and Discussion 107 4.1 The Evaluation Framework.................... 108 4.2 Selecting the Best System s Features.............. 109 4.3 Experiments, Results and Discussion.............. 115 4.4 Comparative Evaluation..................... 123 4.5 Additional Experiments..................... 127 4.5.1 The 3-way RTE Classification Problem......... 127 4.5.2 Dealing with Paraphrases................ 129 4.6 Summary............................. 132 5 Applicability in other NLP Areas 133 5.1 Textual Entailment in Question Answering........... 133 5.1.1 The Answer Validation Exercise Competition..... 134 5.1.2 The QALL-ME Entailment-based Question Answering System........................... 135 5.2 Textual Entailment in Automatic Text-Summarization.... 143 5.2.1 Brief Text Summarization Background......... 143 5.2.2 The Approach....................... 144 5.2.3 Evaluation: Experiments and Discussion........ 145 ii

CONTENTS 5.3 Textual Entailment Recognition for Linking and Disambiguating Wikipedia Categories to WordNet.............. 147 5.3.1 Adapting the Textual Entailment System........ 148 5.3.2 Methods Used for Comparison.............. 148 5.3.3 The Evaluation...................... 149 6 Conclusions and Future Work 153 6.1 Conclusions............................ 153 6.1.1 Main Contributions.................... 155 6.2 Future Work............................ 157 6.3 Selected Scientific output..................... 158 7 Software Developments 163 7.1 VerbNet Wrapper in Java.................... 163 7.2 VerbOcean Wrapper in Java................... 163 7.3 Frame-to-Frame Similarity Demo in Java............ 164 7.4 The FrameNet-WordNet Alignments.............. 164 7.5 Entailment-based QA System Demo (Spanish QALLME-demo).................... 164 References 167 A The PASCAL Recognizing Textual Entailment Challenges 185 A.1 RTE Official Results....................... 185 B The Answer Validation Exercise Official Results 191 B.1 AVE Official Results....................... 191 C Information Gain Achieved by the System Features Regarding the RTE development corpora 195 C.1 The Information Gain Bar Graphics for All System Features. 195 D Síntesis en Castellano 203 D.1 Introducción............................ 203 D.1.1 Motivación........................ 204 D.2 Estado de la cuestión....................... 204 D.3 Sistema de reconocimiento de implicación textual basado en perspectivas............................ 208 D.3.1 Perspectiva léxica..................... 210 iii

CONTENTS D.3.2 Perspectiva sintáctica................... 213 D.3.3 Perspectiva semántica.................. 214 D.4 Evaluación............................. 216 D.4.1 Evaluación comparativa................. 220 D.4.2 Experimentos adicionales................. 224 D.5 Aplicabilidad en otras tareas de PLN.............. 226 D.5.1 Implicación textual en Búsqueda de Respuestas.... 227 D.5.2 Implicación textual en generación de resúmenes.... 229 D.5.3 Implicación textual en asociar categorías de Wikipedia y glosas de WordNet................... 230 D.6 Conclusiones............................ 231 D.6.1 Principales contribuciones................ 231 D.6.2 Trabajo futuro...................... 233 E Bio-sketch and Research Projects Relative to this Thesis 235 iv

List of Tables 2.1 WordNet 3.0 statistics....................... 43 2.2 FrameNet 1.3 Frame-to-Frame relations............. 45 2.3 VerbOcean statistics........................ 48 2.4 Examples of text-hypothesis pairs taken from the RTE corpora. 51 3.1 Levenshtein distance between Saturday and Sunday.... 60 3.2 An example of the calculation of the Smith-Waterman distance. 61 3.3 Weights assigned to the grammatical categories......... 77 3.4 Weights assigned to the grammatical relationships....... 77 3.5 Frame-to-Frame: FrameNet relation weights........... 96 3.6 FrameNet-WordNet alignment: FrameNet relations weights.. 101 3.7 FrameNet-WordNet alignment: WordNet relations weights... 101 3.8 FrameNet-WordNet alignment: results on Tonelli s dataset... 103 4.1 The 10-fold cross-validation accuracy values obtained by each lexical and syntactic feature.................... 111 4.2 The 10-fold cross-validation accuracy values obtained by each semantic feature and all features combined............ 112 4.3 The best lexical and semantic features sets obtained with regards to each RTE development corpus.............. 113 4.4 The best features sets (all perspectives) obtained with regards to each RTE development corpus................. 114 4.5 The set and the set of features................ 114 4.6 RTE-2 and RTE-3 results..................... 116 4.7 RTE-4 results............................ 117 4.8 RTE results considering the task as another feature....... 119 4.9 Oracle results for RTE corpora.................. 120 4.10 RTE results applying the verbs and entities constraints..... 121 4.11 The precision values achieved by the entity and verb constraints.123 v

LIST OF TABLES 4.12 Comparative results for the RTE-2 2006 challenge........ 124 4.13 Comparative results for the RTE-3 2007 challenge........ 125 4.14 Comparative results for the RTE-4 2008 challenge........ 126 4.15 Examples of UNKNOWN and CONTRADICTION text-hypothesis pairs................................. 128 4.16 RTE-4 3-way classification results................. 128 4.17 When two entailment relations are a paraphrase......... 131 4.18 MSRPC results........................... 131 5.1 The QALL-ME project: evaluation results............ 142 5.2 DUC 2002 result for single- and multi-document tasks..... 146 5.3 System results linking and disambiguating Wikipedia categories to WordNet........................ 151 A.1 Official results for the RTE-1 2005 challenge........... 186 A.2 Official results for the RTE-2 2006 challenge........... 187 A.3 Official results for the RTE-3 2007 challenge........... 188 A.4 Official results for the RTE-4 2008 challenge........... 189 B.1 English official results for the AVE 2006 track.......... 192 B.2 English official results for the AVE 2007 track.......... 192 B.3 English official results for the AVE 2008 track.......... 193 B.4 Spanish official results for the AVE 2008 track.......... 193 D.1 Resultados para RTE-2 y RTE-3................. 217 D.2 Resultados para RTE-4...................... 218 D.3 Resultados RTE considerando la tarea como una característica más................................. 219 D.4 Resultados obtenidos por el oráculo para los corpus del RTE.. 219 D.5 Resultados aplicando las restricciones sobre los corpus del RTE.220 D.6 Evaluación comparativa con los participantes del RTE-2 2006. 221 D.7 Evaluación comparativa con los participantes del RTE-3 2007. 222 D.8 Evaluación comparativa con los participantes del RTE-4 2008. 223 D.9 Resultados para la tarea de tres tipos de implicación del RTE-4.224 D.10 Cuando dos implicaciones determinan paráfrasis......... 226 D.11 Resultados obtenidos sobre el corpus de Microsoft de paráfrasis.226 D.12 Resultados para la tarea de asociar categorías de Wikipedia a glosas de WordNet........................ 231 vi

List of Figures 1.1 The human brain has always been an inspiration for artificial intelligence researchers....................... 2 1.2 Visual example of language variability and ambiguity...... 3 1.3 The syntactic tree for the sentence Pianists practice scales.. 5 2.1 Architecture of the Wang&Neumann System.......... 19 2.2 The Tree Skeleton example of the sentence A typhoon batters the Philippines.......................... 20 2.3 An example of an environment in the TALP system for the sentence Romano Prodi is the prime minister of Italy.... 27 2.4 An example of a first-order syntactic rewrite rule........ 34 2.5 An example of a T-H pair that activates the rule shown in Figure 2.4.............................. 35 2.6 The architecture of the GROUNDHOG system......... 36 2.7 The Freeling toolkit........................ 41 3.1 The system at a glance...................... 56 3.2 The Consecutive Subsequence Matching measure........ 63 3.3 Syntactic perspective architecture................. 72 3.4 Distance between two synsets................... 75 3.5 Frame annotation of RTE-2 test pair id=55........... 90 3.6 Frame annotation of RTE-2 test pair id=132.......... 91 3.7 Causative of FrameNet relation between Killing and Death frames................................ 92 3.8 Frame annotation of RTE-2 test pair id=423 (text)....... 93 3.9 Frame annotation of RTE-2 test pair id=423 (hypothesis)... 94 3.10 Frame-to-Frame similarity metric: visual example........ 95 3.11 Visual example about FrameNet-WordNet alignment...... 102 vii

LIST OF FIGURES 4.1 The RTE test corpora statistics applying the constraints.... 122 5.1 The QALL-ME project: an example............... 137 5.2 The QALL-ME project: general infrastructure.......... 138 5.3 The QALL-ME project: the inner architecture......... 138 5.4 The QALL-ME project: entailment candidates according to the query concept constraint................... 140 C.1 Information gain of lexical features for RTE-2 development corpus................................ 196 C.2 Information gain of syntactic-semantic features for RTE-2 development corpus.......................... 197 C.3 Information gain of lexical features for RTE-3 development corpus................................ 198 C.4 Information gain of syntactic-semantic features for RTE-3 development corpus.......................... 199 C.5 Information gain of lexical features putting together both corpora (RTE-2 and RTE-3 development corpora)......... 200 C.6 Information gain of syntactic-semantic features putting together both corpora (RTE-2 and RTE-3 development corpora). 201 D.1 Arquitectura del sistema...................... 208 viii

The attempt to build machines that can do things that would require intelligence if done by humans. Marvin Minsky (1927-) 1 Introduction I had already started my DEA 1 thesis using Minsky s definition of Artificial Intelligence (AI). Among the plethora of AI definitions found in the bibliography, I chose one of the most famous Minsky statements because this sentence, used to define what is AI, speaks volumes. AI was born under the influence of human beings in modelling the human knowledge by means of machines. For many years, AI has been considered as an area of strong interest within Computation Sciences. There are many subdisciplines hanging from the tree represented by the concept of AI: disciplines such as Robotics, Cognitive Sciences, Speech recognition and Computer Vision, to name but a few. Natural Language Processing (NLP 2 ) is one of these disciplines. It first appeared in the Forties with pioneering attempts in automatic texts translation. These first attempts were not the success expected due to the poor 1 DEA comes from the Spanish Diploma de Estudios Avanzados, which is an academic award obtained after justifying two years of research on a specific field. 2 The definition of NLP has been partially extracted from Wikipedia (http://www. wikipedia.org). 1

CHAPTER 1. INTRODUCTION Figure 1.1: The human brain has always been an inspiration for artificial intelligence researchers. quality of computers. However, the constant improvement of computer hardware allows us to reach very promising achievements. NLP deals with statistical and/or rule-based natural language modelling from computational perspectives. NLP studies the problems that occur when a machine tries to interact with human beings. Broadly speaking there are two main areas that are intended to be covered by NLP applications, the automatic language generation and the natural language understanding. Natural language generation systems convert information from computer databases into normal sounding human language and natural language understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate. The aim of NLP applications is to simulate human linguistic behaviour, however there is no consensus on how closely it should be simulated. In such a complex task, the computer has to be aware of the structures used by the language as well as the global discourse and the context in which the dialogue is taking place. Therefore, apart from dealing with the knowledge provided by the different linguistic analyses that a human unwittingly makes, the system has to be able to combine these analyses in the most appropriate way (Moreno et al., 1999). 2

1.1. THE RICHNESS OF LANGUAGE: A HUGE PROBLEM FOR COMPUTERS 1.1 The Richness of Language: a Huge Problem for Computers Human languages are extremely rich and ambiguous resulting in the fact that the same information can be expressed employing different words and linguistic structures. In other words, an ambiguous text might represent several distinct meanings and a concrete meaning might be expressed in different ways. Therefore, one of the main aims of the research community is to build systems capable of managing the ambiguity and variability of the language. Figure 1.2 extracted from (Glickman, 2006; Dagan et al., 2007; Dolan, 2007) depicts a typical and widely-used example of these language phenomena. Figure 1.2: Visual example of language variability and ambiguity. Language variability is considered as an enigma imperative to solve in order to overcome the barrier that separates the human understanding from that of the computer. As previously exposed in (Kouylekov, 2006) there are many phenomena that characterize the language variability, however as an overview we can differentiate three main language variability types: Lexical: the speaker or writer uses different words to express the same information. A typical lexical transformation consists of using synonyms to change the words but not the final meaning of the sentence. Syntactic: changing the sentence structure without changing the meaning. A common example to explain this kind of language variability is 3

CHAPTER 1. INTRODUCTION the correspondence between active and passive sentences. Semantic: is the most complex sort of variability. It implies some reasoning, which at times is somewhat difficult for computers to manage. Reasoning about temporal and spacial expressions, entities and the role that they play in the sentence, the logical order of events that happen in the dialogue, and the most complicated, reasoning about world knowledge. Controlling language variability is something which as yet has not been attained. In terms of reasoning, there are many inferences easily detected by humans but extremely difficult for computers to address. Since its conception, NLP has attempted to solve language variability, and it has been traditionally associated with the processing of the following linguistic analyses: Lexical Analysis: within this analysis we distinguish between partof-speech analysis and lexico-semantic analysis. The former assigns to each word its grammatical category (e.g. noun, verb, adverb, adjective, etc) and, in some cases, it also develops finer analysis by tagging more information such as the genre and person as well as the verb tense and the modal verbs. The latter carries out the disambiguation of the word senses choosing the proper sense of the words in a sentence. As a result, systems performing this kind of analysis have to manage the ambiguity between words that have the same form but different grammatical categories and/or senses. For instance, in the following sentences: Pianists NNS practice VBP scales NNS. Practice NNS makes VBZ perfect JJ. the word practice is a verb (tag VB) in the first sentence and a noun (tag NN) in the second one. Obviously, they have different meanings. 3 3 The morphologic tags used in the example are those used in the Penn Treebank (see the University of Pennsylvania (Penn) Treebank Tag-set http://www.comp.leeds.ac. uk/ccalas/tagsets/upenn.html). 4

1.1. THE RICHNESS OF LANGUAGE: A HUGE PROBLEM FOR COMPUTERS Syntactic Analysis: this analysis carries out the syntactic representation of the sentences, usually using tree representations. It illustrates the syntactic relations between the words showing which of them are more relevant for the sentence meaning as well as grouping the words into constituents (e.g. noun phrase, prepositional phrase, etc). Sometimes and depending on the final application of this analysis, it can be partially performed detecting specific types of constituents instead of the entire sentence structure. This partial or shallow analysis, while incomplete, obtains better accuracy in specific aspects that would be desired by many NLP applications. Figure 1.3 shows the syntactic tree 4 for the previous sentence Pianists practice scales : Figure 1.3: The syntactic tree for the sentence Pianists practice scales. Semantic Analysis: consists of recognising semantic relations between words. For instance, the verbs marry and divorce are semantically related by a happens-before relation. 5 Many semantic relations can be expressed by the roles that a word and/or a constituent play in the sentence. A role represents a semantic relation between a constituent (normally a verb argument) and a predicate (normally 4 The syntactic tree was generated by the phpsyntaxtree tool, http://ironcreek. net/phpsyntaxtree/. 5 This relation has been extracted from the VerbOcean resource (Chklovski & Pantel, 2004). 5

CHAPTER 1. INTRODUCTION a verb) (Moreda, 2008). Therefore, building the semantic structure of the sentence is of paramount importance to identify the sentence s roles (this task is called ASLR, Automatic Semantic Role Labelling). To see this more clearly, let s look at an example: [ The mysterious fighter Assailant ] attacked [ the guardsman V ictim ] [ with a sabre Weapon ]. 6 the agent role is played by The mysterious fighter, in this case the Assailant role since the verb that evokes the agent role is attack. The patient role is for the guardsman, who is the victim of the attack. Complementary, there are a few roles that although less important could also appear in the sentence, for example in the previous sentence with a sabre is tagged with the role weapon that shows the entity used by the assailant to cause damage to the victim. Therefore, the semantic role labelling aids computers to understand the sentence s meaning and permits the construction of a semantic structure regarding the fact the text is talking about. Textual-Pragmatic Analysis: this analysis will create the final interpretation of the sentence/s meaning/s. It consists of finding out the contextual references that are relevant to understand the message. At this point, resources related to the anaphora resolution, topic detection, temporal and spatial reasoning, as well as reasoning about world knowledge that wrap up the message hold much importance in building the textual-pragmatic representation. Although these analyses can be managed independently, the techniques used to solve them share the knowledge provided by each one. Furthermore, this shared knowledge is used by the researchers in order to increase the performance of systems tackling the whole task of understanding human language (i.e. controlling language variability). Consequently, a suitable combination of these analyses is the key to achieving computers that emulate the human brain when they process language. 6 Example extracted from FrameNet 1.3 data (Baker et al., 1998), The FrameNet resource will be explained in detail in section 2.2. 6

1.2. THE PRACTICAL MOTIVATION Moreover, at this point it would be useful for the context of this thesis to describe some of the most important NLP tasks, which have strong interest within the NLP research community: 1. Information Retrieval (IR): consists of searching for documents and information within documents that are relevant to user needs. The retrieval engine is based on a user query which is commonly made up of terms that are susceptible to appearing within the expected documents. For instance, if one of the user terms is rose a document talking about kinds of flowers could be interesting for the user. 2. Information Extraction (IE): is a type of information retrieval whose goal is to automatically extract structured information (i.e. categorized and contextually and semantically well-defined from a certain domain) from unstructured machine-readable documents. 7 3. Question Answering (QA): the aim of QA is to return a precise and coherent answer for a specific given question. Traditionally, the answer is extracted from large text collections, and the QA systems have to process these collections and apply several techniques in order to retrieve the target answer. 4. Text Summarization (SUM): automatic text summarization is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text. In recent years, due to the vast amount of information, especially since the growth of the Internet, automatic summarization tools are required in order to help users manage all the information available. 1.2 The Practical Motivation The global motivation comes from the need to automatically extract knowledge from structured and non-structured data. It has become acute with the dramatic growth of digital information and we are witnessing an impressive expansion of the Digital Age giving rise to an unbelievable increase in on-line information and resources. Hence the fact that NLP is an important research 7 Definition from Wikipedia - the free encyclopedia http://www.wikipedia.org. 7

CHAPTER 1. INTRODUCTION field to the global research community. Therefore, any approach capable of supporting the task of selecting, classifying, assimilating, retrieving, filtering and exploiting information, in order to enrich our collective and individual knowledge and skills, will be welcomed by everyone. More precisely regarding the context of this thesis, the practical motivation relies on the fact that a lot of applications in many NLP areas are highly influenced by the problem of language variability. Moreover, due to the fact that the research lines of our group cover many NLP fields, we were eager to use our textual entailment system to support them. However, firstly we are going to pinpoint the textual entailment problem. 1.2.1 Addressing the Textual Entailment Problem While the problem of language variability is the context of this work, it has been concretely focused on textual entailment. Textual entailment has been defined as a generic framework for modelling semantic variability, which appears when a concrete meaning is described in different manners as proposed by Dagan & Glickman (2004). Hence, language variability can be addressed by defining the concept of textual entailment as a one-way meaning relation between two text snippets (Glickman, 2006). Two coherent fragments of text are defined and according to the definition of textual entailment, the meaning of one of them must entail the meaning of the other, should this not occur the entailment relation does not hold. The snippet that permits the meaning inference is traditionally called T (the text) and the other, whose meaning is deduced, is named H (the hypothesis). For the sake of clarity, throughout this thesis when we cite the two texts that are involved in the entailment relations, we will call them text or T and hypothesis or H following the textual entailment terminology. The next example shows a true entailment relation: T: Yahoo acquired Overture. H: Yahoo owns Overture. T entails H since when a company acquires another company, the former is the owner of the latter. Thus, it is also important to introduce the concept of paraphrasing, which is extremely relative to the textual entailment phenomenon. Both consist of recognising and characterizing when two texts, although superficially dis- 8

1.2. THE PRACTICAL MOTIVATION tinct, overlap semantically. However, as previously stated, while textual entailment just considers unidirectional meaning relations, paraphrases are bidirectional. Two texts can be considered paraphrases when their meanings are so close that they could be interchangeable in many contexts. Therefore, we can assume that paraphrases are linked by bidirectional entailments. Following the previous textual entailment example: T: Yahoo acquired Overture. H: Yahoo owns Overture. T : Yahoo bought Overture. although T entails H, H does not entail T because being the owner of a company does not imply its prior acquisition, so there is no paraphrase between T and H. Nevertheless, T actually become a paraphrase of T since T entails T and T entails T, moreover they are utterly interchangeable. Due to the fact that paraphrasing and textual entailment share a common target, there are a lot of approaches that use and exploit paraphrasing resources in order to solve the entailment problem and viceversa. Therefore, solving textual entailment relations would help many NLP applications to increase their final performance by means of correct language variability disambiguation. Indeed, as previously mentioned this is the practical motivation of this thesis. The following subsections describe how the underlying semantics of textual entailment are related to specific NLP core tasks such as QA, IR, IE and SUM, to name but a few. Question Answering Within the procedure of collecting the right answer, it often occurs that candidate answers are expressed in different syntactic and semantic ways, and the QA system has to decide which is the most appropriate for the given question. To achieve this, a textual entailment component could help such systems to weight the most suitable answers from the whole candidate answer set according to a hypothesized answer, the question and the piece of text from which the answer was extracted. For instance, the following example extracted from (Roth, 2005) shows how a textual entailment system positively weights an answer as the final 9

CHAPTER 1. INTRODUCTION one from the set of candidate answers: Given Question: Who acquired Overture? Candidate Answer: Yahoo Extracted from: Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year. In this case the textual entailment component is able to recognise the meaning relation between the sentence Yahoo acquired Overture, which is considered as H (the hypothesis) and formed by the question and the forecast answer, and the piece of text used to obtain the candidate answer, which is considered as T (the text). For this reason, the candidate answer would be considered as the one presented to the user. Information Retrieval It could appear that the terms used by the user to express their needs are insufficient for retrieving relevant documents. It happens due to the fact that most users formulate their queries employing terms that they expect to appear in a relevant document, and sometimes it does not occur. For instance, for the terms soft and beverages documents about coca-cola may be relevant, although the words soft or beverages may be absent in that document. To overcome this lack of useful related terms, textual entailment techniques could be used. Incidentally, these techniques would provide more sophisticated knowledge than the widely-used query expansion by synonyms, hyperonyms, etc. Obviously, the more unambiguous terms are, the less uncertainty there will be within the retrieval process. Information Extraction In IE, systems need to identify the various ways in which a relation could be expressed in order to fill in the template that captures the information solicited. In the same way for the aforementioned NLP tasks, the application of textual entailment techniques would help to properly fill in the template, but in contrast to QA and IR in where the user query is a priori unknown and open domain, with IE the templates are static and belonging to specific 10

1.3. THIS THESIS domains, thus the textual entailment engine would be specialized to increase its performance for the proposed domain. Text Summarization To make a coherent summary, the summarization system has to take into account several variables such as length, writing-style and syntax. Therefore, the detection of textual entailment phenomena would help in deducing when different expressions found in the document express the same concept or idea, avoiding redundancy in the final summary. In Multi-document summarization, which consists of obtaining a unique summary from several documents, the entailment relations should be sought between expressions that belong to different documents, hence just one of them will be included in the summary. 1.3 This Thesis This thesis presents the research work carried out on textual entailment under the graduate studies program 8 within the Department of Software and Computing Systems at the University of Alicante. An end-to-end textual entailment system has been achieved taking into account different entailment levels. As will be detailed throughout this thesis, specifically we distinguish between Lexical, Syntactic and Semantic entailments inferences, and consequently our entailment system is focused on these three perspectives. 1.3.1 Reader s Guide Chapter 2 is intended to provide the related work relevant to this thesis. Apart from detailing the most up-to-date textual entailment systems, this chapter also describes resources used within the system development and testing as well as the two main competitions and/or workshops in this field. 8 Graduate Studies Program with Quality Mention by Spanish Government for the years 2005-2006 (BOE 14/07/2005 Ref. MCD-2005 00095), 2006-2007 (BOE 30/08/2006 Ref. MCD-2005 00095), 2007-2008 (BOE 12/10/2007 Ref. MCD2005-00095) and 2008-2009 (BOE 12/11/2008 Ref. MCD2005-00095). 11

CHAPTER 1. INTRODUCTION Chapter 3 comprises the detailed description of our system. Carefully explaining each inference used by the three perspectives and how they assist in solving entailment relations. Chapter 4 discusses the framework in which the system is evaluated as well as analysing the results obtained. Chapter 5 presents the applicability of our system in other NLP tasks rather than puristic textual entailment recognitions. It permits us to make an extrinsic system evaluation, assessing the gain of applying textual entailment techniques to QA and SUM among others. Chapter 6 gives conclusions together with some thoughts for future work. Chapter 7 presents the software developments carried out in this thesis. Appendix A illustrates the official results regarding each PASCAL Recognizing Textual Entailment challenge. Appendix B shows the official results corresponding to the different Answer Validation Exercise competitions. Appendix C presents the bar graphs for the information gain values of each system feature. Appendix D gives a brief bio-sketch of the author and a summary of the research projects relative to this thesis. 12

2 Related Work and Relevant Resources and Tools Detecting and categorizing semantic relations are pursued challenges that encompass issues of lexical modifications, syntactic alternation, reference or discourse structure and world knowledge comprehension. The last few years have seen a surge of interest in modelling and designing systems aimed at measuring semantic equivalence and capable of addressing the uncertainty that underlies textual entailment relations. As a fundamental background in this area, this chapter describes some of the most relevant techniques used to solve textual entailment. Moreover, after delving into the most relevant textual entailment approaches, we move to detail the two competitions and workshops that expose with time the most significant achievements in this research field: (1) the Answer Validation Exercise (AVE) track 1 within the Cross-Language Evaluation Forum (CLEF) 2 ; 1 http://nlp.uned.es/clef-qa/ave/ 2 http://www.clef-campaign.org/ 13

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS and (2) the PASCAL Recognizing Textual Entailment (RTE) Challenges 3 series. 2.1 Related Work History To address the uncertainty that underlies textual entailment relations, researchers have proposed in recent years a wide variety of approaches that, from a technical point of view, we encapsulate on lexical, syntactic, semantic and logic models. Of course, there are approaches implemented by a combination of them, which seems reasonable in order to manage the huge diversity of entailments. The following sections describe each model and some of the most important research carried out on them. 2.1.1 The Lexical Model Although the lexical model is the simplest one, it is also the base of almost all textual entailment systems. It represents the text as a bag-of-words and detects true entailment relations if each element in H can be implied lexically by one or more elements in T. These implications can be direct, using matching techniques between lemmas, or indirect, by means of any lexical transformation exploiting lexical databases (e.g. lexical derivations within WordNet 4 (Miller et al., 1990)). Surprisingly, although this model lacks deeper semantic knowledge necessary for detecting entailment relations, it obtains very competitive results. For instance, approaches designed following this model are able to recognise entailment relations such as: T: Accardo was the winner of the Paganini Competition in Genoa. H: Accardo won the Paganini Competition in Genoa. (winner is a morphological derivation of win) However, it is somewhat easy to find an example for which this model could make a wrong entailment decision. For instance, this model would mark as a correct entailment relation the following pair of texts: 3 http://pascallin.ecs.soton.ac.uk/challenges/rte 4 The lexical database WordNet can be downloaded at http://wordnet.princeton. edu/. This resource is explained in detail in section 2.2. 14

2.1. RELATED WORK HISTORY T: Virginia was born in Bunbury (Australia). H: Bunbury was born in Virginia. Therefore, deeper analysis is of paramount importance in order to make the lexical model more robust. For this reason, the lexical model is used to being supported by some kind of syntactic representations such as noun phrases, prepositional phrases, etc. So, instead of trying to match the texts as bags-of-words the matching procedure is carried out considering the chunks obtained in previous analyses. Considering them, the above example would be tagged as false entailment since the H s chunk in Virginia, which is a prepositional phrase, does not correspond to any of T s chunk. The next paragraphs describe relevant related works using largely the lexical model, emphasizing the influence of lexical inferences within the entailment phenomena. The Athens University of Economics and Business Textual Entailment System The most relevant works on the lexical model are probably the ones presented in (Malakasiotis & Androutsopoulos, 2007; Galanis & Malakasiotis, 2008). This second research work, which will be discussed in detail, is an extension of the first one, and both wrap up the participations of the Athens University of Economics and Business in the Recognising Textual Entailment (RTE) Challenges. 5 The authors implement a Maximum Entropy classifier trained on some string similarity measures applied to the text-hypothesis pairs. They convert the original T-H pairs into different pairs of strings: considering the original tokens; their stems; their part-of-speech; their soundex codes 6 ; just considering the nouns or just considering the verbs. Also, when T is longer than H, they build all possible substrings of T of the same length as H, and they obtain the pairs of strings of these substrings, as well. As a preprocessing 5 The RTE Challenges are a series of Workshops that establish an evaluation framework for textual entailment systems, http://pascallin.ecs.soton.ac.uk/challenges/rte/. The RTE Workshops are explained in section 2.3. 6 Soundex is an algorithm intended to map English names to alphanumeric codes, http: //en.wikipedia.org/wiki/soundex. 15

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS step they use WordNet in order to replace each word in H for a synonym if it appears in T. Finally, they compute some string similarity measures such as Levenshtein distance, Cosine similarity, n-gram distance, Manhattan distance, etc, over the pairs of strings obtained previously. Each measure serves as a feature for the aforementioned machine learning algorithm. To select the most meaningful features, the authors apply a procedure conducted by a 10-fold cross validation over the training corpora. They experiment with two distinct feature selection processes: (i) Forward Hillclimbing that starts with an empty feature set, to which it adds features, one at a time, by preferring to add at each step the feature that leads to the highest predictive power; and (ii) Forward Beam Search, which is similar, except that the search frontier contains the k best examined feature subsets at each step (they set k equal to ten). The results achieved for this system are quite promising, overcoming the baselines proposed in the RTE challenges, and their performance is comparable with the majority of current textual entailment systems. Other Works on this Model Other works also compute an antonym matching based on WordNet. In (Settembre, 2007; Montalvo-Huhn & Taylor, 2008), apart from considering the synonyms of the words in H the authors also try to match the antonyms of H with words in T. It helps to detect when T contradicts H. Although the common trend in the research community is to use a machine learning algorithm to decide the entailment, in these works the authors learn the decision threshold by maximizing the system s score on the training data. In (Adams et al., 2007), two textual entailment approaches are presented. The first one is based primarily on the concept of lexical overlap, considering a bag-of-words similarity overlap measure to form a mapping of terms in the hypothesis to the source text. The second system is a lexico-semantic matching between the text and the hypothesis that attempts an alignment between chunks in the hypothesis and chunks in the text, and a representation of the text and hypothesis as two dependency graphs. Both approaches employ decision trees as a supervised learning algorithm. Surprisingly, the first approach, while simpler in concept, always overcomes the second in every experiment carried out. As stated in the paper, 16

2.1. RELATED WORK HISTORY it seems puzzling that a simple approach has outperformed one that takes advantage of a deeper analysis of the text. However, the simple one treats the text naively, as a bag-of-words, and does not rely on any preprocessing application. Whereas, the complex approach uses other systems such as a coreference resolver and a dependency and semantic parser, and its performance is limited by the performance of these tools. 2.1.2 The Syntactic Model The syntactic model usually represents the snippets as dependency trees and determines the entailment by means of a similarity function between the trees. Additionally, we frequently have to consider several transformations of the trees as well as the grammatical function that performs the branch or the node of the trees. Within this model, it is commonly used to implement some variant of the edit distance for dependency trees as a similarity function, as well as train a machine learning algorithm in order to set the decision threshold for entailment relations. The PhD. dissertation presented in (Kouylekov, 2006) is a clear example of this model. It exposes a textual entailment system using a Tree Edit Distance algorithm between the syntactic trees of the two texts. They consider that an entailment relation is related to the ability to show that the whole content of H can be mapped into the content of T. This mapping can be described as the sequence of editing operations (i.e. Insertion, Deletion, Substitution) needed to transform T into H, where each edit operation has a cost associated with it. They assign an entailment relation if the overall cost of the transformation is below a certain threshold, empirically estimated by the training data. They adapt the tree edit distance to deal with dependency trees and introduce some constraints that modify the cost of substitution depending on whether the substitution involves a considerable change of meaning or not. With this system they participated in the first RTE challenge (Kouylekov & Magnini, 2005) obtaining very good results. In addition to this, in the syntactic model the subject/object functions, modifiers, the verb voice (active or passive) and the verb tense have strong importance. If the entailment relation exists, these linguistic functions in H have to find their counterparts in T. 17

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS Following the previous example ( T: Virginia was born in Bunbury (Australia) and H: Bunbury was born in Virginia ), the syntactic model would tag this pair as false entailment relation, since Bunbury plays the subject role in H but not in T. However, for the next two sentences: T: The girl in red was attacked by a group of thieves. H: A group of thieves attacked the girl in red. although the object of T is the subject of H, due to the fact that T is in passive voice and H in active the two sentences express the same idea. These situations have to be controlled by the syntactic model. Next, several relevant works developed using mainly the syntactic model are exposed. The Wang&Neumann System - Saarland University & DFKI, Saarbruecken, Germany During their last participations in the AVE 7 and RTE challenges, the system presented in (Wang & Neumann, 2007b; Wang & Neumann, 2007a; Wang & Neumann, 2008b; Wang & Neumann, 2008a) has moved from a puristic syntactic approach, in the sense that they only performed dependency parser, to the development of specialized RTE-modules capable of tackling more entailment phenomena. Figure 2.1 shows the architecture of the system. For preprocessing, they use a PoS tagger, a dependency parser 8, and a Named Entity (NE) recognizer 9 in order to annotate the original plain texts. The Precision-Oriented (PO) modules are created to specialize the system in the RTE task. They implemented three PO-modules: 1. The Time Anchoring Component for Textual Entailment (TACTE): deals with cases which contain temporal expressions. It extracts corresponding events (i.e. verbs or nouns which the temporal expressions 7 The AVE competition provides a framework to validate the correctness of the answers given by a Question Answering system, http://nlp.uned.es/clef-qa/ave/. The AVE Workshops will be explained in section 2.4. 8 The dependency parser used is MINIPAR (Lin, 1998a). 9 They use the Stanford NE recognition system (Finkel et al., 2005). 18

2.1. RELATED WORK HISTORY Figure 2.1: Architecture of the Wang&Neumann System. modify) on the dependency tree and applies entailment rules between these Event-Time-Pairs. Such pairs consist of a noun or a verb denoting the event and its associated temporal expression. In order to resolve the relation between two Event-Time-Pairs, they separately resolve the relation between events and between temporal expressions. For the former, they use lexical resources such as WordNet (Miller et al., 1990) and VerbOcean (Chklovski & Pantel, 2004) to discover the relationship between two events (i.e. nouns or verbs); and for the latter, they manually define some entailment rules regarding the granularity 10 of the temporal expressions and their types. 11 2. The NE-Oriented Module: was created to deal with other types of NEs. Similar to the TACTE module, it finds the corresponding events 10 The granularity order used to normalize the temporal expression is the following: second < minute < hour < part of day < day of week < day < weeknumber < part of month < month < part of year < year. 11 Two types: (i) specific time points (e.g. on 6 th of May) and intervals (e.g. from Wednesday to Friday). 19

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS (i.e. the nearest parent nodes, which are verbs or nouns, to the NEs on the dependency tree) and applies entailment rules between these Event pairs. They consider as entities person names, location names and organization names. 3. The Tree Skeleton Module: this module extracts a new sentence representation based on the dependency parse trees. This representation is an extended version of the predicate-argument structure, since it contains not only the predicate and its arguments, but also the dependency paths in-between, and it captures the essential part of the sentence. For the final step, this module uses a kernel-based machine learning method to decide the entailment. The algorithm first selects overlapping topic words (i.e. nouns) in T and H (it uses fuzzy match at the substring level). Starting with these nouns, the algorithm traverses the dependency tree to identify the lowest common ancestor node (i.e. the root node). This sub-tree is defined as a Tree Skeleton. Figure 2.2 depicts an example. Figure 2.2: The Tree Skeleton example of the sentence A typhoon batters the Philippines. The cases which cannot be covered by any specialized PO RTE-module will be passed to the high-coverage, but probably less accurate Backup Modules. Two Backup modules are implemented: 1. The Triple Backup Module: is based on the Triple similarity function which operates on two triple (dependency structure represented in the form of < head, relation, modif ier >) sets and determines how many 20

2.1. RELATED WORK HISTORY triples of H are contained in T. The similarity function uses an approximate matching, and the overall sum of the similarity weights is divided by the cardinality of H for normalization. 2. The Bag-of-Words Module: based on the similarity score calculated by dividing the number of overlapping words between T and H by the total number of words in H after a simple tokenization according to the space between words. The final stage joins the results of all specialized RTE-modules and backup modules by means of a voting strategy. To do this, different confidence values are assigned to each RTE-module according to the performances on training data. The results achieved for this system are ranked as one of the three best results for the last RTE Challenge edition (Giampiccolo et al., 2008a). FBK-Irst Textual Entailment System Throughout its participations in the AVE and RTE Challenges (Kouylekov & Magnini, 2005; Kouylekov & Magnini, 2006; Kouylekov et al., 2006; Cabrio et al., 2008a), the FBK-Irst system has mainly tackled the entailment phenomenon by using a syntactic perspective. Their last textual entailment system apart from determining the entailment by a tree edit distance algorithm based on dependency trees, also integrates some lexical inferences and reasoning about negation terms. The main idea is to compute a distance as the cost of the editing operations (i.e. insertion, deletion and substitution) needed to transform T into H. To achieve this, the authors develop different specialized entailment engines (called EDITS, Edit DIstance Textual entailment Suite) where the cost of edit operations is defined according to the linguistic phenomenon they cope with. Each EDITS is composed by three modules: (i) a distance algorithm, which determines the best (less costly) sequence of edit operations; (ii) a cost schema, that determines the cost of the three edit operations; and (iii) a set of entailment rules, each with a probability association representing the degree of confidence of the rule. In EDITS, the entailment rules can be at different levels (e.g. lexical, syntactic, etc.) and either generated from existing resources (e.g. WordNet) or manually defined. The system presented is made up of two EDITS: 21

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS 1. EDITS negation: sets specific costs for edit operations concerning negation (e.g. the insertion of a token (that is preceded by a negation) from H in T). The underlying intuition is that high edition costs to negative polarity items should prevent the assignment of positive entailment when one snippet contradicts the other. This module deals with overt negative markers ( not and n t ), negative quantifiers ( no, nothing ), strong negative adverbs ( never ) and antonyms derived from WordNet. 2. EDITS lexical: sets specific costs for edit operations considering Word- Net similarities among words, specifically they used the lesk measure implemented in the WordNet::Similarity tool (Pedersen et al., 2004). The distance algorithms used for the EDITS are two: (1) the Linear distance (a.k.a. Levenshtein distance (Levenshtein, 1966)) applied converting the texts into sequences of words; and (2) the Tree Edit Distance over the dependency trees of both T and H. Regarding the Tree Edit Distance the edit operations refer to inserting, deleting or substituting a node within the dependency tree. Finally, the sum of the distances between T and H provided by each module is divided by the sum of the no mapping distances equivalent to the cost of inserting the entire text of H, and deleting the entire text of T. The entailment score function has a range from 0 to 1. The performance obtained by this system is quite encouraging being one of the most relevant works on textual entailment research. Other Works on this Model Other research on the syntactic model are the works presented in (Blake, 2007; Marsi et al., 2007; Nielsen et al., 2008), in which the inferences used to detect entailment relations are based on the ability of the system in processing the information supplied by the dependency trees of the texts. Inferences regarding the functions that play the constituents in the tree are the essential block of these systems. As observed in the bibliography, to build a system entirely based on syntactic information is insufficient to solve the entailment problem. The researchers 22

2.1. RELATED WORK HISTORY use to support the syntactic dependencies by some kind of textual and/or lexical similarity between plain or preprocessed texts. 2.1.3 The Semantic Model The semantic model uses many resources to model the semantic knowledge inert in the texts. The semantics that exist in the text can be extracted by many distinct NLP techniques as well as the use of or construction of semantic resources and applications for these issues. For instance, the recognition of NEs, temporal expressions and quantities and/or numeric expressions are very relevant tasks for obtaining semantic inferences from texts. They would help the computer to give more importance to these terms than common nouns or other particles that often appear in the text. Additionally, the categorization of these entities in a predefined set of classes and the normalization of the temporal and numeric expressions would allow us to make appropriate correspondences between them. Some examples about this reasoning would be the following: NATO North Atlantic Treaty Organization The detection of acronyms permits the discovery of entities correspondences. 12/31/09 December 31 st of 2009 New Years Eve 2008 1,233.000 more than a million A suitable normalization of temporal, numeric expressions and intervals allows us to make proper correspondences between them. [Madrid Organization ] won the Champions league [Madrid Location ] is the capital of Spain Correctly categorizing the entities avoids wrong correspondences. It is also very common to extract the semantics of the texts representing them by semantic frames. A semantic frame presents a particular type of situation, object, or event evoked by a specific lexical unit, which is a pairing of a word with a meaning. One frame can be evoked by one or more lexical units and one lexical unit can evoke one or more frames (if more than one 23

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS frame is evoked the lexical unit is polysemous). Each semantic frame integrates several participants and props involved therein, the semantic roles, and the world knowledge associated with them. Let s see an example of a semantic frame extracted from Framenet 12 (Baker et al., 1998): Frame: ARREST Definition: Authorities charge a Suspect, who is under suspicion of having committed a crime (the Charges), and take him into custody. Roles: Authorities [Auth] (the Authorities charge the Suspect with committing a crime). Charges [Chrg] (Charges identifies a category within the legal system; it is the crime with which the Suspect is charged). Suspect [Susp] (the Suspect is taken into custody, under suspicion of having committed a crime). Lexical Units that evoke the frame: apprehend.v, apprehension.n, arrest.n, arrest.v, book.v, bust.n, bust.v, collar.v, cop.v, nab.v, nick.v Example: [The police Auth ] ARRESTED [Harry Susp ] [on charges of manslaughter Chrg ]. Therefore, the aim of the systems based on frame semantics consists of finding correspondences between the frames evoked in H and the ones in T. These correspondences also have to involve the roles of each frame taking into account which ones are the most relevant within the semantic inference process. Finally, we would like to mention that research on this model also use other semantic inferences focused on other resources as well as the generation of semantic repositories destined for establishing semantic relations. Resources such as VerbNet 13 (Kipper et al., 2006) and VerbOcean 14 (Chklovski & Pantel, 2004), which will be explained in detail in section 2.2, show different types of verb relations. Systems use them to infer these relations. Also, it is a very common occurrence to implement strategies that build or consume 12 A freely available lexical-semantic database, http://framenet.icsi.berkeley.edu/. This resource is described in detail in section 2.2. 13 http://verbs.colorado.edu/~mpalmer/projects/verbnet.html 14 http://demo.patrickpantel.com/content/verbocean/ 24

2.1. RELATED WORK HISTORY repositories of paraphrases in order to increase the performance of systems solving the entailment problem. The following paragraphs list some works focused on semantic inferences such as the ones previously mentioned. The UNED Textual Entailment System Perhaps the UNED system is the clearest example using NEs inferences to face the textual entailment phenomena. With their participations in RTE and AVE (Rodrigo et al., 2006; Rodrigo et al., 2007b; Rodrigo et al., 2007a; Rodrigo et al., 2008b), the authors have progressively improved the system but always toward obtaining a textual entailment system based on NEs recognition. The initial version of the UNED system based the final decision on detecting entailment between NEs. The system recognised the entities 15 and detected a true entailment relation whether each entity in H was entailed by a T s entity or not. A named entity NE1 entails a named entity NE2 if the text string of NE1 contains the text string of NE2. However, some characters change in different expressions of the same NE (e.g. Yasser, Yaser, Yasir). Therefore, to detect the entailment in these situations and when the previous process failed, the entailment decision took into account the edit distance of Levenshtein (Levenshtein, 1966) in such a way that if two NEs differ by less than 20%, it is assumed that an entailment relation exists between them. Although with this system configuration they achieved very competitive results, in (Rodrigo et al., 2008b) the authors, apart from considering the previous inferences, decided to step forward considering more complex inferences related to NEs and process all of them as features for a Support Vector Machine (SVM) algorithm. They adopt a traditional entity/relation/attribute model: Entity: something that has a distinct, separate existence, though it need not be a material existence. Attribute: a property or abstraction of a characteristic of an entity. 15 They used the Freeling NE Recognizer (Carreras et al., 2004) to recognise numeric expressions, proper nouns and temporal expressions. 25

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS Relation: a triplet that connects two entities. And they build a representation in which each hypothesis is mapped into a structure made up of a set of entities with their own attributes and relations. To construct these structures they apply a dependency parser (Lin, 1998a) and to detect entailment between them they use the lexical relations shown in WordNet. As the authors mentioned in the paper, the new configuration of the system is in its early stages. However, the results achieved are very promising showing that the NEs play a crucial role within entailment relations. The TALP Textual Entailment System In its two RTE participations (Ferrés & Rodríguez, 2007; Ageno et al., 2008), the presented approaches follow a similar strategy performing lexical, syntactic and semantic analyses, and computing a set of semantic-based distances to determine the entailment. The first approach, which is the base of the second one, has two main components: The Linguistic Processing: consists of a pipe of general purpose natural language processors that perform tokenization, morphologic tagging, lemmatization, NE Recognition, syntactic parsing (obtaining the constituents and their relations) and semantic labelling with WordNet synsets, Magnini s domain markers and EuroWordNet Top Concept Ontology labels. They used the Spear 16 parser in order to perform full parsing and robust detection of verbal predicate arguments. As a result they obtain a language independent representation of the sentence (called environment). The environment is a semantic network where the nodes are the semantic units and the edges are their semantic relations. These units and relations belong to an ontology of about 100 semantic classes (as person, city, action, magnitude, etc.) and 25 relations (mostly binary) between them (e.g. time of event, actor of action, location of event, etc.). Figure 2.3 illustrates the environment for the sentence Romano Prodi is the prime minister of Italy. 16 http://www.lsi.upc.edu/~surdeanu/spear.html 26

2.1. RELATED WORK HISTORY Figure 2.3: An example of an environment in the TALP system for the sentence Romano Prodi is the prime minister of Italy. The Semantic-based Distance Measures: each environment is transformed into a labelled directed graph considering only unary and binary predicates. Then over this representation the system obtains a rich variety of lexico-semantic proximity measures by means of two components: The Lexical component: considers the set of tokens occurring in both sentences and measures the token-level compatibility by word form identity, lemma identity, overlapping of WordNet synsets, approximate string matching between NEs, etc. The Semantic component: is computed over the graphs obtained measuring a strict and loose overlapping of unary and binary predicates. The former implies that two predicates exactly matching their arguments being lexically compatible, whilst the latter allows a relaxed matching of predicates by climbing up in the ontology of predicates (obviously, loose overlapping also implies a penalty on the score that depends on the length of the path between the two predicates). The whole set of measures computed by the two components is sent to a machine learning classifier in order to decide the final entailment decision. Specifically in their last RTE participation, they used the AdaBoost algorithm implemented in Weka (Witten & Frank, 2005). 27

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS After analysing the errors derived from their first participation in RTE, the authors carried out several system improvements to deal with them: The lack of a coreference component within the semantic analysis. In the current version of the system they introduce an in-house simple coreference solver implementation. Poor accuracy of the NE recognizer. They implement a resegmentation and reclassification of the entities where needed. The resegmentation consists of (i) extending the original NE limits in two tokens; (ii) merging two or more contiguous NEs; (iii) splitting the NE into several; and (iv) reducing the size of the NE taking away prefixes or suffixes. For reclassification they used additional resources such as: GNIS 17 for USA toponyms, Geonames 18 for toponyms outside the USA, WordNet toponyms, frequencies of the NE capitalized or not in a large corpus (the BNC 19 corpus) and categories attached to the NE in Wikipedia. As a result about 15% of the NEs were changed with an accuracy rate of 90%. The system fails to recognise compatible predicates, particularly in the cases of synonymy not covered by WordNet, the entailment relation in WordNet, approximate string matching of NEs and related words having different POS. To solve them, the authors enrich the system including relations between actions and actors (e.g. work, worker ), locations and inhabitants (e.g. Spain, Spanish ), locations including or not trigger words (e.g. New York City and New York ) and different forms of naming people (e.g. President Bush, Bush, Mr. Bush ). They use the also see and meronymy WordNet relation (i.e. I visited Madrid entails I visited Spain but not the inverse), the similar, stronger-than and happens-before VerbOcean relations (Chklovski & Pantel, 2004), different forms of acronym expansion and predicates for managing dates (e.g. May 15 th entails May ). Apart from these improvements, another interesting feature that the authors add to the system is a hypothesis classification. It consists of classifying 17 http://geonames.usgs.gov/domestic/index.html 18 http://www.geonames.org/ 19 http://www.natcorp.ox.ac.uk/ 28

2.1. RELATED WORK HISTORY the hypothesis into a set of possible classes. The classes are made up of three characters that can be instantiated as follows: [ a s e * ][ o p l e * ][ o p l e * ] the first character shows whether the H s predicate is an e = event, a = action, s = state or *=not covered, the second refers to the subject and the third to the object taking the following values o = organization, p = person, l = location, e = other or *=not covered. This classification is somewhat straightforward searching in the syntactic information for the head of the predicates occurring in H and for their arguments. Unfortunately, although the results are similar to most of the current textual entailment systems and the research developed is quite encouraging, the new features added to the baseline system did not report a general system improvement. The SALSA RTE System The SALSA system is the most representative example of using Frame Semantics in the textual entailment recognition task. Although attempts to integrate this type of information into a textual entailment system did not confirm the expected gain in performance, the investigations behind this work are very useful and interesting for further research. They participated in the second and third RTE Challenges (Burchardt & Frank, 2006; Burchardt et al., 2007). Broadly speaking, the SALSA system combines deep syntactic analysis, the structured lexical meaning descriptions encoded in FrameNet (Baker et al., 1998) and a shallow component based on word overlap. The architecture of the SALSA system is based on three main components: Linguistic analysis component: uses the probabilistic LFG grammar for English developed at PARC (Riezler et al., 2002), and a combination of systems for frame semantic annotation: the Shalmaneser system for frame and role annotation (Erk & Pado, 2006) and the rule-based Detour system for frame assignment (Burchardt et al., 2005). The linguistic analysis combines LFG f-structures and FrameNet frames computed by the Shalmaneser and Detour systems, resulting in a projection from f-structures to a semantic layer of frames and pseudo predicates for f-structure predicates that do not project frames. As well 29

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS as this, semantic nodes are further projected into an ontological analysis layer containing WordNet (Miller et al., 1990) senses and SUMO (Niles & Pease, 2001) classes. Other semantic phenomena not treated by FrameNet like anaphora, negation or modality are encoded with special operators. Finally, as a result, layered graph structures for text and hypothesis are built. Semantic overlap and match graphs: compare the LFG f-structures with semantic and ontological projection by determining compatibility (i.e. matching nodes and edges). The result is stored in a match graph, which contains all pairs of matched nodes and edges. Nodes match if they are labelled with identical frames or predicates, or if the nodes are semantically related on the basis of WordNet or FrameNet frame relations. Edges match whether they connect matching nodes, or nodes taking identical atomic values. Statistical entailment decision: regarding a match graph and the graphs for T and H, this module extracts features to train a machine learning model for textual entailment. These features express lexical, syntactic and semantic characteristics. For instance: lexical features count the number of lexical items, syntactic features record the number of LFG predicate matches, and semantic features distinguish between semantic node matches (e.g. identical or semantically related frames, modal properties, etc). Also, they compute the number and size of connected items in the match graph as well as their size in relation to that of the hypothesis graph. In its last configuration, the SALSA system uses a Weka s LogitBoost machine learning algorithm (Witten & Frank, 2005). Finally, they complement the system with a shallow lexical overlapping. It measures the relative number of words in the hypothesis that also occur in the text, using Tree-Tagger (Schmid, 1994) for lemmatization and partof-speech tagging and taking only nouns, non-auxiliary verbs, adjectives and adverbs into account. They also individually evaluate the lexical overlapping training a decision tree with a single feature, and they compare the results with the ones obtained by the SALSA system. Although the SALSA RTE system results are similar to current textual entailment systems, as the authors stated in their paper, it is a bit surprising that a shallow word overlapping performs comparable to, or even better 30

2.1. RELATED WORK HISTORY than, more informed features obtained from a relatively deep linguistic analysis, and that the combination of both types of feature does not always increase the overall accuracy. They exposed that one possible explanation for that is the limited size of the training data, which seems to be too small for the machine learner to exploit the full potential of the deep features. Other Works on this Model We would also like to mention some other works that perform similar strategies to the ones commented on previously throughout this chapter. In (Castillo & i Alemany, 2008) a Support Vector Machine classifier is used considering three simple features: edit distance, distance in WordNet and Longest Common Substring between T and H. Additionally, the authors apply a filter made up of a set of hand-crafted rules based on NEs in order to detect false entailment cases. In spite of the simplicity of the approach, they obtained reasonable results overcoming the baseline in the fourth RTE challenge (Giampiccolo et al., 2008a). In (Agichtein et al., 2008), they apply shallow semantic and syntactic clues in order to extract features for a machine learning algorithm. Features such as: simple word overlapping; another overlapping considering some Word- Net similarities (Leacock-Chodorow similarity (Leacock & Chodorow, 1998), Wu-Palmer similarity (Wu & Palmer, 1994), Resnik similarity (Resnik, 1995), Jiang-Conrath similarity (Jiang & Conrath, 1997), and Lin similarity (Lin, 1998b)); both overlappings considering the roles certain phrases play in each sentence (using Stanford parser (Finkel et al., 2005)); both overlappings grouping the words by their part-of-speech; the cosine similarity; several substring similarities; the polarity; other features about the length of H and T; and the translation-based similarity metrics. Interesting are the translation-based similarity metrics, which consist of applying the aforementioned metrics (Word Overlap, T and H lengths, Cosine Similarity, etc) to the text and hypothesis after translating them in Russian using Google Language Tools. 20 The intuition for these metrics was that translating the text and hypothesis into foreign languages might simplify the complexity of some of the sentences, making the metrics perform better. 20 http://www.google.com/language_tools 31

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS In their participation in RTE-4 they achieve competitive results above 0.58 in accuracy. 2.1.4 The Logic Model So far, the three previous models consider the entailment relations as a correspondence between linguistic expressions - from a simple word overlapping to complex syntactic transformations and semantic inferences. However, within the logic model the linguistic expressions have to be transformed into a logic representation establishing a set of axioms capable of determining the entailment. At this point, we could talk about logic entailment instead of textual entailment. A logic prover is necessary to detect when the axioms extracted from the text determine positive or negative entailment relations. Next, several works on this model are detailed. The Nutcracker Textual Entailment System The Nutcracker system (Bos & Markert, 2005; Bos & Markert, 2006) is a representative approach of this model obtaining good results. 21 To sum up, Nutcracker can be broken down into the following processing stages: Deep Semantic Analysis: carries out a part-of-speech tagging, chunking, NE recognition and semantic parsing. Nutcracker uses the C&C semantic parser (Clark & Curran, 2004), which implements a grammar derived from (Hockenmaier & Steedman, 2002). Afterwards, the system creates a Discourse Representation Structures (DRS) applying Boxer (Bos, 2005). The DRSs are converted into first order logic representation and, consequently, the system obtains a logic representation for T and H. A logic prover (such as Vampire, Otter or Bliksem) is used to validate that T H. If this assumption is found, then T entails H. Finally, to support the logic inferences in the entailment decision, Nutcracker makes use of some background knowledge. The rules that form this knowledge were manually created by studying the RTE corpora and automatically 21 The Nutcracker system can be downloadable at http://svn.ask.it.usyd.edu.au/ trac/candc/wiki/nutcracker. 32

2.1. RELATED WORK HISTORY derived from WordNet. For instance, a WordNet hyponymy relation between two synsets A and B is converted into x(a(x) B(x)). The COGEX Textual entailment System The COGEX (Moldovan et al., 2003) is a natural language prover originating from OTTER (MacCune, 1994). The COGEX system works as follows: The prover requires a list of clauses called the set of support which is used to initiate the search for inferences. COGEX loads the set of support with the negated form of the hypothesis H as well as the predicates that make up the logic form of the text passage (T). The system also contains a set of axioms used to generate inferences. It starts searching for proofs and assigns the largest weight to the negated hypothesis. Any produced inferences are assigned an appropriate weight depending on which axiom they were derived from, and if a refutation is found, then the proof is complete. For textual entailment they consider several types of axioms: extended WordNet Knowledge Base axioms: these axioms capture and store the world knowledge encoded in WordNet s glosses into a knowledge base. NLP axioms: they are linguistic rewriting rules that help break down complex logic structures and express syntactic equivalence. After analyzing the logic form and the parse trees of each text fragment, the system, automatically, generates NLP axioms to break down complex nominals and coordinating conjunctions into their constituents so that other axioms can be applied, individually, to the components. Semantic axioms: axioms manually identified and validated against large corpora. They use relations such as PartWhole, Isa, Location, Attribute, or Agent. For example, the axiom ISA SR(x1, x2) & ATTRI- BUTE SR(x2, x3) ATTRIBUTE SR(x1, x3). Event and Temporal axioms: the TARSQI tool (Temporal Awareness and Reasoning Systems for Question Interpretation) (Verhagen et al., 2005) is used to: (i) detect, resolve and normalize time expressions; (ii) 33

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS mark events and their grammatical features; (iii) identify subordination constructions introducing modality information; (iv) add temporal relations between events and temporal expressions; and (v) compute temporal closures. COGEX has participated in the 2 nd and 3 rd editions of RTE (Tatu et al., 2006b; Tatu & Moldovan, 2007) and in the first AVE edition (Tatu et al., 2006a) always obtaining the second or first position in the participants ranking. Other Works on this Model Another work on the logic model is the one presented in (Roth & Sammons, 2007). The system described in this paper uses a suite of resources to modify the original entailment pair by augmenting or simplifying either or both the text and the hypothesis. Terms relating to quantification, modality and negation are detected and removed from the graphical representation of the entailment pair and resolved with an entailment module that handles basic logic. The system presented in (Zanzotto et al., 2007; Zanzotto et al., 2008) is also another example tackling the textual entailment problem from a logic perspective. They use several kernel-based machine learning models trained with first order syntactic rewrite rules extracted from development examples. They define a first-order syntactic rewrite rule feature space as a space in which each feature f p represents a syntactic first-order rewrite rule p. Figure 2.4 shows an example of a first-order syntactic rewrite rule. Figure 2.4: An example of a first-order syntactic rewrite rule. And Figure 2.5 depicts a T-H pair that activates the above rule. 34

2.1. RELATED WORK HISTORY Figure 2.5: An example of a T-H pair that activates the rule shown in Figure 2.4. The matching information provided by the first-order rewrite rules activated by both T and H is learned by the machine learning algorithm that will take the entailment decision. 2.1.5 Models Combination The majority of the systems utilize a combination of lexical, syntactic, semantic and sometimes logic models to deal with entailment relations. In previous sections we have attempted to classify these systems depending on the model that prevails in each approach. However, this section will describe those systems that make use of a large variety of resources and inferences (i.e they are too involved to encapsulate them in one of the previous models). The LCC s GROUNDHOG Textual Entailment System The GROUNDHOG system (Hickl & Bensley, 2007; Bensley & Hickl, 2008) uses a pipeline of lightweight, largely statistical systems for commitment extraction, lexical alignment, and entailment classification in order to estimate the likelihood that T includes sufficient linguistic content to textually entail H. Figure 2.6 depicts the architecture of GROUNDHOG. The preprocessing module: carries out a syntactic parser using (Collins, 1999), identifies semantic dependencies using a semantic dependency parser trained on PropBank (Palmer et al., 2005) and NomBank (Meyers et al., 2004), annotates NEs, resolves instances of pronominal and nominal coreference, and normalizes temporal and spatial expressions to fully-resolved 35

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS Figure 2.6: The architecture of the GROUNDHOG system. instances. The commitment extraction module: obtains the set of discourse commitments that are derivable from the textual content of a pair of texts. To achieve this, it considers: (i) sentence segmentation; (ii) syntactic decomposition by heuristics; (iii) supplemental expressions including appositives, as-clauses, parenthetical adverbs, relative clauses, epithets, etc; (iv) relation extraction by an in-house system detecting relations such as OW NER OF, LOCATION OF, EMPLOY EE OF, PART WHOLE, RELA- T ED T O, and LOCAT ED NEAR; and (v) commitments derived from solving pronominal and nominal coreference. The commitment selection module: uses a word alignment technique in order to select the set of commitments extracted from T that represents the best alignments for each of the commitments extracted from H. It reduces the number of commitments considered and avoids incorrect final inferences. The inference classification module: once the best set of commitment alignments has been identified, a decision tree classifier is used in order to estimate the likelihood that a commitment from T textually entails a commitment derived from H. This classifier is trained using a set of linguistic features analogous to those described in several state-of-the-art textual entailment approaches (e.g. word overlapping, string features, WordNet similarities, NE matching, etc). In all of its participations in the RTE challenge, this system has achieved very good results reaching the first ranked position for the last two RTE editions. 36

2.1. RELATED WORK HISTORY The UAIC Textual Entailment System The main idea of the Al. I. Cuza University system (Iftene, 2008; Iftene & Balahur-Dobrescu, 2007) is to map every word from the hypothesis to one or more words from the text. This simple statement becomes difficult when, to achieve it the system has to transform the hypothesis making use of extensive semantic knowledge from sources like DIRT (Lin & Pantel, 2001), WordNet (Miller et al., 1990), VerbOcean (Chklovski & Pantel, 2004), Wikipedia and acronyms databases. After the mapping process, the system associates a local fitness value to every word from H, which is used to calculate a global fitness value for current fragments of T. The global fitness value is decreased in cases where a word from H cannot be mapped to one word from T or when we have different forms of negations for mapped verbs. In the end, the system uses a threshold predefined in the training step to decide the entailment. As a preprocessing step, the system applies Tree-tagger (Schmid, 1994) in order to obtain the part-of-speech tags, MINIPAR (Lin, 1998a) for dependency parsing, and LingPipe 22 and GATE (Cunningham et al., 2002) for NE recognition. Additionally, a set of patterns was built with the aim of identifying numbers, percentages, dates, etc. These information is appended to the nodes in the dependency trees. The core of the system implements the mapping between triplets in the form (node-lemma, father-lemma, edge-label) extracted from both dependency trees. These mappings can be direct (when triplets from the hypothesis tree exist in the text tree) or indirectly (when triplets from text trees or hypothesis trees cannot be mapped directly and support transformations using external resources). The transformations follow the following patterns: For verbs: the system tries to replace the target verb for one expressed in the DIRT paraphrases. For NEs: the system uses the acronym database and a manually created background knowledge database to make replacements. For nouns and adjectives: it considers WordNet and a part of the relations from extended WordNet to look up synonyms. 22 http://alias-i.com/lingpipe/ 37

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS For numbers: a set of rules was generated to manage intervals and expressions like more than, less than, etc. Moreover, in order to detect false entailments and add a penalty to the final entailment score, the system checks whether the verbs are negated along the dependency trees. Also, to deal with contradictory cases the oppositeof relation of VerbOcean and the WordNet antonym relation are taken into account. The UAIC system proves that a suitable combination of semantic resources helps to solve textual entailment phenomena. Moreover, in its two participations in the last RTE editions it reached one of the top three positions. The Concordia University Textual Entailment System This system relies on acquiring and aligning ontologies to recognize textual entailment (Siblini & Kosseim, 2008b). It automatically acquires an ontology representing the text fragment and another one representing the hypothesis, and then aligns the created ontologies. By learning from available textual entailment data, the system can then classify textual entailment using the information collected from its ontology alignment phase. Ontology Acquisition. This automatic process involves three steps: 1. A syntactic analysis by MINIPAR (Lin, 1998a) resulting in a set of dependency relations of the form Grammatical relation ( Governing Content word, Modifying Content word). As content words only verbs, nouns, adjectives and adverbs are considered. 2. A semantic analysis that transforms the previous structures into more meaningful semantic structures (namely Intervaria Semantic Structure - ISR). The difference is that the governing words are restricted to verbs ( Relation(Governing verb, Content word) ). In this step a NE recognition process is also developed, and when an entity modifying another entity appears but the relation verb between them is not shown, they use the RoDEO system (Siblini & Kosseim, 2008a) that finds verbs that characterize the semantic relation between entities. 38

2.1. RELATED WORK HISTORY 3. A ontological analysis transforming the Relation into a Property, the Governing verb into the property s Domain, the Content word into the property s Range.. Ontology Alignment: this phase aligns both T and H ontologies into another ontology, namely Ontology-A. The alignment is done in two steps: (1) matching the classes of Ontology-H to the classes of Ontology-T, creating equivalent classes when necessary; and (2) matching the properties of Ontology-H to the properties of Ontology-T, creating equivalent properties when necessary. As a result an Ontology-A is created. Entailment Decision: the collected information from the alignment phase regarding the classes and properties matching over the development corpora is used to train a machine learning algorithm, which will decide when a textual entailment relation is expressed in a T-H pair. The Concordia University system obtained good results above 60% in the last RTE challenge, successfully creating an ontology- alignment-based textual entailment system. 2.1.6 Conclusions To sum up, the main trend followed by the research community has been to integrate deeper semantic knowledge to take entailment decisions. However, simple approaches based on a matching between bag of words have achieved very promising results. This is due to the fact that the semantic knowledge is very complex and difficult to deal with, and finding the right way to merge the different knowledge levels is something still undiscovered. Nevertheless, a pure lexical perspective seems to be a good starting point in order to progressively add more sophisticated syntactic-semantic textual entailment inferences. Moreover, the majority of the approaches rely on machine learning algorithms to detect entailment relations. Although they have the handicap of needing large corpora collections to train the learning models, the benefits of using them overcome these drawbacks. 39

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS 2.2 Relevant Resources and Tools for this Thesis This section exposes the most relevant resources widely used by the research community to solve textual entailment. Furthermore, they have been an inspiration as well as being used as external resources for the textual entailment system presented in this PhD. thesis. 2.2.1 The FreeLing Toolkit The FreeLing package (Atserias et al., 2006) is an open source suite that consists of a library providing language analysis services such as tokenizing, sentence splitting, morphological analysis, NE detection and classification, recognition of dates/numbers/physical magnitudes/currency/ratios, part-ofspeech tagging, shallow parsing, dependency parsing, and WordNet-based sense annotation. The distributed version includes morphological dictionaries for covered languages such as English, Spanish, Catalan, Galician, and Italian. As technical features, for part-of-speech tagging FreeLing provides two algorithms: (i) a Hidden Markov Model trigram; and (ii) a relaxation labelling model which enables the use of hand-written rules together with the statistical models. Whereas for the NE Classification task encoded in FreeLing, it uses a machine learning technique, namely, the AdaBoost algorithm which requires the representation of the sentence to be annotated into a feature vector representation achieved via a general feature extraction module. Figure 2.7 shows a snapshot of the Freeling on-line demo (http://garraf. epsevg.upc.es/freeling/demo.php). 2.2.2 The MINIPAR Parser MINIPAR (Lin, 1998a) is a broad-coverage parser for the English language. MINIPAR represents the grammar as a network, where the nodes represent grammatical categories and the links represent types of syntactic (dependency) relationships. The lexicon in MINIPAR is derived from WordNet (Miller et al., 1990). With additional proper names, the lexicon contains about 130K entries (in base forms). The lexical entry of a word lists all possible parts of speech of 40

2.2. RELEVANT RESOURCES AND TOOLS FOR THIS THESIS Figure 2.7: The Freeling toolkit. the word and its subcategorization frames (if any). The lexical ambiguities are handled by the parser instead of a tagger. Like chart parsers, MINIPAR constructs all possible parses of an input sentence. However, it outputs a single parse tree with the highest ranking. Although the grammar is manually constructed, the selection of the best parse tree is guided by statistical information. An evaluation with the SUSANNE corpus (Sampson, 1995), which is a subset of the Brown Corpus of American English. It contains parse trees of 64 of the 500 texts in the Brown Corpus of American English. In this evaluation MINIPAR achieved about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is available for non-commercial purposes. 23 23 http://www.cs.ualberta.ca/~lindek/minipar.htm 41

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS 2.2.3 The NERUA System NERUA is an open domain NE recognizer developed by our NLP group at Alicante University (Ferrández, 2006; Kozareva et al., 2007). It was developed combining three machine learning classifiers by means of a voting strategy. This system carries out the recognition of entities in two phases: entities detection and classification of the detected entities. The three classifiers integrated in NERUA use the following algorithms: Hidden Markov Models (HMM) (Schröder, 2002), Maximum Entropy (ME) (Suárez & Palomar, 2002) and Memory Based Learning (TiMBL) (Daelemans et al., 2003). The outputs of the classifiers are combined using a weighted voting strategy which consists of assigning different weights to the models corresponding to the correct class they determine. The features used by the classifiers can be divided into several groups: Orthographic: related to the orthography of the word to be classified. For instance features about capitalization, digits, punctation marks, hyphenated words, suffixes and prefixes, etc. Contextual: a three-word window was used to create and analyse the context of the target words. Morphologic: represents morphological characteristics such as lemma, stem, part-of-speech tag, etc. Handcrafted-list: features testing whether or not the word is contained in some handcrafted lists of general entities obtained from several web pages. NERUA classifies the entities into four classes: (i) PERson, entities denoting person names; (ii) LOCation, entities regarding specific locations; (iii) ORGanization, names of organizations; and (iv) MISCellaneous, the remaining entities which do not belong to any of the previous categories. To train the classifiers we used the corpora provided by CoNLL 2002 (Sang, 2002). Initially, NERUA was designed to recognise entities within Spanish texts, however due to the fact that this conference also supplies annotated corpora for English, 24 we were able to adjust the system to recognise English entities. 24 The English corpora belong to the CoNLL 2003 edition (Tjong Kim Sang & De Meulder, 2003). 42

2.2. RELEVANT RESOURCES AND TOOLS FOR THIS THESIS 2.2.4 WordNet WordNet (Miller et al., 1990), an electronic lexical database, is considered to be the most important resource available to researchers in computational linguistics, text analysis, and many related areas. Its design is inspired by current psycholinguistic and computational theories of human lexical memory. English nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each representing one underlying lexicalized concept. Synsets are interlinked by means of conceptual-semantic and lexical relations, resulting in a lexico-semantic interlinked concept network. In its lastest version (i.e. WordNet 3.0), it contains 155,287 words organized in 117,659 synsets. The table shows some statistics for WordNet 3.0. PoS Unique strings Synsets Total Word-Sense pairs Noun 117,798 82,115 146,312 Verb 11,529 13,767 25,047 Adjective 21,479 18,156 30,002 Adverb 4,481 3,621 5,580 Totals 155,287 117,659 206,941 Table 2.1: WordNet 3.0 statistics. Some of the most important semantic relations established in WordNet to connect synsets are the following: Synonymy is WordNet s basic relation, because WordNet uses sets of synonyms (synsets) to represent word senses. Two words belonging to the same synset are synonyms. Antonymy is also a symmetric semantic relation between word forms, especially important in organizing the meanings of adjectives and adverbs. Hyponymy (sub-name) and its inverse, hypernymy (super-name), are transitive relations between synsets. These semantic relations organize the meanings of words into a hierarchical structure. 43

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS Meronymy (part-name) and its inverse, holonymy (whole-name), are complex semantic relations. WordNet distinguishes between component parts, substantive parts, and member parts. Entailment relations between verbs are also coded in WordNet. WordNet is also freely and publicly available for download. 25 WordNet s structure makes it a useful tool for computational linguistics and natural language processing. 2.2.5 FrameNet The Berkeley FrameNet project (Baker et al., 1998) is creating an on-line lexical resource for English, based on frame semantics and supported by corpus evidence. The aim is to document the range of semantic and syntactic combinatory possibilities (namely valences) of each word in each of its senses. To achieve this, they use examples from attestations taken from naturalistic corpora, mainly from the British National Corpus (BNC) 26, rather than constructed by a linguist or lexicographer. To understand the semantic network that represents FrameNet it is necessary to describe the main concepts encoded in it. A lexical unit (LU) is the pairing of a word with a meaning. Typically, each sense of a polysemous word belongs to a different semantic frame, which is a script-like conceptual structure that describes a particular type of situation, object, or event and the participants and props involved therein. For example, the Apply_heat frame describes a common situation involving a Cook, some Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil, broil, brown, simmer, steam, etc. The roles are called frame elements (FEs) and the frame-evoking words are Lexical Units. Moreover, FrameNet also provides Frame-to-Frame relations. Each relation represents an asymmetric link between two frames, where the less dependent or more abstract frame can be called Super frame and the more dependent or less abstract the Sub frame. Furthermore, each Frame-to-Frame 25 http://wordnet.princeton.edu/obtain 26 http://www.natcorp.ox.ac.uk/ 44

2.2. RELEVANT RESOURCES AND TOOLS FOR THIS THESIS relation has also linked the Frame Elements that participate in each particular relation. Table 2.2 summarizes the entire set of FrameNet relations. 27 Relation Sub Super Inheritance Child Parent Perspective on Perspectivized Neutral Subframe Component Complex Precedes Later Earlier Inchoative of Inchoative State Causative of Causative Inchoative/State Using Child Parent See also Referring Entry Main Entry Table 2.2: FrameNet 1.3 Frame-to-Frame relations. The major product of this work, the FrameNet lexical database, 28 currently contains more than 10,000 lexical units (more than 6,100 of which are fully annotated) in more than 825 semantic frames, exemplified in more than 135,000 annotated sentences. It has gone through three releases, and is now in use by hundreds of researchers, teachers, and students around the world. Active research projects are now seeking to produce comparable frame-semantic lexicons for other languages (e.g. Spanish FrameNet, http://gemini.uab.es/sfn) and to devise a means of automatically labelling running text with semantic frame information. Besides the FrameNet project, there are other projects that annotate corpora with semantic roles. Perhaps, the most representative is PropBank (Palmer et al., 2005). The PropBank project takes a practical approach to semantic representation, adding a layer of predicate-argument information, or semantic role labels, to the syntactic structures of the Penn Treebank. 29 The resulting 27 For more information about the Frame-to-Frame relations, the readers are kindly redirected to the FrameNet Book (http://framenet.icsi.berkeley.edu/index.php). 28 Available upon request at http://framenet.icsi.berkeley.edu/index.php. 29 http://www.cis.upenn.edu/~treebank/ 45

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS resource can be thought of as shallow, in that it does not represent coreference, quantification, and several other higher-order phenomena, but also broad, in that it covers every instance of every verb in the corpus and allows representative statistics to be calculated. PropBank differs from FrameNet, the resource to which it is most frequently compared, in two major ways. The first is that it commits to annotating all verbs in its data. The second is that all arguments to a verb must be syntactic constituents. The FATE corpus The FATE (FrameNet-Annotated Textual Entailment) is a manually crafted, fully reliable frame-annotated textual entailment corpus (Burchardt & Pennacchiotti, 2008). FATE consists of the 800 T-H entailment pairs from the RTE-2 Challenge (Bar-Haim et al., 2006) test set, annotated with frame and semantic role labels derived from FrameNet. The main goal of FATE is to give practical help to disentangle the problem of applying FrameNet in textual entailment tasks. Indeed, FATE shows that: (i) FrameNet coverage over the RTE corpora is sufficient to allow inference at the predicate-argument level; (ii) a gold standard for testing the performance of existing shallow semantic parsers on realistic data; (iii) a basis that enables researchers to develop clearer ideas on how to effectively integrate frame knowledge in semantic inference tasks like recognising textual entailment; and (iv) a noise-free frame-annotated corpus for RTE systems to experiment on. The FATE corpus is available upon request. 30 2.2.6 The Shalmaneser Tool Shalmaneser (Erk & Pado, 2006) is a supervised learning toolbox for shallow semantic parsing, i.e. the automatic assignment of semantic classes and roles to text. Shalmaneser was developed for Frame Semantics encoded in FrameNet; thus it uses FrameNet terminology such as frames and frame elements. The distributed version of Shalmaneser provides a simple end user mode which can simply apply the pre-trained classifiers for English (using FrameNet 30 http://www.coli.uni-saarland.de/projects/salsa/fate/ 46

2.2. RELEVANT RESOURCES AND TOOLS FOR THIS THESIS annotation and Collins parser) and German (using SALSA Frame annotation and Sleepy parser). From a technical point of view, Shalmaneser contains three modules: a preprocessor to parse plain-text input into the interchange format, a module for sense-disambiguation of predicates, and one for the assignment of semantic roles. The modularity also allows easy integration with other NLP tools. These independent modules are communicating through a common XML format. Moreover, Shalmaneser output can be graphically inspected. 2.2.7 VerbNet VerbNet 31 is the largest on-line verb lexicon currently available for English (Kipper et al., 2006). It is a hierarchical domain-independent, broad-coverage verb lexicon organized into verb classes. VerbNet uses an extension of the Levin classes (Levin, 1993) through the refinement and addition of subclasses to achieve syntactic and semantic coherence among members of a class. Each verb class in VerbNet is completely described by thematic roles, selectional restrictions on the arguments, and frames consisting of a syntactic description and semantic predicates with a temporal function. The VerbNet project also provides several mappings to other lexical resources such as WordNet, PropBank and FrameNet, this allows researchers to use these resources in a collaborative way. 2.2.8 VerbOcean VerbOcean is a broad-coverage semantic network of verbs (Chklovski & Pantel, 2004). VerbOcean implements a semi-automatic method for extracting fine-grained semantic relations between verbs. It detects similarity, strength, antonymy, enablement, and temporal happens-before relations between pairs of strongly associated verbs using lexico-syntactic patterns over the Web. Table 2.3 illustrates some statistics about the VerbOcean relations showing properties such as transitiveness (characterized by or involving transition) and symmetry. 31 http://verbs.colorado.edu/~mpalmer/projects/verbnet/downloads.html 47

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS Relation Example Transitive Symmetric #VerbOcean similarity produce::create Yes Yes 11,515 strength wound::kill Yes No 4,220 antonymy open::close No Yes 1,973 enablement fight::win No No 393 happens-before buy::own Yes No 4,205 2.2.9 Paraphrase corpora Table 2.3: VerbOcean statistics. The acquisition of a collection of paraphrases is an important task for modelling language variability as well as a critical step for natural language interpretation. Consequently, the knowledge provided by these collections would be of great assistance in order to solve textual entailment problems. Next, we list relevant works dealing with this issue. The TEASE Knowledge Collection The TEASE (Szpektor et al., 2004) algorithm implements a Web-based acquisition of entailment relations, an extended model of paraphrases. Basically, it consists of a given lexical-syntactic input template (i.e. a parse sub-tree with variable slots), the algorithm automatically learns other templates that are candidates for entailment relation with the input template. The direction of the entailment relation is not learned, so the resulting relation can be either that the input entails the candidate, the candidate entails the input or both entail each other (paraphrases). The structure of the candidates is also learned as part of the acquisition. The current TEASE knowledge collection consists of 136 different templates that were given as input. The resource also provides a description of each input template together with a description of all the learned templates. The TEASE collection is available for research purposes. 32 The DIRT Paraphrase Collection DIRT (Discovery of Inference Rules from Text) is both an algorithm and a resulting knowledge collection (Lin & Pantel, 2001). The algorithm auto- 32 http://www.cs.biu.ac.il/~szpekti/tease_collection.zip 48

2.2. RELEVANT RESOURCES AND TOOLS FOR THIS THESIS matically learns paraphrase expressions from text using the Distributional Hypothesis over paths in dependency trees. A path, extracted from a parse tree, is an expression that represents a binary relationship between two nouns. In short, if two paths tend to link the same sets of words, DIRT hypothesizes that the meanings of the corresponding patterns are similar. The DIRT knowledge collection is the output of the DIRT algorithm over a 1GB set of newspaper texts (San Jose Mercury, Wall Street Journal and AP Newswire from the TREC-9 collection). It extracted 7 million paths from the parse trees (231,000 unique) from which paraphrases were generated. For example, the Top-10 paraphrases of X solves Y generated by DIRT are the following: Y is solved by X, X resolves Y, X finds a solution to Y, X tries to solve Y, X deals with Y, Y is resolved by X, X addresses Y, X seeks a solution to Y, X does something about Y, X solution to Y The DIRT knowledge collection is available from its authors for research purposes. The Microsoft Research Paraphrase Corpus The Microsoft Research Paraphrase Corpus (MSRPC) is distilled from a database of 13,127,938 sentence pairs, extracted from 9,516,684 sentences in 32,408 news clusters collected from the World Wide Web over a two-year period. The methods and assumptions used in building this initial data set are discussed in (Dolan et al., 2004). Unsupervised techniques were used to acquire monolingual sentence-level paraphrases. Two techniques were employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. The downloadable version contains 5,801 pairs of sentences extracted from the aforementioned sources, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship. The Mutaphraser The Mutaphrase resource (Ellsworth & Janin, 2007) generates paraphrases of semantically labeled input sentences using the semantics and syntax encoded 49

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS in FrameNet. The algorithm generates a large number of paraphrases with a wide range of syntactic and semantic distances from the input. For example, given the input I like eating cheese, the system produces Eating cheese is liked by me as a paraphrase. The wide range of generated paraphrases makes the algorithm ideal for a range of statistical machine learning problems such as machine translation as well as language modelling. 2.3 The PASCAL Recognizing Textual Entailment Challenges The PASCAL Recognizing Textual Entailment (RTE) Challenges constitute a series of workshops destined for tackling the recognition of textual entailment as an isolate task (Dagan et al., 2006; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Giampiccolo et al., 2008b). The goal of these challenges has been to create a benchmark for textual entailment engines, providing several sets of pairs of snippets and evaluating the judgements of the systems in deciding whether there is an entailment relation between them or not. Moreover, the four RTE editions establish the best reference point for researchers working in this area. In its four editions, these challenges are attempting to capture major semantic inference needs across applications. The task is required to recognize, given two text fragments, whether the meaning of one text can be inferred (entailed) from another text. More concretely, in these aspects textual entailment is defined as a directional relationship between pairs of text expressions, denoted by T (the entailing Text), and H (the entailed Hypothesis). This definition assumes common human understanding of language as well as common background knowledge. Traditionally, the RTE organizers provide the participants with both the development and test corpora, and they build these corpora with the aim of reflecting textual entailment inferences that could appear in well-known natural language processing applications, such as QA, IR, IE, and SUM. These datasets are made up of text-hypothesis pairs collected by human annotators. They consist of several subsets that correspond to typical success and failure settings in different applications, such as the ones previously mentioned, and 50

2.3. THE PASCAL RECOGNIZING TEXTUAL ENTAILMENT CHALLENGES represent different levels of entailment reasoning, such as lexical, syntactic, morphological and logical. The datasets are totally balanced (i.e. 50% true and 50% false entailments), and each pair was judged by several annotators. Table 2.4 shows some examples of true and false entailments. Text Hypothesis Task Judg. Sheriff s officials said a robot could be put to use in Ventura County, where the bomb squad has responded to more than 40 calls this year. The drugs that slow down or halt Alzheimer s disease work better the earlier you administer them. The available scientific reports do not show that any health problems are associated with the use of wireless phones. The flights begin at San Diego s Lindbergh Field in April, 2002 and follow the Lone Eagle s 1927 flight plan to St. Louis, New York, and Paris Spencer Dryden, the drummer of the legendary American rock band Jefferson Airplane, passed away on Tuesday, Jan. 11. He was 66. Dryden suffered from stomach cancer and heart disease. Police use robots for bombhandling. Alzheimer s disease is treated using drugs. Cell phones pose health risks. Lindbergh began his flight from Paris to New York in 2002. Spencer Dryden died at 66. SUM IR IR QA IE YES YES NO NO YES Table 2.4: Examples of text-hypothesis pairs taken from the RTE corpora. Within the RTE challenges the participating systems have been evaluated by means of two different measures: (i) the overall accuracy of the system which means the fraction of correct answers (i.e. the amount of correct answers returned by the system divided by the total number of pairs in the corpus); and (ii) the average precision which evaluates the ability of systems to rank the pairs according to their entailment confidence, in decreasing order from the most certain entailment to the least certain. Average precision is 51

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS a common evaluation measure for system ranking. More formally, it can be written as follows: Average Precision = 1 R ( n i=1 ) E(i) # correct up to pair i i (2.1) where n is the number of the pairs in the test set, R is the total number of positive pairs in the test set, E(i) is 1 if the i th pair is positive and 0 otherwise, and i ranges over the pairs, ordered by their ranking. It is important to highlight the new direction introduced in the fourth RTE challenge for the recognition of textual entailment. The step forward consists of considering the task as a three-class decision problem, which means that the systems have to distinguish when the entailment relation is not supported at all for the two involved snippets and when there is not enough evidence of its supportiveness. This new task was already proposed as a pilot task in the third RTE edition. For the fourth one, the organizers decided to keep the traditional two-way entailment recognition and supply corpora as well as evaluation procedures for those teams that wanted to participate in the new three-way entailment decision problem. Throughout its four editions, the RTE challenges have had an average participation of 23 teams from all around the world. This number has grown from 16 participants in the first edition to 26 for the last one. Distinct methods to tackle the entailment problem have been presented. For reader s quick reference, appendix A (p. 185) shows the RTE official results for its four editions, and our system s official accuracy achieved for those in which it participated (i.e. 2nd, 3rd and 4th edition). The tables show only the best run submitted by each participant. 2.4 The Answer Validation Exercise The three-years-old Answer Validation Exercise (AVE) task (Peñas et al., 2006; Peñas et al., 2007; Rodrigo et al., 2008a) was activated to promote the development and evaluation of subsystems aimed at validating the correctness of the answers given by QA systems. The AVE task was integrated as a track within the Cross-Language Evaluation Forum. 33 33 http://www.clef-campaign.org 52

2.4. THE ANSWER VALIDATION EXERCISE The Cross-Language Evaluation Forum (Peters, 2008): the objective of the Cross Language Evaluation Forum is to promote research in the field of multilingual system development. This is done through the organization of annual evaluation campaigns where a series of tracks designed to test different aspects of mono- and cross-language information retrieval are offered. The intention is to encourage experimentation with all kinds of multilingual information access from the development of systems for monolingual retrieval operating on many languages to the implementation of complete multilingual multimedia search services. This has been achieved by offering an increasingly complex and varied set of evaluation tasks over the years. The aim is not only to meet but also to anticipate the emerging needs of the R&D community. The main objectives of Answer Validation Exercise are to improve the overall performance of QA systems, help humans in the assessment of QA systems output and develop better criteria for collaborative systems. Systems must emulate human assessment of QA responses and decide whether an answer is correct or not according to a given snippet. The AVE task can be reformulated as a textual entailment problem, where the hypotheses are built turning the questions plus the answers into declarative sentences. Participant systems receive a set of triplets (Question, Answer, Supporting Text) and they must return a judgement for each triplet showing whether or not the answer is supported by the text. The systems response must be tagged as: VALIDATED: indicating that the answer is correct and supported. SELECTED: when the answer is VALIDATED and it is the one chosen as the output of a hypothetical QA system (one of the VALIDATED answers per question should be marked as SELECTED). REJECTED: when the answer is incorrect or there is not enough evidence of its correctness. These triplets are extracted from the QA main track responses of the QA-CLEF task, therefore the participant systems could be considered as an external component for helping QA systems in the decision making process. However, the real outputs of QA systems do not provide a balanced corpora 53

CHAPTER 2. RELATED WORK AND RELEVANT RESOURCES AND TOOLS (i.e. within the AVE corpora there are many false entailments and few true entailments). To manage this unbalanced nature, the AVE organizers instead of using an overall accuracy as the evaluation measure, propose to use the precision, the recall and the F-measure (harmonic mean) over pairs with true entailment relations. In other words, they propose quantifying systems ability to detect the pairs with positive entailment or to detect whether there is enough evidence to accept an answer. Moreover, apart from these traditional evaluation measures, the organizers also supply results for comparing AVE results with QA results in order to obtain some evidence about the goodness of incorporating more sophisticated validation systems into QA architectures. For instance, the QA-accuracy measure obtains the proportion of correctly answers marked as SELECTED. Since answers were grouped by questions and participant systems were requested to mark as SELECTED one or none of them, the resulting behaviour is comparable to a QA system. Therefore, the QA-accuracy is comparable to the accuracy used in the QA CLEF Main Track and it serves to compare AVE systems to QA systems over the questions involved in the AVE collections. In the three AVE editions, an average of 10 teams have participated in each edition, and although the AVE organizers provided development and test corpora for German, English, Spanish, French, Italian, Romanian, Bulgarian, Dutch and Portuguese, the first three were those demanded most by the participants. Appendix B (p. 191) shows the AVE official results for the three AVE editions. The results show the precision, recall and F-measure over the positive pairs achieved for the participant systems, they are illustrated for the language or languages that our system participated in for each edition. The runs carried out by our system are drawn in bold. 54

3 The Idea: A Perspective-based Textual Entailment System The idea behind the proposed system in this thesis was to develop a perspective-based approach to solve entailment relations. Its principal goal is to tackle the entailment task from different angles, thus these perspectives will enable us to properly combine the distinct knowledge supplied by each of them. This chapter illustrates the different perspectives detailing the inferences considered in each one. In a broad outline we have established three perspectives: Lexical, Syntactic and Semantic. 3.1 The System at a Glance Figure 3.1 depicts an overview of our textual entailment system. The system work-flow starts by taking the pair text-hypothesis as input and, following the textual entailment methodology, it will judge if there is enough evidence to determine an entailment relation between them. 55

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM Figure 3.1: The system at a glance. Each perspective is responsible for extracting a set of features that will be passed to a machine learning algorithm. This algorithm will determine the final decision after having concluded a previous learning stage. At this point different configurations of the system are made. For instance, on one hand we set a configuration supported only by one of the three perspectives basing the entailment decision on lexical, syntactic or semantic inferences respectively (the blue, green and red broken arrows). On the other hand, we create another configuration computing all features extracted by each perspective consequently putting together the knowledge supplied by all perspectives (the ML Algorithm-1, black arrows way). Moreover, additional configurations were also implemented, such as a voting strategy between the three perspectives as well as the definition of several entailment constraints useful for candidate pairs and prior to the computation of the perspectives. The modular architecture of our system allows the use of any machine learning algorithm. For our purposes and after experimentally checking several algorithms implemented in Weka (Witten & Frank, 2005), we opted to use the Weka s Support Vector Machine (SVM) implementation. It has been stated in previous works (including ours) that SVM classifiers achieve very good performance in the task of recognising entailment relations (Agichtein et al., 2008; Castillo & i Alemany, 2008; Rodrigo et al., 2008b; Balahur et al., 56

3.2. PREVIOUS AND SHARED STEPS 2008). Regarding the entailment constraints, two different constraints were created. Although, they will be detailed in sections 3.5.3 and 3.5.4, as a brief description the constraints are based on (i) the importance of discovering semantic relations between the verbs appearing in H and T; and (ii) finding out correspondences between the hypothesis NEs and the NEs posed within the text. With regards to the voting strategy used, a simple voting was implemented. It means that for the three outputs (belonging to the lexical, syntactic and semantic perspectives), the voting chooses the one that is returned by at least two perspectives. All the system s configurations will be profoundly evaluated in chapter 4. 3.2 Previous and Shared Steps Prior to accomplishing any perspective-based analyses about the existence of an entailment relation, various procedures are performed and shared by all perspectives. For instance, the texts are tokenised and lemmatised. A morphological analysis is carried out as well as a stemmization using a Porter stemmer implementation. Once these steps are completed, several data structures containing the tokens, stems, lemmas and parts-of-speech are created. Although these structures are shared for all perspectives, each perspective also makes use of other resources that will be described in the corresponding sections. To attain the morphological analysis the FreeLing toolkit (Atserias et al., 2006) is used. FreeLing also allows us to create different sets of open-class words depending on their parts-of-speech. These sets will permit the system to take a specific procedure for each morphological group. The FreeLing toolkit was explained in section 2.2. 3.3 Lexical Perspective This could be considered as the base of the system. It computes the extraction of several measures from the two snippets (the text and the hypothesis) producing a score that shows the similarity degree. Such measures are fun- 57

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM damentally based on word co-occurrences as well as the context where they appear. This perspective was presented in our paper (Ferrández et al., 2007), however in the context of this thesis an extensive group of lexical measures has been considered. To compute these lexical measures the texts have been managed as bagsof-words removing from them all stop-words. We created different bagsof-words structures for each text, containing the tokens as they originally appear, the lemmata and the stems. Each lexical inference is calculated taking into account these structures derived from the pair text-hypothesis. It has been proven that such a knowledgeless technique obtains very promising results comparable with other more sophisticated approaches (Giampiccolo et al., 2008a; Giampiccolo et al., 2007). The effectiveness of this sort of technique relies on human common sense when people want to communicate something. In many cases, different persons use the same or similar expressions to expose the same idea, and these situations are captured by the lexical perspective. However, a slight modification in the sentences wording can produce a fatal error in the entailment decision. For instance, the negation terms can utterly change the meaning of the sentence but not have a great influence on the final lexical score. So, the researchers ability in weighting these terms and their influencies in the entailment decision is of paramount importance to correctly detect entailment phenomena. The next subsection lists the entire set of lexical measures considered in this research work together with a detailed explanation of all of them. 3.3.1 Measuring Lexical Similarities The following list shows the set of lexical measures 1 performed by the system together with a sound description of each measure. Binary matching: word overlapping between text and hypothesis items. It is initialized to zero. If an item in the hypothesis also appears in the text, an increment of one unit is added. Otherwise, no increment in the final weight is produced. Finally, this measure is normalized dividing it by the length of the hypothesis, calculated by the number of items. Equation 3.1 defines the measure. 1 For some measures we have used implementations provided by SimMetrics library http://www.dcs.shef.ac.uk/~sam/simmetrics.html. 58

3.3. LEXICAL PERSPECTIVE spmatch = H i=1 match(i) H (3.1) where H is the set of hypothesis items and match(i) is computed as follows: match(i) = { 1 if j T i=j, 0 otherwise. (3.2) being T the set of items belonging to the text. Levenshtein distance: similar to the binary matching, but in this case we calculate the matching function as follows: 1 if T i Lv(H i, T i ) = 0, 0.9 if T i Lv(H i, T i ) = 0 match(h i ) = T i Lv(H i, T i ) = 1, ( ) 1 max otherwise. T i Lv(H i, T i ) (3.3) where Lv(H i, T j ) represents the Levenshtein distance between H i and T j. The cost of an insertion, deletion or substitution is equal to one and the weight assigned to match(h i ) when Lv(H i, T j ) = 1 has been obtained empirically. Table 3.1 shows an example of the Levenshtein distance between the Saturday and Sunday strings. The minimum essential steps to be taken are highlighted. Therefore, the matching procedure is applied to the items belonging to the hypothesis and text computing the Levenshtein distance between each pair. 59

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM S A T U R D A Y 0 1 2 3 4 5 6 7 8 S 1 0 1 2 3 4 5 6 7 U 2 1 1 2 2 3 4 5 6 N 3 2 2 2 3 3 4 5 6 D 4 3 3 3 3 4 3 4 5 A 5 4 3 4 4 4 4 3 4 Y 6 5 4 4 5 5 5 4 3 Table 3.1: Levenshtein distance between Saturday and Sunday. Needleman-Wunsch algorithm: this algorithm, proposed by Saul Needleman and Christian Wunsch in (Needleman & Wunsch, 1970) initially used to find similarities in protein sequences, is similar to the basic edit distance (Levenshtein distance) but adding a variable cost adjustment to the cost of an insertion or deletion. Therefore, the Levenshtein distance can simply be seen as the Needleman-Wunch distance with a gap cost equal to 1. Some experiments were done in order to adjust the cost of a gap resulting in a penalty of 3 being the best value. Smith-Waterman algorithm: the Smith-Waterman algorithm is a well-known dynamic programming algorithm for performing local sequence alignment and determining similar regions between sequences. The algorithm was first proposed in (Smith & Waterman, 1981) and consists of two steps: (i) calculate the similarity matrix score; and (ii) according to the dynamic programming method, trace back the similarity matrix to search for the optimal alignment. For two sequences SQ 1 and SQ 2, the optimal alignment score of two sub-sequences SQ 1 [1]...SQ 1 [i] and SQ 2 [1]...SQ 2 [j] is the calculation of D(i, j) defined as: 0 start over, D(i 1, j 1) f(sq 1 [j], SQ 2 [j]) substitution or copy, max D(i 1, j) GAP insertion, D(i, j 1) GAP deletion. 60

3.3. LEXICAL PERSPECTIVE (3.4) The algorithm permits two adjustable parameters regarding substitutions and copies for an alphabet mapping (the f function) and also allows costs to be attributed to a GAP for insertions or deletions. Table 3.2 shows the Smith-Waterman distance for the terms AAAA MNOP ZZZZ and BBBB MNOP YYYY, with the gap cost set to 0.5 and where the copy and substitution costs are equal to -2 and 1. 2 A A A A M N O P Z Z Z Z B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.5 0 0 0 1 0.5 0 0 0 M 0 0 0 0 0.5 2 1.5 1 0.5 0.5 0 0 0 0 N 0 0 0 0 0 1.5 3 2.5 2 1.5 1 0.5 0 0 O 0 0 0 0 0 1 2.5 4 3.5 3 2.5 2 1.5 1 P 0 0 0 0 0 0.5 2 3.5 5 4.5 4 3.5 3 2.5 0 0 0 0 1 0.5 1.5 3 4.5 6 5.5 5 4.5 4 Y 0 0 0 0 0.5 0 1 2.5 4 5.5 5 4.5 4 3.5 Y 0 0 0 0 0 0 0.5 2 3.5 5 4.5 4 3.5 3 Y 0 0 0 0 0 0 0 1.5 3 4.5 4 3.5 3 2.5 Y 0 0 0 0 0 0 0 1 2.5 4 3.5 3 2.5 2 Table 3.2: An example of the calculation of the Smith-Waterman distance. In the example, the Smith-Waterman distance is given by the highest value among all cells, i.e. 6. This score indicates that the longest approximate matching string terminates in the cell with the highest value so the sequence MNOP matches in both strings. In our experiments for recognising textual entailment phenomena, we empirically set the values 0.3, -1 and 2 for a gap, copy and substitution respectively. Matching of consecutive subsequences (Consecutive subsequence matching, CSM): this measure assigns the highest relevance 2 This example has been extracted from http://www.dcs.shef.ac.uk/~sam/ stringmetrics.html#smith 61

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM to the appearance of consecutive subsequences. In order to perform this, we have generated all possible sets of consecutive subsequences of items, from length two up to the length in words, from the text and the hypothesis. If we proceed as mentioned, the sets of length two extracted from the hypothesis will be compared to the sets of the same length from the text. If the same element is present in both the text and the hypothesis set, then a unit is added to the accumulated weight. This procedure is applied to all sets of different length extracted from the hypothesis. Finally, the sum of the weight obtained from each set of a specific length is normalized by the number of sets corresponding to this length, and the final accumulated weight is also normalized by the length of the hypothesis in words minus one. The next equations detail this measure: CSM = H i=2 f(sh i ) H 1 (3.5) where SH i contains the hypothesis subsequences of length i, and f(sh i ) is defined as follows: f(sh i ) = j SH i match(j) H i + 1 (3.6) being match(j) equal to one if there exists an element k that belongs to the set that contains the text s subsequences of length i, such that k = j. To clarify this measure, Figure 3.2 illustrates a visual example. The red braces show the different subsequences and the numbers associated with them are their weights (in the case where the subsequence finds its counterpart in the text). The green braces group the normalized weight for each set of subsequences depending on their length, and finally the total weight is obtained by the sum of these weights normalized by the hypothesis length minus one. 62

3.3. LEXICAL PERSPECTIVE Figure 3.2: The Consecutive Subsequence Matching measure. We would like to point out that this measure does not consider nonconsecutive subsequences. In addition to this, it assigns the same relevance to all consecutive subsequences with the same length. Furthermore, the longer the subsequence is, the more relevant it will be considered. ROUGE measures: these measures have already been tested for automatic evaluation of summaries and machine translation (Lin & Och, 2004; Lin, 2004). For this reason, and considering the impact of n-gram overlap metrics in textual entailment, we consider the idea of integrating these measures in our system as interesting. We have implemented these measures as defined in (Lin, 2004). Next, a brief explanation of them in order to highlight how they were developed for our aims is shown. ROUGE-N: determines an n-gram recall between a candidate hypothesis and the reference text. It is computed as follows: ROUGE N = Count match (gram n ) gram n H gram n H Count(gram n ) (3.7) where n indicates the length of the n-gram (gram n ), Count match (gram n ) is the maximum number of n-grams that appear in both 63

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM the hypothesis and the text, and Count(gram n ) is the number of n-grams within the hypothesis. In our approach, the n-grams are created from the items extracted from the text and the hypothesis, and a set of previous experiments determined that the most suitable values for n are two and three. ROUGE-L: prior to calculating this measure, we obtained the longest common subsequence (LCS) between the hypothesis and the text, defined as LCS(T, H). Later on, we used an LCS-based F-measure to estimate the similarity rate as follows: R LCS = P LCS = LCS(T, H) ; (3.8) T LCS(T, H) ; (3.9) H F LCS = (1 + β2 ) R LCS P LCS R LCS + β 2 P LCS (3.10) where β = 1, and T and H are the lengths of T (the text) and H (the hypothesis) measured in items (i.e. tokens, lemmata or stems, depending on the data structure used) belonging to T and H respectively. ROUGE-W: is quite similar to the ROUGE-L measure. The difference relies on the extension of the basic LCS. ROUGE-W uses a weighted LCS between the text and the hypothesis, W LCS(T, H). This modification of LCS memorizes the length of consecutive matches encountered considering them as a better choice than longer non-consecutive matches. For instance, within the following example: X: [ABCDEFG] Y 1 : [ABCDHIK] Y 2 : [AHBKCID] Y 1 is a better choice than Y 2 because Y 1 has consecutive matches. So, this measure remembers the length of the consecutive matches found so far with a regular two dimensional dynamic program table. 64

3.3. LEXICAL PERSPECTIVE We computed F-measure based on WLCS as follows: ( ) WLCS(T, H) R LCS = f 1 ; (3.11) f ( T ) ( ) WLCS(T, H) P LCS = f 1 ; (3.12) f ( H ) F LCS = (1 + β2 ) R LCS P LCS R LCS + β 2 P LCS (3.13) where f 1 is the inverse function of f. One property that f must have the is that f(x + y) > f(x) + f(y) for any positive integers 3. In our experiments we use f(k) = k 2, f 1 (k) = k 1/2 and β = 1. ROUGE-S: this measure is based on skip-ngrams. A skip-ngram is any combination of n words in their sentence order, but allowing arbitrary gaps. ROUGE-S measures the overlap of skip-ngrams between the hypothesis and the text, SKIP n (T, H). As with the aforementioned ROUGE measures, we compute the ROUGE-Sbased F-measure as follows: R LCS = SKIP n(t, H) ; (3.14) C ( T, n) P LCS = SKIP n(t, H) ; (3.15) C ( H, n) F LCS = (1 + β2 ) R LCS P LCS R LCS + β 2 P LCS (3.16) where β = 1, C is a combinational function and n is the length of the selected skip-gram. For our experiments we developed skipbigrams and skip-trigram (n = 2 and n = 3), due to the fact that higher values of n produced meaningless skip-ngrams. 3 This property ensures that consecutive matches has more scores than non-consecutive matches 65

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM Jaro distance: this metric comes from the work presented in (Jaro, 1989; Jaro, 1995) and measures the similarity between two strings taking into account spelling derivations. The following equation describes the way that it obtains the similarities: d j (s 1, s 2 ) = m 3 s 1 + m 3 s 2 + m t 3 m (3.17) being s 1 and s 2 the strings to be compared, s 1 and s 2 their respective lengths, m the number of matching characters considering only those that are not farther than [ max( s 1, s 2 ) ] 1 and t the number of transpositions computed as the number of matching (but differently ordered) 2 characters divided by two. For instance, the Jaro distance for MARTHA and MARHTA is computed as follows: s 1 = MARTHA ; s 1 = 6 s 2 = MARHTA ; s 2 = 6 m = 6 Two mismatched characters T/H and H/T, so t = 2 2 = 1 Jaro distance: d j (s 1, s 2 ) = 6 3 6 + 6 3 6 + 6 1 3 6 = 0.944 Jaro-Winkler distance: is a variant of the aforementioned Jaro distance metric. The Jaro-Winkler distance metric (Winkler, 1999) is designed and best suited for short strings such as person names due to the fact that it emphasizes prefix similarity. It uses a prefix scale p which gives more favourable ratings to strings that match from the beginning for a set prefix length l. Given two strings s 1 and s 2, their Jaro-Winkler distance d w is: d w (s 1, s 2 ) = d j (s 1, s 2 ) + (lp(1 d j )) (3.18) where d j is the Jaro distance for s 1 and s 2, l is the length of the longest common prefix of the two strings and p is a constant scaling factor for 66

3.3. LEXICAL PERSPECTIVE how much the score is adjusted upwards for having common prefixes. After experiments we set l with a maximum value of 4 and p = 0.1 which is the standard value for this constant in Winkler s work. Regarding the previous example about the two strings MARTHA and MARHTA, the Jaro-Winkler distance would be calculated as follows: l = 4; p = 0.1 d w (s 1, s 2 ) = 0.944 + (4 0.1(1 0.944)) = 0.9664 Euclidean distance: the traditional definition measures the distance between two points P = (p 1, p 2,...,p n ) and Q = (q 1, q 2,...,q n ) in Euclidean n-space as: (p1 q 1 ) 2 + (p 2 q 2 ) 2 + + (p n q n ) 2 = n (p i q i ) 2 (3.19) With the aim of dealing with strings, we set n as the number of distinct items (characters) that occur in both strings and p i, q i the number of times that each of them appear in each string respectively. For instance, let s assume the two strings DIXON and DICKSON, so n = 8 and for each character: D: p 1 = 1, q 1 = 1 ; I: p 2 = 1, q 2 = 1 ; X: p 3 = 1, q 3 = 0 O: p 4 = 1, q 4 = 1 ; N: p 5 = 1, q 5 = 1 ; C: p 6 = 0, q 6 = 1 K: p 7 = 0, q 7 = 1 ; S: p 8 = 0, q 8 = 1 taking the Euclidean distance the value of (1 1)2 + (1 1) 2 + + (0 1) 2 + (0 1) 2 = 4 = 2 i=1 Cosine similarity: is a common vector-based similarity. The input strings are transformed into vector space as in the previous measure, and it is computed as follows: cos( x, y) = x y x y (3.20) 67

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM The dot product of two vectors x = [x 1, x 2,, x n ] and y = [y 1, y 2,, y n ] is defined as: x y = n x i y i = x 1 y 1 + x 2 y 2 + + x n y n (3.21) i=1 where denotes summation notation and n is the dimension of the vectors, and their magnitudes are expressed as the square root of the dot product of the vector with itself: x := x 2 1 + + x2 n. (3.22) Commonly, the cosine similarity as well as the dice s coefficient (which will be explained next) have been used in Information Retrieval tasks (Frakes & Baeza-Yates, 1992). Jaccard similarity coefficient: is a statistic coefficient for comparing the similarity and diversity of sample sets. The Jaccard metric was first introduced and detailed in (Jaccard, 1912). It is defined as the size of the intersection divided by the size of the union of the sample sets: J(A, B) = A B / A B (3.23) In our case, we compute this coefficient representing the strings as Jaccard vectors containing the sets of unique items (characters) of each string. Dice s coefficient: is a term-based similarity measure related to the Jaccard metric. For sets X and Y extracted from the two strings to be processed, the coefficient is defined as: D = 2 X Y X + Y (3.24) The X and Y sets representing the strings are as in the Jaccard measure. 68

3.3. LEXICAL PERSPECTIVE Soundex distance: Soundex is a coarse phonetic indexing scheme, widely used in genealogy. Soundex allows phonetic misspellings to be evaluated easily, for instance the names John, Johne and Jon are often genealogically the same person. This is a term based evaluation where each term is given a Soundex code, each Soundex code consists of a letter and three numbers between 0 and 6 (e.g. Chapman is C155 ), where the letter is always the first letter and the numbers represent the rest of the name. This approach is very promising for disambiguation of translitterated/misspelt names. To apply this metric, we compute the Soundex code of each item (word, lemma and stem) and we try to match each hypothesis Soundex code with the text s ones. The final matching coefficient is returned. Q-gram matching: Q-grams are typically used in approximate string matching by sliding a window of length q over the characters of a string to create a number of q length grams. Q-gram matching is then rated as the number of q-gram matches within the second string over possible q-grams. The intuition behind the use of q-grams as a foundation for approximate string processing is that when two strings s 1 and s 2 are within a small edit distance of each other, they share a large number of q-grams in common. This method is frequently employed as a fuzzy search in databases to allow non exact matching of queries. For our purposes, we obtain the sum of the q-gram ratio of each item of the hypothesis with regards to the text s ones. After several experiments we set q as 3, since that value obtained the best performance in the developing stages. IDF specificity: another way to weight the words is by means of their Inverse Document Frequency (IDF) specificity. We determine the specificity of a word using the IDF concept introduced in (Sparck- Jones, 1972), which is defined as the total number of documents in a collection divided by the total number of documents that include that word. In our experiments, we derive the document frequencies from the document collections used for the tracks reported within the 69

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM Cross-Language Evaluation Forum (CLEF) (Peters, 2007), concretely the LA Times 94 and Glasgow Herald 95 collections, which contain a total number of 169,477 documents. The IDF measure helps the system to evaluate each word regarding its specificity whereby the words with higher IDF values will be more relevant to take the entailment decision. If we process this metric as in the equation 3.25, we could consider this factor as a feature to determine entailment relations. ENT idf (T, H) = idf(w) w H T w H idf(w) (3.25) In a nutshell, the application of the lexical perspective is broken down into three steps: (i) The tokenization, stemmization and lemmatization of the texts. These analyses will be shared for all perspectives (see section 3.2). (ii) The construction of several data structures each containing the set of tokens, stems or lemmata of each text. (iii) The computation of the lexical measures over the data structures corresponding to the target pair text-hypothesis, storing the maximum value achieved among all data structures. If the lexical measure is calculated over the characters of the token, stem or lemma (such as the majority of them), the measure is computed for each hypothesis item regarding all text items returning the maximum value achieved. Afterwards, to obtain a normalized lexical similarity score, these values are accumulated and divided by the number of hypothesis items. (iv) Finally, the values achieved by the lexical measures will be used as features for a machine learning algorithm, which will decide whether the entailment relation exists or not. The selection of this set of lexical measures was made due to the fact that there are many investigations in textual entailment partly supported 70

3.4. SYNTACTIC PERSPECTIVE by some of these measures (see RTE challenges (Bar-Haim et al., 2006; Giampiccolo et al., 2007; Giampiccolo et al., 2008a)). Besides, such well-known measures are somewhat easy to implement in addition to finding an open source implementation for most of them. 4 It is obvious that there are many of our lexical measures that overlap each other. Although it could be considered as a problem, the idiosyncrasies of each measure result in slight differences in the lexical similarity measurement process, which could help within the entailment recognition. Naturally, it will be achieved by a selection of the most meaningful measures for recognising textual entailment relations. Such a selection process will be exposed in section 4.2. However, as mentioned previously, the robustness of the lexical perspective can be compromised by slight modifications within the wording (e.g. negation terms). This leads us to take into consideration other perspectives which will help the system within the entailment inference procedure. 3.4 Syntactic Perspective The syntactic perspective was presented in our paper (Micol et al., 2007). It attempts to deduce whether there is an entailment relation based on the information provided by the syntactic dependency trees of the phrases. The syntactic perspective is composed of four modules that behave collaboratively. These include tree construction, filtering, embedded subtree detection and graph node matching. A schematic representation of the architecture of this perspective is shown in Figure 3.3. As mentioned at the beginning of this chapter, each perspective can support the entailment decision in an isolated manner as well as together with other perspectives. In the case of the syntactic perspective, a syntactic similarity score is returned, and it can be used to determine the entailment itself or as another system feature together with the features derived from the lexical and semantic perspectives. In following sections the modules that make up the syntactic perspective will be described, they are numbered sequentially according to the execution 4 For some measures we have used implementations provided by SimMetrics library http://www.dcs.shef.ac.uk/~sam/simmetrics.html. 71

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM Figure 3.3: Syntactic perspective architecture. order of the modules. 3.4.1 Tree Generation The first module constructs the corresponding syntactic dependency trees of the input text and hypothesis. For this purpose we use MINIPAR (Lin, 1998a), which is a parser that provides the dependency information of the words within a phrase, as well as their grammatical categories and relationships. Later on, we will use all this information to determine whether there is an entailment relation between text and hypothesis. Once the output of MINIPAR is generated, we construct on-memory trees according to the dependency relationships that this parser provides. 3.4.2 Tree Filtering Some of the words of the input phrases may not be relevant for our system, such as stop-words. We cannot erase them before constructing the dependency trees, otherwise the parser would fail to extract the dependency relations. Therefore, we must remove them once the corresponding trees have been constructed. For this purpose we have generated a list of relevant grammatical categories, 5 so that we will keep all those words whose categories belong to said list, and discard the rest of them, which are the ignored 5 Verbs, verbs with one argument, verbs with two arguments, verbs taking clause as complement, verb Have, verb Be, nouns, numbers, adjectives, adverbs, noun-noun modifiers. 72

3.4. SYNTACTIC PERSPECTIVE grammatical categories 6. We have performed tests taking into account and discarding each grammatical category, which has allowed us to generate both lists of relevant and ignored grammatical categories. By filtering the syntactic dependency trees we will have reduced our system s noise, since all words that do not provide useful information will have been discarded. In addition to this, the resulting trees will be smaller than the original ones, thus, our system s execution time will be smaller. 3.4.3 Graph Embedding Detection The first entailment detection stage is the graph embedding detection, and consists of determining whether the hypothesis tree is embedded into the text s. Let us first define the concept of an embedded tree as in (Katrenko & Adriaans, 2006). Definition 1: Embedded tree A tree T 1 = (V 1, E 1 ) is embedded into another T 2 = (V 2, E 2 ) iff 1. V 1 V 2, and 2. E 1 E 2 where V 1 and V 2 represent the vertices, and E 1 and E 2 the edges. In other words, a tree, T 1, is embedded into another, T 2, if all nodes and branches of T 1 are present in T 2. We believe that it makes sense to reduce the strictness of such a definition to allow the appearance of intermediate nodes in the text s branches that are not present in the corresponding hypothesis branch, which means that we allow partial matching. Therefore, a match between two branches will be produced if all nodes of the first one are present in the second, and their respective order is the same, allowing the possibility of the appearance of intermediate nodes that are not present in both branches. This is also described in (Katrenko & Adriaans, 2006). To determine whether the hypothesis tree is embedded into the text s, we perform a top-down matching process. For this purpose we first compare 6 Determiners, pre-determiners, post-determiners, clauses, inflectional phrases, preposition and preposition phrases, specifiers of preposition phrases, auxiliary verbs, complementizers. 73

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM the roots of both trees. If they coincide, we then proceed to compare their respective child nodes, which are the tokens that have some sort of dependency with their respective root token. If they do not, we keep fixed the node in the hypothesis and loop through the ones in the text to try to find a match. When we locate it, we will move to the next node of the hypothesis tree and attempt to find the corresponding one in the subtree of the text whose root is the one that matches the root of the hypothesis. We perform this step reiteratively until we have found a match for all nodes in the hypothesis. If we are able to do so, the hypothesis syntactic dependency tree will be embedded into the text s and we will believe that there is a high probability of an entailment relation, and as a result the syntactic perspective will return the highest score (i.e. 1). If not, we will not be able to assure that such an implication is produced and will proceed to execute the next module. In order to add more flexibility to this procedure and due to the fact that it is very difficult to find strict embedding between syntactic trees, we do not require the pair of words that are being compared to be exactly the same, but rather set a threshold that represents the minimum similarity value between them. This is the difference between our approach and the one described in (Katrenko & Adriaans, 2006). Such a similarity is calculated by using the WordNet::Similarity tool (Pedersen et al., 2004), which is based on WordNet (Miller et al., 1990), and, concretely, the Wu&Palmer measure (Wu & Palmer, 1994), as defined in Equation 3.26: Sim(C 1, C 2 ) = 2N 3 N 1 + N 2 + 2N 3 (3.26) where C 1 and C 2 are the synsets whose similarity we want to calculate, C 3 is their least common superconcept, N 1 is the number of nodes on the path from C 1 to C 3, N 2 is the number of nodes on the path from C 2 to C 3, and N 3 is the number of nodes on the path from C 3 to the root. All these synsets and distances can be observed in Figure 3.4. We chose the Wu&Palmer measure due to the fact that its idiosyncrasies explore the tree that connects two concepts within a taxonomy (in our specific case WordNet), and from the point of view of the syntactic perspective it is more coherent than using other similarity measures such as the ones supported by corpus evidence. Therefore, if the similarity rate is greater than or equal to the established threshold, which we have set empirically to 80%, we will consider the cor- 74

3.4. SYNTACTIC PERSPECTIVE Root N3 N1 C3 N2 C1 C2 Figure 3.4: Distance between two synsets. responding hypothesis word as suitable to have the same meaning as the text s one, and will proceed to compare its child nodes in the hypothesis tree. On the other hand, if the similarity value is less than the corresponding threshold, we will proceed to compare the children of such a text s tree node with the hypothesis node that was actually being analyzed. In the event that this module is able to assert that the hypothesis tree is embedded into the text s tree, the maximum similarity value will be returned (i.e. 1). 3.4.4 Graph Node Matching If the previous graph embedding detection module within the syntactic perspective has not been able to find a match between text and hypothesis, the system performs the graph node matching module. It consists of finding pairs of tokens in both trees whose lemmas are identical, irrelevant of whether they are in the same position within the tree. This process is also known as alignment, and we would like to point out that in this step we do not use the WordNet::Similarity tool. Some authors have already designed similar matching techniques, such as the ones described in (MacCartney et al., 2006) and (Snow et al., 2006). However, these include semantic constraints that we have decided not to consider. The reason for not including the WordNet::Similarity tool into the Graph Node Matching module is because we desired that the syntactic perspective should overcome the textual entailment recognition from an ex- 75

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM clusively syntactic point of view. Therefore, we did not want this module to include any kind of semantic knowledge in this approach. The similarity rate between text and hypothesis is calculated based on the relevance of the words that appear in both trees. This is represented in Equation 3.27: ψ(τ, λ) = ν ξ φ(ν) (3.27) where τ and λ represent the text s and hypothesis syntactic dependency trees, respectively. The set ξ is defined as the one that contains all words present in both trees, such that ξ = τ λ α τ, β λ. Finally, the function φ(ν) provides the relevance of the word represented as ν. The main goal is to define how to calculate the aforementioned function so that it assigns a relevance value to each word correspondent for its contribution to the entailment relation. The relevance of a word will depend on the depth in the hypothesis tree and on its grammatical information. The first of these factors is based on an empirically-calculated weight that assigns less importance to a node the deeper it is located in the tree. The reason for this is that the most relevant words within a tree generally occupy the highest positions in the syntactic dependency tree, while the less relevant ones are in the deeper positions. As an example, the main verb of a phrase will be the root of its syntactic dependency tree, while for the rest of the verbs, the less relevant they are, the deeper they will be located in the tree. The second factor gives different relevance depending on the grammatical category and relationship. For instance, a verb will have the highest weight, while an adverb or an adjective will have less relevance. The values assigned to each grammatical category and relationship are also empirically-calculated and are shown in Tables 3.3 and 3.4, respectively. To mathematically define the relevance function, we assume we have found a word, namely β, present in both τ and λ. Now let γ be the weight assigned to β s grammatical category (defined in Table 3.3), σ the weight of β s grammatical relationship (defined in Table 3.4), µ an empirically-calculated value that represents the weight difference between tree levels, and δ β the depth of the node that contains the word β in λ. We define the function φ(β) as represented in Equation 3.28. 76

3.4. SYNTACTIC PERSPECTIVE Grammatical category Weight Verbs, with one argument, with 1.0 two arguments, taking clause as complement Nouns, numbers 0.75 Be used as a linking verb 0.7 Adjectives, adverbs, noun-noun 0.5 modifiers Verbs Have and Be 0.3 Table 3.3: Weights assigned to the grammatical categories. Grammatical relationship Weight Subject of verbs, surface subject, 1.0 object of verbs, second object of ditransitive verbs The rest 0.5 Table 3.4: Weights assigned to the grammatical relationships. φ(β) = γ σ µ δ β (3.28) The value obtained by calculating the expression of Equation 3.28 would represent the relevance of a word in our system. The experiments performed reveal that the optimal value for µ is 1.1. One should note that a requirement of our system s similarity measure is to be independent of the hypothesis length. Therefore, we must define the normalized similarity rate, as shown in Equation 3.29. ψ(τ, λ) = ψ(τ, λ) φ(β) = φ(ν) ν ξ φ(β) (3.29) β λ β λ Once the similarity value, ψ(τ, λ), has been calculated, it will provide 77

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM a similarity factor corresponding to a specific text-hypothesis pair. This factor or the one returned by the Graph Embedding Detection module will be integrated as a feature into a machine learning classifier responsible for taking the entailment decision as described in section 3.1. 3.5 Semantic Perspective Although lexical-syntactic perspectives can achieve promising results detecting entailment relations, in many cases such a semantic problem can not be solved without using semantic knowledge. It is widely known that the semantic inferences are the most complex ones to implement as well as to integrate into a textual entailment system. In our understanding, one of the main reasons for that is the limited coverage of the semantic resources and consequently their effectiveness. Lexical resources (e.g. lemmatizers, stemmers) and parsers achieve high performance in the tasks that they carry out because the boundaries of these tasks are to a certain extent delimited, however to model the knowledge provided by semantics is a very tedious task just to obtain a minimum coverage. Therefore, sometimes it is more appropriate to develop shallow semantic inferences than sophisticated (and probably less efficient) semantic analyses. For this reason, we have developed several studies about how to integrate semantic knowledge to the system, in this fashion the system will recognise false and true entailment pairs that could not be previously detected. The next sections will detail the different semantic inferences integrated into our system, and as for the other perspectives, they will be considered as features for a machine learning algorithm. 3.5.1 Measuring Semantic Similarity This analysis aims to derive automatically a similarity score indicating the similarity degree between two snippets or fragments of texts and focus on semantic relations between words encoded in WordNet. Relations such as synonymy, hypernymy and holonomy could be a great help to recognise up to now unknown entailment relations. Similar approaches have already been developed such as the one presented in (Corley & Mihalcea, 2005). However, in our case apart from determining 78

3.5. SEMANTIC PERSPECTIVE the similarity using WordNet-based semantic metrics, we also consider the words that are not found in WordNet for the final similarity score. We exploit WordNet relations of synonymy, hypernymy, hyponymy, antonymy, meronomy, holonomy and so on, in order to find semantic paths that connect two concepts within the WordNet taxonomy (we use the lemmata of the text and hypothesis for these inferences). Therefore, rather than considering them as lexical transformations we use WordNet to find out the existing semantic similarity and relatedness between concepts. Indeed, there are several implementation of similarity and relatedness measures between words based on WordNet, for instance the WordNet::Similarity::tool (Pedersen et al., 2004) written in perl. In our experiments, we have used the Java WordNet Similarity Library (JWSL 7 (Pirrò & Seco, 2008)). As JWSL is based on a Lucene index that encompasses information about the whole WordNet structure, the computation of similarity between words can be speeded up. Besides, it implements some of the most common semantic similarity measures such as the followings: Resnik (Resnik, 1995): is a measure of semantic similarity in an IS-A taxonomy based on the notion of information content. Information Content (IC) can be considered a measure that quantifies the amount of information a concept expresses. The IC values are obtained by statistically analyzing corpora, associating probabilities to each concept in the taxonomy based on word occurrences in a given corpus. These probabilities are accumulative as we go up the taxonomy from specific concepts to more abstract ones, consequently every word occurrence in the corpus is also counted as an occurrence of each taxonomic class containing it. The IC value is obtained by considering negative the log likelihood: IC(c) = log p(c) (3.30) where c is a concept in a taxonomy (e.g. WordNet) and p(c) is the probability of encountering c in a given corpus. It should be noted that this method ensures that p is monotonic as one moves up the 7 http://grid.deis.unical.it/similarity/ 79

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM taxonomy: if c 1 IS-A c 2, then p(c 1 ) p(c 2 ). By default, the IC of concepts is derived from the sense tagged corpus SemCor. 8 According to Resnik and once the IC values are defined, the similarity between two concepts depends on the amount of information they have in common. This information shared by two concepts is indicated by the information content of the concepts that subsume them in the taxonomy. Formally: sim res (c 1, c 2 ) = max IC(c) (3.31) c S(c1,c2) where S(c 1, c 2 ) is the set of concepts that subsume c 1 and c 2. The Resnik metric suffers the problem that when computing the similarity between identical concepts the output yields the maximum IC value of their common subsumer and not the value corresponding to maximum similarity. Lin (Lin, 1998b): with his work, Lin presented a similarity measure derived from a set of assumptions that capture three intuitions about similarity: The similarity between A and B is related to their commonality. The more commonality they share, the more similar they are. The similarity between A and B is related to the differences between them. The more differences they have, the less similar they are. The maximum similarity between A and B is reached when A and B are identical. Lin also employs the previous IC formula to measure the IC of concepts. Therefore, as Lin stated using his semantic similarity metric in a taxonomy such as WordNet, the similarity between c 1 and c 2 is defined as the ratio between the amount of information needed to state the 8 The SemCor corpus is a subset of the English Brown corpus containing almost 700,000 running words. In SemCor all the words are tagged by part-of-speech, and more than 200,000 content words are also lemmatized and sense-tagged according to Princeton Word- Net 1.6 (http://multisemcor.itc.it/semcor.php). 80

3.5. SEMANTIC PERSPECTIVE commonality of c 1 and c 2 and the information needed to fully describe what c 1 and c 2 are. This equation describes the measure: sim Lin (c 1, c 2 ) = 2 IC(c 0 ) IC(c 1 ) + IC(c 2 ) (3.32) where c 0 is the most specific concept within a taxonomy that subsumes both c 1 and c 2. Jiang & Conrath (Jiang & Conrath, 1997): also continue to use the information theoretic definition and they suggested a measure of semantic distance that is derived from the edge-based notion by adding the IC as a decision factor. The edge-based approach is a more natural and direct way of evaluating semantic similarity in a taxonomy. It estimates the distance (e.g. edge length) between nodes which correspond to the concepts/classes being compared. Given the multidimensional concept space, the conceptual distance can be conveniently measured by the geometric distance between the nodes representing the concepts. Obviously, the shorter the path from one node to the other, the more similar they are. The next equation defines the Jiang & Conrath semantic distance: Dist Jiang (c 1, c 2 ) = IC(c 1 )+IC(c 2 ) 2 IC(LSuper(c 1, c 2 )) (3.33) where LSuper(c 1, c 2 ) denotes the lowest super-ordinate of c 1 and c 2 within a taxonomy. Note that both Lin s and Jiang s formulations correct the problem existent with Resnik s similarity metric, yielding that sim Lin (c 1, c 1 ) = 1 and Dist Jiang (c 1, c 1 ) = 0. Pirro & Seco (Nuno, 2005; Pirrò & Seco, 2008): this measure is conceptually similar to the previous ones, but it is founded on the featuresbased theory of similarity posed by Tversky (1977). Tversky proposed an abstract model of similarity that considers the features that are common to two concepts as well as the differentiating features peculiar 81

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM to each. Assuming Ψ(c) as the function that obtains the set of features relevant to c: sim Tversky (c 1, c 2 ) = αf(ψ(c 1 ) Ψ(c 2 )) βf(ψ(c 1 )/Ψ(c 2 )) γf(ψ(c 2 )/Ψ(c 1 )) (3.34) where F is a function that reflects the salience of a set of features, and α, β and γ are parameters that provide different weights focusing on the different components. Although the above definition is not based on theoretic information, the authors establish a parallel that lead them towards a new similarity based on IC. The common subsumer of two concepts reflects the information these concepts share, which is exactly the intersection of features, and F can be considered as the quantification in the form of information theory. Therefore, IC(common subsumer(c 1, c 2 ) F(Ψ(c 1 ) Ψ(c 2 )) (3.35) Please note, at this point, Resnik s metric can be formulated as Tversky s similarity with β and γ equal to 0. Finally, the authors proposed an information theoretic counterpart of the Tversky s similarity as: sim Tversky (c 1, c 2 ) = IC(common subsumer(c 1, c 2 )) (IC(c 1 ) IC(common subsumer(c 1, c 2 ))) (IC(c 2 ) IC(common subsumer(c 1, c 2 ))) = = 3 IC(common subsumer(c 1, c 2 )) IC(c 1 ) IC(c 2 ) (3.36) 82

3.5. SEMANTIC PERSPECTIVE Moreover, in order to not fall into the same problem as the Resnik metric, the authors assign the value of 1 if the two concepts are the same. They formalize their metric as follows: sim P&S (c 1, c 2 ) = { sim Tversky if c 1 c 2, 1 if c 1 = c 2. (3.37) In the context of our experiments, the JWSL toolkit implements an intrinsic IC. The conventional calculation of the IC consists of combining the knowledge provided by a hierarchical structure (such as WordNet in our specific case) with statistics derived from a large corpus. The fathers of the JWNSL toolkit propose computing the IC values using WordNet as a statistical resource with no need for external ones. Their intrinsic measure of IC relies on the assumption that the taxonomic structure of WordNet is organized in a meaningful and structured way, where concepts with many hyponyms convey less information than concepts that are leaves, so the more hyponyms a concept has the less information it expresses. Hence, they define the IC for a concept c as: IC(c) = 1 log(hypo(c) + 1 log(max wn ) (3.38) where hypo returns the number of hyponyms of a given concept c, 9 and max wn is a constant that indicates the total number of concepts in the Word- Net taxonomy. In (Nuno, 2005; Pirrò & Seco, 2008) the authors successfully compared and evaluated the intrinsic IC regarding the traditional IC, reporting the benefits of this new method to calculate the IC values on a taxonomy. Therefore, to catch the semantic similarity behind these WordNet-based measures, we have developed a procedure that automatically derives a score that measures the similarity degree between two texts. This procedure is computed as shown in Program 1. 10 9 Note that concepts that represent leaves in the taxonomy will have an IC of one, since they do not have hyponyms. And this value of 1 states that a concept is maximally expressed and cannot be further differentiated. 10 We consider all senses of each word, obtaining the best similarity factor regarding all of them. No Word Sense Disambiguation algorithm was used for this issue. 83

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM Program 1 Measuring semantic similarity based on WordNet measures. TotalWeight = 0 for i=0... size(h) do maxsemanticweight = 0 for i=0... size(t) do if SemanticSimilarity(H(i),T(i))>maxSemanticWeight then maxsemanticweight = SemanticSimilarity(H(i),T(i)) endif endfor TotalWeight += maxsemanticweight endfor return TotalWeight/size(H) The function SemanticSimilarity(H(i), T(j)) returns the maximum similarity obtained by the four measures cited above. Whether a hypothesis word is not found in WordNet, we apply the Smith-Waterman algorithm between it and the words belonging to the text, adding the highest value found to the accumulated weight. This permits us to take into account entities that, while not appearing in WordNet, are very relevant to detect entailment relations (e.g. in the RTE-3 development corpus pair id = 1 appears the entity Rosneft, an oil company, which is not included in WordNet but its consideration for the entailment decision is crucial). Plus, apart from computing our WordNet-based semantic similarity using all lemmata belonging to the hypothesis, the preprocessing step developed previously (see section 3.2) has already associated each item with its partof-speech, so we are also able to establish WordNet-based semantic relations between nouns, verbs, adjectives and adverbs as follows: The sets of words that comprise the text and the hypothesis are split into different groups, each one representing a gramatical category. The previous procedure is applied to these new sets, obtaining distinct similarity factors for each grammatical group. Specifically, three sets were created containing: (1) the nouns; (2) the verbs; and (3) adjectives and adverbs. The decision to consider different similarities for these three sets was 84

3.5. SEMANTIC PERSPECTIVE due to the fact that the most relevant grammatical categories such as nouns and verbs were not influenced by other categories. Unfortunately, although the authors of the JWSL toolkit plan to enlarge the coverage of this resource, currently it only works with nouns. So, to deal with the remaining grammatical categories we used the WordNet::Similarity::tool (Pedersen et al., 2004). For the sake of coherency, the same similarity metrics were computed, but on this occasion using the WordNet::Similarity::tool. 11 Furthermore, another way to give more relevance to the semantic connections found in WordNet, is to weight the similarity score according to the Inverse Document Frequency (as described in section 3.3). So, the procedure is the same except for the MaxSemanticWeight value: MaxSemanticW eight = SemanticSimilarity(H(i), T(j)) idf(h(i)) The inverse document frequencies for all lemmata in the hypothesis are computed as described in section 3.3. Once all WordNet-based semantic similarities are computed, they are passed as features to a machine learning algorithm, as we did for the lexical and syntactic inferences. 3.5.2 The Negation Feature Regarding negation, there are several inferences that we have implemented in order to support the final entailment decision. For instance: The antonymy WordNet and VerbOcean relations. We created a feature that shows whether in the text appear verbs having an antonymy relation with any hypothesis verb. This feature takes into account that the hypothesis verb is not negated (i.e. it is not associated with any negative term). 12 11 Note that the WordNet::Similarity::tool neither implements the Pirrò&Seco measure nor the intrinsic IC calculation, it develops the traditional IC definition instead. 12 For antonymy relations we took the Most Frequent Sense (MFS) (i.e. the first one), no Word Sense Disambiguation (WSD) was used. MFS obtains very good disambiguation results overcoming many current WSD algorithms. 85

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM We elaborated a list of negative terms extracted from several web pages as well as from training corpora. Terms such as not, never, nothing, none, etc, and we extract the polarity of each text. This polarity can be deduced in several ways: General polarity, considering all the negative terms within the sentence(s) of each text (text & hypothesis). General polarity equal to one means that the hypothesis and text have the same number of negative terms, otherwise the general polarity value will be zero. Main verbs polarity, considering only the negative terms that affect the main verbs of the sentence(s) detected by MINIPAR (Lin, 1998a), both for the text and the hypothesis. Conditional polarity on hypothesis verbs, if a hypothesis verb is negated we look for the same verb or a synonym also negated within the text, if it is not found we look for an antonym. These polarity inferences will serve as features for deducing entailments. In addition, we also created a list of modality markers (Sauri & Pustejovsky, 2007) that express a particular modal degree (e.g. must denotes certainty, likely probability and might possibility). So, a feature was added in order to represent the modal degree of each text. 3.5.3 The Importance of Being a Verb It is well-known that semantic knowledge is of paramount importance in most inference-based NLP tasks. Moreover, the construction of semantic resources (such as FrameNet, VerbNet and WordNet) capable of extracting the latent semantics of texts has always had a strong interest in the research community. Within the semantics of the texts, verbs play an important role, since they have a strong relevance to the sentence s final meaning. Therefore, with this analysis we want to measure how the hypothesis verbs are related to the text s verbs. To achieve this, we exploit the VerbNet lexicon and the VerbOcean and WordNet relationships: 13 13 These resources are described in section 2.2. 86

3.5. SEMANTIC PERSPECTIVE First, we created two wrappers in Java for the VerbNet and VerbOcean resources in order to build on-memory structures that represent the semantic information provided by these resources. Afterwards, we tried to find correlations between the verbs expressed in the hypothesis with the ones in the text. For these correlations, auxiliar verbs were discarded, and we used the MINIPAR (Lin, 1998a) tool (also used in the syntactic perspective) to detect them. Finally, we established a correspondence between two verbs depending on the occurrence of at least one of the following situations: (i) the two verbs have the same lemma or they are synonyms considering the WordNet synonymy relation, (ii) they belong to the same VerbNet class or a subclass of their classes, or (iii) there is a relation in VerbOcean that connects them. The underlying intuition about verbs correspondences is that the verbs wrapped in the same VerbNet class or in one of their subclasses have a strong semantic relation since they share the same thematic roles and restrictions, as well as syntactic and semantic frames. Additionally, the relations encoded in VerbOcean are good indicators of semantic relations among verbs. In this specific case we are dealing with (i.e. entailment deductions), the VerbOcean s relations considered are similarity, strength and happens-before. At this point we do not take into account the antonymy and enablement relations, because the enablement relation has a poor coverage and it neither involves transaction nor symmetry, so it does not provide enough knowledge to justify the entailment, and the antonymy relation was already considered together with the negation features (see section 3.5.2). Obviously, the asymmetric VerbOcean s relations are from the hypothesis s verbs to the text s ones. Once we have established when two verbs are related, we study several ways to integrate this knowledge into the system: Adding a binary feature that shows if all hypothesis verbs have at least one correspondence with a text s verb (with regards to the aforementioned verbs inferences). Considering the previous feature as a normalized value with regards to all hypothesis verbs (i.e. this feature has a range value between 0 and 87

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM 1, 1 means all H s verbs are related to T s verbs and 0 means that no correspondences were found for any H s verb). Setting the verb inference as a prior constraint for determining entailment. Therefore, each entailment candidate pair has to fulfil the requirement that every H verb is related to a T verb, otherwise the candidate will be tagged as false entailment relation. In chapter 4, the different ways to consider these inferences will be evaluated. 3.5.4 The Importance of Being a Named Entity It is based on the detection, presence and absence of NEs, and it consists of measuring the importance of the presence or absence of an entity (e.g. when there is an entity in the hypothesis but the same entity is not present in the text). This idea comes from the work presented in (Rodrigo et al., 2006; Rodrigo et al., 2007b), where the authors successfully build their system mainly using the knowledge supplied by the recognition of NEs. In our case, rather than construct the system basing it on NEs inferences, we study the addition of this knowledge as system s features as well as an entailment constraint (similar to the previous inference about verbs relations). In the context of finding correspondences between entities, a partial entity matching was considered. Therefore, entities such as George Bush, George Walker Bush, G. Bush and Bush are considered as the same entity. Besides, our entity reasoning module also takes into account acronyms correspondences. So, if an entity is only made up of uppercase letters and another entity is composed by several words where the first letters match with the acronym, then these two entities are considered as similar entities. For instance, IBM International Business Machines. As for the Verb s inferences, we consider the entities correspondences as: (i) a binary feature showing whether all hypothesis entities found their counterparts within the text; (ii) a value between 0 and 1 depending on the number of hypothesis entities with correspondences; and (iii) a previous constraint forcing that to be an entailment candidate the hypothesis entities need to have found their counterparts in the text. 88

3.5. SEMANTIC PERSPECTIVE For this purpose, we use our in-house NE Recognizer, called NERUA, previously described in section 2.2. 3.5.5 Applying Frame Semantic Analyses Frame semantic-based analyses are all the more interesting in the task of recognising textual entailment as they offer a robust yet relatively precise measure for semantic overlap. For our inferences based on Frame Semantics we used FrameNet (Baker et al., 1998), where the meaning of predicates and their arguments are modelled in terms of frames and frame elements or roles. A frame describes a prototypical situation and roles identify participants involved in this situation. Frames provide normalizations, including variations in argument structure realizations. Let s study some cases extracted from the RTE-2 test corpus: The simplest one is when the same Frame and frame elements are presented in both text and hypothesis. For instance: Pair id=55 (see Figure 3.5) T: Canadian Nation Defense has been using virtual reality to train pilots and ground soldiers. H: Soldiers have been trained using virtual reality. The Frame Education_teaching appears in the text as well as in the hypothesis. This aside, the same frame elements Student and Material are also instantiated with the same or similar entities. In these kinds of situations there is a lot of probability of a true entailment relation. Besides, discovering matchings between frame elements instantiations is more robust than the previous semantic-free inferences (e.g. lexical matchings). 89

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM (a) RTE-2 test corpus: T id=55. (b) RTE-2 test corpus: H id=55. Figure 3.5: Frame annotation of RTE-2 test pair id=55. 90

3.5. SEMANTIC PERSPECTIVE However, other cases are less straightforward: Pair id=132 (see Figure 3.6) T: An avalanche has struck a popular skiing resort in Austria, killing at least 11 people. H: Humans died in an avalanche. (a) RTE-2 test corpus: T id=132. (b) RTE-2 test corpus: H id=132. Figure 3.6: Frame annotation of RTE-2 test pair id=132. In this case there is not a direct matching between the hypothesis frame Death and any of the text s frames. However, if we exploit the frame-to-frame relations encoded in FrameNet, we find that the text s frame killing is linked to the Death frame by a causative_of relation as shown in Figure 3.7. 91

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM Figure 3.7: Causative of FrameNet relation between Killing and Death frames. Furthermore, the inner relations between the frame elements of Killing and Death connect the Killing s CAUSE frame element with the Death s CAUSE frame element and the Killing s VICTIM frame element with the Death s PROTAGONIST one. Therefore, exploiting the Frame-to-Frame and Frame elements-to-frame elements relations, we could discover an entailment relation with a certain probability degree. Obviously, this degree will depend on the relations that connect the target frames. Besides, FrameNet can be very useful in detecting false positives. For instance, FrameNet is a good indicator of situations in which the lexical perspective as well as the syntactic one achieve good similarity levels because of the high overlapping between the lexical items and the syntactic structures, but there is still no entailment relation. Let s consider: Pair id=423 (see Figures 3.8 and 3.9) 92

3.5. SEMANTIC PERSPECTIVE T: X-rays and radioactivity had been discovered just a decade earlier, and some years before that Hertz had discovered radio waves. H: Hertz discovered X-rays. (a) RTE-2 test corpus: T id=423 (PART ONE). (b) RTE-2 test corpus: T id=423 (PART TWO). Figure 3.8: Frame annotation of RTE-2 test pair id=423 (text). Within the text, the frame Becoming_aware appears twice and it is also present in the hypothesis. However, in both appearances the frame element instantiations differ from the hypothesis ones. For the first text s Becoming_aware frame, COGNIZER does not have instantiation, 93

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM Figure 3.9: Frame annotation of RTE-2 test pair id=423 (hypothesis). whilst for the second one PHENOMENON is instantiated by a different entity regarding the PHENOMENON hypothesis instantiation. Consequently, the appearance of the same frame in both the text and the hypothesis, with different entities instantiated by their frame elements, shows a high degree of false entailment relations. Other similar cases are the pair id=52, 54 and 193. In order to take advantage of the knowledge provided by the Frames within the FrameNet hierarchy, a procedure was developed to find similarities at Frame-to-Frame level. Frame-to-Frame Similarity Metric This metric obtains similarity scores between two FrameNet Frames. It is useful for weighting the inherent semantics of FrameNet, finding out semantic relations through the FrameNet hierarchy, and quantifying how much two concepts are alike based on the information contained in FrameNet. The similarity factor obtained by this metric is based mainly on the relations used in FrameNet to connect the Frames as well as the information content of the Frames. Figure 3.10 depicts a visual example of the connection path between two Frames. 94

3.5. SEMANTIC PERSPECTIVE Figure 3.10: Frame-to-Frame similarity metric: visual example. Regarding the example, to obtain the similarity factor between two Frames, F and F, the algorithm exploits all relations from F and its descendants until reaching F. For the sake of developing an algorithm computationally efficient, the maximum depth in the path in order to find the desired Frame was set to 5. Moreover, in our experiments longer connection paths report insignificant semantic values. There are several ways to obtaine the final similarity coefficient, equation 3.39 is the simplest one in order to calculate this similarity weight: Sim v1 = W R1 W R2... W Rn (3.39) R1, R2..., Rn are the relations that form the connection path, and W R1, W R2,..., W Rn their weights, as shown in Table 3.5. These weights were established heuristically considering the significance that each relation has in the FrameNet hierarchy. After studying the connection paths between Frames, some improvements considering several aspects that are also relevant for the similarity coefficient were proposed: Relations order within the path. We add a factor that measures how important two consecutive relations are within the path: Sim v2 = W R1 W R2 RW(R 1, R 2 )... RW(R n 1, R n ) W Rn RW(R 1, R 2 ) is the weight when R 1 precedes R 2 in the path. To establish these weights, the idea was that when the same or similar relation is following in the path (e.g. from Inherits from to Inherits from) the 95

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM Relation Parent (FROM) Child (BY) Membership 1 Inheritance 0.8 0.7 Perspective on 0.8 0.7 SubFrame 0.6 0.5 Precedes 0.7 0.7 Causative 0.5 0.4 Inchoative 0.5 0.4 Using 0.7 0.6 See also 0.5 0.4 Table 3.5: Frame-to-Frame: FrameNet relation weights. weight is equal to one, when it goes down (e.g. from Inherits from to Inherited by) the weight is the same for the parent relation (see Table 3.5), and when it goes up (e.g. from Inherited by to Inherits from) the weight is equal to the parent relation weight divided by 2. The underlying intuition to take this decision is that from child nodes to parent nodes the loss of semantic information is higher than from parent nodes to child ones. For cross-relation references: Inheritance, Perspective on and Using are considered as similar relations in order to establish these weights, as well as Causative and Inchoative. The rest of the crossrelation reference values, although rarely appearing, are established to one leaving the similarity decision to the other weights. The Frames along paths. Measuring how relevant the Frames are involved in the connection way is also important for the final similarity score. Thus, we transform the similarity equation as follows: Sim v3 = FW(F) W R1 FW(f 1 ) W R2 RW(R 1, R 2 )... RW(R n 1, R n ) W Rn FW(F ) FW(F) is the weight for the Frame F. It is calculated depending upon the generality of the Frame regarding the relations associated with it in the path. We measure this generality as the number of children that the Frame has with regards to the relations that connect it. For in- 96

3.5. SEMANTIC PERSPECTIVE stance, following the previous example: FW(F) = 1 #Children(F,R 1 ) FW(f 1 ) = 1 #Children(f 1,R 1 )+#Children(f 1,R 2 ) FW(F ) = 1 #Children(R n,f ) The Frame Elements (FEs). In this case, we measure the importance of FE-to-FE relations, distinguishing between core and non-core FEs. The core FEs are those that uniquely define a Frame (e.g. Speaker, Addressee, Message), whereas non-core FEs describe aspects of events more generally (e.g. Time, Place). Therefore, the equation would be the same as the previous one, but now the relation weights (see Table 3.5) are calculated as follows (Sim v4 ): W Ri = OriginalWeight 2 3 1 RelatedFEnonCore 3 F EnonCore RelatedFECore F ECore + OriginalW eight where OriginalWeight is the weight as in Table 3.5, RelatedFECore is the number of the core FEs of the parent Frame related to FEs of the child Frame, FECore is the total number of the core FEs of the parent Frame, and finally RelatedF EnonCore and F EnonCore correspond to the non-core FEs. As shown in the equation, more importance is given to core FEs than non-core. The Frame Elements (FEs) (II). Similar to the previous one, but instead of considering both core and non-core FEs, only core FE are considered (Sim v5 ). This is because there are many non-core FEs without correspondence between Frames, and we would like to test if solely considering the core FEs, the similarity score improves. Aside from the application of this metric as another inference to support the recognition of textual entailment relations, we evaluate its informativeness in the following way: 1. We created an evaluation framework made up of the common monosemous lemma.pos regarding the whole set of FrameNet 1.3 Lexical Units and WordNet 3.0 Synsets. 97

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM 2. Over these monosemous entries, we apply the Frame-to-Frame similarity metric together with other metrics based on WordNet (e.g. the Wu & Palmer (1994) measure 14 ). 3. Compare the results obtained by both metrics. This comparison shows how similar the two hierarchies are and it will denote the potential usefulness of the Frame-to-Frame metric. 4. Preliminary results show that a correspondence around 80% is reached when two measures are higher than 0.75. However, there are some cases where a very high coefficient is obtained by the Frame-to-Frame metric and low scores are computed by WordNet-based measures. This is due to the fact that the meanings of specific lemma.pos, while monosemous, are different with regards to FrameNet and WordNet. Using FrameNet, we realized its limited coverage, indeed it is well-known that many approaches using FrameNet experience poor results owing to this. So, we also developed an algorithm that aligns FrameNet Lexical Units with WordNet Synsets. This alignment will also allow us to associate the synonyms and hyponyms of a synset with its aligned Lexical Unit and Frame, and consequently it will increase the FrameNet coverage. FrameNet-WordNet Alignment Measure The proposed alignment measure obtains correlation scores per lemma and part-of-speech tag between WordNet Synsets and FrameNet Lexical Units. The main idea is to exploit the relations that appear in both hierarchies in order to obtain the scores showing the degree of correlation between each Lexical Unit and Synset, which in principle only share the same lemma and part-of-speech tag. Both the WordNet and the FrameNet hierarchies are incomplete and they use different sets of relations; this is the reason why we depend on the lemma- PoS combination as our starting point for comparisons between the two lex- 14 We chose the Wu & Palmer due to it is mainly based on the WordNet hierarchy and the least common superconcept between two concepts, which is similar to the way of obtaining Frame-to-Frame similarities over the FrameNet hierarchy. 98

3.5. SEMANTIC PERSPECTIVE ical resources. Broadly speaking, what the proposed algorithm does is to define the neighbourhood of each LU and WordNet synset that are candidates for alignment, and calculates how similar those neighbourhoods are for each relation. The algorithm starts with a particular lemma.pos, which can occur in several WordNet synsets and also several FrameNet lexical units, and looks up all of the senses associated with that lemma.pos in WordNet and FrameNet, traversing each of the relations in the resources in turn to construct one WordNet neighbourhood and one FrameNet neighbourhood with the starting word sense at the centre. The similarity between these neighbourhoods is calculated by summing together similarity scores for the other word senses within the neighbourhoods, and then summing across these aggregate scores for each of the relation types, as seen in the equation shown in Equation 3.40. The score contributed by one word in the neighbourhood is calculated by, first, finding the distance from the word to the center of each neighbourhood travelling along the current relations, then taking the difference between the two distances, and, finally, inverting the distance to produce a similarity score. For this to work, the distances must be weighted differently in the two resources, since WordNet has a finer-grained hierarchy, meaning that the average number of nodes between two related senses is larger. We then judge whether the neighbourhoods are sufficiently correlated to align the senses. A variety of methods are possible, but the most obvious is perhaps best-first, so we use it in our experiments. Program 2 presents the algorithm step by step. C(LU, S) = R F N R ( WN ) W(R FN, R WN ) 1 λ d RF N (LU, λ) d RWN (S, λ) + α(r FN, R WN ) λ (3.40) The Equation 3.40 encapsulates the inner steps of the algorithm and represents the sense-to-sense correlation. C is the correlation between a FrameNet Lexical Unit (LU) and a WordNet Sense S, R FN is a FrameNet frame-to-frame relation or lemma-to-frame relation type, R WN is a Word- Net synset-to-synset or word-to-synset relation type, W is a function which 99

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM Program 2 The alignment algorithm step by step. INPUT: lemma+pos FN LUs related to this lemma+pos: LUs = {LU1,..., LUn} WN Synsets related to this lemma+pos: Synsets = {S1,..., Sn} for each LU and WN-sense of the lemma+pos for each FN-relation - obtain the FN neighbourhood for lemma+pos traversing the given relations for each WN-relation - obtain the WN neighbourhood for lemma+pos traversing the given relations for each neighbour in the neighbourhood of the LU or WN-sense - calculate and normalize the distance between the neighbour and starting LU - calculate and normalize the distance between the neighbour and starting WN-sense (if there is no such neighbour, use a default maximum) - subtract these distances (when the distances are similar, we get a small number) - take the inverse to produce a score (now good correlation gives a big number) - aggregate the score for the neighbourhood (by summing) - normalize by dividing by the number of neighbourhood lemmas - multiply by a weight for the current WN/FN relation pair (e.g. WN-hyponymy to downward FN-Inheritance is almost 1, WN-antonymy to Causative_of is almost 0, i.e. uninformative) - aggregate the score per relation (by summing) - judge whether the correlation for the mapping is good enough (by a best-first way: the best-scoring pair WN sense-fn LU will be matched) - if so, join the LU and WN-sense in the joint hierarchy - if not, move on weights the expected informativeness of each pair of WN/FN relation types, λ is a word in the vicinity of LU (along relation R FN ) and/or S (along relation R WN ), λ is the number of words in the WordNet and FrameNet neighbourhoods along the relevant relations, d RF N is the normalized LU-to- LU distance function travelling the FrameNet hierarchy along relation R FN, d RWN is the normalized sense-to-sense distance function travelling the Word- Net hierarchy along relation R WN, and α is a small constant (say 1) to prevent division by zero as well as to prevent complete swamping by good individual correlations. Regarding the R FN for a specific FrameNet relation, we have taken into account the weights shown in Table 3.6. They were obtained heuristically, by considering the meaningfulness that each relation has within the FrameNet hierarchy (as in the Frame-to-Frame similarity). These weights are multiplied by the depth of the neighbour in the neighbourhood measured from the starting LU. In the current version of the algorithm, the neighbourhood is built by all neighbours with depth equal to one. Subsequent work will expand the depth of the neighbourhood in order to obtain similar neighbourhood sizes for both resources. Similarly, the R WN is obtained by the weights illustrated in Table 3.7. 100

3.5. SEMANTIC PERSPECTIVE Relation Parent (FROM) Child (BY) Membership 0.8*#members Inheritance 0.8 0.7 Perspective on 0.8 0.7 SubFrame 0.6 0.5 Precedes 0.7 0.7 Causative 0.5 0.4 Inchoative 0.5 0.4 Using 0.7 0.6 See also 0.5 0.4 Table 3.6: FrameNet-WordNet alignment: FrameNet relations weights. Although they were empirically set, the work presented in (Moldovan & Novischi, 2002) about lexical chains through the WordNet hierarchy also served as inspiration for establishing these weights. Relation Weight Synonymy 0.9 Hypernymy 0.8 Hyponymy 0.7 Antonymy 0.1 Entailment 0.7 Cause to 0.5 Derived forms 0.7 Holonomy 0.5 Meronomy 0.5 Attributes 0.5 Coordinate terms 0.5 Table 3.7: FrameNet-WordNet alignment: WordNet relations weights. With regards to the weights assigned to each FrameNet-WordNet relation pair (i.e. W(R FN, R WN )), in the current state of this research all W-values are set to one, leaving the final alignment decision to the remaining values within the formula. Figure 3.11 depicts a visual example about how the FrameNet and Word- 101

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM Net neighbourhoods are compared in order to obtain the alignment between a specific FrameNet Lexical Unit and WordNet Synset that share the lemma and part-of-speech tag. Figure 3.11: Visual example about FrameNet-WordNet alignment. In order to assess the accuracy of our alignment procedure, we used the evaluation framework provided by Tonelli & Pianta (2009), consisting of a gold-standard set of manually annotated mappings between LUs and Word- Net 1.6 synsets. To use this gold-standard we had to map each WordNet 1.6 synset to its corresponding WordNet 3.0 synset, and to maintain consistency with our framework and algorithm, we also had to discard those mappings that: (i) have different PoS in WordNet and FrameNet (e.g. in the gold standard the WordNet adjective born is associated with the FrameNet verb born); 102

3.5. SEMANTIC PERSPECTIVE or (ii) have different lemmata (e.g. it associates WordNet account for.v with FrameNet account.v). Accuracy (%) Nouns Verbs Adjs Advs 68.55 64.25 77.19 100 overall 67.73 Table 3.8: FrameNet-WordNet alignment: results on Tonelli s dataset. Therefore, after discarding the aforementioned cases, the evaluation was carried out over 375 manual mappings broken down into 124 nouns, 193 verbs, 57 adjectives and 1 adverb. 15 Table 3.8 shows the accuracy (i.e. the correct alignments) obtained by our algorithm using Tonelli s gold-standard dataset; overall, it seems to be roughly comparable to Tonelli et al. s reported F-score of 66%. Adding the Frame Semantic Analyses to the Textual Entailment System To integrate these Frame semantics analyses into our system, the Shalmaneser tool (Erk & Pado, 2006) explained in section 2.2.6 was used in order to annotate the Frames and Frame Elements in plain texts. After that, we compute the inferences previously mentioned deriving the corresponding system features: Frames matching: a simple overlapping between the frames detected in H and T (i.e. a normalized weight that measures how many H s frames also appear in T). Frame Elements matching: shows how many frame elements from the frames detected in both text and hypothesis share similar or lexically related instantiations. To compare frame element instantiations we used the Levenshtein distance (with a similarity higher or equal to 80%) as well as the WordNet synonym and hyponym relations from T s frame elements to H s frame elements. Therefore, two frame element 15 One should note that the accuracy obtained for adverbs is not significant, since there is just one sample of adverbs within the gold-standard dataset. 103

CHAPTER 3. THE IDEA: A PERSPECTIVE-BASED TEXTUAL ENTAILMENT SYSTEM instantiations are similar whether they have the same lemma, their Levenshtein distance is higher or equal to 80% or the T s instantiation is a synonym or hyponym of the H s instantiation. Frame-to-Frame similarity: computes a score by exploiting the frameto-frame relations as mentioned previously. This accumulated score is obtained by summing the maximum frame-to-frame similarity values for each H s frame regarding all T s frames. In this inference, the frame-to-frame relations are computed from T s Frames to H s Frames. FrameNet-WordNet alignment matching: similar to the Frames matching but considering the FrameNet-WordNet alignment. So, if there are WordNet synonyms and hyponyms of a synset that have been aligned to a specific Lexical Unit and Shalmaneser has not annotated them, these WordNet synonyms and hyponyms evoke the Frame corresponding to the alignment. After that, the Frames matching considering these new frames is calculated. Note that no Frame Elements are detected for these new frames, so the matching is carried out solely over the Frames. 3.6 Summary This chapter presented the main idea behind the construction of our textual entailment system: Tackle the entailment phenomena from relevant but conceptually different points of view: Lexical, Syntactic and Semantic perspectives. Consequently, three perspectives were developed in order to extract meaningful inferences from these three different points of view. Moreover, we also presented detailed descriptions of all of them making special emphasis on the benefits of applying these inferences to the entailment phenomena. Furthermore, within this chapter a visual overview is described about the system s workflow showing how it works and determines the final entailment decision. To sum up, several system configurations were proposed, some with the entailment decision relying on each perspective individually, others combining the knowledge provided by all perspectives, and additionally some configurations implementing several potentially useful entailment constraints as 104

3.6. SUMMARY well as a voting schema. All of them will be evaluated throughout the next chapter. 105

106

4 A Pure Entailment Evaluation: Experiments, Results and Discussion We distinguish between two ways of evaluating our textual entailment system: intrinsically and extrinsically. The former consists of evaluating the system over pure textual entailment environments, 1 whereas the latter assesses the applicability of the system to other NLP tasks and how it helps with their global performances. This chapter will describe the pure entailment evaluation framework used to test the capability of our system for detecting entailment relations in an isolated manner. It will start with the selection procedure of the most meaningful inferences from those previously exposed in chapter 3. Broadly speaking, this selection is based on the information gain of each inference and its 10-fold cross validation accuracy over the development corpora. Later on, a battery of experiments will be exposed considering the differ- 1 With pure textual entailment environments we denote those that are intended for evaluating the textual entailment task alone, as it was defined in the introductory section. 107

CHAPTER 4. A PURE ENTAILMENT EVALUATION: EXPERIMENTS, RESULTS AND DISCUSSION ent perspectives and system configurations presented in the previous chapter. Finally, two additional experiments over very relative tasks will also be detailed. Regarding the extrinsic evaluation, chapter 5 will show the system s utility in supporting other NLP tasks. 4.1 The Evaluation Framework As previously mentioned in section 2.3, the RTE Challenges provide the most appropriate evaluation environment for determining the performance of textual entailment systems. Moreover, these challenges establish a good reference point for comparing the system to the most relevant works in this area. In their fourth editions the RTE organizers have supplied both development and test corpora (except for the last one when just the test corpus was provided). Therefore, throughout this chapter we will show the results obtained for our system for each RTE challenge we participated in (i.e. RTE-2, RTE-3 and RTE-4) as well as the different experiments we carried out. 2 The corpora provided by the organizers of these RTE editions contain 800 pairs manually annotated for logical entailment (except for the fourth edition where the test corpus is made up of 1,000 pairs). These corpora are composed of four subsets, each of them corresponding to typical true and false entailments in different NLP tasks, such as IE, IR, QA, and SUM. For each task, the annotators selected the same amount of true and negative entailments (50%-50% split). However, as the organizers of RTE-3 explained in the overview (Giampiccolo et al., 2007), although the datasets were supposed to be perfectly balanced, unintentionally the number of positive examples were slightly higher in both development and test sets. Finally, the judgments returned by the system will be compared to those manually assigned by the human annotators. The percentage of matching judgements will provide the accuracy of the system, i.e. the percentage of correct responses. 2 One should note that some results can differ from the RTE official results (shown in appendix A), since some experiments were developed after the RTE Workshops deadlines. 108

4.2. SELECTING THE BEST SYSTEM S FEATURES 4.2 Selecting the Best System s Features It has been noted from related systems that a proper combination of trained features in a machine learning algorithm can lead to an overall improvement in system performance, in particular if features from a more informed component and shallow ones are combined (Bos & Markert, 2006). Therefore, larger feature sets do not necessarily lead to improved classification performance. Despite seeming useful, some features may in fact be too noisy, irrelevant or redundant, increasing the risk of overfitting. In order to discover the set of the most meaningful features, we processed all of them thus obtaining the Information Gain of each one with regards to the RTE development corpora. Information Gain is the reduction of entropy (uncertainty) about the classification/clustering of a test class based on observations of a particular variable. It is the amount by which you reduce the uncertainty about the target class using or making a decision about a particular variable. It can be understood as the amount by which you gain in terms of information in order to get pure classes. Information Gain is mostly used for deciding which variables to use first in a classification problem. The higher the Information Gain, the higher are the chances of getting pure classes in a target class if split on the variable with the highest gain. Figures C.1, C.2, C.3, C.4, C.5 and C.6 in appendix C (p. 195) illustrate by means of bar graphs the Information Gain achieved by each feature considering the RTE-2 development corpus (Figures C.1 and C.2), the RTE-3 development corpus (Figures C.3 and C.4) and putting together both corpora (Figures C.5 and C.6). By analysing the graphs, we observe the features obtaining better Information Gain are the ones related to the lexical perspective. However, within this features group the Rouge measures do not achieve values as high as the rest of the lexical features. The syntactic feature (consisting of the tree embedding and tree matching procedures explained in section 3.4) obtains acceptable Information Gain values and more importantly, its characteristics seem to be a potentially good indicator in order to recognise undetected entailments by the lexical perspective. With regards to the semantic features, the worst ones, or at least those that obtain such lower values that they seem to be too poor to discriminate between true or false entailment, are the features regarding polarity. This is 109

CHAPTER 4. A PURE ENTAILMENT EVALUATION: EXPERIMENTS, RESULTS AND DISCUSSION due to there not being many pairs that express negation between T and H verbs, additionally the WordNet antonymy relation does not often appear in the texts, resulting in low Information Gain values. The features based on WordNet measures reach the best values among all semantic features, those being the most significant considering all grammatical categories and only the nouns. The NEs correspondences and verb relations features, although not obtaining high values, the feature that considers weighted NEs correspondences between H s and T s entities as well as the one returning a weighted value showing how many H s verbs are related to at least one T s verb could be useful within the entailment decision procedure. Finally, the Frame semantics inferences also seem to be relevant for solving some entailment cases, since they encompass the entailment phenomenon from a distinct point of view. Additional to the Information Gain values, we also computed the 10-fold cross-validation accuracy obtained by each feature using the RTE-2 or the RTE-3 development corpora. Tables 4.1 and 4.2 show these 10-fold crossvalidation values. As previously commented on in section 3.1, in our experiments we have used the Support Vector Machine (SVM) classifier implemented in Weka (Witten & Frank, 2005). Therefore, we built several features sets according to the highest Information Gain and the 10-fold cross-validation values of each perspective features group as well as the intuition that features can be complementary for entailment recognition when they are derived from different perspectives and/or resources. Tables 4.3 and 4.4 show the best features sets obtained with regards to each RTE development corpus (i.e. RTE-2 and RTE-3 corpora). Since the syntactic perspective only consists of one feature, no selection procedure was needed. However, the syntactic feature was taken into account when the best features sets considering all features were generated. To deduce the set of the most informative features, we iteratively removed from the whole set of features (lexical, semantic or all features set) the one with the lowest Information Gain value and if the 10-fold cross-validation accuracy of this preliminary features set decreases or stays equal, the feature is discarded for the final features set. This strategy is similar to a top-down features selection procedure but, in our case, influenced by the Information Gain of each feature. As shown in Tables 4.3 and 4.4, the best features sets are strongly depen- 110

4.2. SELECTING THE BEST SYSTEM S FEATURES Lexical Features RTE-2 dev. RTE-3 dev. Binary Matching 60.625% 70.125% Levenshtein distance 61.625% 69.875% Needleman-Wunsch algorithm 62.375% 69.375% Smith-Waterman algorithm 60.625% 70.125% Consecutive Subsequence Matching 59.875% 66.25% (CSM) Rouge-N2 59.875% 63.25% Rouge-N3 54.75% 55.25% Rouge-L 59.625% 62.625% Rouge-W 59.375% 60.75% Rouge-S2 55.75% 55% Rouge-S3 53.625% 51.5% Jaro distance 61.25% 70% Jaro-Winkler distance 61.875% 70.875% Euclidean distance 59.875% 69.75% Jaccard similarity coefficient 59.875% 69.75% Dice s coefficient 59.875% 69.75% Cosine similarity 59.875% 69.75% Soundex distance 58.625% 63.875% Q-gram matching 60.875% 63.875% IDF specificity 58.875% 66% ALL LEXICAL FEATURES 63.125% 70.5% Syntactic Feature Tree embedding & matching 61.25% 64% Table 4.1: The 10-fold cross-validation accuracy values obtained by each lexical and syntactic feature. dent on the idiosyncrasies of each corpus. Although there are many features that appear in all the best features sets (such as CSM, Cosine, WN-measures, General Polarity, Frames Matching, etc), there are also some features that only support the entailment decision for specific corpora (e.g. the Frame-to- Frame similarity for RTE-2 and the Q-gram matching regarding the RTE-3 corpus, when all features are considered). Nevertheless, even though these statements depend on the corpora, the RTE corpora are, nowadays, the most reliable for evaluating textual entailment systems. Nonetheless, for the sake of consolidating a set of features that supports the classification of new entailment pairs rather than the ones presented in the RTE corpora, two new sets were created: (i) by the union of the features regarding the set of best features (considering all perspectives) for RTE-2 111

CHAPTER 4. A PURE ENTAILMENT EVALUATION: EXPERIMENTS, RESULTS AND DISCUSSION Semantic Features RTE-2 dev. RTE-3 dev. WN-measures all items 61.5578% 66.8342% WN-measures nouns 60.0503% 64.0704% WN-measures verbs 53.8945% 54.6482% WN-measures others 53.8945% 58.5427% WN-measures idf 49.4975% 56.6583% Antonymy 50.1256% 51.3819% General Polarity 53.2663% 51.005% Main Verbs Polarity 51.6332% 50.6281% Conditional Polarity 49.7487% 51.3819% Modal Degree 49.8744% 51.3819% Verbs Correspondences (binary 55.1508% 56.5327% feature) Verbs Correspondences (normalized 52.8894% 53.8945% feature) Entities Correspondences (binary 45.2261% 50.6281% feature) Entities Correspondences (normalized 56.0302% 55.402% feature) Frames Matching 56.2814% 54.5226% Frames Elements Matching 53.6432% 52.7632% Frame-to-Frame similarity 48.4925% 51.3819% FN-WN Alignment Matching 54.0201% 51.8844% ALL SEMANTIC 60.9296% 66.7085% ALL Features Lexical-Syntactic-Semantic 63.875% 70.4774% Table 4.2: The 10-fold cross-validation accuracy values obtained by each semantic feature and all features combined. and RTE-3 development corpora, namely set; and (ii) by their intersection, the set. Table 4.5 illustrates these two new sets of features together with their 10-fold cross validation over the RTE corpora. The system behaviour using these sets is quite similar, and although the results slightly decrease when computing them over the RTE corpora, we believe that considering these sets for pairs not belonging to the RTE challenges may overcome the feature dependency on the RTE corpora idiosyncrasies. 112

4.2. SELECTING THE BEST SYSTEM S FEATURES The Best Lexical Features set for RTE-2 dev. corpus Needleman-Wunsch, CSM, Rouge-L, Rouge-W, 10-fold cross val. Rouge-S3, Jaro, Euclidean, Cosine, Soundex 64.5% The Best Lexical Features set for RTE-3 dev. corpus Smith-Waterman, CSM, Rouge-L, Rouge-W, 10-fold cross val. Jaro-Winkler, Euclidean, Cosine, Soundex 71.125% The Best Semantic Features set for RTE-2 dev. corpus WN-measures all-items, WN-measures nouns, 10-fold cross val. General Polarity, Entities Correspondences (normalized), Frames Matching, Frames Elements Matching, Frame-to-Frame similarity, FrameNet-WordNet Alignment Matching 62.9397% The Best Semantic Features set for RTE-3 dev. corpus WN-measures all-items, WN-measures nouns, 10-fold cross val. General Polarity, Main Verbs Polarity, Conditional Polarity, Verbs Correspondences (binary), Verbs Correspondences (normalized), Entities Correspondences (normalized), Frames Matching, Frames Elements Matching, Frame-to-Frame similarity, FrameNet-WordNet Alignment Matching 68.8442% Table 4.3: The best lexical and semantic features sets obtained with regards to each RTE development corpus. 113

CHAPTER 4. A PURE ENTAILMENT EVALUATION: EXPERIMENTS, RESULTS AND DISCUSSION The Best Features set (all features) for RTE-2 dev. corpus Levenshtein, Needleman-Wunsch, Smith-Waterman, 10-fold cross val. CSM, Rouge-L, Rouge-S2, Jaro, Jaro-Winkler, Euclidean, Cosine, Soundex, IDF, Tree embedding & matching, WN-measures all-items, WN-measures nouns, WN-measures verbs, WN-measures idf, General Polarity, Verbs Correspondences (normalized), Entities Correspondences (normalized), Frames Matching, Frames Elements Matching, Frame-to- Frame similarity, FrameNet-WordNet Alignment Matching 65.125% The Best Features set (all features) for RTE-3 dev. corpus Needleman-Wunsch, Smith-Waterman, CSM, 10-fold cross val. Rouge-L, Rouge-W, Jaro, Jaro-Winkler, Euclidean, Cosine, Soundex, Q-gram, IDF, Tree embedding & matching, WN-measures all-items, WN-measures nouns, WN-measures verbs, WN-measures idf, General Polarity, Verbs Correspondences (normalized), Entities Correspondences (normalized), Frames Matching, FrameNet-WordNet Alignment Matching 71.7337% Table 4.4: The best features sets (all perspectives) obtained with regards to each RTE development corpus. The set of Features Needleman-Wunsch, Smith-Waterman, CSM, Rouge-L, Jaro, Jaro- Winkler, Euclidean, Cosine, Soundex, IDF, Tree embedding & matching, WN-measures all-items, WN-measures nouns, WN-measures verbs, WNmeasures idf, General Polarity, Verbs Correspondences (normalized), Entities Correspondences (normalized), Frames Matching, FrameNet- WordNet Alignment Matching RTE-2 dev. RTE-3 dev. 63.125% 71.2312% The set of Features set + Levenshtein, Rouge-W, Rouge-S2, Q-gram, Frames Elements Matching, Frame-to-Frame similarity RTE-2 dev. RTE-3 dev. 64.125% 71.3568% Table 4.5: The set and the set of features. 114

4.3. EXPERIMENTS, RESULTS AND DISCUSSION 4.3 Experiments, Results and Discussion In order to assess the performance of the system, a baseline setting all the pairs as positive entailment relations was established (BASE yes ). Due to the fact that the RTE corpora are practically balanced, this baseline obtains values close to 50%. Tables 4.6 and 4.7 illustrate the results achieved over the RTE-2-3 and RTE-4 corpora, respectively. They present the individual results achieved for each perspective as well as considering all of them, showing the accuracy obtained by the 10-fold cross-validation over the development corpus, the overall accuracy using the test corpus (as a blind corpus) and the accuracy achieved depending on the task the entailment pair was derived from. Also shown are the precision, recall and f-measure regarding positive and negative pairs. Note that, as for RTE-4 no development corpora was provided, several experiments were carried out using the development corpora of RTE-2, RTE- 3 and merging these two corpora. As shown in the tables, using the best features from the three perspectives (lexical, syntactic and semantic) achieves better overall results than each perspective individually. However, splitting the corpus into the four tasks it is composed of, the results vary depending on both each target task and each perspective used. Regarding positive/true/yes or negative/false/no entailments pairs, the system behaviour is somewhat similar for all RTE corpora. Although the precision of NO-pairs is slightly higher than YES-pairs, their recall is quite low resulting in a low f-measure, as well. It contrasts with the values achieved by the YES-pairs. In this case, the precision is more than acceptable and the recall reaches very good values. The underlying reason behind these percentages is that when the system has doubts about which is the correct decision (YES or NO), it opts for tagging the pair as true entailment due to the fact that in the training phase the system was able to better represent the NOpairs examples than the YES-ones (i.e. there is more diversity regarding the YES-pairs during the training phase). Regarding numbers, the highest task-dependent values are obtained by: For RTE-2: the lexical perspective obtained the best results for the SUM pairs, achieving an accuracy rate of 69.5%. 115

CHAPTER 4. A PURE ENTAILMENT EVALUATION: EXPERIMENTS, RESULTS AND DISCUSSION RTE-2 Dev. corpus Test corpus 10-f cross val. Overall IE IR QA SUM BASE yes 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 Best Lexical set 0.6450 0.5875 0.5150 0.6150 0.5250 0.6950 YES pairs Prec. 0.57 Recall 0.713 F 0.633 NO pairs Prec. 0.617 Recall 0.463 F 0.529 Syntactic feature 0.6125 0.5613 0.4950 0.5850 0.5250 0.6400 YES pairs Prec. 0.542 Recall 0.793 F 0.644 NO pairs Prec. 0.614 Recall 0.33 F 0.429 Best Semantic set 0.6093 0.5962 0.5050 0.6900 0.5300 0.6600 YES pairs Prec. 0.573 Recall 0.758 F 0.652 NO pairs Prec. 0.642 Recall 0.435 F 0.519 Best set (ALL) 0.6512 0.5975 0.5100 0.6600 0.5450 0.6750 YES pairs Prec. 0.576 Recall 0.74 F 0.648 NO pairs Prec. 0.636 Recall 0.455 F 0.531 RTE-3 BASE yes 0.5150 0.5125 0.5250 0.4350 0.5300 0.5600 Best Lexical set 0.7112 0.6700 0.5100 0.7450 0.8600 0.5650 YES pairs Prec. 0.638 Recall 0.822 F 0.719 NO pairs Prec. 0.732 Recall 0.51 F 0.601 Syntactic feature 0.6400 0.5938 0.5050 0.6500 0.6450 0.5750 YES pairs Prec. 0.583 Recall 0.724 F 0.646 NO pairs Prec. 0.612 Recall 0.456 F 0.523 Best Semantic set 0.6884 0.6450 0.5300 0.7050 0.7650 0.5800 YES pairs Prec. 0.614 Recall 0.829 F 0.705 NO pairs Prec. 0.715 Recall 0.451 F 0.553 Best set (ALL) 0.7173 0.6775 0.4950 0.7450 0.8700 0.6000 YES pairs Prec. 0.646 Recall 0.82 F 0.723 NO pairs Prec. 0.736 Recall 0.528 F 0.615 Table 4.6: RTE-2 and RTE-3 results. 116

4.3. EXPERIMENTS, RESULTS AND DISCUSSION RTE-4 Dev. corpus Test corpus Overall IE IR QA SUM BASE yes 0.5000 0.5000 0.5000 0.5000 0.5000 Best Lexical set RTE-2 0.5980 0.5300 0.6967 0.5100 0.6400 YES pairs Prec. 0.581 Recall 0.702 F 0.636 NO pairs Prec. 0.624 Recall 0.494 F 0.551 RTE-3 0.5800 0.4967 0.6667 0.4800 0.6750 YES pairs Prec. 0.556 Recall 0.796 F 0.655 NO pairs Prec. 0.641 Recall 0.364 F 0.464 RTE-2&3 0.5880 0.5100 0.6867 0.4850 0.6600 YES pairs Prec. 0.565 Recall 0.762 F 0.649 NO pairs Prec. 0.635 Recall 0.414 F 0.501 Syntactic feature RTE-2 0.5520 0.5067 0.6000 0.5150 0.5850 YES pairs Prec. 0.538 Recall 0.73 F 0.62 NO pairs Prec. 0.581 Recall 0.374 F 0.455 RTE-3 0.5520 0.5067 0.6000 0.5150 0.5850 YES pairs Prec. 0.539 Recall 0.726 F 0.618 NO pairs Prec. 0.58 Recall 0.378 F 0.458 RTE-2&3 0.5520 0.5067 0.6000 0.5150 0.5850 YES pairs Prec. 0.539 Recall 0.726 F 0.618 NO pairs Prec. 0.58 Recall 0.378 F 0.458 Best Semantic set RTE-2 0.6170 0.5367 0.7100 0.5550 0.6600 YES pairs Prec. 0.594 Recall 0.74 F 0.659 NO pairs Prec. 0.655 Recall 0.494 F 0.563 RTE-3 0.5920 0.5167 0.6833 0.5350 0.6250 YES pairs Prec. 0.566 Recall 0.792 F 0.66 NO pairs Prec. 0.653 Recall 0.392 F 0.49 RTE-2&3 0.5970 0.5133 0.7100 0.5300 0.6200 YES pairs Prec. 0.573 Recall 0.762 F 0.654 NO pairs Prec. 0.645 Recall 0.432 F 0.517 Best set (ALL) RTE-2 0.6240 0.5433 0.7267 0.5450 0.6700 YES pairs Prec. 0.61 Recall 0.688 F 0.647 NO pairs Prec. 0.642 Recall 0.56 F 0.598 RTE-3 0.6080 0.5467 0.6967 0.4900 0.6850 YES pairs Prec. 0.579 Recall 0.794 F 0.669 NO pairs Prec. 0.672 Recall 0.422 F 0.518 RTE-2&3 0.6090 0.5200 0.7100 0.5300 0.6700 YES pairs Prec. 0.581 Recall 0.778 F 0.666 NO pairs Prec. 0.665 Recall 0.44 F 0.529 Table 4.7: RTE-4 results. 117

CHAPTER 4. A PURE ENTAILMENT EVALUATION: EXPERIMENTS, RESULTS AND DISCUSSION For RTE-3: the combination of all features achieved 87% over the QA pairs subset. For RTE-4: the highest accuracy rate (72.67%) was reached by the combination of all perspectives using the RTE-2 development corpora and computing IR pairs. Thoroughly examining the results, we observed that for every RTE corpus the perspective that achieved the lowest results was the Syntactic one. We expected this due to the fact that it hardly makes use of semantic knowledge (solely to smooth the strictness of the tree embedding module) and finding matchings between syntactic trees is somewhat difficult (more than finding them lexically between bags-of-words). A peculiar case appears when processing the syntactic perspective for RTE-4. In this specific case, the syntactic inferences obtained the same accuracy rates independently of the development corpus used. This is because the syntactic inferences applied to RTE-4 are quite discriminative in the sense that there are no pairs on the borderline between true or false entailments (obviously, these statements are dependent on the RTE development corpora used). The high results reached by the lexical perspective denote that both strict and lexical-derived word overlapping are very important for detecting entailment relations. We honestly believe this is due to humans tending to express the same ideas in similar manners. Finally, we would like to point out the increase in accuracy when all perspectives are computing together. Although this results in quite a low gain, it demonstrates the perspectives are complementary and it also opens further research lines on how to merge these perspectives. As briefly mentioned previously, we noticed that different perspectives perform better depending on the task from which the pairs were extracted. Therefore, we decided to make another experiment including the task as a new feature for our machine learning algorithm. For this experiment we used the features sets which obtained the best results to date (i.e. best-set-all- RTE2dev for RTE-2, best-set-all-rte3dev for RTE-3 and best-set-all- RTE2dev for RTE-4). Table 4.8 shows the results achieved when the task is added to both the training and testing phases of the system. As anticipated, the addition of the target task, which each pair belongs to, increased the final system performance except for the RTE-4 challenge. 118

4.3. EXPERIMENTS, RESULTS AND DISCUSSION RTE-2 Best set (ALL) plus task feature Test corpus Overall IE IR QA SUM 0.6100 0.5100 0.6900 0.5400 0.7000 YES pairs Prec. 0.584 Recall 0.763 F 0.662 NO pairs Prec. 0.658 Recall 0.458 F 0.54 RTE-3 Best set (ALL) 0.6887 0.5300 0.7600 0.8600 0.6050 plus task feature YES pairs Prec. 0.655 Recall 0.829 F 0.732 NO pairs Prec. 0.751 Recall 0.541 F 0.629 RTE-4 Best set (ALL) 0.6180 0.5367 0.7133 0.5450 0.6700 plus task feature YES pairs Prec. 0.603 Recall 0.692 F 0.644 NO pairs Prec. 0.638 Recall 0.544 F 0.587 Table 4.8: RTE results considering the task as another feature. Although there is a slight improvement, it proves that some features or perspectives are more appropriate when they represent pairs derived from a specific task (IE, IR, QA or SUM). For instance, looking at the results it seems that the lexical perspective is appropriate for the SUM pairs and the semantic one for the IR pairs. To further justify the fact that addressing the problem from different perspectives can drive better entailment resolution, we carried out another experiment to demonstrate that our perspectives are complementary. Although it has already been proven by an increase in the results when all perspectives were considered, we would like to step forward and assess this by means of implementing an oracle. The oracle provides an optimal combination of our three perspectives, considering each one as an independent system. It chooses, for each pair, the entailment decision from one of the three perspectives which is equal to the correct entailment annotated in the gold-standard corpus. If all perspectives return a wrong entailment value, then the oracle will take the output of one of them also producing a wrong entailment decision. Thus, the oracle represents an optimal upper-bound that would be achieved by the most suitable combination of our perspectives. Oracles have been used by other authors 119

CHAPTER 4. A PURE ENTAILMENT EVALUATION: EXPERIMENTS, RESULTS AND DISCUSSION such as (Agirre et al., 2009b) in order to be aware of systems upper-bounds. Table 4.9 shows the performance obtained when the oracle is applied using the best features set of each perspective. ORACLE Test corpus Overall IE IR QA SUM RTE-2 test 0.6963 0.5750 0.7500 0.6400 0.8200 RTE-3 test 0.7438 0.6050 0.8350 0.8950 0.6400 RTE-4 test 0.7210 0.6833 0.7933 0.6650 0.7250 Table 4.9: Oracle results for RTE corpora. Apart from the oracle, we also checked to see if using a simple voting strategy between the three perspectives the final results increase. Unfortunately, with a simple voting strategy the results were similar to the ones obtained by the combination of all features by means of our machine learning classifier (i.e. an overall accuracy of 59.2% for RTE-2, 67.1% for RTE-3 and 59.9% for RTE-4). We expected this slight decrease in the results, since the use of a machine learning algorithm takes advantage of the knowledge provided from each perspective in a suitable (statistical) way. Furthermore, as previously commented on in chapter 3 (sections 3.5.3 and 3.5.4), another option to make use of the knowledge supplied by the importance of finding out correspondences between the hypothesis and text entities and verbs, consists of setting two previous constraints that every entailment candidate pair must fulfill: 1. NEs Constraint: is based on the detection, presence and absence of NEs, in such a way that it only considers as entailment candidate pairs where the entities in H also appear in T. The entity correspondences as well as the NE recognizer used were explained in section 3.5.4. 2. Verbs Constraint: to surpass this constraint every verb in the hypothesis (auxiliar verbs are not considered) must be related to one or more verbs in the text. The situations where two verbs are related and the semantic resources used to discover verbs relations were detailed in section 3.5.3. 120

4.3. EXPERIMENTS, RESULTS AND DISCUSSION Establishing these two constraints prior to processing all possible entailment pairs, the system obtained the results shown in Table 4.10. The system configuration used for this experiment was the one considering the task as a feature for the machine learning algorithm (see Table 4.8). Test corpus Overall IE IR QA SUM RTE-2 W/ Entities Constraint 0.6162 0.5100 0.7100 0.5450 0.7000 W/ Verbs Constraint 0.5900 0.4600 0.6600 0.5600 0.6800 W/ Both Constraint 0.5988 0.4700 0.6800 0.5650 0.6800 RTE-3 W/ Entities Constraint 0.6913 0.5350 0.7400 0.8850 0.6050 W/ Verbs Constraint 0.6438 0.4700 0.7100 0.8250 0.5700 W/ Both Constraint 0.6450 0.4750 0.6900 0.8400 0.5750 RTE-4 W/ Entities Constraint 0.6200 0.5467 0.7267 0.5400 0.6500 W/ Verbs Constraint 0.6170 0.5567 0.7233 0.5400 0.6250 W/ Both Constraint 0.6130 0.5600 0.7200 0.5350 0.6100 Table 4.10: RTE results applying the verbs and entities constraints. In general, the results drawn in Table 4.10 are a bit lower than those that do not consider these two strict constraints. However, in some cases, for instance the experiments developed processing the entities constraint alone, an improvement, although not very high, is reached. More important is the reduction of the corpora processed by the system and, consequently, the reduction of the system processing time. Figure 4.1 shows the percentages of corpora processed by the application of each constraint over each RTE corpus. The constraint regarding entities did not reduce the processing corpora much, since the appearance of the same entities in the hypothesis and the text is very common although no entailment relation is expressed between them. However, due to the fact that the verb constraint is more restrictive and difficult to fulfill, in this case the corpora was dramatically reduced. As a result, with the computation of both constraints the processing corpora were cut by 36% on average regarding all RTE corpora (19% applying only the entities constraint and 21%. regarding the Verbs one). 3 3 Note that these percentages differ from those presented in our paper (Balahur et al., 2008) within the RTE-4 challenge, due to some improvements as well as several deficiencies within the constraints were carried out and solved. 121

CHAPTER 4. A PURE ENTAILMENT EVALUATION: EXPERIMENTS, RESULTS AND DISCUSSION (a) The RTE-2 test corpus processing percentages applying the entities and verbs constraints. (b) The RTE-3 test corpus processing percentages applying the entities and verbs constraints. (c) The RTE-4 test corpus processing percentages applying the entities and verbs constraints. Figure 4.1: The RTE test corpora statistics applying the constraints. 122

4.4. COMPARATIVE EVALUATION Furthermore, apart from reducing the corpus processed and the processing time, it is also importan to measure the capability of our constraints in tagging NO-entailment pairs. Table 4.11 shows these values. Entities Constraint Verbs Constraint Both Constraints For RTE-2 For RTE-3 For RTE-4 Prec. 0.6615 0.7767 0.7248 Recall 0.215 0.4175 0.216 F 0.3245 0.5431 0.3328 Prec. 0.5172 0.4912 0.5959 Recall 0.225 0.21 0.236 F 0.3136 0.2942 0.3381 Prec. 0.573 0.6291 0.6348 Recall 0.3925 0.53 0.4 F 0.4658 0.5753 0.4908 Table 4.11: The precision values achieved by the entity and verb constraints. From Table 4.11, we can state that very good precision values are achieved by both constraints and for all RTE corpora. Specifically, they are quite high with regards to the entity constraint (from 66% to 77% in precision). The recall values are not so high. This is because there are many NO-pairs that satisfy the constraints, however in order to assess the correct applicability of our constraint in the task the precision factor is the most significant. 4.4 Comparative Evaluation In order to assess the improvements of our system with regards to the official participants in the last RTE challenges, as well as point out the results that our system would have reached had it participated, Tables 4.12, 4.13 and 4.14 illustrate the different configurations of our system (exposed throughout this chapter) together with the rest of the systems that took part in these challenges. Within the tables, the approaches regarding the constraints (i.e. Entities constraint, Verbs constraint and Both constraints), as previously mentioned, consisted of applying the corresponding constraint(s) to the configuration of the system which considers all features plus the task of the target pair. 123

CHAPTER 4. A PURE ENTAILMENT EVALUATION: EXPERIMENTS, RESULTS AND DISCUSSION 1st Author - Team - Approach accuracy Hiclk (LCC) 0.7538 Tatu (LCC) 0.7375 Zanzotto (Rome & Milan) 0.6388 Adams (Dallas) 0.6262 Entities constraint 0.6162 Bos (Rome & Leeds) 0.6162 All perspectives + task 0.6100 de Marneffe (Stanford) 0.6050 Kouylekov (Irst (FBK) & Trento) 0.6050 Marsi (Tilburg & Twente) 0.6050 Vandelwende (Microsoft & Stanford) 0.6025 Both constraints 0.5988 All perspectives 0.5975 Herrera (UNED) 0.5975 Semantic perspective 0.5962 Nielsen (Colorado) 0.5962 Verbs Constraint 0.5900 Burchardt (Saarland / SALSA) 0.5900 Katrenko (Amsterdam) 0.5900 Rus (Memphis) 0.5900 Lexical perspective 0.5875 Inkpen (Ottawa) 0.5825 Litkowski (CL Research) 0.5813 Syntactic perspective 0.5613 Ferrández (our RTE-2 official participating system) 0.5563 Schilder (Thomson & Minnesota) 0.5550 Kozareva (Alicante) 0.5500 Clarke (Sussex) 0.5475 Delmonte (Venice) 0.5475 Newman (Dublin) 0.5437 Nicholson (Melbourne) 0.5288 Table 4.12: Comparative results for the RTE-2 2006 challenge. 124

4.4. COMPARATIVE EVALUATION 1st Author - Team - Approach accuracy Hiclk (LCC) 0.8000 Tatu (LCC) 0.7225 Entities constraint 0.6913 Iftene (UAIC) 0.6913 All perspectives + task 0.6887 All perspectives 0.6775 Lexical perspective 0.6700 Adams (Dallas) 0.6700 Wang (DFKI) 0.6687 Zanzotto (Rome & Milan) 0.6675 Blake (North Carolina) 0.6585 Ferrández (our RTE-3 official participating system) 0.6563 Li (Atlanta) 0.6488 Both constraints 0.6450 Semantic perspective 0.6450 Verbs constraint 0.6438 Chambers (Stanford) 0.6362 Rodrigo (UNED) 0.6312 Burchardt (Saarland / SALSA) 0.6262 Roth (CCG) 0.6262 Settembre (Buffalo) 0.6262 Malakasiotis (Athens) 0.6175 Ferres (TALP) 0.6150 Litkowski (CL Research) 0.6125 Bar-Haim (Bar-Ilan & Tel Aviv) 0.6112 Montejo-Raez (UJA) 0.6038 Syntactic perspective 0.5938 Marsi (Tilburg & Twente) 0.5913 Delmonte (Venice) 0.5875 Harmeling (Edinburgh) 0.5775 Burek (Open Univ.) 0.5500 Bobrow (Palo Alto) 0.5150 Clark (Seattle, Marina del Rey & Princeton) 0.5088 Baral 0.4963 Table 4.13: Comparative results for the RTE-3 2007 challenge. 125

CHAPTER 4. A PURE ENTAILMENT EVALUATION: EXPERIMENTS, RESULTS AND DISCUSSION 1st Author - Team - Approach 2-way task acc. Bensley (LCC) 0.746 Iftene (UAIC) 0.721 Wang (Saarland & DFKI) 0.706 Siblini (Concordia) 0.688 Li (Tsinghua) 0.659 All perspectives 0.6240 Entities constraint 0.6200 Mohammad (Maryland) 0.619 All perspectives + task 0.6180 Verbs constraint 0.6170 Semantic perspective 0.6170 Pado (Stanford) 0.614 Both constraints 0.6130 Ferrández (our RTE-4 official participating system) 0.608 Nielsen (Colorado) 0.606 Lexical perspective 0.5980 Zanzotto (Rome, Saarland & Trento) 0.59 Agichtein (Emory) 0.588 Bar-Haim (Bar-Ilan & Tel Aviv) 0.584 Shen (Edinburgh) 0.582 Galanis (Athens) 0.578 Castillo (Cordoba) 0.571 wlvuk 0.571 Cabrio (FBK-Irst) 0.57 Ageno (TALP) 0.563 Syntactic perspective 0.5520 Rodrigo (UNED) 0.549 Clark (Seattle) 0.547 Krestel (Hannover & Concordia) 0.54 Varma (IIIT Hyderabad) 0.531 Glinos (SAIC) 0.526 Montalvo-Huhn (Fitchburg) 0.526 Yatbaz (Koc) 0.519 Bergmair (Cambridge) 0.516 Table 4.14: Comparative results for the RTE-4 2008 challenge. 126

4.5. ADDITIONAL EXPERIMENTS These tables show that all of our system configurations overcome the baseline setting all pairs as positive entailments (these baselines are circa 50% in accuracy). Moreover, the system achieved very good places in the ranking of participating systems; fifth, third and sixth positions for RTE-2, RTE3 and RTE-4 respectively. 4 4.5 Additional Experiments In order to enlarge the evaluation of our system as well as evaluate it with other corpora that, although related to the traditional RTE task, differ from it in some aspects, we propose two different experiments which are described in the following subsections. 4.5.1 The 3-way RTE Classification Problem Previously in chapter 2, we briefly introduced the new direction presented in the fourth RTE challenge. It consists of considering the task as a three-class decision problem, which means that the systems have to distinguish between when the entailment relation is not supported at all by the two involved snippets (CONTRADICTION-pair) and when there is not enough evidence of its supportiveness (UNKNOWN-pair). Table 4.15 shows an example of a CONTRADICTION-pair and UNKNOWN-pair. Therefore, the participating systems had to decide whether: (i) T entailed H; (ii) T contradicted H, in which case the pair is marked as CONTRADIC- TION; and (iii) the truth of H could not be determined on the basis of T, the pair is marked as UNKNOWN. The dataset consisted of 1,000 T-H pairs, and its distribution according to the 3 way annotation, both in the individual settings (IE, IR, QA, SUM) and in the overall test set, was as follows: 50% ENTAILMENT 35% UNKNOWN 15% CONTRADICTION 4 The fifth and third position for RTE-2 and RTE-3 are shared with two other participating systems. 127

CHAPTER 4. A PURE ENTAILMENT EVALUATION: EXPERIMENTS, RESULTS AND DISCUSSION Text Hypothesis Task Judg. Four people were killed and at least 20 injured when a tornado tore through an Iowa boy scout camp on Wednesday, where dozens of scouts were gathered for a summer retreat, state officials said. Kingdom flag carrier British Airways (BA) has entered into merger talks with Spanish airline Iberia Líneas Aéreas de España SA. BA is already Europe s third-largest airline. Four boy scouts were killed by a tornado. The Spanish airline Iberia Líneas Aéreas de España SA is Europe s third-largest airline. QA SUM UNKNOWN CONTRADICTION Table 4.15: Examples of UNKNOWN and CONTRADICTION texthypothesis pairs. As development set, the RTE-4 organizers did not provide any, but the development set regarding the RTE-3 was annotated with 3-way classification tags since this task was also proposed in RTE-3 as a pilot task. Therefore, we used the RTE-3 3-way development set as a training corpus, and it led us to use the best features set with regards to the RTE-3 corpus. Consequently, the results shown in Table 4.16 are using the best features set of each perspective as well as all combined according to the RTE-3 development corpus. RTE-4 3-way 3-way Dev. Test corpus Overall IE IR QA SUM Lexical perspective 0.6737 0.5570 0.4767 0.6233 0.5150 0.6200 Syntactic perspective 0.6150 0.5300 0.4700 0.6000 0.5150 0.5300 Semantic perspective 0.6608 0.5560 0.4833 0.6600 0.4800 0.5850 All perspectives 0.6834 0.5610 0.5067 0.6333 0.4750 0.6200 Table 4.16: RTE-4 3-way classification results. Results point out that although the system was not designed to deal with the 3-way of classifying entailments, its behaviour faced with this new task 128

4.5. ADDITIONAL EXPERIMENTS was somewhat promising. In such a way that if we were to participate in the three way classification entailment problem within the RTE-4, we would have achieved fourth position in the participant ranking (see appendix A for details about other systems accuracy). Therefore, with this evaluation we prove that: (1) our system features derived from the three perspectives are also appropriate for recognising UN- KNOWN entailments; and (2) the combination of all perspectives obtains yet again better results than each perspective alone. 4.5.2 Dealing with Paraphrases It is well-known that paraphrase and entailment are very close concepts, indeed a paraphrase can be considered as a bidirectional entailment relation. Hence, we found the idea of testing our system on paraphrase corpora very attractive. In chapter 2 section 2.2, when we talked about paraphrase, we presented the Microsoft Research Paraphrase Corpus (MSRPC) (Dolan et al., 2004), which is commonly used by researchers for evaluating and supporting paraphrase as well as entailment systems (Corley & Mihalcea, 2005; Ferrés & Rodríguez, 2007). For our experiments with paraphrases, we have used MSRPC as well. The downloadable distribution of MSRPC comes with the corpus split into two sets: the training and the test set. The training set is made up of 4,076 pairs of paraphrases distributed in 2,753 positive and 1,323 negative examples, whereas the test set contains 1,725 pairs from which 1,147 are positive and 578 are negative examples. We have used these sets for training and testing our system respectively. For the sake of addressing the paraphrase detection from a textual entailment point of view, we consider that paraphrases are bidirectional entailments. Therefore, a paraphrase pair (P 1 P 2 ) from MSRPC represents two entailment relations (P 1 P 2 and P 2 P 1 ). In order to apply our system to this task, we opted for two different configurations and/or approaches: (i) adjust the system; or (ii) customize the corpus. (i) Adjust the system: consists of obtaining similarities from both directions P 1 P 2 and P 2 P 1. Afterwards, the scores are combined into a bidirectional similarity using a simple average function: 129

CHAPTER 4. A PURE ENTAILMENT EVALUATION: EXPERIMENTS, RESULTS AND DISCUSSION sim(p 1 P 2 ) = sim(p 1 P 2 ) + sim(p 2 P 1 ) 2 (4.1) so, when the bidirectional similarity scores obtain high values, there is a lot of probability of it being a paraphrase. (ii) Customize the corpus: we transformed the MSRPC pairs into textual entailment pairs as follows: For positive paraphrase pairs: If P 1 P 2 we derived from it two positive entailment pairs P 1 P 2 and P 2 P 1. In the first pair P 1 is the text (T) and P 2 is the hypothesis (H), whilst for the second one P 2 is T and P 1 is H. For negative paraphrase pairs: If within MSRPC, the paraphrase pair is false or negative (i.e P 1 P 2 ), we cannot know if the paraphrase does not hold because there is no entailment between P 1 and P 2 (i.e. P 1 P 2 ), between P 2 and P 1 (i.e. P 2 P 1 ), or both. However, for the sake of maintaining the original proportion of positive and negative pairs of MSRPC and in order to make this corpus transformation automatically, we assumed that a negated paraphrase evokes two negative entailment relations (i.e. P 1 P 2 implies P 1 P 2 and P 2 P 1 ). As a result, we obtained a corpus made up of textual entailment pairs derived from paraphrase relations. Note that this corpus has double the amount of positive and negative pairs. Regarding the evaluation, for the approach computing the average of unidirectional similarities, it is rather straightforward since this approach returns true/false paraphrases according to the training observations. However, for the approach that transforms the corpus, we had to implement an evaluation framework that took into account several options as shown in the next table. 130

4.5. ADDITIONAL EXPERIMENTS P 1 P 2 P 2 P 1 P 1 P 2 YES YES YES YES NO NO NO YES NO NO NO NO Table 4.17: When two entailment relations are a paraphrase. Therefore, if at least one of the entailment relations involved in the paraphrase is false, the paraphrase is also false. The idea of splitting the paraphrases into two entailment relations and analysing them individually comes from the fact that within a paraphrase one statement can be strongly related to the other but the opposite relation can be very weak, so although the average would indicate a paraphrase it does not exist. Table 4.18 illustrates the results obtained by computing both approaches. In particular, two baselines were added to the table for assessing the accuracy of our system in the task of recognising paraphrases. BASE NO consists of marking all paraphrase pairs as NO, whereas BASE Y ES tagged all pairs as YES. Note that the accuracy of these baselines is highly dependent on the corpus balance in terms of positive and negative examples. MS Paraphrase Corpus Development Corpus Test corpus 10-fold cross validation Accuracy BASE NO 0.3351 BASE Y ES 0.6649 Bidirectional similarity average approach set of features 0.7218 0.7119 set of features 0.7407 0.7310 Corpus tranformation approach set of features 0.7229 0.7165 set of features 0.7489 0.7362 Table 4.18: MSRPC results. The results, although similar, report a slight improvement when processing the corpus transformation approach. It is our understanding there are two reasons for that: (1) as mentioned before, the corpus transformation approach resolves the problem of calculating the average between two entailment similarities (one very strong and the other very weak); and (2) the assumption that splitting a false paraphrase into two false entailments was 131

CHAPTER 4. A PURE ENTAILMENT EVALUATION: EXPERIMENTS, RESULTS AND DISCUSSION in most cases appropriate. Furthermore, with this evaluation we also demonstrate the suitable generation of the set and set of features (see section 4.2), which were created to manage other sorts of snippets rather than the RTE ones. 4.6 Summary This chapter presented the evaluation of our textual entailment system. It started by describing the evaluation framework used, i.e. the one provided by the RTE Challenge series, which to the best of our knowledge is the most reliable one nowadays. Later on, we detailed the selection of the best system features from all those exposed in the previous chapter, and finally, we delved into the experiments carried out, together with a discussion of them. Among this mess of results and percentages, we conclude that the combination of the knowledge provided by our three perspectives (lexical, syntactic and semantic) is a correct way of solving entailment relations, since it takes into account more entailment candidates than each perspective alone. Indeed, the improvement in accuracy achieved by the combination is significant statistically with regards to the performance of every perspective individually. To measure this significance we used the Paired T-tester of the Experiment Environment of Weka (Witten & Frank, 2005) with a significance level of 0.05. Besides, the use of entailment (semantic) constraints leads the system towards more precise entailment recognition (although with a slight decrease in recall) and, more importantly, towards faster responses. Even so, we believe that there is still a lot of research to be done on discovering ways to merge the different knowledge sources, and in our opinion this is necessary partly due to the limited (but utterly understandable) coverage of linguistic resources. To end the chapter, and designed to enrich our evaluation, some additional experiments were presented. We evaluated the system over the new three-way entailment classification task as well as detecting paraphrases. In these environments the system behaviour was more than acceptable obtaining successful results. 132

5 Applicability in other NLP Areas This chapter summarizes the applicabilities of the textual entailment system presented in this thesis to other NLP areas. As already mentioned in the introductory chapter, textual entailment has a strong application to QA, SUM, IR and many NLP tasks. The following sections show in detail how the system was adapted for some of these NLP tasks together with a substantial evaluation for each application. 5.1 Textual Entailment in Question Answering As briefly mentioned in the introductory section, QA appears with the aim of retrieving information required by natural language users queries. The purpose of a QA system is to find the correct answers to user arbitrary questions in both non-structured and structured collections of digital data. Thus, the need to automatically extract knowledge from data has become acute with the dramatic growth of digital information. 133

CHAPTER 5. APPLICABILITY IN OTHER NLP AREAS Moreover, due to the QA background of our research group and some research projects about QA that our group is involved in, we consider very appealing the idea of finding out ways to achieve improvements in QA tasks by using textual entailment techniques. The next subsections detail the two research lines followed in order to apply our textual entailment system to QA environments. Firstly, we will describe our participation in the Answer Validation Exercise competition series explaining how the system was used for this issue, then, we will present the entailment-based QA system developed within the framework of the QALL- ME project. With the goal of adapting the system for each specific task, several adjustments were made regarding the pure entailment approach. However, the system s core was the same as well as the methodology followed. 5.1.1 The Answer Validation Exercise Competition This competition has already been described in section 2.4, however as a brief reminder for the reader the following lines sum up the main aim of the Answer Validation Exercise (AVE). AVE is a track within the Cross-Language Evaluation Forum (Peters, 2008) (CLEF). 1 In its three editions, the organizers have provided an evaluation framework to appropriately consider those answers belonging to QA system runs that are supported by the question and the passage from which they were supposedly extracted. This kind of inference will help QA systems to increase their performances as well as those of humans in the assessment of QA system outputs. Therefore, systems must emulate human assessment of QA responses and decide if an answer to a question is correct or not according to a given text. This shows that the AVE task is very close to the recognition of textual entailment relations, since it can be considered as a type of such relations. Besides, traditionally this problem has been tackled by textual entailment recognition techniques as shown in the different AVE overviews (Peñas et al., 2006; Peñas et al., 2007; Rodrigo et al., 2008a). 1 http://www.clef-campaign.org 134

5.1. TEXTUAL ENTAILMENT IN QUESTION ANSWERING Our participation in AVE We have participated in the three AVE editions (Ferrández et al., 2006; Ferrández et al., 2007; Ferrández et al., 2008). For each participation, improvements as well as different experiments were done in the same way we did for the RTE challenges (see Evaluation section). In appendix B, we show the official ranking for each AVE edition emphasizing the positions that we obtained. The results obtained by our system have always been similar to those of the majority of the participants. Regarding English, our system s evolution comes from puristic lexical and WordNet-based inferences towards dealing with entities and verb relations. The system achieved second place within the last AVE ranking reaching its best position. With regards to Spanish, we only participated in the last edition. Similar inferences to the ones used for English were successfully processed (e.g. inferences about lexical transformations and derivations, NEs, etc) reaching first place in the AVE official ranking. 5.1.2 The QALL-ME Entailment-based Question Answering System QALL-ME, Question Answering Learning technologies in a multilingual and Multimodal Environment (http://qallme.itc.it/), is a European Union project (6 th Framework Research Programme of the European Union, contract number FP6-IST-033860) which involves several academic partners and companies. 2 The QALL-ME project is focused on answering questions such as Where can I eat paella this evening?, which have become a concrete business opportunity, with a large array of services ranging from traditional customer care to more and more articulated web-based assistance services being offered. Nowadays, voice portals (i.e. services providing speech-enabled access to web-based information) provide users with a broad variety of information (timetables, traffic circulation, weather forecasts, cultural events, etc.), and are experiencing an unsurpassed increment in popularity. 2 The QALL-ME partners are FBK-irst (Italy), DFKI (Germany), University of Wolverhampton (UK) and University of Alicante (Spain) as academic institutions and Comdata (Italy), Ubiest (Italy) and Waycom (Italy) as the companies involved within the project. 135

CHAPTER 5. APPLICABILITY IN OTHER NLP AREAS Open Domain QA is the core technology behind the final application. QA takes a question in natural language and returns an answer from a collection of information sources (e.g. documents, databases). The questions are formulated as a free natural language input as opposed to a keyword query, and are not limited to fixed templates, as in Information Extraction. As a technology, QA is now mature enough to move from addressing isolated, factoid questions to more natural and knowledge intensive interactions. As for the applicative perspectives, QA is these days recognized as one of the killer applications for the Semantic Web, as both language technologies and knowledge and reasoning intensive processing are greatly desired. The general objective of the QALL-ME project is to establish a shared infrastructure for multilingual and multimodal open domain QA for mobile phones. The scientific and technological objectives pursue three crucial directions: multilingual open domain QA, user-driven and context-aware QA, and Machine Learning technologies for QA. The specific research objectives of the project include state-of-the-art advances in the complexity of the questions handled by the system (e.g. how questions); the development of a web-based architecture for cross-language QA (i.e. question in one language, answer in a different language); the realization of real time QA systems for concrete applications; the integration of the temporal and spatial context both for question interpretation and for answer extraction; the development of a robust framework for applying minimally supervised machine learning algorithms to QA tasks; and the integration of mature technologies for automatic speech recognition within the open domain question answering framework. After two years, the QALL-ME prototype is working within the tourist domain (Sacaleanu et al., 2008), specifically on the Cinema and Accommodation domains, a demo is available at http://qallme.itc.it/server/demo/. An ontology in OWL 3 was created representing the domain together with an RDF 4 database containing the data instances. Regarding the Spanish side of the QALL-ME project an on-line demo at http://sqm1.dlsi.ua. es/general/index.jsp is available showing the distinct web services imple- 3 The Ontology Web Language (OWL) is designed to be used by applications that need to process the content of information instead of just presenting information to humans http://www.w3.org/tr/owl-features/. 4 The Resource Description Framework (RDF) model. RDF is a set of specifications originally designed as a metadata model which is considered as a general method of modelling information http://www.w3.org/rdf. 136

5.1. TEXTUAL ENTAILMENT IN QUESTION ANSWERING mented. Figure 5.1 depicts an example about how the QALL-ME service satisfies the users needs for a typical question within the tourist domain. Figure 5.1: The QALL-ME project: an example. In this example, the QALL-ME service detects the language of the query and the location of the user, and with this information is able to run different modules belonging to different partners addressing the multilingual problem. Moreover, the multimodality of the project is shown when the user is asking for an address and the service provides an interactive map. Behind the example, is the QALL-ME infrastructure organized as in Figure 5.2. Principally, there is a QA-planner that organizes the communication between the different modules of each partner and is also in charge of accessing the common resources. Focusing the QALL-ME phenomenon on the target of this thesis, Figure 5.3 shows the inner architecture of each partner side, specifically we are going to focus on the work flow of the Spanish side. 137

CHAPTER 5. APPLICABILITY IN OTHER NLP AREAS Figure 5.2: The QALL-ME project: general infrastructure. Figure 5.3: The QALL-ME project: the inner architecture. 138

5.1. TEXTUAL ENTAILMENT IN QUESTION ANSWERING A set of learned patterns were obtained from users experiences in order to create a representative sample of different manners to pose queries about the domain (Cabrio et al., 2008b). Once this process is completed, the entailment engine is responsible for inferring the meaning of a new input query with regard to the learned patterns, and finally returns the DB-query to extract the answers. Therefore, the problem of finding the most appropriate pattern for a new query is considered as an entailment detection problem. Obviously, although the entailment system s core is the same and the lexical perspective (see section 3.3) has also been applied, in the case of the QALL-ME project we had to orientate the recognition of entailment relations towards the new QA paradigm that we are dealing with (i.e. entailment relations between queries and query patterns). Moreover, we should note that the approach is going to be released as a web application, so in the course of the system s construction we have endeavoured to use as few as possible external resources. Further details about how the Spanish side of the QALL-ME project works regarding the construction of the pattern database, the entailment inferences used, experimental results and so on, can be found in our Information Processing and Management paper (Ferrández et al., 2009). However, we would like to point out some new inferences added to specifically increase the system performance when it is applied to the QALL-ME framework. Wh-terms or interrogative terms (e.g. when, what, where, etc) play an important role within the queries meanings. Although the lexical perspective takes into account every token, including Wh-terms, these measures do not reflect the importance of the presence or absence of these terms in both queries involved in the entailment process. For instance, a query asking for a place (Wh-term: where) semantically differs from a query asking for a period of time (Wh-term: when). We strongly believe that such a situation would be useful to determine the entailment. Hence, we added to the system a feature that integrates the capability of knowing when two queries have the same Wh-term. It helps the system to rank the candidates in the entailment decision process, but it is totally clueless as to determine the entailment by itself. Query concept constraint during the creation of the set of learned patterns, all entity instances were tagged with their corresponding ontology concept. For this tagging process, we developed a NE annotator 139

CHAPTER 5. APPLICABILITY IN OTHER NLP AREAS that applies fuzzy (non-overlapping) matching techniques between the queries and the ontology lexicon. 5 Therefore, having this information a constraint to be fulfilled by every entailment pair should be the candidate entailment pair of queries must embed the same entities in number and type. Figure 5.4 shows an input annotated query together with the candidates (in light grey) and non-candidates patterns (in dark grey) that could produce an entailment inference according to this constraint. Figure 5.4: The QALL-ME project: entailment candidates according to the query concept constraint. This constraint discards those patterns that do not contain the same number and entity types detected in the input query. Attribute-based inference taking advantage of the knowledge provided by the user queries posed in the pattern database, we have developed a semantic resource called the ontology attributes characterization. This characterization consists of knowing the different ways that the users ask for a specific ontology attribute or concept. For instance, considering queries asking for the telephone number of a specific cinema, we found three ways to request the telephone number attribute: número de teléfono (telephone number) número de contacto (contact number) número telefónico (telephonic number) 5 The Lexicon comprises the whole set of the ontology instances along with their respective ontology classes or property tags. 140

5.1. TEXTUAL ENTAILMENT IN QUESTION ANSWERING We extracted and stored the different ways of mentioning each ontology attribute. The attribute-based inference consists of detecting the presence of the ontology attributes in the query (normally these attributes are the information solicited) and positively weights those patterns that contain attributes equivalent to the ones within the input query. Two attributes are equivalents whether they are expressed in the same manner or using a paraphrase of them among the paraphrases stored in the ontology attributes characterization. The final weight obtained by this inference is defined as follows: AbI weight = Eql(a i, a j ) a i Q, a j P Q (5.1) where Q and P contain the attributes that appear in the input query and each database pattern respectively, and Eql(a i, a j ) takes the following value: Eql(a i, a j ) = { 1 if a i = a j or a i is a paraphrase of a j, 0 otherwise. (5.2) Therefore, if two or more attributes are found in the query, they are considered of equal importance. However, for patterns without the requested attribute (or any of its paraphrases) this weight will be set as zero. The QALL-ME Entailment-based QA System Evaluation Although detailed results and experiments are described in (Ferrández et al., 2009), we would like to show an evaluation destined for measuring our system behaviour when users, who are unacquainted with the structure and content of the system, pose queries. 6 For this evaluation, 10 new users were requested to formulate 10 spontaneous and independently-generated queries about the cinema domain. These 6 These users do not know anything about our system ontology structure nor the data instances it contains. 141

CHAPTER 5. APPLICABILITY IN OTHER NLP AREAS new users were recruited from non-research environments (e.g. high school students and administrative assistants amongst others), and they did not know anything about our system and research. There were no specific instructions given to these new users, they were only requested to make a query asking for any information about movies or cinemas in Spanish. This evaluation will show how the ontology fulfils the users needs in the cinema domain as well as the semantic recall of the system faced with totally unacquainted users. Table 5.1 shows the results obtained by performing this evaluation. Best system configuration (all inferences) Prec. Rec. F 80% 89% 84.26% correct incorrect uncertain 80 9 11 Table 5.1: The QALL-ME project: evaluation results. From the whole set of 100 queries, 80 queries were well-answered 7 satisfying the users requirements (i.e precision rate of 80%), whereas for the rest of the queries the system returned wrong answers or instead, it was not able to give an answer (9 incorrect and 11 uncertain queries respectively). Regarding the errors, the uncertain ones were due to the fact that the system did not return a pattern because there was not any pattern that obtained a confident enough score to be selected. For the incorrect ones, three different situations occurred: (1) the query is asking for properties that, at the present stage of the system, we do not take into account (e.g. Cuánto cuestan las palomitas en el cine Colci?, How much does popcorn cost at the Colci cinema?); (2) the query is a verification query (e.g. Están dando Casino Royale en el cine Colci?, Is Casino Royale being shown at the Colci cinema?) and the system is unable to deal with this kind of query; and (3) the entailment engine produces a wrong entailment inference. In conclusion, with the application of our entailment system to the QALL- ME project we have successfully overcome the task of detecting entailment relations between queries and more importantly, we have developed a methodology based on entailment engines to tackle QA phenomena. 7 Well-answered queries are those that the users considered as correct, since the answers provided the information solicited. 142

5.2. TEXTUAL ENTAILMENT IN AUTOMATIC TEXT-SUMMARIZATION 5.2 Textual Entailment in Automatic Text- Summarization Text Summarization has become a very popular NLP task in recent years. Due to the vast amount of information, especially since the growth of the Internet, automatic summarization has been developed and improved in order to assist with managing all the information available these days. In this section, we demonstrate how the influence of textual entailment techniques affects the final performance of summarization systems. Furthermore, we show several experiments carried out using our textual entailment system and a summarization approach also developed in our research group. 5.2.1 Brief Text Summarization Background The Text Summarization task consists of obtaining a summary that encapsulates the main ideas and concepts from a single document (i.e. Singledocument Summarization) or from a collection of documents (i.e. Multidocument Summarization). A summary can be defined as a text that is produced from one or more texts, which contains a significant portion of the information in the original text(s), and that is no longer than half of the original text(s) (Hovy, 2005). Summarization systems can be characterized according to many features. Following (Jones, 1999), there are three classes of context factors that influence summaries: input, purpose and output factors. This allows summaries to be characterized by a wide range of properties. For instance, summarization has traditionally been focused on text, but the input to the summarization process can also be multimedia information, such as images, video or audio as well as on-line information or hypertexts. Regarding the output, a summary may be an extract, i.e. when a selection of significant sentences of a document is performed, or an abstract, when the summary can serve as a substitute to the original document or even a headline or title. It is also possible to distinguish between generic summaries and user-focused summaries. The former type of summaries can serve as surrogates of the original text as they may try to represent all relevant features of a source text. Whereas the latter rely on a specification of a user information need. Also concerning the style of the output, a broad distinction is normally made between two types of summaries: indicative and informative. Indicative summaries are used to 143

CHAPTER 5. APPLICABILITY IN OTHER NLP AREAS indicate which topics are addressed in the source text. As a result, they can give an brief idea of what the original text is about. The other type, the informative summaries, are intended to cover the topics in the source text (Mani & Maybury, 1999; Alemany et al., 2003). Regarding the state-of-the-art use of textual entailment techniques to assist with text summarization problems, there are some attempts to study the influence of textual entailment on summarization. They have been focused on the evaluation of summaries (Harabagiu et al., 2007) to determine which candidate summary, from a selection best represents the content in the original document depending on whether the summary entails it or not. However, very little effort has been made to consider both fields together to produce extracts. Only in (Tatar et al., 2008) approaches to combine summarization and textual entailment can be found, where a summary is generated either directly from the entailment relations that appear in a text, or extracting the highest scored sentences of a document. The score of each sentence is computed as the number of sentences of the text that are entailed by it. 5.2.2 The Approach Our approach was presented in (Lloret et al., 2008b; Lloret et al., 2008a), and in contrast to previous work, our idea is to integrate our textual entailment system into a summarization system as a preprocessing tool for extracting the set of the most meaningful sentences that allow us to make the final summary construction more accurate. Therefore, a preliminary summary is generated by the sentences of the text that does not hold an entailment relation. Let s assume that a document consists of a list of sentences: S 1 S 2 S 3 S 4 S 5 S 6 and we perform the entailment component as follows: SUM = {S 1 } SUM entails S 2 NO SUM = {S 1, S 2 } SUM entails S 3 NO 144

5.2. TEXTUAL ENTAILMENT IN AUTOMATIC TEXT-SUMMARIZATION SUM = {S 1, S 2, S 3 } SUM entails S 4 Y ES SUM = {S 1, S 2, S 3 } SUM entails S 5 Y ES SUM = {S 1, S 2, S 3 } SUM entails S 6 NO SUM = {S 1, S 2, S 3, S 6 } The summary obtained by the processed entailment inferences comprises the sentences that are not entailed by the accumulated summary of the previous non-entailed sentences (i.e S 1, S 2, S 3 and S 6 regarding the above example). This preliminary summary is the input for a text summarization technique. In our case, a word-frequency approach was developed. This technique assumes that the more times a word appears in a document, sentences containing that word becomes more relevant. Therefore higher scored sentences will be extracted to produce the final summary (more details of this text summarization approach can be found in (Lloret et al., 2008b; Lloret et al., 2008a)). The approach was used to solve single- and multi-document summarization tasks. In order to deal with multiple documents, we opted to join all documents belonging to the same cluster from which the summary has to be extracted in one unique document. After that we ran the system as we did for the single-document summarization task. 5.2.3 Evaluation: Experiments and Discussion In order to assess both the entailment engine on summarization tasks and how the recognition of entailment relations can influence positively the overall performance of a summarization system, we propose two different evaluations: (i) on the one hand, we evaluate the summary directly obtained from the word-frequency approach; and (ii) on the other hand, we evaluate a final summary built from the highest scored sentences belonging to the preliminary entailment summary and according to the word-frequency calculus. As a data test set, we took the DUC 2002 test documents and their human generated summaries for single- and multi-document tasks. 8 That year was 8 http://www-nlpir.nist.gov/projects/duc/data.html 145

CHAPTER 5. APPLICABILITY IN OTHER NLP AREAS the last year when single-document summarization evaluation of informative summaries was performed. Within DUC 2002 there were two tasks: the first one that proposed to evaluate the participant systems in a single-document environment, and the second that evaluates the system s capabilities when several documents are used to make the summary. The former evaluated the participants by means of 100-word-length summaries, whereas the latter used in its evaluation several summary sizes (10, 50, 100 and 200 words). Taking into account the description of the DUC 2002 task and following the methodology proposed, four experiments were performed. All of them consisted of generating 100-word-length extracts. Two were developed for the single-document task: one considering only the text summarization approach based on word frequencies (i.e. experiment TSwf ) and the other using the textual entailment tool in the preliminary preparation of the summary (i.e. experiment TSwf+TE). The remaining experiments followed the same philosophy but applying our approach over the multi-document task. Moreover, we performed a baseline that built the summary from the first 100 words of the documents for both single. and multi-document tasks. Regarding multidocument the summary was made up of the first 100 words of the most recent document within the cluster of documents. Table 5.2.3 shows the results obtained for each experiment. All were evaluated using the ROUGE tool 9 (Lin, 2004). We computed ROUGE-1, ROUGE-2 values as well as ROUGE-L and ROUGE-W and we obtained recall, precision and F-measure on average for the system s performance. Single-doc ROUGE-1 ROUGE-2 ROUGE-L Baseline 41.132 21.075 37.535 TSwf 43.210 17.072 39.188 TSwf+TE 44.759(+3.58%) 18.840(+10.36%) 40.606(+3.62%) Multi-doc ROUGE-1 ROUGE-2 ROUGE-W Baseline 28.684 5.283 9.525 TSwf 29.620 5.200 9.266 TSwf+TE 31.333(+5.78%) 5.780(+11.15%) 9.588(+3.48%) Table 5.2: DUC 2002 result for single- and multi-document tasks. 9 ROUGE version 1.5.5 run with the same parameters as in (Steinberger et al., 2007) (ROUGE-1.5.5.pl -n 2 -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 -t 0 -l 100 -d). 146

5.3. TEXTUAL ENTAILMENT RECOGNITION FOR LINKING AND DISAMBIGUATING WIKIPEDIA CATEGORIES TO WORDNET As can be seen from Table 5.2.3, our approaches obtain better results than DUC 2002 baselines in any ROUGE measure, except for ROUGE-2 value in single-document. Within the table, the improvement percentage with regards to applying textual entailment in summarization tasks is shown in brackets. 10 From these values, we can observe that an average improvement of 5.85% is reached for single-document and 6.80% for multi-document. Furthermore, we have checked that the use of the entailment engine as a preprocessing tool in the composition of the summary discarded for the final processing a rate of 71.57% from the total set of sentences within the DUC 2002 corpora. 5.3 Textual Entailment Recognition for Linking and Disambiguating Wikipedia Categories to WordNet For this context, our textual entailment system is applied into a methodology devoted to the automatic construction of a NE Repository (Toral et al., 2008). This method exploits the knowledge available in existing language resources to support procedures of lexico-semantic acquisition from Web 2.0 collaborative semistructured resources. Our test case for English focuses on WordNet (Fellbaum, 1998) as the language resource and Wikipedia as the Web 2.0 resource. The first step consists of establishing links between entries of both resources; the instantiable common nouns found in WordNet are mapped to Wikipedia categories. Obviously, these mappings are ambiguous for polysemous nouns. Another piece of research, YAGO (Suchanek et al., 2008), also addresses linking WordNet to Wikipedia. However, the authors do not deal with the ambiguity that arises from linking both resources (ambiguous mappings are simply disambiguated manually). The solution we proposed consists of applying semantic similarity between the language resource definitions (in our specific case the WordNet glosses) and the mapped Wikipedia abstracts. This research was presented in our paper (Toral et al., 2009). Therefore, we consider our problem as a real-world testbed in which different methods dealing with semantic similarity between contexts can be 10 These improvements have been tested by the sign-test with a level of significance of 5% showing a significant difference for each improvement. 147

CHAPTER 5. APPLICABILITY IN OTHER NLP AREAS applied. Although comparative results will be exposed with regards to other semantic similarity methods, for consistency with the context of this thesis, special emphasis will be placed on the applicability of textual entailment in this task. 5.3.1 Adapting the Textual Entailment System For the final application of the system to the target task of this work, we adapted it in order to manage bidirectional meaning relations. Linking Word- Net glosses to Wikipedia categories is not a clear entailment phenomenon. It can occur that the gloss is implied by the category, the category is inferred by the gloss or the entailment appears in both directions. Therefore, to control these situations we opted for computing the average of the two system outputs regarding each unidirectional relation. 11 BiSim(Gloss i, Catg j ) = sim(gloss i Catg j ) + sim(catg j Gloss i ) 2 (5.3) 5.3.2 Methods Used for Comparison Two methods were used to compare our textual entailment system in the task of linking WordNet glosses to Wikipedia categories by semantic similarity. Personalized PageRank over WordNet Given a pair of texts and a graph-based representation of WordNet, this method implements basically two steps: (1) it computes the Personalized PageRank over WordNet separately for each of the texts, producing a probability distribution over WordNet synsets; and (2) then it compares how similar these two discrete probability distributions are by encoding them as vectors and computing the cosine between the vectors. The method represents WordNet as a graph G = (V, E) as follows: Graph nodes represent WordNet concepts (synsets) and dictionary words. Relations between synsets are represented by undirected edges. 11 We have already used this strategy for dealing with paraphrases, see section 4.5.2. 148

5.3. TEXTUAL ENTAILMENT RECOGNITION FOR LINKING AND DISAMBIGUATING WIKIPEDIA CATEGORIES TO WORDNET Dictionary words are linked to the synsets associated to them by directed edges. Regarding PageRank, the damping value is 0.85, the calculation finishes after 30 iterations and all WordNet 3.0 relations are used (including the disambiguated glosses 12 ). This similarity method has already been used for word similarity (Agirre et al., 2009a) which reported very good results for word similarity datasets. Semantic Vectors Semantic Vectors (Widdows & Ferraro, 2008) 13 is an open source software package that creates WORDSPACE models from plain text. Its aim is to provide an easy-to-use and efficient tool which can suit both research and production users. It relies on Apache Lucene 14 for tokenization and indexing in order to create a term document matrix. Once the reference corpus has been tokenized and indexed, Semantic Vectors creates a WORDSPACE model from the resulting matrix by applying random projection. For the current task, a corpus made up of WordNet glosses and Wikipedia abstracts was gathered. It contains the glosses of all the synsets present in WordNet 2.1. (i.e. 117,598 glosses), and the abstracts of all the entries present in a Wikipedia dump obtained in January 2008 (i.e. 2,179,275 abstracts). The final corpus has 1,292,447 terms. In order to compute the similarity, the Semantic Vectors class (CompareTerms) was used. It calculates the similarity between two terms (which can be words or texts). 5.3.3 The Evaluation The evaluation data consists of a set of polysemous nouns from WordNet 2.1 which are mapped to Wikipedia categories (207 mapping in total). Additional information is provided for both nouns and categories; for the first their glosses, while for the second their abstracts. The disambiguation task should then identify, for each noun, which of its senses, if any, corresponds to the mapped/s category/ies.the corpus file follows the following format: 12 http://wordnet.princeton.edu/glosstag 13 http://code.google.com/p/semanticvectors 14 http://lucene.apache.org 149

CHAPTER 5. APPLICABILITY IN OTHER NLP AREAS <word id={id}> <sense number={num}>{sense gloss}</sense> [...] <sense number={num}>{sense gloss}</sense> <category id ={id}>{category abstract}</category> [...] <category id ={id}>{category abstract}</category> </word> while the key file is made up of lines with the format of Senseval-3 scorer 15, showing correct associations between specific Wikipedia categories and Word- Net synsets: word category sense_number+ Furthermore, two baselines were also established: First-sense: is based on sense predominance and always chooses the first sense of WordNet as being the correspondent to the mapped Wikipedia category. Word-overlap: calculates similarity between two texts by counting the number of overlapping words. In order to do this we have used the software package Text::Similarity 16. Note that after some experiments better accuracy was achieved when the stop-words were discarded from the word-overlap procedure. Table 5.3 presents the accuracy rates obtained by the different systems together with the baselines. Regarding the application of our textual entailment system, three different experiments were carried out, each one with a specific setting: TE (trained AVE 07 08 + RTE-3): for this experiment the system was trained with the corpora provided in the AVE competitions (editions 2007 and 2008) and RTE-3 Challenge. This configuration uses a BayesNet algorithm, and it will show the systems s capability for solving the task when specific textual entailment corpora are used as training. 15 http://www.senseval.org/senseval3/scoring 16 http://text-similarity.sourceforge.net 150

5.3. TEXTUAL ENTAILMENT RECOGNITION FOR LINKING AND DISAMBIGUATING WIKIPEDIA CATEGORIES TO WORDNET Method Accuracy Baseline 1st-sense 64.7% Baseline Word overlap 62.7% Semantic Vectors 54.1% Semantic Vectors (supervised) 70.27% Personalised PageRank 64.3% Personalised PageRank (supervised) 73.26% TE (trained AVE 07-08 + RTE-3) 52.8% TE (no training) 64.7% TE (supervised) 77.74% Table 5.3: System results linking and disambiguating Wikipedia categories to WordNet No training phase: in order to assess whether the training corpora are appropriate to the final decision with regards to the task tackled in this work, we also decided to do an experiment without a training phase. Therefore, the highest entailment coefficient returned by the system among all sense-category pairs for each word will be tagged as the correct link. These coefficients are obtained by computing the set of entailment measures integrated into the system. Supervised (10-fold cross-validation): a BayesNet algorithm was trained with the evaluation corpus. We evaluated this experiment by 10-fold cross-validation using each textual entailment inference as a feature for the machine learning algorithm. This experiment shows the system behaviour when it is trained with a specific corpus for our task. Results from applying our textual entailment system point out that both the AVE and RTE corpora are not appropriate to this task (acc. 52.8%). This is due to the fact that the idiosyncrasies of each corpus are somewhat different resulting in a poor training stage. Nevertheless, computing the entailment coefficient returned by the system without training (acc. 64.7%), a considerable improvement in accuracy is achieved. It proves our textual entailment inferences are suitable to support this research. Finally, as expected, the best textual entailment results are obtained when the dataset created for the evaluation is also processed as training and evaluated by 10-fold cross-validation (acc. 77.74%). 151

CHAPTER 5. APPLICABILITY IN OTHER NLP AREAS With regards to comparative results from other methods and baselines used, our textual entailment system surpassed these other methods in both supervised and unsupervised experiments. In fact, it was the only method that in its unsupervised version reached the accuracy achieved by the firstsense baseline. 152

6 Conclusions and Future Work This chapter will expose the conclusions derived from this thesis as well as the main contributions and subsequent future investigations in this field. 6.1 Conclusions Throughout this thesis the major topics in textual entailment have been exposed, exemplified and discussed. After describing them, the aim was to design, implement and evaluate a modular and flexible textual entailment system capable of tackling the entailment phenomenon from three different perspectives: Lexical, Syntactic and Semantic. Therefore by building a modular system the distinct linguistic levels can be properly combined. Also by implementing a flexible system we can use it separately and test the knowledge provided by different sources. Consequently, in the current state of the system there are different modules computing different inferences (e.g. lexical measures, WordNet-based measures, 153

CHAPTER 6. CONCLUSIONS AND FUTURE WORK verbs relations, entities correspondences, etc) and each module is supplying features for a machine learning algorithm, which will take the entailment decision based on training corpora. This allowed us to test the significance of each module and its relevance within the entailment problem. Moreover, this modular architecture would also allow an easy addition to the system of other modules using different resources and computing new inferences. Behind the scenes, the idea was that managing the entailment phenomena from distinct perspectives, the system would be able to take advantage of recognising different entailment levels. For instance, in some cases the lexical perspective is unable to decide the entailment because the lexical derivations and overlappings are insufficient, but they are complemented by the other two perspectives which will support the decision showing syntactic and semantic relations between the texts. However, these ideal situations do not always happen. As stated in the PASCAL RTE Challenges, the lexical overlapping and, in general, the lexical inferences have much importance in the final entailment decision, but a semantic problem such as textual entailment can not be resolved by just relying on lexical deductions. The most difficult entailment recognitions come when lexical inferences show a huge possibility of true entailment but the entailment relation does not hold. This is the point where the syntactic and semantic perspectives have to take a relevant place within the final decision. Let s look at the following example extracted from the RTE-3 test corpus: ID= 551 Entailment= NO Task= QA <t>however, there are some who believe that the real disaster in the North Atlantic on that cold April 1998 morning was not that the Titanic sank with the loss of 1,500 lives.</t> <h>the Titanic sank in 1912.</h> in this example the lexical inferences obtained high values, since good matching and lexical transformations were found for both subsequences and bagof-words, indeed the computation of this perspective alone tagged this pair as true entailment. Fortunately however, this wrong annotation was solved by the inferences corresponding to the syntactic and semantic perspectives. The syntactic was unable to find a tree that properly matched with the hypothesis tree, as seen in the example the most similar tree was the Titanic 154

6.1. CONCLUSIONS sank with the loss of 1,500 lives vs. The Titanic sank in 1912. Besides, the semantic perspective also reported a false entailment. This was because many of its inferences returned low scores, perhaps the most representative being the absence of the date 1912 within T. As a result, by processing all perspectives the pair is finally annotated as false entailment. Furthermore, this thesis also provides an extensive evaluation framework for the RTE task. Evaluating the system on the RTE task itself (i.e. intrinsic or direct evaluation) as well as in application scenarios such as Text Summarization and Question Answering (i.e. extrinsic or indirect evaluation). Regarding these evaluations, the results point out that the textual entailment phenomena is well-addressed by merging shallow features such as the lexical ones with more sophisticated features such as those derived from WordNet, VerbNet, VerbOcean and FrameNet. However, the system accuracy is highly dependent on the performance of each resource used, and although these resources are the most reliable nowadays, their limited coverage is still a drawback for automatic systems. 6.1.1 Main Contributions With the ideas, reasoning and experiments exhibited in this thesis, we demonstrate that the combination of lexical, syntactic and semantic knowledge is the correct way to tackle the entailment recognition. As our main contributions, we would like to highlight the following: We have measured the impact of trivial lexical and syntactic inferences within the task of detecting entailment relations. Concluding that these deductions play a crucial role in the final entailment decision. Moreover, we have proposed new configurations of these lexicosyntactic measures to perform better within the entailment recognition task (such as the Consecutive Subsequence Matching measure). Dealing with complex analyses (i.e. the semantic perspective), we have evaluated the benefits of using linguistic resources in order to recognise entailments. For instance, WordNet, FrameNet, VerbNet and VerbOcean allowed: (i) the use of semantic inferences based on synonyms, antonyms, etc; (ii) more abstract semantic deductions using Frame Analysis; and (iii) to measure the importance of finding out correspondences between verbs and entities. As a result, we report that: 155

CHAPTER 6. CONCLUSIONS AND FUTURE WORK the processing corpus discarded by wrong verb and entitiy associations between T and H results in faster system responses with similar accuracy rates. each inference is able to recognise distinct entailment levels and the combination of them makes the recognition even more accurate. Furthermore, we have implemented some new linguistic resources based on FrameNet: the Frame-to-Frame similarity metric and the FrameNet- WordNet alignment measure. Although in this thesis they have been used in order to discover entailments, they could also be useful in other NLP tasks and/or applications. Other contributions are the software developments carried out in this thesis. Chapter 7 describes how they can be downloaded, used and tested on-line. Regarding the applicability of our textual entailment system in other NLP tasks. It was successfully applied to SUM, QA and the task of linking Wikipedia categories to WordNet glosses (see Chapter 5): In summarization tasks, the system supported the generation of a preliminary summary reducing the amount of data processed and increasing the quality of the final summary. To the best of our knowledge, this is the first time that a textual entailment system has been used in order to form a preliminary summary for a Text Summarization approach. In QA, two different approaches were faced using our textual entailment system: The validation of the answers returned by a QA system (Ferrández et al., 2008). This task was proposed as the Answer Validation Exercise (Rodrigo et al., 2008a) within the Cross-Language Evaluation Forum conferences (Peters, 2008). The construction of a close domain QA system based on textual entailment inferences between a new query and a set of predefined patterns (Ferrández et al., 2009). This was carried out within the framework of the European project QALL-ME. 156

6.2. FUTURE WORK For the task of linking Wikipedia categories to WordNet glosses to enrich the automatic construction of a NE Repository, we used the system in order to obtain enough semantic evidence that allowed the mapping between a specific WordNet synset to a Wikipedia category. This novel application yet again proved the suitability of our system for computing semantic similarities. For every application, the system performed extremely well, achieving competitive results for the AVE editions 1, promisingly solving the QA paradigm presented in the QALL-ME project, as well as successfully supporting the semantic links between Wikipedia and WordNet. Moreover, with our participation in the Spanish track of AVE 08 and the QALL-ME project we had to move the system towards a textual entailment recognition in Spanish. Obviously, several inferences were not applied since they are language dependent, however the core of the system was the same and it proved its portability to other languages. In a nutshell, the investigations posed throughout this thesis reveal the importance of combining linguistic features derived from distinct perspectives, analyses and/or resources. It results in the implementation of a textual entailment system capable of making use of a variety of features extracted from lexical, syntactic and semantic analyses. 6.2 Future Work In the future, we plan to continue improving our perspective-based textual entailment system by means of integrating new inferences as well as investigating new ways to combine the entire set of them: Regarding NE inferences, we plan to extend the partial and acronym correspondences considering deeper analyses. For example, reasoning about date expansion, metonymy and location/demonymy have not yet been integrated into the system. Subsequent work on this area will be characterized by the addition of this sort of reasoning. 1 It reached the top one in the participant ranking in AVE 08 for Spanish. 157

CHAPTER 6. CONCLUSIONS AND FUTURE WORK Exploiting the knowledge encoded in Wikipedia. It is well-known that Wikipedia is a huge source of information, moreover it is growing day by day, so if we were able to take advantage of this knowledge we would harvest a lot of information to support the entailment decision. Interesting are the works presented in (Zesch & Gurevych, 2007; Zesch et al., 2008), in which the authors exploit Wikipedia and Wiktionary in order to make a semantic relation representation. Therefore, we would also like to go down this line using Wikipedia for obtaining semantic related inferences. Moreover, since Wikipedia is an enormous repository of NEs, it opens further research on this area. For instance, finding terms related to a specific entity (or entities category) as well as relations between them. Following this line, another future idea is to model a concept-net based on ontologies (e.g. general ontologies such as SUMO 2 as well as specific domain ontologies). It will allow us to extract deeper semantic inferences between concepts. Finally, with the goal of the ongoing evaluation of our system extrinsically, we plan to apply textual entailment techniques to the task of Plagiarism Detection. By and large, given a set of suspicious documents and a set of source documents the Plagiarism Detection task consists of finding all text passages in the suspicious documents which have been plagiarized and the corresponding text passages in the source documents. Therefore, the detection of entailment relations between the suspicious text and the source ones may help to detect plagiarism as well. 6.3 Selected Scientific output Although throughout this thesis we have already cited our most salient work, for a quick readers reference to our publications, the following list presents our own specific selection. Regarding RTE PASCAL Challenges: 2 http://www.ontologyportal.org/ 158

6.3. SELECTED SCIENTIFIC OUTPUT Alexandra Balahur, Elena Lloret, Óscar Ferrández, Andrés Montoyo, Manuel Palomar and Rafael Muñoz, The DLSIUAES Team s Participation in the TAC 2008 Tracks. In Notebook Papers of the Text Analysis Conference, TAC 2008 Workshop. Gaithersburg, Maryland, USA. November 2008. Óscar Ferrández, Daniel Micol, Rafel Muñoz and Manuel Palomar. A Perspective-Based Approach for Solving Textual Entailment Recognition. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp. 66-71, Prague, June 2007. Óscar Ferrández, Rafael M. Terol, Rafael Muñoz, Patricio Martínez- Barco and Manuel Palomar. An Approach Based on Logic Forms and WordNet relationships to Textual Entailment Performance. In PAS- CAL Recognising Textual Entailment Challenge. Venice, Italy, April 2006. Regarding AVE Challenges: Óscar Ferrández, Rafael Muñoz and Manuel Palomar. Studying the Influence of Semantic Constraints in AVE. In CLEF 2008, Lecture Notes in Computer Science, to appear. Óscar Ferrández, Daniel Micol, Rafael Muñoz and Manuel Palomar. On the Application of Lexical-Syntactic Knowledge to the Answer Validation Exercise. In CLEF 2007, Lecture Notes in Computer Science LNCS 5152, pp. 377-380, 2007. Óscar Ferrández, Rafael M. Terol, Rafael Muñoz, Patricio Martínez- Barco and Manuel Palomar. A Knowledge-Based Textual Entailment Approach Applied to the AVE Task. In CLEF 2006, Lecture Notes in Computer Science LNCS 4730, pp. 490-493, 2006. Related to the QALL-ME project: Óscar Ferrández, Rubén Izquierdo, Sergio Ferrández and José Luis Vicedo. Addressing ontology-based question answering with collections of user queries. In Information Processing and Management Journal, Volume 45, Issue 2, March 2009, Pages 175-188. 159

CHAPTER 6. CONCLUSIONS AND FUTURE WORK Bogdan Sacaleanu, Constantin Orasan, Christian Spurk, Shiyan Ou, Óscar Ferrández, Milen Kouylekov and Matteo Negri. Entailmentbased Question Answering for Structured Data. In Coling 2008: Companion volume: Demonstrations, pp. 173-176, Manchester, UK, August 2008. Óscar Ferrández, Rubén Izquierdo, Sergio Ferrández and José Luis Vicedo. Un sistema de búsqueda de respuestas basado en ontologías, implicación textual y entornos reales. In Procesamiento del Lenguaje Natural, vol 41, pp. 47-54, 2008. Related to textual entailment and Text Summarization works: Elena Lloret, Óscar Ferrández, Rafael Muñoz and Manuel Palomar. A Text Summarization Approach under the Influence of Textual Entailment. In Natural Language Processing and Cognitive Science, Proceedings of the 5th International Workshop on Natural Language Processing and Cognitive Science, NLPCS 2008, In conjunction with ICEIS 2008, Barcelona, Spain, June 2008. Elena Lloret, Óscar Ferrández, Rafael Muñoz and Manuel Palomar. Integración del reconocimiento de la implicación textual en tareas automáticas de resúmenes de textos. In Procesamiento del Lenguaje Natural, vol. 41, pp. 183-190, 2008. Related to the task of linking and disambiguating Wikipedia categories to WordNet glosses: Antonio Toral, Óscar Ferrández, Eneko Aguirre and Rafael Muñoz. A study on Linking and disambiguating Wikipedia categories to Word- Net using text similarity. In Proceedings of the Recent Advances in Natural Language Processing (RANLP 09). To appear. Other relevant publications: Óscar Ferrández, Michael Ellsworth, Rafael Muñoz and Collin F. Baker. A Graph-based Measure of FrameNet-WordNet Alignment. At preparation stage (at time of writing). 160

6.3. SELECTED SCIENTIFIC OUTPUT Óscar Ferrández, Daniel Micol, Rafael Muñoz and Manuel Palomar, Manuel. DLSITE-1: Lexical Analysis for Solving Textual Entailment Recognition. In Proceedings of the 12th International Conference on Applications of Natural Language to Information Systems. Paris, France. June 2007. Daniel Micol, Óscar Ferrández, Rafael Muñoz and Manuel Palomar. DLSITE-2: Semantic Similarity Based on Syntactic Dependency Trees Applied to Textual Entailment. In Proceedings of the TextGraphs-2 Workshop, pp. 73-80, The North American Chapter of the Association for Computational Linguistics Rochester, New York. April 2007. Daniel Micol, Óscar Ferrández, Rafael Muñoz and Manuel Palomar. A Semantic-less Approach for the Textual Entailment Recognition Task. In Recent Advances in Natural Language Processing (RANLP). Borovets, Bulgaria, 2007. Collaborative works in other fields: Sergio Ferrández, Antonio Toral, Óscar Ferrández, Antonio Ferrández and Rafael Muñoz. Exploiting Wikipedia and EuroWordNet to Solve Cross-Lingual Question Answering. Preliminary accepted in Information Sciences Journal. Estela Saquete, Óscar Ferrández, Sergio Ferrández, Patricio Martínez- Barco and Rafael Muñoz. Combining automatic acquisition of knowledge with machine learning approaches for multilingual temporal recognition and normalization. In Information Sciences Journal, vol. 178 (17), 2008. Zornitsa Kozareva, Óscar Ferrández, Andrés Montoyo and Rafael Muñoz. Combining data-driven systems for improving Named Entity Recognition. In Data and Knowledge Engineering Journal, vol. 61 (3), pp. 449-446, 2007. 161

162

7 Software Developments 7.1 VerbNet Wrapper in Java A package of Java classes were created in order to manage the VerbNet XML files. Although it was originally developed for our own purposes, it can be easily extended as we used pure Java programming. As a prerequisite the VerbNet resource has to be first downloaded, and the path where the VerbNet files live has to be specified, as well. It uses the jdom.jar library 1, and it can be obtained from http://www. dlsi.ua.es/~ofe/verbnetwrapper.tar. 7.2 VerbOcean Wrapper in Java Another Java package was developed to deal with the VerbOcean resource. It models the entire set of relations encoded in VerbOcean (http://demo. 1 http://www.jdom.org/ 163

CHAPTER 7. SOFTWARE DEVELOPMENTS patrickpantel.com/content/verbocean/), which for our research were used within Textual Entailment environments. Requires the VerbOcean file containing the full list of relations and it can be downloaded from http://www.dlsi.ua.es/~ofe/verboceanwrapper.tar. 7.3 Frame-to-Frame Similarity Demo in Java At http://sqm1.dlsi.ua.es/jfnslonlinedemo/ our demo about the Frameto-Frame similarity can be tested. It obtains a similarity score corresponding to two frames evoked by two specific lemma part-of-speech as described in section 3.5.5. It works on FrameNet 1.3 release and, obviously, to achieve a similarity factor the FrameNet data has to be loaded plus the lemma part-of-speech has to correspond to an actual FrameNet 1.3 Lexical Unit. This tool uses the FrameNet API developed by Nils Reiter. 2 7.4 The FrameNet-WordNet Alignments The alignments of FrameNet 1.3 Lexical Units and WordNet 3.0 Synset obtained by computing the alignment algorithm proposed in section 3.5.5 can be downloaded at http://www.dlsi.ua.es/~ofe/berkeley/best-alignments -FN13-WN30.zip and http://www.dlsi.ua.es/~ofe/berkeley/all-align ments-fn13-wn30.zip. They report the best alignments according to the best score achieved between a specific Lexical Unit and Synset (i.e. http://www.dlsi.ua.es/ ~ofe/berkeley/best-alignments-fn13-wn30.zip) and all posible alignment (i.e. http://www.dlsi.ua.es/~ofe/berkeley/all-alignments-fn1 3-WN30.zip). They are in a nice machine-readable XML format. 7.5 Entailment-based QA System Demo (Spanish QALLME-demo) The Spanish entailment-based QA system developed within the framework of the European QALL-ME project can be tested on-line at http://sqm1. 2 http://www.coli.uni-saarland.de/~reiter/framenetapi/doc/index.html 164

7.5. ENTAILMENT-BASED QA SYSTEM DEMO (SPANISH QALLME-DEMO) dlsi.ua.es/general/index.jsp. As previously mentioned, it makes use of the Textual Entailment system presented in this thesis, but in this case finding out entailment relations between queries and a set of predefined query patterns. 165

166

References Adams, Rod, Nicolae, Gabriel, Nicolae, Cristina, & Harabagiu, Sanda. 2007. Textual entailment through extended lexical overlap and lexico-semantic matching. Pages 119 124 of: Proceedings of the aclpascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Ageno, Alicia, Farwell, David, Ferrés, Daniel, Cruz, Fermin, Rodriguez, Horacio, & Turno, Jordi. 2008. TALP at TAC 2008: a semantic approach to recognizing textual entailment. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Agichtein, Eugene, Askew, Walt, & Liu, Yandong. 2008. Combining semantic and lexical evidence for recognizing textual entailment (RTE4) task. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Agirre, E., Soroa, A., Alfonseca, E., Hall, K., Kravalova, J., & Pasca, M. 2009a. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In: Proceedings of naacl. Agirre, Eneko, Alfonseca, Enrique, Hall, Keith, Kravalovaz, Jana, Pasca, Marius, & Soroa, Aitor. 2009b. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In: Proceedings of the 12th conference of the european chapter of the association for computational linguistics (eacl-09), to appear. Alemany, Laura Alonso, Castellón, Irene, Climent, Salvador, Fort, María Fuentes, Padró, Lluís, & Rodríguez, Horacio. 167

References 2003. Approaches to text summarization: Questions and answers. Inteligencia artificial, revista iberoamericana de inteligencia artificial, 22, 79 102. Atserias, Jordi, Casas, Bernardino, Comelles, Elisabet, González, Meritxell, Padró, Lluis, & Padró, Muntsa. 2006 (May). Freeling 1.3: Syntactic and semantic services in an open-source nlp library. In: Proceedings of the fifth international conference on language resources and evaluation (lrec 2006), elra. Baker, Collin F., Fillmore, Charles J., & Lowe, John B. 1998. The Berkeley FrameNet project. Pages 86 90 of: Boitet, Christian, & Whitelock, Pete (eds), Proceedings of the thirty-sixth annual meeting of the Association for Computational Linguistics and seventeenth international conference on computational linguistics. San Francisco, California: Morgan Kaufmann Publishers. Balahur, Alexandra, Lloret, Elena, Óscar Ferrández, Andrés Montoyo, Manuel Palomar, & Muñoz, Rafael. 2008. The DLSIUAES Team s Participation in the TAC 2008 Tracks. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Bar-Haim, Roy, Dagan, Ido, Dolan, Bill, Ferro, Lisa, Giampiccolo, Danilo, Magnini, Bernardo, & Szpektor, Idan. 2006. The second pascal recognising textual entailment challenge. Pages 1 9 of: Proceedings of the second pascal challenges workshop on recognising textual entailment. Bensley, Jeremy, & Hickl, Andrew. 2008. Workshop: Application of LCC s GROUNDHOG system for RTE-4. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Blake, Catherine. 2007. The role of sentence structure in recognizing textual entailment. Pages 101 106 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. 168

References Bos, Johan. 2005. Towards wide-coverage semantic interpretation. Pages 42 53 of: Proceedings of the sixth international workshop on computational semantics (iwcs-6). Bos, Johan, & Markert, Katja. 2005. Recognising textual entailment with robust logical inference. Pages 404 426 of: Machine learning challenges, evaluating predictive uncertainty, visual object classification and recognizing textual entailment, first pascal machine learning challenges workshop, mlcw 2005, southampton, uk, april 11-13, 2005, revised selected papers, lncs 3944. Bos, Johan, & Markert, Katja. 2006. When logical inference helps determining textual entailment (and when it doesn t). Pages 98 103 of: Proceedings of the second pascal challenges workshop on recognising textual entailment. Burchardt, Aljoscha, & Frank, Anette. 2006. Approaching textual entailment with lfg and framenet frames. Pages 92 97 of: Proceedings of the second pascal challenges workshop on recognising textual entailment. Burchardt, Aljoscha, & Pennacchiotti, Marco. 2008 (may). FATE: a FrameNet-Annotated Corpus for Textual Entailment. In: (ELRA), European Language Resources Association (ed), Proceedings of the sixth international language resources and evaluation (lrec 08). Burchardt, Aljoscha, Erk, Katrin, & Frank, Anette. 2005. A wordnet detour to framenet. In: Fisseni, Bernhard, Schmitz, Hans-Christian, & Schroeder, Bernhard (eds), Sprachtechnologie, mobile kommunikation und linguistische resourcen. Computer Studies in Language and Speech, vol. 8. Peter Lang, Frankfurt am Main. Burchardt, Aljoscha, Reiter, Nils, Thater, Stefan, & Frank, Anette. 2007. A semantic approach to textual entailment: System evaluation and task analysis. Pages 10 15 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Cabrio, Elena, Kouylekov, Milen, & Magnini, Bernardo. 2008a. Combining Specialized Entailment Engines for RTE-4. In: Notebook 169

References papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Cabrio, Elena, Kouylekov, Milen, Magnini, Bernardo, Negri, Matteo, Hasler, Laura, Orasan, Constantin, Tomas, David, Vicedo, Jose Luis, Neumann, Guenter, & Weber, Corinna. 2008b (may). The qall-me benchmark: a multilingual resource of annotated spoken requests for question answering. In: (ELRA), European Language Resources Association (ed), Proceedings of the sixth international language resources and evaluation (lrec 08). Carreras, Xavier, Chao, Isaac, Padró, Lluís, & Padró, Muntsa. 2004. Freeling: An open-source suite of language analyzers. In: Proceedings of the 4th international conference on language resources and evaluation (lrec 04). Castillo, Julio Javier, & i Alemany, Laura Alonso. 2008. An approach using named entities for recognizing textual entailment. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Chklovski, Timothy, & Pantel, Patrick. 2004. Verbocean: Mining the web for fine-grained semantic verb relations. In: Proceedings of conference on empirical methods in natural language processing (emnlp- 04). Clark, Stephen, & Curran, James R. 2004 (July). Parsing the wsj using ccg and log-linear models. Pages 103 110 of: Proceedings of the 42nd meeting of the association for computational linguistics (acl 04), main volume. Collins, Michael. 1999. Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania. Corley, Courtney, & Mihalcea, Rada. 2005. Measuring the semantic similarity of texts. Pages 13 18 of: Proceedings of the acl workshop on empirical modeling of semantic equivalence and entailment. Ann Arbor, Michigan: Association for Computational Linguistics. Cunningham, Hamish, Maynard, Diana, Bontcheva, Kalina, & Tablan, Valentin. 2002. Gate: an architecture for development of 170

References robust hlt applications. Pages 168 175 of: Proceedings of 40th annual meeting of the association for computational linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. Daelemans, Walter, Zavrel, Jakub, van der Sloot, Ko, & van den Bosch, Antal. 2003 (November). TiMBL: Tilburg Memory- Based Learner. Tech. rept. ILK 03-10. Tilburg University. Dagan, Ido, & Glickman, Oren. 2004. Probabilistic textual entailment: Generic appied modelling of language variability. In: Proceedings of the pascal workshop on learning methods for text understanding and mining. Dagan, Ido, Glickman, Oren, & Magnini, Bernardo. 2006. The pascal recognising textual entailment challenge. Pages 170 190 of: et al., Quionero-Candela (ed), Mlcw 2005, lnai, vol. 3395. Springer-Verlag. Dagan, Ido, Roth, Dan, & Zanzotto, Fabio Massimo. 2007 (June). Textual entailment. Tutorial in 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic. Dolan, Bill. 2007 (May). Paraphrase and textual entailment. Microsoft Research India Summer School on NLP, Bangalore, India. Dolan, W.B., Quirk, C., & Brockett, C. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In: The 20th international conference on computational linguistics. Ellsworth, Michael, & Janin, Adam. 2007. Mutaphrase: Paraphrasing with framenet. Pages 143 150 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Erk, Katrin, & Pado, Sebastian. 2006. Shalmaneser - a flexible toolbox for semantic role assignment. In: Proceedings of lrec 2006: the 5 th international conference on language resources and evaluation. Fellbaum, Christian (ed). 1998. Wordnet: An electronic lexical database (isbn: 0-262-06197-x). First edn. MIT Press. 171

References Ferrández, Oscar. 2006. Nerua: Sistema de detección y clasificación de entidades para el español basado en aprendizaje automático. Diploma de estudios avanzados (dea). Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante. Ferrández, Óscar, Terol, Rafael M., Muñoz, Rafael, Martínez- Barco, Patricio, & Palomar, Manuel. 2006. A knowledge-based textual entailment approach applied to the ave task. Pages 490 493 of: Clef 2006, lecture notes in computer science lncs 4730. Ferrández, Óscar, Micol, Daniel, Muñoz, Rafael, & Palomar, Manuel. 2007. DLSITE-1: Lexical analysis for solving textual entailment recognition. In: Proceedings of the 12th international conference on applications of natural language to information systems. Paris, France: Springer. Ferrández, Óscar, Micol, Daniel, Muñoz, Rafael, & Palomar, Manuel. 2007 (September). On the application of lexical-syntactic knowledge to the answer validation exercise. Pages 377 380 of: et al., C. Peters (ed), Clef 2007, lecture notes in computer science lncs 5152. Ferrández, Óscar, Muñoz, Rafael, & Palomar, Manuel. 2008 (September). Studying the influence of semantic constraints in ave. In: Clef 2008, lecture notes in computer science, to appear. Ferrández, Óscar, Izquierdo, Rubén, Ferrández, Sergio, & Vicedo, José Luis. 2009. Addressing ontology-based question answering with collections of user queries. Information processing and management, 45(2), 175 188. Ferrés, Daniel, & Rodríguez, Horacio. 2007. Machine learning with semantic-based distances between sentences for textual entailment. Pages 60 65 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Finkel, Jenny Rose, Grenager, Trond, & Manning, Christopher. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. Pages 363 370 of: Proceedings of the 43rd 172

References annual meeting of the association for computational linguistics (acl 05). Ann Arbor, Michigan: Association for Computational Linguistics. Frakes, Bill, & Baeza-Yates, Ricardo. 1992. Information retrieval, data structure and algorithms. Prentice Hall. Galanis, Dimitios, & Malakasiotis, Prodromos. 2008. AUEB at TAC 2008. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Giampiccolo, D., Dang, H.T., Magnini, B., Dagan, I., & Dolan, B. 2008a. The fourth pascal recognizing textual entailment challenge. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Giampiccolo, Danilo, Magnini, Bernardo, Dagan, Ido, & Dolan, Bill. 2007. The third pascal recognizing textual entailment challenge. Pages 1 9 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Giampiccolo, Danilo, Dang, H.T., Magnini, Bernardo, Dagan, Ido, & Dolan, Bill. 2008b (November). The fourth pascal recognizing textual entailment challenge. In: Proceedings of the tac 2008 workshop. Glickman, Oren. 2006. Applied textual entailment. Ph.D. thesis, Bar Ilan University. Harabagiu, Sanda M., Hickl, Andrew, & Lacatusu, V. Finley. 2007. Satisfying information needs with multi-document summaries. Inf. process. manage., 43(6), 1619 1642. Hickl, Andrew, & Bensley, Jeremy. 2007. A discourse commitmentbased framework for recognizing textual entailment. Pages 171 176 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Hockenmaier, Julia, & Steedman, Mark. 2002 (May). Acquiring compact lexicalized grammars from a cleaner tree-bank. Pages 1974 1981 of: 173

References Proceedings of the third international conference on language resources and evaluation (lrec). Hovy, E. H. 2005. Automated text summarization. The Oxford Handbook of Computational Linguistics. Oxford University Press. Iftene, Adrian. 2008. UAIC participation at RTE4. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Iftene, Adrian, & Balahur-Dobrescu, Alexandra. 2007. Hypothesis transformation and semantic variability rules used in recognizing textual entailment. Pages 125 130 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Jaccard, Paul. 1912. The distribution of the flora in the alpine zone. New phytologist, 11(2), 37 50. Jaro, Matthew A. 1989. Advances in record linking methodology as applied to the 1985 census of tampa florida. Journal of the american statistical society, 84(406), 414 420. Jaro, Matthew A. 1995. Probabilistic linkage of large public health data file. Statistics in medicine, 14, 491 498. Jiang, Jay J., & Conrath, David W. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of international conference on research in computational linguistics. Jones, Karen Sparck. 1999. Automatic summarising: factors and directions. Pages 1 12 of: Advances in automatic text summarization. MIT Press. Katrenko, Sophia, & Adriaans, Pieter. 2006. Using Maximal Embedded Syntactic Subtrees for Textual Entailment Recognition. Pages 33 37 of: Proceedings of the second pascal challenges workshop on recognising textual entailment. Kipper, Karin, Korhonen, Anna, Ryant, Neville, & Palmer, Martha. 2006 (June). Extending verbnet with novel verb classes. In: 174

References Fifth international conference on language resources and evaluation (lrec 2006). Kouylekov, Milen. 2006 (December). Recognizing textual entailment with tree edit distance: Application to question answering and information extraction. Ph.D. thesis, International Graduate School in Information and Communication Technologies, Faculty of Science, University of Trento. Advisor: Bernardo Magnini. Kouylekov, Milen, & Magnini, Bernardo. 2005 (April). Recognizing textual entailment tree edit distance algorithms. In: Proceedings of pascal workshop on recognizing textual entailment. Kouylekov, Milen, & Magnini, Bernardo. 2006. Tree edit distance for recognizing textual entailment: Estimating the cost of insertion. Pages 68 73 of: Proceedings of the second pascal challenges workshop on recognising textual entailment. Kouylekov, Milen, Negri, Matteo, Magnini, Bernardo, & Coppola, Bonaventura. 2006. Towards entailment-based question answering: Itc-irst at clef 2006. Pages 526 536 of: Clef 2006, lecture notes in computer science lncs 4730. Kozareva, Z., Ferrández, O., Montoyo, A., & Muñoz, R. 2007. Combining data-driven systems for improving named entity recognition. Data and knowledge engineering, 61(3), 449 466. Leacock, Claudia, & Chodorow, Martin. 1998. Combining local context and wordnet similarity for word sense identification. An electronic lexical database, 265 283. Levenshtein, Vladimir I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 10(8), 707 710. Levin, Beth. 1993. English Verb Classes and Alternations: A Preliminary Investigation. The University of Chicago Press. Lin, Chin-Yew. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. Pages 74 81 of: Marie-Francine Moens, Stan Szpakowicz (ed), Text summarization branches out: Proceedings of the 175

References acl-04 workshop. Barcelona, Spain: Association for Computational Linguistics. Lin, Chin-Yew, & Och, Franz Josef. 2004. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. Pages 605 612 of: Acl. Lin, Dekang. 1998a. Dependency-based Evaluation of MINIPAR. In: Workshop on the evaluation of parsing systems. Lin, Dekang. 1998b. An Information-Theoretic Definition of Similarity. Pages 296 304 of: Icml 98: Proceedings of the fifteenth international conference on machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Lin, Dekang, & Pantel, Patrick. 2001. DIRT - discovery of inference rules from text. Pages 323 328 of: Kdd 01: Proceedings of the seventh acm sigkdd international conference on knowledge discovery and data mining. Lloret, Elena, Ferrández, Óscar, Muñoz, Rafael, & Palomar, Manuel. 2008a. Integración del reconocimiento de la implicación textual en tareas automáticas de resúmenes de textos. Pages 183 190 of: Procesamiento del lenguaje natural, nâo 41. Lloret, Elena, Ferrández, Óscar, Muñoz, Rafael, & Palomar, Manuel. 2008b. A text summarization approach under the influence of textual entailment. Pages 22 31 of: Sharp, Bernadette, & Zock, Michael (eds), Nlpcs. INSTICC Press. MacCartney, Bill, Grenager, Trond, de Marneffe, Marie- Catherine, Cer, Daniel, & Manning., Christopher D. 2006. Learning to recognize features of valid textual entailments. Pages 41 48 of: Proceedings of the north american association of computational linguistics. MacCune, William W. 1994. Otter 3.0 reference manual and guide. Technical report ANL-94/6, Argonne Natl. Laboratory. Malakasiotis, Prodromos, & Androutsopoulos, Ion. 2007. Learning textual entailment using svms and string similarity measures. Pages 176

References 42 47 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Mani, I., & Maybury, M. T. 1999. Advances in automatic text summarization. The MIT Press. Marsi, Erwin, Krahmer, Emiel, & Bosma, Wauter. 2007. Dependency-based paraphrasing for recognizing textual entailment. Pages 83 88 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Meyers, A., Reeves, R., Macleod, C., Szekely, R., Zielinska, V., Young, B., & Grishman, R. 2004. The nombank project: An interim report. Pages 24 31 of: Meyers, A. (ed), Hlt-naacl 2004 workshop: Frontiers in corpus annotation. Boston, Massachusetts, USA: Association for Computational Linguistics. Micol, Daniel, Ferrández, Óscar, Muñoz, Rafael, & Palomar, Manuel. 2007. DLSITE-2: Semantic similarity based on syntactic dependency trees applied to textual entailment. Pages 73 80 of: Proceedings of the textgraphs-2 workshop. Rochester, New York, United States of America: The North American Chapter of the Association for Computational Linguistics. Miller, George A., Beckwith, Richard, Fellbaum, Christiane, Gross, Derek, & Miller., Katherine J. 1990. Introduction to WordNet: An On-line Lexical Database. International journal of lexicography, 3(4), 235 244. Moldovan, Dan, & Novischi, Adrian. 2002. Lexical chains for question answering. In: Proceedings of coling 2002. Moldovan, Dan I., Clark, Christine, Harabagiu, Sanda M., & Maiorano, Steven J. 2003. Cogex: A logic prover for question answering. In: Hlt-naacl. Montalvo-Huhn, Orlando, & Taylor, Stephen. 2008. Textual Entailment - Fitchburg State College. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. 177

References Moreda, Paloma. 2008 (Mayo). Los roles semánticos en la tecnologia del languaje humano: Anotación y aplicación. Ph.D. thesis, Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante. Advisor: Manuel Palomar. Moreno, Lidia, Palomar, Manuel, Molina, Antonio, & Ferrández, Antonio. 1999. Introducción al Procesamiento del Lenguaje Natural. Universidad de Alicante. Needleman, Saul, & Wunsch, Christian. 1970. A general method applicable to the search for similarities in amino acid sequence of two proteins. Journal of molecular biology, 48(3), 443 453. Nielsen, Rodney D., Becker, Lee, & Ward, Wayne. 2008. TAC 2008 CLEAR system report: Facet-based entailment. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Niles, I., & Pease, A. 2001. Towards a standard upper ontology. Pages 2 9 of: Proceedings of the 2nd international conference on formal ontology in information systems (fois-2001). ACM Press. Nuno, Seco. 2005. Computational models of similarity in lexical ontologies. Ph.D. thesis, University College Dublin. Palmer, Martha, Gildea, Daniel, & Kingsbury, Paul. 2005. The proposition bank: An annotated corpus of semantic roles. Computational linguistics, 31(1), 71 106. Pedersen, Ted, Patwardhan, Siddhart,, & Michelizzi, Jason. 2004. WordNet::Similarity - Measuring the Relatedness of Concepts. Pages 38 41 of: Proceedings of the north american chapter of the association for computational linguistics. Peñas, Anselmo, Rodrigo, Álvaro, & Verdejo, Felisa. 2006 (September). Overview of the answer validation exercise 2006. In: et al., C. Peters (ed), Clef 2006, lecture notes in computer science lncs 4730. Peñas, Anselmo, Rodrigo, Álvaro, & Verdejo, Felisa. 2007 (September). Overview of the answer validation exercise 2007. In: 178

References et al., C. Peters (ed), Clef 2007, lecture notes in computer science lncs 5152. Peters, Carol. 2007 (September). What happened in clef 2007? introduction to the working notes. In: Working notes for the 8th workshop of the cross-language evaluation forum, clef. Peters, Carol. 2008 (September). What happened in clef 2008 introduction to the working notes. In: Working notes for the 9th workshop of the cross-language evaluation forum, clef. Pirrò, Giuseppe, & Seco, Nuno. 2008. Design, implementation and evaluation of a new semantic similarity metric combining features and intrinsic information content. Pages 1271 1288 of: On the move to meaningful internet systems: Otm 2008, otm 2008 confederated international conferences, coopis, doa, gada, is, and odbase 2008, monterrey, mexico, november 9-14, 2008, proceedings, part ii. Lecture Notes in Computer Science, vol. 5332. Springer. Resnik, Philip. 1995. Using information content to evaluate semantic similarity in a taxonomy. Pages 448 453 of: Ijcai. Riezler, Stefan, King, Tracy H., Kaplan, Ronald M., Crouch, Richard, Maxwell, John T. III, & Johnson, Mark. 2002. Parsing the wall street journal using a lexical-functional grammar and discriminative estimation techniques. Pages 271 278 of: Proceedings of 40th annual meeting of the association for computational linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. Rodrigo, Álvaro, Peñas, Anselmo, Herrera, Jesús, & Verdejo, Felisa. 2006. The effect of entity recognition on answer validation. Pages 483 489 of: Clef 2006, lecture notes in computer science lncs 4730. Rodrigo, Álvaro, Peñas, Anselmo, Herrera, Jesús, & Verdejo, Felisa. 2007a. Experiments of uned at the third recognising textual entailment challenge. Pages 89 94 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. 179

References Rodrigo, Álvaro, Peñas, Anselmo, & Verdejo, Felisa. 2007b. Uned at answer validation exercise 2007. Pages 404 409 of: Clef 2007, lecture notes in computer science lncs 5152. Rodrigo, Álvaro, Peñas, Anselmo, & Verdejo, Felisa. 2008a (September). Overview of the answer validation exercise 2008. In: et al., C. Peters (ed), Clef 2008, lecture notes in computer science, to appear. Rodrigo, Álvaro, Peñas, Anselmo, & Verdejo, Felisa. 2008b. Towards an entity-based recognition of textual entailment. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Roth, Dan. 2005 (June). Knowledge representation and inference models for textual entailment. An Invited talk in Empirical Modeling of Semantic Equivalence and Entailment (Workshop co-located with ACL-2005), Ann Arbor, Michigan. Roth, Dan, & Sammons, Mark. 2007. Semantic and logical inference model for textual entailment. Pages 107 112 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Sacaleanu, Bogdan, Orasan, Constantin, Spurk, Christian, Ou, Shiyan, Ferrández, Óscar, Kouylekov, Milen, & Negri, Matteo. 2008. Entailment-based Question Answering for Structured Data. Pages 173 176 of: Coling 2008: Companion volume: Demonstrations. Manchester, UK: Coling 2008 Organizing Committee. Sampson, G. 1995. English for the Computer. In: Oxford university press. Sang, Tijong Kim. 2002. Introduction to the CoNLL-2002 Shared Task: Language Independent Named Entity Recognition. Pages 155 158 of: Proceedings of conll-2002. Sauri, Roser, & Pustejovsky, James. 2007. Determining modality and factuality for text entailment. Pages 509 516 of: Proceedings of the first ieee international conference on semantic computing (icsc 2007). 180

References Schmid, Helmut. 1994 (September). Probabilistic part-of-speech tagging using decision trees. In: Proceedings of international conference on new methods in language processing. Schröder, Ingo. 2002. A Case Study in Part-of-Speech tagging Using the ICOPOST Toolkit. Tech. rept. FBI-HH-M-314/02. Department of Computer Science, University of Hamburg. Settembre, Scott. 2007. Textual entailment using univariate density model and maximizing discriminant function. Pages 95 100 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Siblini, R., & Kosseim, L. 2008a (June). Rodeo: Reasoning over dependencies extracted online. In: Proceedings of the 4th web as corpus workshop of language resources and evaluation (lrec-2008). Siblini, Reda, & Kosseim, Leila. 2008b. Using ontology alignment for the TAC RTE challenge. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Smith, T. F., & Waterman, M. S. 1981. Identification of common molecular subsequences. Journal of molecular biology, 147, 195 197. Snow, Rion, Vanderwende, Lucy, & Menezes, Arul. 2006. Effectively using syntax for recognizing false entailment. Pages 33 40 of: Proceedings of the north american association of computational linguistics. Sparck-Jones, Karen. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1), 11 21. Steinberger, Josef, Poesio, Massimo, Kabadjov, Mijail A., & Jeek, Karel. 2007. Two uses of anaphora resolution in summarization. Inf. process. manage., 43(6), 1663 1680. Suárez, Armando, & Palomar, Manuel. 2002 (August). A Maximum Entropy-based Word Sense Disambiguation System. Pages 960 966 of: Chen, Hsin-Hsi, & Lin, Chin-Yew (eds), Proceedings of the 19th international conference on computational linguistics, coling 2002. 181

References Suchanek, Fabian, Kasneci, Gjergji, & Weikum, Gerhard. 2008. Yago - a large ontology from wikipedia and wordnet. Elsevier journal of web semantics, 6(3), 203 217. Szpektor, I., Tanev, H., Dagan, I., & Coppola, B. 2004. Scaling web-based acquisition of entailment relations. In: Proceedings of the 2004 conference on empirical methods in natural language processing. Tatar, Doina, Tamaianu-Morita, Emma, Mihis, Andreea, & Lupsa, Dana. 2008. Summarization by logic segmentation and text entailment. Pages 15 26 of: 9th international conference cicling 2008, research in computing science vol 33. Alexander Gelbukh. Tatu, Marta, & Moldovan, Dan. 2007. Cogex at rte 3. Pages 22 27 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Tatu, Marta, Iles, Brandon, & Moldovan, Dan I. 2006a. Automatic answer validation using cogex. Pages 494 501 of: Clef 2006, lecture notes in computer science lncs 4730. Tatu, Marta, Iles, Brandon, Slavick, John, Novischi, Adrian, & Moldovan, Dan. 2006b. Cogex at the second recognizing textual entailment challenge. Pages 104 109 of: Proceedings of the second pascal challenges workshop on recognising textual entailment. Tjong Kim Sang, Erik F., & De Meulder, Fien. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. Pages 142 147 of: Daelemans, Walter, & Osborne, Miles (eds), Proceedings of conll-2003. Edmonton, Canada. Tonelli, Sara, & Pianta, Emanuele. 2009. A novel approach to mapping FrameNet lexical units to WordNet synsets. In: Proceedings of iwcs-8. Toral, Antonio, Muñoz, Rafael, & Monachini, Monica. 2008 (may). Named entity wordnet. In: (ELRA), European Language Resources Association (ed), Proceedings of the sixth international language resources and evaluation (lrec 08). 182

References Toral, Antonio, Óscar Ferrández, Aguirre, Eneko, & Muñoz, Rafael. 2009. A study on linking and disambiguating wikipedia categories to wordnet using text similarity. In: Proceedings of the recent advances in natural language processing (ranlp 09). to appear. Tversky, Amos. 1977. Features of similarity. Pages 327 352 of: Psychological review, vol. 84. Verhagen, Marc, Mani, Inderjeet, Sauri, Roser, Littman, Jessica, Knippen, Robert, Jang, Seok B., Rumshisky, Anna, Phillips, John, & Pustejovsky, James. 2005. Automating temporal annotation with TARSQI. Pages 81 84 of: Proceedings of the acl interactive poster and demonstration sessions. Ann Arbor, Michigan: Association for Computational Linguistics. Wang, Rui, & Neumann, Guenter. 2008a. An Accuracy-Oriented Divide-an-Conquer Strategy for Recognizing Textual Entailment. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Wang, Rui, & Neumann, Günter. 2007a. Recognizing textual entailment using sentence similarity based on dependency tree skeletons. Pages 36 41 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Wang, Rui, & Neumann, Günter. 2007b. Using recognizing textual entailment as a core engine for answer validation. Pages 387 390 of: Clef 2007, lecture notes in computer science lncs 5152. Wang, Rui, & Neumann, Günter. 2008b (September). Information synthesis for answer validation. In: Clef 2008, lecture notes in computer science, to appear. Widdows, Dominic, & Ferraro, Kathleen. 2008 (may). Semantic vectors: a scalable open source package and online technology management application. In: Proceedings of the sixth international language resources and evaluation (lrec 08). 183

Winkler, William E. 1999. The State of Record Linkage and Current Research Problems. Tech. rept. Statistical Research Division, U.S. Census Bureau. Witten, Ian H., & Frank, Eibe. 2005. Data Mining: Practical machine learning tools and techniques. 2nd Edition, Morgan Kaufmann, San Francisco, United States of America. Wu, Zhibiao, & Palmer, Martha. 1994. Verb Semantics and Lexical Selection. Pages 133 138 of: Proceedings of the 32nd annual meeting of the associations for computational linguistics. Zanzotto, Fabio Massimo, Pennacchiotti, Marco, & Moschitti, Alessandro. 2007. Shallow semantic in fast textual entailment rule learners. Pages 72 77 of: Proceedings of the acl-pascal workshop on textual entailment and paraphrasing. Prague: Association for Computational Linguistics. Zanzotto, Fabio Massimo, Pennacchiotti, Marco, & Moschitti, Alessandro. 2008. PeMoZa submission to TAC 2008. In: Notebook papers of the text analysis conference, tac 2008 workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology. Zesch, Torsten, & Gurevych, Iryna. 2007 (April). Analysis of the wikipedia category graph for nlp applications. Pages 1 8 of: Proceedings of the textgraphs-2 workshop (naacl-hlt 2007). Zesch, Torsten, Müller, Christof, & Gurevych, Iryna. 2008 (July). Using wiktionary for computing semantic relatedness. Pages 861 867 of: Proceedings of the 23rd aaai conference on artificial intelligence. 184

A The PASCAL Recognizing Textual Entailment Challenges A.1 RTE Official Results This appendix shows the official results reported on all RTE challenges. 185

1st Author (Team) accuracy Akhmatova (Macquarie) 0.519 Andreevskaia (Concordia) 0.519 Bayer (MITRE) 0.586 Bos (Edinburgh & Leeds) 0.563 Delmonte (Venice & Irst (FBK)) 0.606 (62% partial coverage) Fowler (LCC) 0.551 Glickman (Bar Ilan) 0.586 Herrera (UNED) 0.566 Jijkoun (Amsterdam) 0.552 Kuleykov (Irst (FBK)) 0.559 Newman (Dublin) 0.565 Perez (Madrid) 0.495 0.7 (19% partial coverage) Punyakanok (UIUC) 0.561 Raina (Stanford) 0.563 Wu (HKUST) 0.512 Zanzotto (Rome-Milan) 0.524 Table A.1: Official results for the RTE-1 2005 challenge. 186

1st Author (Team) accuracy Avg. precision (if provided) Adams (Dallas) 0.6262 0.6282 Bos (Rome & Leeds) 0.6162 0.6689 Burchardt (Saarland / SALSA) 0.5900 Clarke (Sussex) 0.5475 0.5260 de Marneffe (Stanford) 0.6050 0.5800 Delmonte (Venice) 0.5475 0.5495 Ferrández (Alicante) 0.5563 0.6089 Herrera (UNED) 0.5975 0.5663 Hiclk (LCC) 0.7538 0.8082 Inkpen (Ottawa) 0.5825 0.5816 Katrenko (Amsterdam) 0.5900 Kouylekov (Irst (FBK) & Trento) 0.6050 0.5046 Kozareva (Alicante) 0.5500 0.5485 Litkowski (CL Research) 0.5813 Marsi (Tilburg & Twente) 0.6050 Newman (Dublin) 0.5437 0.5103 Nicholson (Melbourne) 0.5288 0.5464 Nielsen (Colorado) 0.5962 0.6464 Rus (Memphis) 0.5900 0.6047 Schilder (Thomson & Minnesota) 0.5550 Tatu (LCC) 0.7375 0.7133 Vandelwende (Microsoft & Stanford) 0.6025 0.6181 Zanzotto (Rome & Milan) 0.6388 0.6441 Table A.2: Official results for the RTE-2 2006 challenge. 187

1st Author (Team) accuracy Avg. precision (if provided) Adams (Dallas) 0.6700 Bar-Haim (Bar-Ilan & Tel Aviv) 0.6112 0.6118 Baral 0.4963 0.5364 Blake (North Carolina) 0.6585 0.6096 Bobrow (Palo Alto) 0.5150 0.5807 Burchardt (Saarland / SALSA) 0.6262 Burek (Open Univ.) 0.5500 0.5514 Chambers (Stanford) 0.6362 0.6527 Clark (Seattle, Marina del Rey & Princeton) 0.5088 0.4961 Delmonte (Venice) 0.5875 0.5830 Ferrández (Alicante) 0.6563 Ferres (TALP) 0.6150 Harmeling (Edinburgh) 0.5775 0.5952 Hiclk (LCC) 0.8000 0.8875 Iftene (UAIC) 0.6913 Li (Atlanta) 0.6488 Litkowski (CL Research) 0.6125 Malakasiotis (Athens) 0.6175 0.6808 Marsi (Tilburg & Twente) 0.5913 Montejo-Raez (UJA) 0.6038 Rodrigo (UNED) 0.6312 Roth (CCG) 0.6262 Settembre (Buffalo) 0.6262 0.6274 Tatu (LCC) 0.7225 0.6942 Wang (DFKI) 0.6687 Zanzotto (Rome & Milan) 0.6675 0.6675 Table A.3: Official results for the RTE-3 2007 challenge. 188

1st Author (Team) 3-way task 2-way task 3way Acc. derived 2way AvgP. Acc. AvgP. Galanis (Athens) 0.554 0.584 0.522 0.578 0.563 Bar-Haim (Bar-Ilan & 0.584 Tel Aviv) Clark (Seattle) 0.481 0.547 Bergmair (Cambridge) 0.516 0.5257 Glinos (SAIC) 0.416 0.526 0.521 Nielsen (Colorado) 0.606 0.6254 Wang (Saarland & 0.614 0.687 0.706 DFKI) Ferrández (Alicante) 0.608 Agichtein (Emory) 0.547 0.583 0.5954 0.588 0.5998 Cabrio (FBK-Irst) 0.57 0.553 Montalvo-Huhn (Fitchburg) 0.466 0.526 0.526 Yatbaz (Koc) 0.519 Varma (IIIT Hyderabad) 0.309 0.531 Krestel (Hannover & 0.432 0.54 Concordia) Bensley (LCC) 0.746 0.7419 Zanzotto (Rome, Saarland 0.59 0.6287 & Trento) Li (Tsinghua) 0.588 0.633 0.6332 0.659 0.6225 Siblini (Concordia) 0.616 0.688 0.5811 Castillo (Cordoba) 0.546 0.571 Pado (Stanford) 0.553 0.614 0.4416 Iftene (UAIC) 0.685 0.72 0.721 Rodrigo (UNED) 0.549 Shen (Edinburgh) 0.582 Ageno (TALP) 0.563 Mohammad (Maryland) 0.556 0.619 0.4427 wlvuk 0.571 Table A.4: Official results for the RTE-4 2008 challenge. 189

190

B The Answer Validation Exercise Official Results B.1 AVE Official Results This appendix shows the official results reported on all AVE competitions. 191

Team F-measure Prec. YES Rec. YES LCC 0.4559 0.3261 0.7576 U. Rome (run2) 0.4106 0.2838 0.7424 ITC-irst 0.3919 0.3090 0.5354 U. Rome (run1) 0.3780 0.2707 0.6263 U.Alicante (Kozareva - run2) 0.3720 0.2487 0.7374 U.Alicante (Ferrández - run2) 0.3177 0.2040 0.7172 U.Alicante (Kozareva - run1) 0.3174 0.2114 0.6364 U.Alicante (Ferrández - run1) 0.3070 0.2144 0.5404 U.Twente (run1) 0.3022 0.3313 0.2778 U.Twente (run2) 0.2759 0.2692 0.2828 Baseline (100% YES) 0.2742 0.1589 1 Baseline (50% YES) 0.2412 0.1589 0.5 U.P. Valencia 0.075 0.2143 0.0455 Table B.1: English official results for the AVE 2006 track. Team F-measure Prec. YES Rec. YES DFKI (run2) 0.55 0.44 0.71 DFKI (run1) 0.46 0.37 0.62 U.Alicante (run1) 0.39 0.25 0.81 Text-Mess (run1) 0.36 0.25 0.62 Iasi 0.34 0.21 0.81 UNED 0.34 0.22 0.71 Text-Mess (run2) 0.34 0.25 0.52 U.Alicante (run2) 0.29 0.18 0.81 Baseline (100% YES) 0.19 0.11 1 Baseline (50% YES) 0.18 0.11 0.5 Table B.2: English official results for the AVE 2007 track. 192

Team F-measure Prec. YES Rec. YES DFKI 0.64 0.54 0.78 U.Alicante 0.49 0.35 0.86 UNC (run2) 0.21 0.13 0.56 Iasi (run2) 0.19 0.11 0.85 UNC (run1) 0.17 0.09 0.94 Iasi (run1) 0.17 0.09 0.76 Baseline (100% YES) 0.14 0.08 1 Baseline (50% YES) 0.13 0.08 0.5 UJA (run2) 0.02 0.17 0.01 Table B.3: English official results for the AVE 2008 track. Team F-measure Prec. YES Rec. YES U.Alicante (run2) 0.44 0.32 0.67 INAOE (run2) 0.39 0.30 0.59 U.Alicante (run1) 0.38 0.26 0.76 INAOE (run1) 0.23 0.13 0.86 Baseline (100% YES) 0.18 0.10 1 Baseline (50% YES) 0.17 0.10 0.5 UJA (run1) 0.06 0.15 0.04 UJA (run2) 0.05 0.22 0.03 Table B.4: Spanish official results for the AVE 2008 track. 193

194

C Information Gain Achieved by the System Features Regarding the RTE development corpora C.1 The Information Gain Bar Graphics for All System Features This appendix shows the information gain bar graphics for all system features regarding each RTE development corpus. 195

Figure C.1: Information gain of lexical features for RTE-2 development corpus. 196

Figure C.2: Information gain of syntactic-semantic features for RTE-2 development corpus. 197

Figure C.3: Information gain of lexical features for RTE-3 development corpus. 198

Figure C.4: Information gain of syntactic-semantic features for RTE-3 development corpus. 199

Figure C.5: Information gain of lexical features putting together both corpora (RTE-2 and RTE-3 development corpora). 200

Figure C.6: Information gain of syntactic-semantic features putting together both corpora (RTE-2 and RTE-3 development corpora). 201

202

D Síntesis en Castellano D.1 Introducción El enorme crecimiento de información digital en los últimos años, tiene como consecuencia un gran auge por parte de la comunidad científica a la hora de procesar automáticamente la información disponible y hace necesario el disponer de herramientas que faciliten este tratamiento automático. En este aspecto, el Procesamiento del Lenguaje Natural (PLN) desempeña un papel muy relevante. Sin embargo, la gran riqueza que posee el lenguaje es un gran impedimento para su procesamiento por computadoras. Esto resulta en una gran variabilidad lingüística a la hora de expresar ideas y/o conceptos, la cual tiene que ser controlada para facilitar la interacción hombre-máquina. Esta tesis se centra en solucionar dicha variabilidad, y en concreto, resolver el problema de la implicación textual. La implicación textual define un modelado de la variabilidad semántica que aparece cuando un significado concreto es descrito de diferentes maneras. Concretamente, el concepto de implicación textual establece relaciones unidireccionales entre los significados de dos textos. Tradicionalmente, el texto que permite la inferencia de significados es denominado T o texto, y el texto cuyo significado es deducido 203

se denomina H o hipótesis (Glickman, 2006). A grandes rasgos, nuestra propuesta consiste en resolver dicha implicación teniendo en cuenta diferentes niveles, así un mayor rango de implicaciones podrán ser detectadas. Proponemos tres niveles: léxico, sintáctico y semántico. Para cada nivel diferentes inferencias son extraídas y consideradas como características para un algoritmo de aprendizaje automático (en concreto un clasificador tipo Support Vector Machine). Además, con el objetivo de eliminar ruido y mejorar la clasificación también se llevó a cabo una selección de características para dicho algoritmo. D.1.1 Motivación La principal motivación es conseguir la extracción automática de conocimiento a partir de la información expresada en textos. Más concretamente, y en relación al contexto de esta tesis, la motivación práctica se basa en que muchas aplicaciones pertenecientes a diferentes áreas del PLN están influenciadas por el problema de la variabilidad del lenguaje. Por lo tanto, el uso de técnicas o sistemas de implicación textual en dichas aplicaciones podría mejorar el rendimiento final de las mismas. D.2 Estado de la cuestión Detectar y clasificar relaciones semánticas son retos que implican modificaciones léxicas, alteraciones sintácticas, estructuras y referencias del discurso, y conocimiento del mundo. Para abordar el reconocimiento de implicación textual, los investigadores han propuesto en los últimos años un gran variedad de técnicas, que desde un punto de vista técnico, puedes ser encapsuladas en varios modelos: léxico, sintáctico, semántico y lógico. El modelo léxico es el más simple de todos, y normalmente sirve como base de sofisticados sistemas de implicación textual. No obstante, resulta bastante sorprendente los buenos resultados que obtiene en esta tarea. Esto queda patente en las diferentes ediciones de la competición RTE (Recognising Textual Entailment), la cual es uno de los puntos de encuentro de mayor relevancia para investigaciones sobre implicación textual (Bar-Haim et al., 2006; Giampiccolo et al., 2008a). 204

Este modelo representa los textos como conjuntos de palabras intentando establecer implicaciones léxicas entre todas las palabras de H con las de T. Por ejemplo, derivaciones léxicas como ganador y ganar, serían detectadas por este modelo. El sistema presentado por la Universidad de Atenas (Malakasiotis & Androutsopoulos, 2007; Galanis & Malakasiotis, 2008), es quizás el más claro ejemplo de sistemas de implicación textual basados en el modelo léxico. Este sistema implementa un clasificador basado en Máxima Entropía entrenado con diferentes medidas de similitud entre cadenas léxicas. El sistema consiguió resultados muy prometedores en las competiciones RTE superando los systemas base propuestos por los organizadores. Otros ejemplos representativos de este modelo son (Montalvo-Huhn & Taylor, 2008; Adams et al., 2007; Settembre, 2007), los cuales también consiguieron resultados bastante esperanzadores teniendo en cuenta que hacen uso principalmente de inferencias léxicas. El modelo sintáctico normalmente representa los textos mediante árboles de dependencias y determina la implicación por medio de una función de similitud entre dichos árboles. En Kouylekov (2006), Kouylekov & Magnini (2006) and Cabrio et al. (2008a) se presenta el sistema desarrollado por FBK-Irst. La principal característica de este sistema es obtener un valor que mida la distancia computada como el coste de las operaciones de edición necesarias para transformar T en H. El sistema implementa dos algoritmos: (i) la distancia de Levenshtein (Levenshtein, 1966); y (ii) la distancia de edición entre árboles de dependencias respecto a operaciones de inserción, borrado y sustitución de los nodos de dichos árboles. La decision de implicación es tomada según estas dos distancias y el coste que se hubiese obtenido si toda H hubiese sido insertada y T hubiese sido borrado. Wang & Neumann (2007a; 2008a) presentan un sistema que en sus orígenes fue puramente sintáctico y que ha evolucionado hacia una especialización modular con el objetivo de abarcar una mayor gama de casos de implicación. El sistema desarrolla un etiquetado morfológico, análisis de dependencias, reconocimiento de entidades y razonamiento de expresiones temporales. El módulo sintáctico extrae la representación de los textos mediante árboles de dependencias obteniendo estructuras predicado-argumento, y a su vez almacenando también el camino de dependencias entre ellos y las partes mas significativas de cada frase. Finalmente, utilizan un algoritmo de aprendizaje 205

tipo kernel-based para decidir la implicación. El modelo semántico utiliza numerosos recursos para modelar el conocimiento semántico. En este modelo es común hacer uso, o incluso crear recursos semánticos que favorezcan la extracción de la semántica existente en los textos. Reconocedores de entidades, razonadores que sean capaces de normalizar expresiones temporales y numéricas, inferencias verbales, etiquetado de roles, marcos semánticos, eventos, etc, formarían parte de sistemas representados por el modelo semántico. El sistema presentado por la UNED (Rodrigo et al., 2007a; Rodrigo et al., 2008b) es el más claro ejemplo del uso de entidades para el reconocimiento de implicación textual. En sus inicios el sistema basaba su decision en encontrar correspondencias entre todas las entidades de la hipótesis con las entidades del texto, utilizando para ello la distancia de Levenshtein y emparejamiento de subcadenas. Sin embargo, en su ultima versión el sistema implementa el modelo tradicional de entidad-relación-atributo, intentando llevar a cabo emparejamientos entre H y T. Esta última configuración, aunque en sus primeros pasos, consiguió resultados muy esperanzadores. El sistema TALP (Ferrés & Rodríguez, 2007; Ageno et al., 2008) se basa en la obtención de un conjunto de distancias semánticas que determinarán la implicación. El primer paso del sistema es un procesamiento lingüístico que implica tokenizado, lematizado, reconocimiento de entidades, análisis morfológico y sintáctico, y etiquetado semántico con WordNet, dominios de Magnini y EuroWordNet Top Concept Ontology. Con este conocimiento, se construye una representación independiente del lenguaje denominada environment mediante grafos dirigidos considerando predicados unarios y binarios. Por último, el sistema obtiene una gran variedad de medidas léxicosemánticas de proximidad entre dichos grafos, procesándolas como características para un algoritmo de aprendizaje automático (AdaBoost). En su última versión, los autores enriquecieron el sistema mediante la detección de correferencia, mejoras en el reconocimiento de entidades, ampliando el uso de las relaciones de WordNet y otros recursos como VerbOcean, y realizando una clasificación previa de la hipótesis. Desafortunadamente, las mejoras introducidas, aunque esperanzadoras en cuanto a investigación, no consiguieron mejorar el sistema base. El sistema SALSA (Burchardt & Frank, 2006; Burchardt et al., 2007) ejemplifica el uso de marcos semánticos en el reconocimiento de implicación textual. Este sistema combina un análisis sintáctico profundo con la infor- 206

mación de marcos semánticos de FrameNet (Baker et al., 1998) y un componente basado en solapamiento de palabras. Sin embargo, como afirman los autores en su artículo, resulta sorprendente que la aproximación basada en simple solapamiento obtenga resultados similares e incluso mejores que inferencias muchos más complejas. En el modelo lógico las expresiones lingüísticas son transformadas en representaciones lógicas junto con un conjunto de axiomas que sean capaces de reconocer relaciones de implicación. Este modelo suele apoyarse en un logic prover encargado de determinar si produce o no la implicación de acuerdo a los axiomas extraídos de los textos. El sistema Nutcracker (Bos & Markert, 2005; Bos & Markert, 2006) representa un claro ejemplo de este modelo. Este sistema lleva a cabo la detección de implicación en tres pasos: (1) análisis semántico profundo mediante un análisis morfológico, reconocimiento de entidades y análisis sintáctico, creando como resultado estructuras de representación del discurso; (2) dichas estructuras son convertidas a predicados de lógica de primer; y (3) utiliza un logic prover para validar que T H. El sistema COGEX (Moldovan et al., 2003) es otro ejemplo de sistemas basados en lógica utilizados para el reconocimiento de implicación textual (Tatu et al., 2006b; Tatu & Moldovan, 2007). COGEX requiere de un conjunto de cláusulas para iniciar la búsqueda de inferencias, y partiendo de la hipótesis negada ( H) intenta encontrar una refutación a partir de T. Si el sistema consigue encontrarla se afirmará que existe relación de implicación. Hasta ahora los sistemas descritos pueden ser encapsulados en alguno de los modelos anteriores, ya que las inferencias que utilizan son en su mayoría léxicas, sintácticas, semánticas o basadas en lógica. Sin embargo, hay un gran número de sistemas que resulta muy difícil de etiquetarlos dentro de uno de los modelos, son sistemas que utilizan una combinación de modelos. El sistema GROUNDHOG (Hickl & Bensley, 2007; Bensley & Hickl, 2008) usa una extensa batería de recursos y sistemas estadísticos para la obtención de los hechos o compromisos derivados de T. Con esto realiza un alineamiento léxico y clasificación que estimará la probabilidad de que haya implicación. El sistema de la UAIC (Iftene & Balahur-Dobrescu, 2007; Iftene, 2008) tiene como principal idea alinear cada palabra de H con al menos una de T. Para ello, el sistema realiza diferentes transformaciones sobre H usando recursos como WordNet (Miller et al., 1990), VerbOcean (Chklovski & Pan- 207

tel, 2004), Wikipedia, repositorios de paráfrasis (DIRT (Lin & Pantel, 2001)) y bases de datos de acrónimos. El sistema de la UAIC demuestra que una combinación adecuada de recursos semánticos es de gran ayuda para resolver la implicación textual. En resumen, la tendencia principal es integrar conocimiento semántico complejo a la hora de tomar decisiones de implicación. No obstante, en la mayoría de los casos este conocimiento no logra los resultados esperados. Por consiguiente, la implicación textual es aún un campo que requiere futuras investigaciones que mejoren los resultados finales. D.3 Sistema de reconocimiento de implicación textual basado en perspectivas La hipótesis que proponemos es seguir nuestra idea sobre la resolución de la implicación textual basándonos en perspectivas. Nuestro principal objetivo es cubrir un mayor rango de implicaciones afrontando la tarea desde diferentes puntos de vista. En concreto, proponemos tres perspectivas: Léxica, Sintáctica y Semántica. La Figura D.1 muestra visualmente la arquitectura de nuestro sistema. Figure D.1: Arquitectura del sistema. 208

Cada perspectiva es responsable de obtener un conjunto de inferencias que servirán como características a un sistema de aprendizaje automático, que será el encargado de decidir si se produce o no la implicación. Nuestra propuesta implementa diferentes configuraciones: Tres configuraciones, creadas cada una basando la decisión de implicación en un algoritmo de aprendizaje alimentado por las características derivadas de cada una de las perspectivas propuestas individualmente (léxica, sintáctica y semántica). Una configuración que decide la implicación teniendo en cuenta todas las características derivadas de todas las perspectivas. Una configuración que implementa una estrategia de votación simple a partir de la salidas de las tres configuraciones del primer punto. Además, de estas configuraciones también creamos un módulo que implementa dos restricciones que deben cumplirse para que el par de textos sea considerado como candidato a la implicación. Este módulo es opcional, previo e independiente de las configuraciones anteriores. Estas dos restricciones se basaron en: (i) la importancia de tener relacionados todos los verbos de H con los verbos de T; y (ii) la importancia de establecer correspondencias entre todas las entidades de H con las entidades de T. Serán explicadas en la perspectiva semántica. La arquitectura modular de nuestro sistema permite el uso de cualquier algoritmo de aprendizaje automático, y tras varias pruebas previas, en nuestros experimentos decidimos utilizar la implementación de Support Vector Machine de Weka (Witten & Frank, 2005). Varios trabajos previos (incluidos nosotros mismos) han demostrado que el uso de este tipo de algoritmo de aprendizaje obtiene buenos resultados para la tarea de implicación textual (Agichtein et al., 2008; Castillo & i Alemany, 2008; Rodrigo et al., 2008b; Balahur et al., 2008). Previo al procesamiento de cada perspectiva, el sistema desarrolla una serie de procesos que son compartidos por todas ellas. En concreto, se realiza una tokenización, lematización, obtención de las raíces de las palabras y análisis morfológico. Los resultados de estos procesos son almacenados en diferentes estructuras de datos para cada par texto-hipótesis. 209

D.3.1 Perspectiva léxica Esta perspectiva se basa en la extracción de una gran variedad de medidas de similitud léxica a partir del par texto-hipótesis. Esta perspectiva fue presentada en nuestro artículo (Ferrández et al., 2007), aunque para esta tesis un mayor número de medidas han sido evaluadas. La mayoría de medidas se centran básicamente en la morfología de las palabras y el contexto en el que aparecen. Dichas medidas fueron aplicadas a las estructuras de datos obtenidas anteriormente, obteniendo cada una su valor de similitud correspondiente. Ha sido demostrado en las últimas competiciones RTE (Giampiccolo et al., 2007; Giampiccolo et al., 2008a), que técnicas similares, aunque pobres en conocimiento, obtienen resultados muy prometedores e incluso comparables con sistemas mucho más sofisticados. Nuestra opinión al respecto es que tales técnicas obtienen resultados tan altos debido a que normalmente diferentes personas suelen comunicar las mismas ideas o conceptos mediante las mismas expresiones lingüísticas o muy similares. Los siguientes puntos describen brevemente el conjunto de medidas léxicas utilizadas: Emparejamiento binario: implementa un emparejamiento binario (1 se consigue el emparejamiento, 0 no se consigue) entre los elementos de la hipótesis y el texto. El peso final es normalizado dividiéndolo entre el número total de elementos de la hipótesis. Distancia de Levenshtein (Levenshtein, 1966): es similar al emparejamiento binario pero el peso de cada elemento depende de la máxima distancia de Levenshtein obtenida para cada elemento de H con respecto a los de T. El algoritmo de Needleman-Wunsch (Needleman & Wunsch, 1970): en sus inicios fue utilizado para encontrar similitudes en cadenas de proteínas. Es también similar a la distancia de Levenshtein, pero permite un ajuste de costes variables para la inserción y el borrado. Tras varios experimentos, ajustamos este coste a 3. El algoritmo de Smith-Waterman (Smith & Waterman, 1981): es un algoritmo de programación dinámica que desarrolla alineamientos locales y determina regiones similares entre secuencias. Varios parámetros 210

tienen que ser establecidos previamente, nosotros empíricamente utilizamos un coste de 0.3, -1 y 2 para el hueco en las inserciones y borrados, la copia y la substitución, respectivamente. Emparejamiento entre subsecuencias consecutivas: procedimiento que asigna pesos a la aparición en ambos textos (T y H) de subsecuencias consecutivas. Todos las posibles subsecuencias de longitud 2 hasta la longitud total en elementos son generadas para cada par textohipótesis. Una vez realizado esto, se intenta emparejar cada subsecuencia de H con alguna subsecuencia de igual tamaño de T. Cuanto más larga sea la subsecuencia mayor será el peso asignado a dicho emparejamiento. Las medidas ROUGE: implementamos las diferentes medidas ROUGE tal como se detallan en (Lin & Och, 2004), pero procesados sobre el par texto-hipótesis. Las medidas implementadas fueron ROUGE-N, ROUGE-L, ROUGE-W y ROUGE-S con valores de n-gramas y skipngramas de 2 y 3 para ambos. La distancia de Jaro (Jaro, 1995): esta distancia esta especialmente diseñada para controlar errores en la escritura de palabras. Considera como alineamientos correctos (aunque penalizables) aquellos que están separados por menos que el máximo de las longitudes de las cadenas dividido entre dos menos uno. La distancia Jaro-Winkler (Winkler, 1999): es una variación de la métrica de Jaro y se adapta muy bien ante cadenas cortas como nombres de personas. Su principal característica es que enfatiza la similitud entre prefijos de una determinada longitud. En nuestros experimentos dicha longitud fue empíricamente establecida a un valor de 4. La distancia Euclídea: usamos la definición tradicional de la distancia euclídea pero con el objetivo de aplicarla sobre cadenas, se establece el n-space como el número de caracteres distintos que aparecen en cada cadena y los valores de los vectores como el número de veces que aparece cada uno de ellos en cada cadena. Similitud del coseno: utilizamos esta medida ampliamente conocida y utilizada en tareas de Recuperación de Información (Frakes & Baeza- 211

Yates, 1992), estableciendo los vectores de las cadenas de la misma manera que la medida anterior. Coeficiente de similitud de Jaccard (Jaccard, 1912): utilizamos este coeficiente para comparar la similitud y diversidad entre dos cadenas, las cuales son previamente transformadas en vectores conteniendo los conjuntos de caracteres específicos de cada cadena. Coeficiente Dice: es una medida de similitud basada en los términos que aparecen en las secuencias. Esta muy relacionada con la métrica de Jaccard. Los conjuntos utilizados para representar las cadenas fueron codificados de la misma manera que se hizo para el coeficiente de Jaccard. La distancia Soundex: Soundex es un esquema de indexación fonética muy utilizado en genealogía. Esta distancia la obtuvimos codificando cada cadena en su código Soundex, e intentamos emparejar cada código Soundex de H con los de T. Emparejamiento Q-gram: este emparejamiento se obtiene mediante una ventana de longitud q que extrae todas las posibles subcadenas de dicha longitud, y luego prorratea el número de emparejamientos obtenidos con respecto a otra cadena entre todos los posibles. El valor de q que mejor se adaptó a nuestros objetivos fue 3. Especificidad IDF: consiste en un proceso que obtiene un peso derivado de la suma de los valores IDF (Inverse Document Frequency) de las palabras comunes al texto e hipótesis entre la suma de los valores IDF de todas las palabras de la hipótesis. Utilizamos la definición de IDF introducida en (Sparck-Jones, 1972), procesada sobre los corpus del CLEF (Cross-Language Evaluation Forum) 1. Resumiendo, todas las medidas presentadas son procesadas sobre las estructuras de datos previamente creadas. El valor de similitud máximo para cada medida obtenido entre todas las estructuras de datos será almacenado para servir como una característica a nuestro algoritmo de aprendizaje automático. 1 http://www.clef-campaign.org/ 212

D.3.2 Perspectiva sintáctica La perspectiva sintáctica que proponemos fue presentada en nuestro artículo (Micol et al., 2007), y se compone de cuatro módulos que interactúan entre sí: Módulo de generación de árboles: es el módulo encargado de cargar en memoria las estructuras necesarias para representar los textos mediante árboles de dependencias. Para ello, utilizamos la herramienta MINIPAR (Lin, 1998a). Módulo de filtrado de árboles: este modulo reconstruye los árboles de dependencias eliminando los nodos que no son relevantes para el análisis (por ejemplo, las stop-words). Módulo de detección de grafos embebidos: siguiendo la aproximación de (Katrenko & Adriaans, 2006), este módulo detecta si el árbol de la hipótesis esta embebido en el árbol del texto. Para relajar esta aproximación, permitimos la aparición en el texto de nodos intermedios, pero siempre preservando el orden de los nodos de la hipótesis. Además, separándonos de la aproximación de (Katrenko & Adriaans, 2006), para el emparejamiento de los nodos utilizamos la medida de similitud de (Wu & Palmer, 1994) sobre WordNet con un umbral de similitud del 80%. Si este módulo deduce que el árbol de la hipótesis está embebido en el del texto, se devolverá el valor más alto de similitud. Módulo de emparejamiento de árboles: este módulo intentará emparejar todos los nodos del árbol de la hipótesis con los del texto. Dicho emparejamiento se basa en los lemas sin realizar ningún tipo de asociación semántica. Además, el proceso de emparejamiento tiene en cuenta la categoría gramatical de cada nodo, la relación gramatical que representa y su profundidad en el árbol. Como resultado se obtiene un valor de similitud normalizado a partir de la longitud de la hipótesis. El factor de similitud devuelto por el módulo de grafos embebidos o por el de emparejamiento de árboles, será proporcionado como otra característica más para decidir la implicación. 213

D.3.3 Perspectiva semántica Por su propia definición la implicación textual requiere del uso de conocimiento semántico que facilite su detección. No obstante, añadir conocimiento semán-tico es quizás la tarea más difícil para todo sistema PLN. Nuestra opinión al respecto es que el uso de este conocimiento resulta tan complejo y a veces ineficiente, debido a la limitada cobertura que suelen tener los recursos semánticos. Varios procedimientos fueron desarrollados, soportados por un estudio previo sobre cómo aplicar conocimiento semántico al sistema: Similitud semántica basada en WordNet: se desarrolló un módulo que obtiene similitudes semánticas entre el par texto-hipótesis. Para ello, utilizamos un conjunto de medidas de similitud basadas en WordNet. En concreto las medidas de Resnik (Resnik, 1995), Lin (Lin, 1998b), Jiang&Conrath (Jiang & Conrath, 1997) y Pirro&Seco (Nuno, 2005; Pirrò & Seco, 2008). Como resultado se obtiene un factor de similitud a partir de la suma de los valores máximos de similitud obtenidos entre los lemas de H y todos los de T, dicho factor es normalizado dividiéndolo entre la longitud de H en cuanto lemas procesados. Si algún lema de H no se encuentra en WordNet, el valor máximo de similitud devuelto es calculado por medio del algoritmo de Smith-Waterman entre todos los lemas de T. Considerando de esta manera los lemas que no aparecen en WordNet, también tenemos en cuenta similitudes entre entidades no soportadas por WordNet. El factor de similitud final servirá como una característica a nuestro algoritmo de aprendizaje. La negación: las partículas negativas pueden cambiar radicalmente el significado de los textos. Varios procedimientos respecto a relaciones de antonimia, polaridad de los textos y detección de marcadores de modales fueron desarrollados, cada uno interpretado como una característica para el aprendizaje. La importancia de los verbos: este módulo detecta todos los verbos (descartando auxiliares) de H y T, e intenta establecer relaciones entre ellos. Para ello utiliza las clases de VerbNet y las relaciones de Verb- Ocean obteniendo dos valores: (i) un valor binario mostrando si todos los verbos de H están relacionados con algún verbo de T; y (ii) un valor obtenido a partir de la división del número de verbos de H relacionados 214

con T y el número total de verbos en H. Además, aparte de considerar estos dos valores como características de aprendizaje, establecimos una restricción opcional previa al procesamiento de nuestras perspectivas que descartara aquellos pares en los que algún verbo de H no tiene relación con ningún verbo de T. La importancia de las entidades: se basa en la detección, presencia y ausencia de entidades en el par texto-hipótesis, e intenta establecer correspondencias entre las entidades de H y T. Para la detección utilizamos NERUA (Kozareva et al., 2007) y para establecer correspondencias consideramos emparejamientos parciales y asociación de acrónimos. Al igual que para los verbos, dos valores (uno binario y otro ponderado) fueron obtenidos, así como una restricción restringiendo que todas las entidades de H tengan su correspondencia en T. Aplicando análisis de marcos semánticos: utilizamos el conocimiento de FrameNet (Baker et al., 1998) para detectar los frames (marcos) y frame elements (roles) que aparecen en los textos y establecer inferencias entre ellos. Además de la detección de los marcos, dos procedimientos fueron creados para enriquecer nuestras inferencias: Similitud entre marcos: esta métrica obtiene un factor de similitud entre dos marcos basándose en el camino que los conecta en la jerarquía y mediante las relaciones codificadas en FrameNet. Algoritmo de alineamiento de FrameNet-WordNet: para extender la cobertura de FrameNet, implementamos un algoritmo de alineamiento entre las Lexical Units de FrameNet y los Synsets de WordNet. De esta manera los alineamientos conseguidos nos permitirán asociar los sinónimos e hipónimos de un Synset con el marco que evoca la Lexical Unit con la que se alineó. Respecto a la aplicación de inferencias basadas en marcos, obtuvimos las siguientes características para el aprendizaje: (i) un emparejamiento simple entre los marcos detectados en H y T; (ii) emparejamiento que indica cuantos elementos de los marcos (frame elements) de H y T están instanciados con los mismos valores (o similares léxicamente); (iii) factor obtenido por la suma de los valores máximos de similitud devueltos de la métrica de similitud entre marcos, procesando las similitudes entre los marcos de T y H; y (iv) emparejamiento de los marcos 215

de H y T, pero teniendo en cuenta el alineamiento de FrameNet con WordNet. D.4 Evaluación Para la evaluación de nuestro sistema distinguimos entre dos tipos de evaluaciones: una intrínseca, que consiste en evaluar el sistema en entornos específicos para la implicación textual, y otra extrínseca, que evalúa como el sistema ayuda en otro tipo de tares PLN (esta última se muestra en la sección D.5). Como marco de evaluación utilizamos los corpus de entrenamiento y test de las competiciones RTE-2 2-3 3-4 4. Los cuales, pensamos, son los más adecuados hoy en día para evaluar sistemas de implicación textual. Previo a la aplicación del test, con el corpus de entrenamiento llevamos a cabo una selección de las características más representativas de cada perspectiva así como de todas las perspectivas juntas. Dicha estrategia consistió en seleccionar las mejores características en base a su ganancia de información y los resultados individuales que obtuvieron en una validación cruzada con el corpus de entrenamiento. Por lo tanto, partimos del conjunto completo de características y vamos descartando aquellas según su ganancia de información y si al quitarla los valores en la validación cruzada no decrecieron. Esta estrategia puede ser considerada como una estrategia top-down pero influenciada por los valores de la ganancia de información de cada característica. Las Tablas D.1 y D.2 muestran los resultados obtenidos con cada perspectiva individualmente y con la información de todas juntas. Además, también se muestran los resultados según implicaciones positivas o negativas. Examinando los resultados, podemos observar que la perspectiva sintáctica es la que obtiene resultados más bajos. Esto era de esperar, debido a que es mucho más difícil encontrar similitudes puramente sintácticas que modelen las implicaciones. Un caso peculiar ocurre para RTE-4 con dicha perspectiva, ya que obtiene los mismos valores de precisión independientemente del 2 http://pascallin.ecs.soton.ac.uk/challenges/rte2/ 3 http://pascallin.ecs.soton.ac.uk/challenges/rte3/ 4 http://www.nist.gov/tac/tracks/2008/rte/ 216

corpus de entrenamiento usado (RTE-2, RTE-3 o ambos, RTE-2&3). Esto se debe a las idiosincrasias del corpus RTE-4, las cuales hacen que no haya pares difusos, o son muy claros para ser etiquetados como positivos, o son claramente pares negativos según la fase de entrenamiento. Por otro lado, la perspectiva que obtiene los mejores resultados individuales es la léxica, lo cual, como dijimos anteriormente, tampoco resulta sorprendente dado el hecho de que este tipo de inferencias se comportan muy bien para el reconocimiento de la implicación. RTE-2 Entrenamiento Test 10-f cross val. Global IE IR QA SUM BASE si 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 Mejor conj. léxico 0.6450 0.5875 0.5150 0.6150 0.5250 0.6950 Pares YES Prec. 0.57 Recall 0.713 F 0.633 Pares NO Prec. 0.617 Recall 0.463 F 0.529 Caract. sintáctica 0.6125 0.5613 0.4950 0.5850 0.5250 0.6400 Pares YES Prec. 0.542 Recall 0.793 F 0.644 Pares NO Prec. 0.614 Recall 0.33 F 0.429 Mejor conj. semántico 0.6093 0.5962 0.5050 0.6900 0.5300 0.6600 Pares YES Prec. 0.573 Recall 0.758 F 0.652 Pares NO Prec. 0.642 Recall 0.435 F 0.519 Mejor conj. Todas caract. 0.6512 0.5975 0.5100 0.6600 0.5450 0.6750 Pares YES Prec. 0.576 Recall 0.74 F 0.648 Pares NO Prec. 0.636 Recall 0.455 F 0.531 RTE-3 BASE si 0.5150 0.5125 0.5250 0.4350 0.5300 0.5600 Mejor conj. léxico 0.7112 0.6700 0.5100 0.7450 0.8600 0.5650 Pares YES Prec. 0.638 Recall 0.822 F 0.719 Pares NO Prec. 0.732 Recall 0.51 F 0.601 Caract. sintáctica 0.6400 0.5938 0.5050 0.6500 0.6450 0.5750 Pares YES Prec. 0.583 Recall 0.724 F 0.646 Pares NO Prec. 0.612 Recall 0.456 F 0.523 Mejor conj. semántico 0.6884 0.6450 0.5300 0.7050 0.7650 0.5800 Pares YES Prec. 0.614 Recall 0.829 F 0.705 Pares NO Prec. 0.715 Recall 0.451 F 0.553 Mejor conj. Todas caract. 0.7173 0.6775 0.4950 0.7450 0.8700 0.6000 Pares YES Prec. 0.646 Recall 0.82 F 0.723 Pares NO Prec. 0.736 Recall 0.528 F 0.615 Table D.1: Resultados para RTE-2 y RTE-3. Finalmente, destacar que la configuración del sistema que combina las mejores características derivadas de todas las perspectivas, mejoró los resultados individuales de cada una de ellas. Esto motiva y justifica nuestra hipótesis de lograr un reconocimiento de implicación textual haciendo uso de diferentes perspectivas, ya que diferentes casos de implicación pueden ser 217

controlados desde diferentes puntos de vista. RTE-4 Entrenamiento Test Global IE IR QA SUM BASE yes 0.5000 0.5000 0.5000 0.5000 0.5000 Mejor conj. léxico RTE-2 0.5980 0.5300 0.6967 0.5100 0.6400 Pares YES Prec. 0.581 Recall 0.702 F 0.636 Pares NO Prec. 0.624 Recall 0.494 F 0.551 RTE-3 0.5800 0.4967 0.6667 0.4800 0.6750 Pares YES Prec. 0.556 Recall 0.796 F 0.655 Pares NO Prec. 0.641 Recall 0.364 F 0.464 RTE-2&3 0.5880 0.5100 0.6867 0.4850 0.6600 Pares YES Prec. 0.565 Recall 0.762 F 0.649 Pares NO Prec. 0.635 Recall 0.414 F 0.501 Caract. sintáctica RTE-2 0.5520 0.5067 0.6000 0.5150 0.5850 Pares YES Prec. 0.538 Recall 0.73 F 0.62 Pares NO Prec. 0.581 Recall 0.374 F 0.455 RTE-3 0.5520 0.5067 0.6000 0.5150 0.5850 Pares YES Prec. 0.539 Recall 0.726 F 0.618 Pares NO Prec. 0.58 Recall 0.378 F 0.458 RTE-2&3 0.5520 0.5067 0.6000 0.5150 0.5850 Pares YESs Prec. 0.539 Recall 0.726 F 0.618 Pares NO Prec. 0.58 Recall 0.378 F 0.458 Mejor conj. semántico RTE-2 0.6170 0.5367 0.7100 0.5550 0.6600 Pares YES Prec. 0.594 Recall 0.74 F 0.659 Pares NO Prec. 0.655 Recall 0.494 F 0.563 RTE-3 0.5920 0.5167 0.6833 0.5350 0.6250 Pares YES Prec. 0.566 Recall 0.792 F 0.66 Pares NO Prec. 0.653 Recall 0.392 F 0.49 RTE-2&3 0.5970 0.5133 0.7100 0.5300 0.6200 Pares YES Prec. 0.573 Recall 0.762 F 0.654 Pares NO Prec. 0.645 Recall 0.432 F 0.517 Mejor conj. Todas carct. RTE-2 0.6240 0.5433 0.7267 0.5450 0.6700 Pares YES Prec. 0.61 Recall 0.688 F 0.647 Pares NO Prec. 0.642 Recall 0.56 F 0.598 RTE-3 0.6080 0.5467 0.6967 0.4900 0.6850 Pares YES Prec. 0.579 Recall 0.794 F 0.669 Pares NO Prec. 0.672 Recall 0.422 F 0.518 RTE-2&3 0.6090 0.5200 0.7100 0.5300 0.6700 Pares YES Prec. 0.581 Recall 0.778 F 0.666 Pares NO Prec. 0.665 Recall 0.44 F 0.529 Table D.2: Resultados para RTE-4. Además, durante nuestra experimentación, nos dimos cuenta que cada una de las perspectivas normalmente se comporta mejor o peor dependiendo la tarea PLN de la cual procede el par texto-hipótesis que se esta tratando (IE, SUM, QA, IR). Por ello, decidimos hacer un experimento que incluyera dicha tarea como una característica más del sistema. La Tabla D.3 muestra los resultados obtenidos por este experimento, como puede observarse se produce una ligera mejora para RTE-2 y RTE-3 pero no para RTE-4. 218

RTE-2 Mejor conj. Todas y tarea Test Global IE IR QA SUM 0.6100 0.5100 0.6900 0.5400 0.7000 Pares YES Prec. 0.584 Recall 0.763 F 0.662 Pares NO Prec. 0.658 Recall 0.458 F 0.54 RTE-3 Mejor conj. Todas 0.6887 0.5300 0.7600 0.8600 0.6050 y tarea Pares YES Prec. 0.655 Recall 0.829 F 0.732 Pares NO Prec. 0.751 Recall 0.541 F 0.629 RTE-4 Mejor conj. Todas 0.6180 0.5367 0.7133 0.5450 0.6700 y tarea Pares YES Prec. 0.603 Recall 0.692 F 0.644 Pares NO Prec. 0.638 Recall 0.544 F 0.587 Table D.3: Resultados RTE considerando la tarea como una característica más. Con el objetivo de justificar aún más nuestra hipótesis y evaluar si nuestras perspectivas son complementarias, realizamos un oráculo que cogiera la salida correcta siempre y cuando ésta fuera devuelta por alguna de nuestras perspectivas. Dicho oráculo nos da un idea del valor potencial que pudiera alcanzar nuestro sistema realizando una combinación ideal de las perspectivas propuestas. La Tabla D.4 muestra los resultados del oráculo, indicando que nuestras perspectivas podrían llegar a complementarse de una manera apropiada y alcanzar resultados bastante altos. Test ORÁCULO Global IE IR QA SUM RTE-2 test 0.6963 0.5750 0.7500 0.6400 0.8200 RTE-3 test 0.7438 0.6050 0.8350 0.8950 0.6400 RTE-4 test 0.7210 0.6833 0.7933 0.6650 0.7250 Table D.4: Resultados obtenidos por el oráculo para los corpus del RTE. Por último, dentro del marco del RTE también quisimos evaluar si el uso de las dos restricciones que propusimos (basadas en los verbos y entidades) 219

resulta beneficioso para la tarea de implicación. Para ello, las restricciones fueron aplicadas previamente descartando aquellos pares que no las cumplieran. Posteriormente aplicamos la ultima configuración de nuestro sistema, y por lo tanto la que tiene en cuenta la tarea como una característica. La Tabla D.5 muestra los resultados obtenidos. Test Global IE IR QA SUM RTE-2 Restricción entidades 0.6162 0.5100 0.7100 0.5450 0.7000 Restricción verbos 0.5900 0.4600 0.6600 0.5600 0.6800 ambas Restricciones 0.5988 0.4700 0.6800 0.5650 0.6800 RTE-3 Restricción entidades 0.6913 0.5350 0.7400 0.8850 0.6050 Restricción verbos 0.6438 0.4700 0.7100 0.8250 0.5700 ambas Restricciones 0.6450 0.4750 0.6900 0.8400 0.5750 RTE-4 Restricción entidades 0.6200 0.5467 0.7267 0.5400 0.6500 Restricción verbos 0.6170 0.5567 0.7233 0.5400 0.6250 ambas Restricciones 0.6130 0.5600 0.7200 0.5350 0.6100 Table D.5: Resultados aplicando las restricciones sobre los corpus del RTE. En cuanto a precisión, la aplicación de las restricciones no consigue mejorar los resultados (excepto cuando sólo se aplica la restricción de entidades). Sin embargo, lo más interesante es la cantidad de corpus que se descarta con dichas restricciones. Por ejemplo, la restricción de entidades descartó una media del 19% del corpus, la de los verbos el 21% y las dos juntas el 36%, esto repercutió en un procesamiento mucho más rápido del sistema y además, los resultados son muy similares a los que se obtuvieron sin dichas restricciones. D.4.1 Evaluación comparativa Con el objetivo de valorar nuestro sistema respecto a los participantes oficiales de las competiciones RTE, las Tablas D.6, D.7 y D.8 muestran las posiciones que las diferentes configuraciones de nuestro sistema habrían alcanzado. 220

Primer autor - Equipo - Aproximación precisión Hiclk (LCC) 0.7538 Tatu (LCC) 0.7375 Zanzotto (Rome & Milan) 0.6388 Adams (Dallas) 0.6262 Restricción entidades 0.6162 Bos (Rome & Leeds) 0.6162 Todas perspectivas + tarea 0.6100 de Marneffe (Stanford) 0.6050 Kouylekov (Irst (FBK) & Trento) 0.6050 Marsi (Tilburg & Twente) 0.6050 Vandelwende (Microsoft & Stanford) 0.6025 Ambas restricciones 0.5988 Todas perspectivas 0.5975 Herrera (UNED) 0.5975 Perspectiva semántica 0.5962 Nielsen (Colorado) 0.5962 Restricción verbos 0.5900 Burchardt (Saarland / SALSA) 0.5900 Katrenko (Amsterdam) 0.5900 Rus (Memphis) 0.5900 Perspectiva léxica 0.5875 Inkpen (Ottawa) 0.5825 Litkowski (CL Research) 0.5813 Perspectiva sintáctica 0.5613 Ferrández (nuestro sistema oficial de RTE-2) 0.5563 Schilder (Thomson & Minnesota) 0.5550 Kozareva (Alicante) 0.5500 Clarke (Sussex) 0.5475 Delmonte (Venice) 0.5475 Newman (Dublin) 0.5437 Nicholson (Melbourne) 0.5288 Table D.6: Evaluación comparativa con los participantes del RTE-2 2006. 221

Primer autor - Equipo - Aproximación precisión Hiclk (LCC) 0.8000 Tatu (LCC) 0.7225 Restricción entidades 0.6913 Iftene (UAIC) 0.6913 Todas perspectivas + tarea 0.6887 Todas perspectivas 0.6775 Perspectiva léxica 0.6700 Adams (Dallas) 0.6700 Wang (DFKI) 0.6687 Zanzotto (Rome & Milan) 0.6675 Blake (North Carolina) 0.6585 Ferrández (nuestro sistema oficial de RTE-3) 0.6563 Li (Atlanta) 0.6488 Ambas restricciones 0.6450 Perspectiva semántica 0.6450 Restricción verbos 0.6438 Chambers (Stanford) 0.6362 Rodrigo (UNED) 0.6312 Burchardt (Saarland / SALSA) 0.6262 Roth (CCG) 0.6262 Settembre (Buffalo) 0.6262 Malakasiotis (Athens) 0.6175 Ferres (TALP) 0.6150 Litkowski (CL Research) 0.6125 Bar-Haim (Bar-Ilan & Tel Aviv) 0.6112 Montejo-Raez (UJA) 0.6038 Perspectiva sintáctica 0.5938 Marsi (Tilburg & Twente) 0.5913 Delmonte (Venice) 0.5875 Harmeling (Edinburgh) 0.5775 Burek (Open Univ.) 0.5500 Bobrow (Palo Alto) 0.5150 Clark (Seattle, Marina del Rey & Princeton) 0.5088 Baral 0.4963 Table D.7: Evaluación comparativa con los participantes del RTE-3 2007. 222

Primer autor - Equipo - Aproximación 2-way clasificación. Bensley (LCC) 0.746 Iftene (UAIC) 0.721 Wang (Saarland & DFKI) 0.706 Siblini (Concordia) 0.688 Li (Tsinghua) 0.659 Todas perspectivas 0.6240 Restricción entidades 0.6200 Mohammad (Maryland) 0.619 Todas perspectivas + tarea 0.6180 Restricción verbos 0.6170 Perspectiva semántica 0.6170 Pado (Stanford) 0.614 Ambas restricciones 0.6130 Ferrández (nuestro sistema oficial de RTE-4) 0.608 Nielsen (Colorado) 0.606 Perspectiva léxica 0.5980 Zanzotto (Rome, Saarland & Trento) 0.59 Agichtein (Emory) 0.588 Bar-Haim (Bar-Ilan & Tel Aviv) 0.584 Shen (Edinburgh) 0.582 Galanis (Athens) 0.578 Castillo (Cordoba) 0.571 wlvuk 0.571 Cabrio (FBK-Irst) 0.57 Ageno (TALP) 0.563 Perspectiva sintáctica 0.5520 Rodrigo (UNED) 0.549 Clark (Seattle) 0.547 Krestel (Hannover & Concordia) 0.54 Varma (IIIT Hyderabad) 0.531 Glinos (SAIC) 0.526 Montalvo-Huhn (Fitchburg) 0.526 Yatbaz (Koc) 0.519 Bergmair (Cambridge) 0.516 Table D.8: Evaluación comparativa con los participantes del RTE-4 2008. 223

D.4.2 Experimentos adicionales Para extender la evaluación de nuestro sistema, dos experimentos más fueron realizados: el primero sobre la tarea de clasificación de tres tipos de implicación introducida en el ultimo RTE, y la segunda sobre el corpus de Microsoft de paráfrasis. Clasificación de tres tipos de implicación Esta tarea fue introducida como tarea piloto en el RTE-3, y como tarea oficial en el RTE-4. Consiste en detectar tres tipos de implicación: (1) CON- TRADICCIÓN, cuando la relación de implicación no es soportada en absoluto por el par de textos; (2) DESCONOCIDO, cuando no hay suficiente información para determinar la existencia o no de implicación: y (3) IMPLI- CACIÓN, cuando hay relación de implicación. Consecuentemente, los sistemas participantes deben de etiquetar la implicación en uno de los tres tipos anteriores. Como corpus de entrenamiento, los organizadores del RTE proporcionaron el corpus del RTE-3 etiquetado con tres tipos de implicación. Por lo tanto, nuestro sistema fue entrenado con la selección de características que realizamos para el RTE-3 y lanzado para realizar una detección de tres tipos. La Tabla D.9 muestra los resultados de nuestro sistema. Dichos resultados revelan que aunque nuestro sistema no fue diseñado para tratar tres tipos de implicación, su comportamiento ante esta tarea es bastante esperanzador. De hecho, con esta configuración nuestro sistema hubiera obtenido la cuarta posición de entre todos los participantes de esta tarea. RTE-4 3-clases Entren. Test Global IE IR QA SUM Perspectiva léxica 0.6737 0.5570 0.4767 0.6233 0.5150 0.6200 Perspectiva sintáctica 0.6150 0.5300 0.4700 0.6000 0.5150 0.5300 Perspectiva semántica 0.6608 0.5560 0.4833 0.6600 0.4800 0.5850 Todas perspectivas 0.6834 0.5610 0.5067 0.6333 0.4750 0.6200 Table D.9: Resultados para la tarea de tres tipos de implicación del RTE-4. 224

Detectando paráfrasis Paráfrasis e implicación textual son dos conceptos muy cercanos, de hecho una paráfrasis puede ser considerada como una relación bidireccional de implicación. Por lo tanto, consideramos relevante probar nuestro sistema detectando paráfrasis. Para ello, usamos el corpus de paráfrasis de Microsoft (Dolan et al., 2004), considerando las paráfrasis (P 1 P 2 ) como dos relaciones de implicación (P 1 P 2 y P 2 P 1 ), y aplicamos nuestro sistema de dos maneras. (i) Ajustando el sistema para la detección de paráfrasis mediante la obtención del valor de la media respecto a las dos relaciones unidireccionales de implicación: sim(p 1 P 2 ) = sim(p 1 P 2 ) + sim(p 2 P 1 ) 2 (ii) Adaptando el corpus de la siguiente manera: (D.1) Para pares positivos de paráfrasis (P 1 P 2 ), derivamos dos pares positivos de implicación (P 1 P 2 y P 2 P 1 ). En el primero de ellos P 1 es T y P 2 es H, mientras que en el segundo P 2 es T y P 1 es H. Para pares negativos de paráfrasis (P 1 P 2 ), no podemos deducir si la paráfrasis no se da por que no hay implicación entre P 1 y P 2 (P 1 P 2 ), entre P 2 y P 1 (P 2 P 1 ), o debido a ambas. Sin embargo, con el objetivo de mantener la proporción original del corpus respecto a ejemplos positivos y negativos, y para hacer esta transformación automática, asumimos que una paráfrasis negada evoca dos pares negativos de implicación (P 1 P 2 implica P 1 P 2 y P 2 P 1 ). Como resultado obtuvimos un corpus compuesto por el doble de pares de implicación textual derivados de los pares de paráfrasis. Respecto a la evaluación consideramos el valor positivo o negativo de las paráfrasis a partir de los valores de las implicaciones que la componen. La Tabla D.10 muestra cuando dos implicaciones textuales determinan paráfrasis. Por último, para evaluar el sistema utilizamos las configuraciones de entrenamiento y test del corpus de Microsoft que vienen junto con dicho corpus. 225

P 1 P 2 P 2 P 1 P 1 P 2 SI SI SI SI NO NO NO SI NO NO NO NO Table D.10: Cuando dos implicaciones determinan paráfrasis. En cuanto a las características de nuestro sistema, creamos dos conjuntos mediante la intersección y la unión de las mejores características derivadas del procesamiento de todas las perspectivas sobre los corpus del RTE-2 y RTE-3. Además, establecimos dos sistemas base con todos los valores de paráfrasis positivos o negativos, respectivamente. La Tabla D.11 muestra los resultados que se obtuvieron. Como indican los sistemas base, el corpus de paráfrasis no es un corpus ponderado como lo era el del RTE. Corpus de paráfrasis de MS Entrenamiento Test 10-fold cross validation Precision BASE NO 0.3351 BASE Y ES 0.6649 Aproximación (i) ajustando el sistema características 0.7218 0.7119 características 0.7407 0.7310 Aproximación (ii) adaptando el corpus características 0.7229 0.7165 características 0.7489 0.7362 Table D.11: Resultados obtenidos sobre el corpus de Microsoft de paráfrasis. D.5 Aplicabilidad en otras tareas de PLN Además de evaluar nuestro sistema en tareas propias de detecciones de relaciones de implicación, quisimos valorar si nuestro sistema podría servir de ayuda en otras tareas de PLN. En concreto, y debido a que las líneas de investigación de nuestro grupo están estrechamente relacionadas con estas tareas, aplicamos el sistema en entornos de Búsquedas de Respuestas, generación de resúmenes y en resolver la tarea particular de enlazar semánticamente las categorías de Wikipedia con las glosas de WordNet. 226

D.5.1 Implicación textual en Búsqueda de Respuestas Dos fueron las maneras en las que probamos la aplicabilidad de nuestro sistema en tareas de Búsqueda de Respuestas: (1) en la tarea de Answer Validation Exercise (AVE) del CLEF 5 ; y (2) dentro del marco del proyecto europeo QALL-ME 6. La tarea AVE Esta tarea consiste en validar las respuestas proporcionadas por un sistema de Búsqueda de Respuestas para una pregunta dada. Dicha validación ha de realizarse teniendo conocimiento de la pregunta, la respuesta y el pasaje o trozo del documento del cual se extrajo la respuesta. Los organizadores de esta tarea en el CLEF proporcionaron los corpus necesarios para el entrenamiento y test de los sistemas. En nuestras participaciones, nuestra estrategia ha sido componer la hipótesis a partir de la pregunta y la respuesta formando una oración declarativa (para ello utilizamos expresiones regulares), y tratar el pasaje como el texto del que se deduce la implicación. Por lo tanto, una vez transformados los corpus, entrenamos el sistema y lanzamos el test para comprobar que tal se comportaba ante esta nueva tarea. Los resultados oficiales de nuestra participación en el último AVE alcanzaron la segunda y primera posición en cuanto a equipos participantes para las tareas de inglés y castellano, respectivamente. En nuestro artículo (Ferrández et al., 2008), se detallan los pormenores de la aplicación de nuestro sistema a esta tarea. El proyecto QALL-ME QALL-ME, Question Answering Learning technologies in a multilingual and Multimodal Environment (http://qallme.itc.it/), es un proyecto europeo (referencia FP6-IST-033860) en el que forman parte diversas instituciones académicas y empresas. 7 5 http://nlp.uned.es/clef-qa/ave/ 6 http://qallme.fbk.eu/ 7 Las instituciones académicas del QALL-ME son FBK-irst (Italia), DFKI (Alemania), University of Wolverhampton (UK) y Universidad de Alicante (España), las empresas involucradas son Comdata (Italia), Ubiest (Italia) y Waycom (Italia). 227

El proyecto QALL-ME se centra en satisfacer necesidades de usuario sobre el entorno turístico, por ejemplo preguntas del tipo Dónde puedo ver la película Casino Royale esta noche?. La resolución de este tipo de cuestiones muestra el enorme potencial de negocio de este proyecto. La parte de investigación que enlaza nuestro sistema de implicación textual con el proyecto QALL-ME consiste en inferir los requerimientos que solicitan preguntas nuevas, que los usuarios lanzan sobre el sistema, a partir de un conjunto predefinido de patrones de preguntas de usuario. Para ello, lo primero que se hizo fue modelar el dominio mediante una ontología y poblar dicha ontología con instancias proporcionadas por proveedores turísticos o extraídas de la web. Con dicha ontología, se crearon un conjunto de escenarios, cada uno enlazado con una sentencia SPARQL 8 capaz de recuperar los datos necesarios. Estos escenarios fueron mostrados a los usuarios para que realizaran posibles preguntas sobre datos de nuestra ontología. Y fue con estas preguntas de usuario con las que creamos los patrones, haciendo instanciaciones genéricas de los conceptos ontológicos y/o entidades que aparecían en ellas. Por lo tanto, cuando una pregunta nueva es procesada, el sistema de implicación textual es el encargado de deducir su significado (por lo que se está preguntando) teniendo en cuenta el conjunto de patrones de usuario previamente generado. El prototipo QALL-ME (Sacaleanu et al., 2008) actualmente esta configurado para resolver cuestiones relacionadas con el dominio turístico relativo al Cine y Alojamientos, una demo del sistema esta disponible en http: //qallme.itc.it/server/demo/. En nuestro artículo (Ferrández et al., 2009), recientemente publicado en Information Processing & Management Journal, se especifican los detalles de la aplicación de nuestro sistema y la evaluación del mismo. A modo de resumen, nuevas inferencias, como ponderar la importancia de los términos interrogativos, fueron añadidas para poder tratar mejor las implicaciones entre preguntas, y en cuanto a evaluación el sistema obtuvo valores de precisión de un 80%. 8 http://www.w3.org/tr/rdf-sparql-query/ 228

D.5.2 Implicación textual en generación de resúmenes La idea de aplicar nuestro sistema en tareas de resúmenes tiene como fundamento la generación previa de un resumen que contenga las frases más significativas del texto, sin importar la longitud del resumen ni la posición de las frases en el documento (en principio todas las frases son igualmente relevantes). Este resumen previo servirá como entrada a técnicas de generación de resúmenes y nuestro objetivo es valorar si la aplicación previa de nuestro sistema mejora la creación final del resumen. Hasta donde nosotros sabemos, varias aproximaciones de resúmenes han usado técnicas y/o sistemas de implicación textual (Harabagiu et al., 2007; Tatar et al., 2008), pero ninguna como una generación preliminar al resumen. Por lo tanto, nuestro sistema de implicación textual genera un resumen preliminar de la siguiente manera: S 1 S 2 S 3 S 4 S 5 S 6 son todas las frases del documento y SUM el resumen preliminar obtenido por nuestro sistema SUM = {S 1 } SUM implica S 2 NO SUM = {S 1, S 2 } SUM implica S 3 NO SUM = {S 1, S 2, S 3 } SUM implica S 4 SI SUM = {S 1, S 2, S 3 } SUM implica S 5 SI SUM = {S 1, S 2, S 3 } SUM implica S 6 NO SUM = {S 1, S 2, S 3, S 6 } Este resumen previo fue procesado por una aproximación de generación automática de resúmenes desarrollada en nuestro grupo basada en la frecuencia de palabras. El uso de la implicación aportó una media de mejora del 6% respecto al procesamiento individual de la técnica de frecuencia de palabras. Además, cabe destacar que usando el resumen preliminar las cantidad de 229

frases a procesar se redujo en torno a un 71%. Nuestros artículos (Lloret et al., 2008b; Lloret et al., 2008a) presentan y detallan la aplicabilidad de nuestro sistema en la generación de resúmenes. D.5.3 Implicación textual en asociar categorías de Wikipedia y glosas de WordNet El objetivo de asociar categorías de Wikipedia con glosas de WordNet esta motivado para la construcción automática de un repositorio de entidades nombradas (Toral et al., 2008), recurso desarrollado por nuestro grupo de investigación. Por lo tanto, quisimos probar si aplicando técnicas de similitud semántica y en concreto, nuestro sistema de implicación textual, podríamos asociar las descripciones de las glosas de WordNet con los abstracts de las categorías de Wikipedia. En cuanto a nuestro sistema, como no sabíamos a priori como podía darse la implicación, el factor de implicación lo obtuvimos a partir de la media de las implicaciones unidireccionales entre la glosa de WordNet y la categoría de Wikipedia (tal y como ya hicimos para la detección de paráfrasis), y para valorar como se comportaba el sistema ante la tarea lo comparamos con otros métodos de similitud semántica: Personalized Page Rank (Agirre et al., 2009a) y Semantic Vectors (Widdows & Ferraro, 2008). Para la evaluación se creó un corpus anotado manualmente compuesto por 207 emparejamientos, junto con dos sistemas base: uno cogiendo el primer sentido de WordNet y otro procesando un emparejamiento simple de palabras. La Tabla D.12 muestra los resultados obtenidos. Se realizaron varios experimentos: (i) entrenando el sistema con los corpus de AVE y RTE; (ii) devolviendo el factor de similitud sin fase de entrenamiento y cogiendo el mayor como correcto; y (iii) versiones supervisadas de los algoritmos considerando el mismo corpus para el entrenamiento y el test (mediante una validación cruzada). Como puede verse en la tabla, aunque siendo entrenado con los corpus de AVE y RTE el comportamiento del sistema no fue el esperado, para el resto de configuraciones nuestro sistema obtuvo los mejores valores superando al resto de métodos tenidos en cuenta. 230

Método Precisión Primer sentido 64.7% Emparejamiento simple 62.7% Semantic Vectors 54.1% Semantic Vectors (supervisado) 70.27% Personalised PageRank 64.3% Personalised PageRank (supervisado) 73.26% Implicación Textual (entrenamiento AVE 07-08 + RTE-3) 52.8% Implicación Textual (sin entrenamiento) 64.7% Implicación Textual (supervisado) 77.74% Table D.12: Resultados para la tarea de asociar categorías de Wikipedia a glosas de WordNet D.6 Conclusiones Los principales objetivos de esta tesis han sido exponer, ejemplificar y discutir las características más significativas de la implicación textual, para así fundamentar la construcción de un sistema modular y flexible que afrontara la tarea de resolver implicaciones textuales. Construyendo un sistema modular, conseguimos combinar apropiadamente los diferentes niveles lingüísticos que pueden tomar parte en la implicación. Esto viene apoyado por nuestra idea sobre la resolución de implicaciones a partir de diferentes perspectivas, en nuestro caso tres: Léxica, Sintáctica y Semántica. Además, también se llevo a cabo una extensiva evaluación del sistema creada tanto para la tarea de la resolución de implicaciones como para valorar la aplicabilidad del sistema en otras tareas y/o aplicaciones PLN. D.6.1 Principales contribuciones Con los experimentos y resultados obtenidos en esta tesis, hemos demostrado que la combinación del conocimiento proporcionado por las características derivadas de nuestras tres perspectivas es apropiado para el reconocimiento de implicaciones textual. De hecho, el uso de dichas perspectivas extiende el rango de reconocimiento contemplando una amplia gama de implicaciones y mejorando los resultados finales del sistema. 231

Aparte de esto, destacamos las siguientes contribuciones de nuestro trabajo: Hemos medido el impacto de inferencias léxicas y sintácticas, que aunque triviales, resultan de gran ayuda para resolver implicaciones. Además, nuevas configuraciones de estas inferencias fueron propuestas con el objetivo de afrontar mejor la tarea de implicación textual. Respecto a análisis lingüísticos más complejos, hemos evaluado los beneficios de incorporar recursos como WordNet, FrameNet, VerbNet y VerbOcean. Además, estudiamos la influencia que tienen los verbos y las entidades involucrados en la implicación. Hemos implementado dos nuevos recursos basados en FrameNet, que aunque en esta tesis han sido utilizados en tareas de implicación textual, podrían ser aplicados en otros entornos. Estos recursos son : la medida de similitud entre marcos y el alineamiento FrameNet-WordNet. Los recursos anteriormente citados, nos permiten incorporar información semántica compleja para la resolución de la implicación textual. Respecto a la aplicabilidad de nuestro sistema en otras tareas PLN, y por lo tanto su evaluación extrínseca: En tareas de resúmenes, el sistema de implicación se utilizó para generar un resumen preliminar que sirviera de entrada al sistema automático de resúmenes, mejorando así los resultados individuales del sistema de resúmenes. Hasta donde nosotros sabemos, somos los primeros en utilizar la implicación textual de esta manera en técnicas de resúmenes. En Búsquedas de Respuestas, afrontamos satisfactoriamente dos tareas diferentes: la validación de las respuestas (Ferrández et al., 2008).. Tarea AVE del CLEF (Rodrigo et al., 2008a). el desarrollo de un sistema de dominio cerrado basado en implicaciones textuales entre nuevas preguntas y un conjunto predefinido de patrones de preguntas de usuario (Ferrández et al., 2009). Esto fue llevado a cabo dentro del seno del proyecto europeo QALL- ME. 232

En la tarea de asociar categorías de Wikipedia con glosas de Word- Net para enriquecer automáticamente un repositorio de entidades nombradas, quedó demostrado la idoneidad de nuestro sistema procesando similitudes semánticas. En resumen, las investigaciones expuestas a lo largo de esta tesis revelan la importancia de combinar características derivadas de diferentes perspectivas lingüísticas. Como resultado final se implementó un sistema de implicación textual que hiciera uso de una variedad de características extraídas de análisis léxicos, sintácticos y semánticos. D.6.2 Trabajo futuro Como trabajos futuros planeamos seguir mejorando nuestro sistema de implicación textual basado en perspectivas mediante la adición de nuevas características, y mejoras en la manera de combinarlas: Respecto a las inferencias basadas en entidades nombradas, planeamos extender su razonamiento en cuanto a expansión de fechas, metonimia, etc. Explotar el conocimiento codificado en Wikipedia. Hay trabajos interesantes (Zesch & Gurevych, 2007; Zesch et al., 2008) que utilizan Wikipedia y Wiktionary para hacer una representación de relaciones semánticas. Siguiendo esta línea, queremos sacar provecho de Wikipedia para obtener términos relacionados semánticamente. Además de Wikipedia, podrían utilizarse ontologías con el fin de la generación de redes conceptuales semánticas capaces de proporcionar conocimiento semántico profundo. La ontología SUMO 9 así como ontologías restringidas a dominios concretos, podrían ser usadas con este objetivo. Por último, y siguiendo con nuestra idea de realizar evaluaciones extrínsecas, queremos aplicar nuestro sistema en tareas de detección de plagio. El sistema podría ser entrenado con el objetivo de detectar cuanto, semánticamente, tienen que ser de parecidos dos textos para ser considerados plagios o qué trozos de un texto son susceptibles de haber sido plagiados. 9 http://www.ontologyportal.org/ 233

E Bio-sketch and Research Projects Relative to this Thesis Óscar Ferrández, the author of this PhD. dissertation, has belonged to the Alicante University GPLSI research group 1 since September 2004. After finishing his bachelor in Computer Science, he became a member of the GLPSI group. During this time, there have been various fellowships and research projects in which he has been involved. Next, we show a brief list of them in chronological order: Dec. 03 to Nov. 06: R2D2 project (Recovery of Answers in Digital Documents) funded by the Spanish Government (Ministerio de Ciencia y Tecnología, MEC), reference number: TIC2003-07158-C04-01. Jan. 04 to Jan. 06: Autonomous project: The Development of a Text Classifier for the Administrative Domain, funded by the Generalitat Valenciana, reference number: GV04B-276. Sept. 04 to Dec. 04: Fellowship grant within an autonomous project subsidized by the Generalitat Valenciana (Spain). 1 http://gplsi.dlsi.ua.es 235

Jan. 05 to Oct. 06: FPTAI grant by the Generalitat Valenciana (Spain). This grant is intended for training researchers. Jan 06 to Dec. 09: TEXT-MESS project (Intelligent, Interactive and Multilingual Text Mining based on Human Language Technologies) subsidized by the Spanish Government (Ministerio de Educación y Ciencia, MEC), reference number: TIN2006-15265-C06-01. Oct. 06 to Nov. 09: QALL-ME project (Question answering learning technologies in a multilingual and multimodal environment), a European project which is a 6th Framework Research Programme of the European Union (EU), contract number: FP6-IST-033860 (http: //qallme.itc.it/). Nov. 06 to present: Three year contract by the European project QALL- ME. Sept. 08 to Dec. 08: Visiting Fellowship as part of the 2007 Visiting Senior Scientist program at the International Computer Science Institute (ICSI), Berkeley, California, US. 236