Work Package 4 Deliverable 4.1. Textual Entailment Open Platform, cycle I

Transcription

1 This document is part of EXCITEMENT project, funded by the 7th Framework Programme of the European Commission through grant agreement no.: EXploring Customer Interaction via Textual EntailMENT Work Package 4 Deliverable 4.1. Textual Entailment Open Platform, cycle I Authors: Dissemination Level: Bernardo Magnini, Roberto Zanoli, Vivi Nastase, Rui Wang, Meni Adler, Asher Stern, Tae-Gil Noh Public Date: December 31 th, 2012

2 Grant agreement no Project acronym EXCITEMENT Project full title EXploring Customer Interaction via Textual EntailMENT Funding scheme STREP Coordinator Moshe Wasserblat (NICE) Start date, duration 1 January 2012, 36 months Distribution Public Contractual date of delivery 31/12/2012 Actual date of delivery 31/12/2012 Deliverable number D4.1 Deliverable title Textual Entailment Open Platform, Cycle I Type Report Status and version Final version 1.1 Number of pages 35 Contributing partners FBK, DFKI, BIU, UHEI WP leader FBK Task leader FBK Authors Bernardo Magnini, Roberto Zanoli, Vivi Nastase, Rui Wang, Meni Adler, Asher Stern, Tae- Gil Noh EC project officer Carola Carstens The partners in EXCITEMENT are: Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Germany NICE Systems, Israel Fondazione Bruno Kessler (FBK), Italy Bar-Ilan University, Israel Heidelberg University, Germany OMQ, Germany ALMAWAVE, Italy For copies of reports, updates on project activities and other EXCITEMENT-related information, contact: NICE Systems EXCITEMENT Moshe Wasserblat [email protected] Hapnina 8 Phone: +972 (9) Ra anana, Israel Fax: +972 (9) Copies of reports and other material can also be accessed via , The Individual Authors No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner. 2

3 Table of Contents 1 Introduction EOP Cycle I: Main Characteristics Architecture Specifications EOP I Cycle development plan Task 4.1: Implementation of the platform architecture Development environment... 9 Topology of Git repositories Storage Code contribution Project dependencies Task 4.2: Linguistic Analysis Pipeline Preprocessing for Italian Preprocessing for German and English Task 4.3: Entailment Algorithms Lexical Edit Distance: EDA and component Classification-based EDA Task 4.4: Entailment Resources Task 4.5: Combination and interoperability of Entailment Components Linguistic annotation interoperability Resource interoperability EDA interoperability Component interoperability Development plans for the second cycle Appendix A: UIMA-CAS example

4 1 Introduction According to the project s DoW, work package 4 ( Textual Entailment Open Platform ) will develop the component-based approach proposed in EXCITEMENT, realizing the Open Platform for Textual Entailment Inference. This deliverable describes the progress in the development of the Excitement Open Platform (EOP) carried under WP4. We address five tasks within WP4, specifically: T4.1 Implementation of the platform architecture; T4.2 Linguistic analysis pipeline (pre-processing); T4.3 Entailment algorithms; T4.4 Entailment resources; T4.5 Combination of entailment components. For each task we show the progress after one year of activity (i.e. the first development cycle of the project) and we highlight the next steps. 1.1 EOP Cycle I: Main Characteristics We present the following main characteristics of the first prototype of the Excitement Open Platform (EOP-v1.0): The prototype is based on three existing systems. The EOP entailment engines should implement some the main functionalities of the three existing entailment engines (BIUTEE, EDITS, TIE), from which the platform originates. Multilinguality. Given an input Text-Hypothesis pair(s) in one of the three languages of the project (i.e. English, German, Italian), the EOP should produce corresponding entailment judgments (i.e. Entailment, Not-Entailment ). Conform to the component-based approach. Existing functionalities of the existing entailment engines are adapted to the blocks of the EOP s architecture (defined in WP3), which includes: pipeline, entailment decision algorithms (EDAs), knowledge components, distance components, resources. Learning capacity. The EOP should be able to build models on training data, which are then evaluated on corresponding test data. For this purpose we use the RTE-3 dataset in the three languages. Focus on lexical entailment. We start from entailment issues that are based on lexical phenomena in the three languages, for which all the academic partners have experience. For each language we individuate lexical resources contributing to entailment (e.g. WordNet, Wikipedia), and build corresponding 4

5 knowledge components, distance components, and EDAs. Typical phenomena that are responsible for lexical entailment include: synonymy, hyperonymy, antonymy, morphological derivations, named entities variations, acronyms, abbreviations, world knowledge about people and locations, etc. Although the focus for EOP-v1 is lexical entailment, we plan to rapidly move to syntactic phenomena thanks to the components and EDAs developed in BIUTEE and TIE. Interoperability. We intend to demonstrate the advantages of the component-based approach in term of interoperating components at several levels of the EOP architecture, including: The pipeline level: share UIMA-CAS for three languages. The EDA level: same EDAs used by all languages, (e.g. word overlap, edit distance) The component and resource level: share APIs of knowledge components, e.g. same high level APIs for three WordNets. Open source distribution. We aim to show that EOP-v1 is already synchronized with the goal of open source diffusion, and for each resource/software included in the prototype we provide the required information for its distribution. This part of activity has been synchronized with the activity in WP8. Relation with the project applicative scenarios. We intend to show that the first prototype already serves at least some of the requirements of the two industrial applications foreseen in the project: entailment-based graph exploration, entailment-based retrieval. This is part of activity has been synchronized with the activity in WP6 on the Transduction Layer. Demonstration. In order to make the first prototype demo as much as possible understandable to potential users we have been developing an EOP-v1 demonstrator based on a minimal web interface that: (i) allows the user to choose a configuration among a library of already experimented configurations; (ii) allows the user to insert a T-H pair; (iii) provides a library of built-in examples; (iv) provides a basic explanation of the system s behavior. 1.2 Architecture Specifications This document assumes the architecture specifications described in Deliverable D3.1 (specifically version D December 2012), as well as the terminology used in that document. 5

6 1.3 EOP I Cycle development plan The following incremental releases of the platform have been planned and then realized during the first year of the project. EOP v0.1 October 15, 2012 ( proof of concept ) This release (see Figure 1) is mainly intended as a proof of concept of the architectural design defined in Deliverable 3.1a. EOP-v0.1 features: Linguistic annotation: minimum requirement for each language is tokenization; additional annotations are welcome, supposed that they are compatible with the adopted UIMA-CAS. EDA: a single EDA based on Edit Distance on tokens. The entailment decision is based on an edit distance threshold between T and H estimated on a training set. Components: a single, language independent, entailment algorithm based on Edit Distance on tokens (i.e. the one implemented in EDITS). ITA- LAP Tokenization (DKPro) GER- LAP Tokenization (DKPro) UIMA-CAS EDA: Edit Distance (Edits) - Fixed costs of operations - Test on single T/H pairs entailment/ not entailment Eng- LAP Tokenization (DKPro) Distance component Edit Distance on To- kens (Edits) Figure 1: EOP-v0.1 6

7 EOP v0.2 November 6, 2012 (basic lexical resources - Rome meeting). This release (see Figure 2) aims at integrating in the platform a number of selected functionalities currently implemented for the three languages in the available systems (BIUTEE, EDITS, TIE) at the level of lexical entailment. Minimum expected capacity is lexical entailment (e.g. synonymy, hypernyms, morphological derivations, etc.) based on lexical resources. Linguistic annotation: minimum requirements for each language are tokenization; pos-tagging, lemmatization, morphological analysis. Additional annotations are welcome, supposed that they are compatible with the adopted UIMA- CAS. EDA: has to be defined for each language on the base of the existing systems. No interoperability is required for this release. Components: separate components for each of the three languages. Minimum requirement is the use of lexical knowledge in order to address typical phenomena of lexical variability. ITA- LAP Tokenization, Lemma, POS GER- LAP Tokenization Lemma, POS UIMA-CAS EDA: Edit Distance (Edits) - Test on RTE-3 - Fixed costs of operations entailment/ not entailment Eng- LAP Tokenization Lemma, POS Distance component Edit Distance on To- kens (Edits) Lexical component Entailment rules (Biutee) WordNets Italian German English Wikipedia Italian English Figure 2: EOP-v0.2 7

8 EOP v1 December This release (see Figure 3) aims at integrating some level of interoperability among EDAs and components already used in version 0.2. This version will be also the official deliverable of WP4 at month 12. Linguistic annotation: same as for version 0.2; additional annotations are welcome, supposed that they are compatible with the adopted UIMA-CAS. EDA: minimum requirement is at least one EDA (parameterized on languages) which can work on the three languages (e.g. Dan Roth s lexical similarity algorithm); Components: minimum requirement is at least one component (e.g. WordNets), which shares APIs for the three languages. Training and test: minimum requirement is training and test on RTE-3 in the three languages (or part of it if not available). Additional training and test is welcome when available, particularly for English. Configuration: a configuration file allows to describe in a declarative way the parameters of the components and EDA used in a certain run of the platform. Configurator ITA- LAP Tokenization, Lemma, POS GER- LAP Tokenization Lemma, POS Eng- LAP Tokenization Lemma, POS U I M A - C A S EDA: Edit Distance (Edits) - Training and test on RTE- 3 EDA: Classification- based (TIE) - Training and test on RTE- 3 entailment/ not entailment Scoring. comp. BoW similarity (TIE) Distance comp. Edit Dis- tance on Tokens (Edits) Distribu- tion similarity German Lexical component Entailment rules (Biutee) Word- Net Italian German English Wikipe- dia Italian English Figure 3: EOP-v1.0 8

9 EOP v1.1 February (planned) This release will be shown at the project first review meeting (February 2013). All functionalities of version 1.0 will be carefully tested, and a minimal graphical interface will be realized for demo purposes. Linguistic annotation: same as version 1.0. EDA: same as version 1.0. Components: same as version 1.0 Training and test: same as version 1.0 Graphical interface: a minimal interface for demo purposes, able to allow to run the platform for training and testing on different configurations. 2 Task 4.1: Implementation of the platform architecture This task focuses on defining the policies for software development adopted by the Excitement consortium, as well as providing the main tools for managing software and data used by the EOP-v Development environment Given the complexity of the EOP platform both in terms of different groups contributing at its development and because of the different nature of the software involved, the definition and setting of the development environment has required several decisions. The EOP platform, due to its characteristics, requires that different kinds of code and data are managed, with a significant level of complexity, also with respect to other academic experiences (e.g. Moses in the Machine Translation field). We have analysed the software and the data used by the platform according to two main dimensions: (1) whether the software/data has been developed as foreground of the Excitement project or by third parties; (2) whether the distribution licence of the software/data is free or restricted by specific constraints. On the basis of such characteristics, we have adopted a sophisticated schema for software development. Table 1 and 2 show, respectively, the schema of the software and data localization in the Excitement development environment: specifically we distinguish a version control repository, for which we use Git ( and an internal repository, for which we use Maven ( 9

10 Java code Not Java code (e.g. C++) 3 rd party software Excitement code 3 rd party software Version control repository/ Version control LGPL /equivalent license Internal Maven Repository repository Up to the EOP user to install Others (e.g. GPL) Up to the EOP user to install Table 1: EOP software management 3 rd party Data Excitement Data Linguistic and Knowledge Resources RTE Data Sets Linguistic and Knowledge Resources RTE Data Sets Internal Maven Version control Internal Maven Version control Creative repository repository repository repository Commons or equivalent licenses Research (e.g. Italian MultiWordNet) (e.g. WordNet rules, Italian Wikipedia) (e.g. Italian RTE3) Commercial Up to the EOP user to get Proprietary Up to the EOP user to get (e.g. GermaNet) Table 2: EOP data management In the following of the Section we provide details of the software and data repositories. 10

11 Topology of Git repositories Git is a distributed revision control and source code management (SCM) system with an emphasis on speed. Git was initially designed and developed by Linus Torvalds for Linux kernel development; it has since been adopted by many other projects. Every Git working directory is a full-fledged repository with complete history and full revision tracking capabilities, not dependent on network access or a central server. Git is free software distributed under the terms of the GNU General Public License version 2. In order to handle the contribution of multiple groups to the public repository, we adopt a multi-layer Git repository topology, for groups that wish to develop their code in a separate private repository, as well. Group-only code is developed in two tiers developer tier and group tier (sharing between group developers and backup), and is entirely private. The Excitement code is developed in three tiers private developer tier and group tier (sharing between group developers and backup) and public excitement tier (sharing with all parties and backup), where code is pushed from lower tiers to upper tiers. Figure 4: EOP schema for software development The main Excitement code repository (1) includes all code that is shared as part of the excitement project between all groups, and will become public sooner or later. 11

12 Each of the groups may maintain a private code which is not part of the Excitement code (2), and an Excitement code which is under development, and is not ready for sharing with other groups (3). The group repository will depend on excitement repository, but not the other way. Each developer in each group can work with both repository instances at the same time (4 and 5 for developer 1, 6 and 7 for developer 2). Storage We identify three types of data, which is not a source code: Java third parties, e.g., jars that are referred to by the Excitement code, but are not presented by the Maven repository. Non-Java third parties, e.g., EasyFirst parser for English. Data files, e.g., knowledge resources These types of data should not be part of the Git repository, and should be stored in an online shared repository. Internal Maven repository Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information. We use the Maven mechanism for downloading from an additional repository, to handle the download of the required files (e.g. jars). The purpose of the Maven repository is to work as an internal private repository of all software libraries and data files like lexical and knowledge resources. Storing Maven artefacts (e.g. jars) in a dedicated Maven repository is preferable compared to storing them in a version control system like GitHub for the following reasons: (i) Libraries (jars) are binary files and do not belong in version control systems which are better at handling text files that are frequently edited. (ii) Keeps the version control repository small. (iii) Checkouts, updates and other actions on the version control system are quicker. The Excitement Maven repository consists of 3 sub-repositories: Private-internal-repository: This repository contains artefacts which are used only within the EOP project. These are manually uploaded by the developer. 12

13 This does not synchronize with the Maven central repository ( as the artefacts in this repository are private to the organization. 3rd party repository, which contains artefacts which are publicly available but not in the Maven central repository. Examples are the latest versions of libraries, which are not yet available. This repository is not synchronized with the Maven central repository given that it does not have these libraries or data files. Repo1-cache: This repository is synchronized with the Maven central repository and it is a cache of the artefacts from it. Code contribution Currently (i.e. first EOP prototype), groups work individually on their own modules. At later stages, when code contributors will add their own code to different modules, the standard development procedure will be based on branch and merge. Each group of users (currently the four academic partners of the project) has identified a person that is responsible for the merging. FBK, as workpackage leader, has the overall responsibility of merging the contributions from different groups. Project dependencies The m2e (Maven to Eclipse) plugin allows easy generation of Maven projects and definition of dependencies. Ultimately, a pom.xml file is generated, defining the repositories and dependencies of the current project. It also defines the project s ID and version, which can then be used as a dependency in other Maven projects. 3 Task 4.2: Linguistic Analysis Pipeline For the pre-processing of test-hypothesis (T-H) pairs, according to the specifications described in D3.1.3, we have adopted the UIMA-CAS format to ensure interoperability among the linguistic analyses for the different languages. For EOP-v1 the focus has been on the following linguistic modules: tokenization, lemmatization and part of speech tagging, as they are the most relevant for lexical entailment. 13

14 3.1 Preprocessing for Italian For Italian pre-processing we use TextPro (Pianta et al. 2008), a suite of modular Natural Language Processing tools developed at FBK. The current version of the tool provides functions ranging from tokenization to part of speech tagging and Named Entity Recognition. The system s architecture is organized as a pipeline of processors wherein each stage accepts data from an initial input or from the output of a previous stage, executes a specific task, and sends the resulting data to the next stage, or to the output of the pipeline. To process Italian RTE data, an sample of which is shown in Example 1, we have built a wrapper around TextPro. The annotation produced for the text portion of the given pair is shown in Example 2. <pair id="xxx" entailment="nonentailment" task="xxx"> <t>hubble è un telescopio.</t> <h>hubble non è un telescopio.</h> </pair> Example 1: Italian Text/Hypothesis pair TextPRO annotations: # FILE:./esempio token tokenstart sentence pos lemma Hubble 0 - SPN Hubble è 7 - VI essere un 9 - RS indet telescopio 12 - SS telescopio. 22 <eos> XPS full_stop Example 2: TextPro output (tabular format) 14

15 Example 3: Italian pipeline output (UIMA-CAS) 3.2 Preprocessing for German and English For pre-processing German data we use DKPro Core, a collection of software components for natural language processing based on the Apache UIMA framework. Many state-of-the-art NLP components are already freely available in the NLP research community. DKPro Core provides wrappers for such third-party tools as well as original NLP components. DKPro Core builds heavily on uimafit which allows for rapid and easy development of NLP processing pipelines. DKPro Core ASL ( contains those components of DKPro Core that are licensed under the Apache Software License (ASL) version 2. Additional components are available in DKPro Core GPL ( Here is a brief (partial) list of the components included in DKPro Core: tokenization/segmentation (BreakIterator, OpenNLP, LanguageTool, Stanford CoreNLP) compound splitting (Banana Split, JWordSplitter) stemming (Snowball) lemmatization (TreeTagger) 15

16 part-of-speech tagging (TreeTagger, OpenNLP, Mecab) syntactic parsing (OpenNLP,Stanford CoreNLP, Berkeley Parser) dependency parsing (MaltParser, MstParser, Stanford CoreNLP) coreference resolution (Stanford CoreNLP) language identification (TextCat) spelling correction (Jazzy) Concerning the languages considered in EXCITEMENT (English, German, Italian), the components listed above are usually available for English and German. During EX- CITEMENT we will work to provide wrappers for tools for Italian and for additional tools for English and German. 4 Task 4.3: Entailment Algorithms This task, during the first development cycle, has provided the basic Entailment decision algorithms (EDA) and the basic scoring and knowledge components, i.e. the algorithms in charge of estimating the distance between T and H. In the following we briefly describe the EDAs and distance components of the EOP-v Lexical Edit Distance: EDA and component EOP-v1 includes both an EDA and a distance component derived from the EDITS system, developed by FBK, mostly in the context of the Qall-Me EU project. EDITS is a software package for recognizing entailment relations between two portions of text. It is based on edit distance algorithms, and computes the T-H distance as the cost of the edit operations that are necessary to transform T into H. EDITS is open source, and available under GNU Lesser General Public License (LGPL). The tool is implemented in Java, it runs on Unix-based Operating Systems, and has been tested on MAC OSX, Linux, and Sun Solaris. The latest release of the package (version XX, date) can be downloaded at: http: //edits.fbk.eu. EDITS implements a distance-based framework, which assumes that the probability of an entailment relation between a given T-H pair is inversely proportional to the distance between T and H (i.e. the higher the distance, the lower is the probability of entailment). Within this framework the system implements and combines different approaches to distance computation, providing both edit distance algorithms, and similarity algorithms. Each algorithm returns a normalized distance score (a number be- 16

17 tween 0 and 1). At the training stage, distance scores calculated over annotated T-H pairs are used to estimate a threshold that best separates positive (entailment) from negative (non-entailment) examples. The threshold, which is stored in a Model, is used at the test stage to assign an entailment judgment and a confidence score to each test pair. 4.2 Classification-based EDA EOP-v1 includes a classification-based EDA derived from TIE, a textual entailment system developed by DFKI. TIE has three EDAs in its current implementation. Two EDAs are for special cases and they only determine Contradiction (NER-based EDA), and Entailment/Paraphrase (DIRT-based EDA), while the last (the main EDA) can process any given H-T pair to produce the decision of Paraphrase, Entailment, Contradiction, or Unknown. The Classification-Based EDA is the most general one among the three EDAs of the TIE engine. It relies on the output of three types of components as features: lexical level component, syntactic level (dependency tree) component and semantic level (graph of semantic roles) component. The outputs consist of a set of scores: the lexical and syntactic components return 2 scores each, and the semantic level component returns 4 scores. Thus, for the main EDA, every H-T pair is represented as a set of 8 numbers. As for EOP-v1, only the lexical component (described below) has been migrated from TIE into the EXCITEMENT platform. Step 1: Decide the three basic relationships "relatedness", "inconsistency", "inequality" The EDA first classifies an H-T pair with three independent classifiers, "Relatedness", "Inconsistency", and "Inequality" Classification. The EDA tries to decompose four TE relations into combinations of three independent relations. Thus, for example, if H and T are related (+ relatedness) and unequal (+ inequality), then it must be an ENTAIL- MENT. If H and T are related, but also inconsistent, it is CONTRADICTION, etc. Step 2: Deciding Entailment Relationship When a given H-T pair is processed up to this point, each H-T pair is represented as three binary classification results (+/- on relatedness, inconsistency and inequality). Instead of using binary labels (1/0 or +/-), the EDA uses confidence values that are obtained from the binary classifiers, so three classification results are represented as numbers between 0 and 1 (for this reason, it currently uses TADM, which outputs a normalized confidence score). 17

18 Finally, another round of classification is done with this three number representation. This second classifier, which has been trained on the final TE labels, can assign a new H-T pair to one of the four TE relationship labels. The system currently uses the TADM multi-class classifier. Bag-of-Words scoring component This component regards an H-T pair as two bags of words (bag-of-words). It compares the two bags, and returns two scores between 0 and 1. The returned scores can be regarded as relatedness/similarity scores between two bags of words. The component uses three knowledge resources to compare the given H-T bags: VerbOcean, WordNet and Google Normalized Distance (GND). The resources are used in two ways. One is expansion: VerbOcean and WordNet are used to expand the bags with related terms. Thus, a new set of expanded H' and T' will be used to calculate overlapping scores. The other is as associative scoring of terms (thus even terms not included in Word- Net/VerbOcean can result in non-zero scores). The return consists of two scores, one normalized by the size of H, and the other by the size of T. TIE regards this information (a sort of coverage balance/imbalance on H and T) as a possible indicative feature, and keeps both scores for almost all features. In the current implementation, since we aim at a unified API for accessing lexical resources, all the resources provided in the EXCITEMENT platform can be used for this scoring component. If the words have relations in the knowledge base, they will contribute to the final score. All the scores (from different lexical resources) will be kept as features for the classifier (or other usage) defined in the EDAs. 5 Task 4.4: Entailment Resources The aims of this task are collecting and integrating into the EOP platform the linguistic resources that are used by the textual entailment algorithms. Some of the resources used exited prior to the start of the current project, in particular WordNet versions for English, Italian and German, VerbOcean and FrameNet for English. Other resources were created specifically for this project: entailment rules extracted from Wikipedia for English and Italian, and in the future for German as well. These resources are (and some will be) integrated in the platform through the lexical component interface migrated from BIUTEE. A few resources are currently under construction e.g. Corpus 18

19 Pattern Analysis (CPA) for Italian and will be completed and integrated in the next development cycle. These resources are described in detail in deliverable 5.1 of WP 5. Apart from resources that provide entailment rules and features, we have also developed training and testing data. We have translated the English RTE-3 dataset consisting of 1600 T-H pairs (800 for training, 800 for testing) into German and Italian. This will allow comparable training and testing for the three languages of the project. The data was manually translated, and the new versions are aligned with the original. The two new datasets are distributed under a Creative Common licence. 6 Task 4.5: Combination and interoperability of Entailment Components This task focuses on investigating the potential of the component-based approach developed in Excitement, particularly under two perspectives: (i) the interoperability of the platform; (ii) the combination of different components carried on by an EDA. During the first cycle of development we have started to address the first issue. 6.1 Linguistic annotation interoperability Linguistic annotations for different languages can be used by the same EDA. This interoperability is achieved through the use of the same format (UIMA-CAS) and the same set of semantic types (derived from DK-Pro) by the linguistic pipelines for the three languages. 6.2 Resource interoperability Different linguistic resources can be used by the same EDA. This interoperability is achieved through the lexical component interface, which basically assumes that knowledge extracted from different resources is represented as entailment rules. As an example, although the Italian WordNet and the Italian Wikipedia entailment resources are stored in completely different formats, their relevant content is managed by the lexical component interface as entailment rules, allowing a single EDA (e.g. the edit distance EDA) to use both in a completely transparent way. 19

20 6.3 EDA interoperability Different EDAs can use the same distance component. This interoperability is achieved through a strict separation of the algorithm taking the entailment decision (i.e. the EDA) from the algorithm that calculates the distance between T and H in a pair (i.e. the distance component). As an example, both the Edit distance EDA (from Edits) and the classification-based EDA (from TIE) can take advantage of the result of the Distance component (from Edits). 6.4 Component interoperability Different components can be used by the same EDA. As in the previous case, this interoperability is achieved through a strict separation of the algorithm taking the entailment decision from the algorithm that calculates the distance between T and H in a pair. As an example the classification-based EDA (from TIE) can use, both separately and in combination, both the results of the distance component (from Edits) and the results of the BoW component (from TIE). 7 Development plans for the second cycle We have the following basic goals for the second development cycle of the EOP; (i) (ii) (iii) (iv) (v) concluding the migration of the BIUTEE system into the EOP; addressing textual entailment phenomena of higher complexity, including those that are based on syntax. This requires that the LAP for the three languages is extended with parsing. Further enrich the resources that provide entailment rules for the three languages. Extensively check the proposed structure and environment for software development. Investigate and progress on the interoperability of single components of the linguistic pipelines. 20

21 Appendix A: UIMA-CAS example This appendix provides an example of the UIMA-CAS output of the Italian LAP for the sentence Hubble è un telescopio ( Hubble is a telescope ). TextPRO annotations: # FILE:./esempio # FIELDS: token tokenstart sentence pos lemma Hubble 0 - SPN Hubble è 7 - VI essere un 9 - RS indet telescopio 12 - SS telescopio. 22 <eos> XPS full_stop Adding annotations for text: Hubble è un telescopio. uima.tcas.documentannotation "Hubble è un telescopio." begin = 0 end = 23 language = "it" eu.excitement.type.entailment.text "Hubble è un telescopio." begin = 0 end = 23 de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence "Hubble è un telescopio." begin = 0 21

22 end = 23 de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.np "Hubble" begin = 0 end = 6 PosValue = "SPN" "Hubble" begin = 0 end = 6 value = "Hubble" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "Hubble" begin = 0 end = 6 parent = null lemma = "Hubble" begin = 0 end = 6 value = "Hubble" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.np "Hubble" 22

23 begin = 0 end = 6 PosValue = "SPN" "è" begin = 7 end = 8 value = "essere" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.v "è" begin = 7 end = 8 PosValue = "VI" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "è" begin = 7 end = 8 parent = null lemma = "è" 23

24 begin = 7 end = 8 value = "essere" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.v "è" begin = 7 end = 8 PosValue = "VI" "un" begin = 9 end = 11 value = "indet" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "un" begin = 9 end = 11 parent = null lemma = "un" begin = 9 end = 11 value = "indet" 24

25 stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.art "un" begin = 9 end = 11 PosValue = "RS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.art "un" begin = 9 end = 11 PosValue = "RS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.nn "telescopio" begin = 12 end = 22 PosValue = "SS" "telescopio" begin = 12 end = 22 value = "telescopio" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "telescopio" 25

26 begin = 12 end = 22 parent = null lemma = "telescopio" begin = 12 end = 22 value = "telescopio" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.nn "telescopio" begin = 12 end = 22 PosValue = "SS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.punc "." begin = 22 end = 23 PosValue = "XPS" "." 26

27 begin = 22 end = 23 value = "full_stop" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "." begin = 22 end = 23 parent = null lemma = "." begin = 22 end = 23 value = "full_stop" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.punc "." begin = 22 end = 23 PosValue = "XPS" TextPRO annotations: # FILE:./esempio # FIELDS: token tokenstart sentence pos lemma Hubble 0 - SPN Hubble non 7 - B non è 11 - VI essere un 13 - RS indet telescopio 16 - SS telescopio. 26 <eos> XPS full_stop 27

28 Adding annotations for text: Hubble non è un telescopio. uima.tcas.documentannotation "Hubble non è un telescopio." begin = 0 end = 27 language = "it" eu.excitement.type.entailment.hypothesis "Hubble non è un telescopio." begin = 0 end = 27 de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence "Hubble non è un telescopio." begin = 0 end = 27 de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.np "Hubble" begin = 0 end = 6 PosValue = "SPN" "Hubble" 28

29 begin = 0 end = 6 value = "Hubble" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "Hubble" begin = 0 end = 6 parent = null lemma = "Hubble" begin = 0 end = 6 value = "Hubble" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.np "Hubble" begin = 0 end = 6 PosValue = "SPN" "non" 29

30 begin = 7 end = 10 value = "non" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.adv "non" begin = 7 end = 10 PosValue = "B" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "non" begin = 7 end = 10 parent = null lemma = "non" begin = 7 end = 10 value = "non" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.adv "non" begin = 7 30

31 end = 10 PosValue = "B" "è" begin = 11 end = 12 value = "essere" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.v "è" begin = 11 end = 12 PosValue = "VI" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "è" begin = 11 end = 12 parent = null lemma = "è" begin = 11 end = 12 value = "essere" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.v "è" 31

32 begin = 11 end = 12 PosValue = "VI" "un" begin = 13 end = 15 value = "indet" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "un" begin = 13 end = 15 parent = null lemma = "un" begin = 13 end = 15 value = "indet" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.art "un" 32

33 begin = 13 end = 15 PosValue = "RS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.art "un" begin = 13 end = 15 PosValue = "RS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.nn "telescopio" begin = 16 end = 26 PosValue = "SS" "telescopio" begin = 16 end = 26 value = "telescopio" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "telescopio" begin = 16 end = 26 33

34 parent = null lemma = "telescopio" begin = 16 end = 26 value = "telescopio" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.nn "telescopio" begin = 16 end = 26 PosValue = "SS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.punc "." begin = 26 end = 27 PosValue = "XPS" "." begin = 26 end = 27 value = "full_stop" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "." 34

35 Pair 450 begin = 26 end = 27 parent = null lemma = "." begin = 26 end = 27 value = "full_stop" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.punc "." begin = 26 end = 27 PosValue = "XPS" written as./450.xmi 35