Work Package 4 Deliverable 4.1. Textual Entailment Open Platform, cycle I



Similar documents
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

Interactive Dynamic Information Extraction

Building a Question Classifier for a TREC-Style Question Answering System

Shallow Parsing with Apache UIMA

UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis

Natural Language to Relational Query by Using Parsing Compiler

D2.4: Two trained semantic decoders for the Appointment Scheduling task

Software Architecture Document

Natural Language Database Interface for the Community Based Monitoring System *

Technical Report. The KNIME Text Processing Feature:

Cross-domain Identity Management System for Cloud Environment

An Online Service for SUbtitling by MAchine Translation

Software Package Document exchange (SPDX ) Tools. Version 1.2. Copyright The Linux Foundation. All other rights are expressly reserved.

WebLicht: Web-based LRT services for German

The Prolog Interface to the Unstructured Information Management Architecture

Natural Language Processing in the EHR Lifecycle

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

Automatic Text Analysis Using Drupal

DATA MANAGEMENT PLAN DELIVERABLE NUMBER RESPONSIBLE AUTHOR. Co- funded by the Horizon 2020 Framework Programme of the European Union

An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Deliverable Name. D1.1.3 IPR Management Plan v1. DuraArk DURAARK. Date: 201x-MM-DD. Date: Document id. id. : duraark/2013/d.1.1.3/v1.

Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy)

Maven or how to automate java builds, tests and version management with open source tools

Scaling Web Applications in a Cloud Environment using Resin 4.0

HOPS Project presentation

Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP

SeaClouds Project. Cloud Application Programming Interface. Seamless adaptive multi- cloud management of service- based applications

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Distributed Computing and Big Data: Hadoop and MapReduce

Chapter 8. Final Results on Dutch Senseval-2 Test Data

NDK: Novell edirectory Core Services. novdocx (en) 24 April Novell Developer Kit. NOVELL EDIRECTORY TM CORE SERVICES.

Open EMS Suite. O&M Agent. Functional Overview Version 1.2. Nokia Siemens Networks 1 (18)

CHAPTER 5 INTELLIGENT TECHNIQUES TO PREVENT SQL INJECTION ATTACKS

Search and Information Retrieval

Table of Contents. Chapter No. 1 Introduction 1. iii. xiv. xviii. xix. Page No.

How To Build A Connector On A Website (For A Nonprogrammer)

Drupal CMS for marketing sites

Cisco Data Preparation

SAS 9.4 Intelligence Platform

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Deposit Identification Utility and Visualization Tool

D Test Strategy

Capacity Plan. Template. Version X.x October 11, 2012

Special Topics in Computer Science

Search Result Optimization using Annotators

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Source Control Guide: Git

Professional. SlickEdif. John Hurst IC..T...L. i 1 8 О 7» \ WILEY \ Wiley Publishing, Inc.

Semantic annotation of requirements for automatic UML class diagram generation

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

Using the MetaMap Java API

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Nexus Professional Whitepaper. Repository Management: Stages of Adoption

Final Report - HydrometDB Belize s Climatic Database Management System. Executive Summary

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery

How RAI's Hyper Media News aggregation system keeps staff on top of the news

How To Manage Your Digital Assets On A Computer Or Tablet Device

Virtualization Techniques for Cross Platform Automated Software Builds, Tests and Deployment

1 File Processing Systems

Automatic Knowledge Base Construction Systems. Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014

Pattern Insight Clone Detection

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

D5.3.2b Automatic Rigorous Testing Components

Open Source & XBRL: the Arelle Project

Clinical Knowledge Manager. Product Description 2012 MAKING HEALTH COMPUTE

Meta-Model specification V2 D

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Build management & Continuous integration. with Maven & Hudson

USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE

VOL. 2, NO. 1, January 2012 ISSN ARPN Journal of Science and Technology ARPN Journals. All rights reserved

Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services

Oracle Database. Products Available on the Oracle Database Examples Media. Oracle Database Examples. Examples Installation Guide 11g Release 2 (11.

31 Case Studies: Java Natural Language Tools Available on the Web

IBM Watson Ecosystem. Getting Started Guide

File S1: Supplementary Information of CloudDOE

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns

Transcription:

This document is part of EXCITEMENT project, funded by the 7th Framework Programme of the European Commission through grant agreement no.: 287923. EXploring Customer Interaction via Textual EntailMENT Work Package 4 Deliverable 4.1. Textual Entailment Open Platform, cycle I Authors: Dissemination Level: Bernardo Magnini, Roberto Zanoli, Vivi Nastase, Rui Wang, Meni Adler, Asher Stern, Tae-Gil Noh Public Date: December 31 th, 2012

Grant agreement no. 287923 Project acronym EXCITEMENT Project full title EXploring Customer Interaction via Textual EntailMENT Funding scheme STREP Coordinator Moshe Wasserblat (NICE) Start date, duration 1 January 2012, 36 months Distribution Public Contractual date of delivery 31/12/2012 Actual date of delivery 31/12/2012 Deliverable number D4.1 Deliverable title Textual Entailment Open Platform, Cycle I Type Report Status and version Final version 1.1 Number of pages 35 Contributing partners FBK, DFKI, BIU, UHEI WP leader FBK Task leader FBK Authors Bernardo Magnini, Roberto Zanoli, Vivi Nastase, Rui Wang, Meni Adler, Asher Stern, Tae- Gil Noh EC project officer Carola Carstens The partners in EXCITEMENT are: Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Germany NICE Systems, Israel Fondazione Bruno Kessler (FBK), Italy Bar-Ilan University, Israel Heidelberg University, Germany OMQ, Germany ALMAWAVE, Italy For copies of reports, updates on project activities and other EXCITEMENT-related information, contact: NICE Systems EXCITEMENT Moshe Wasserblat moshew@nice.com Hapnina 8 Phone: +972 (9) 775-3702 Ra anana, Israel Fax: +972 (9) 775-3702 Copies of reports and other material can also be accessed via http://www.excitement-project.eu 2012, The Individual Authors No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner. 2

Table of Contents 1 Introduction... 4 1.1 EOP Cycle I: Main Characteristics... 4 1.2 Architecture Specifications... 5 1.3 EOP I Cycle development plan... 6 2 Task 4.1: Implementation of the platform architecture... 9 2.1 Development environment... 9 Topology of Git repositories... 11 Storage... 12 Code contribution... 13 Project dependencies... 13 3 Task 4.2: Linguistic Analysis Pipeline... 13 3.1 Preprocessing for Italian... 14 3.2 Preprocessing for German and English... 15 4 Task 4.3: Entailment Algorithms... 16 4.1 Lexical Edit Distance: EDA and component... 16 4.2 Classification-based EDA... 17 5 Task 4.4: Entailment Resources... 18 6 Task 4.5: Combination and interoperability of Entailment Components... 19 6.1 Linguistic annotation interoperability... 19 6.2 Resource interoperability... 19 6.3 EDA interoperability... 20 6.4 Component interoperability... 20 7 Development plans for the second cycle... 20 Appendix A: UIMA-CAS example... 21 3

1 Introduction According to the project s DoW, work package 4 ( Textual Entailment Open Platform ) will develop the component-based approach proposed in EXCITEMENT, realizing the Open Platform for Textual Entailment Inference. This deliverable describes the progress in the development of the Excitement Open Platform (EOP) carried under WP4. We address five tasks within WP4, specifically: T4.1 Implementation of the platform architecture; T4.2 Linguistic analysis pipeline (pre-processing); T4.3 Entailment algorithms; T4.4 Entailment resources; T4.5 Combination of entailment components. For each task we show the progress after one year of activity (i.e. the first development cycle of the project) and we highlight the next steps. 1.1 EOP Cycle I: Main Characteristics We present the following main characteristics of the first prototype of the Excitement Open Platform (EOP-v1.0): The prototype is based on three existing systems. The EOP entailment engines should implement some the main functionalities of the three existing entailment engines (BIUTEE, EDITS, TIE), from which the platform originates. Multilinguality. Given an input Text-Hypothesis pair(s) in one of the three languages of the project (i.e. English, German, Italian), the EOP should produce corresponding entailment judgments (i.e. Entailment, Not-Entailment ). Conform to the component-based approach. Existing functionalities of the existing entailment engines are adapted to the blocks of the EOP s architecture (defined in WP3), which includes: pipeline, entailment decision algorithms (EDAs), knowledge components, distance components, resources. Learning capacity. The EOP should be able to build models on training data, which are then evaluated on corresponding test data. For this purpose we use the RTE-3 dataset in the three languages. Focus on lexical entailment. We start from entailment issues that are based on lexical phenomena in the three languages, for which all the academic partners have experience. For each language we individuate lexical resources contributing to entailment (e.g. WordNet, Wikipedia), and build corresponding 4

knowledge components, distance components, and EDAs. Typical phenomena that are responsible for lexical entailment include: synonymy, hyperonymy, antonymy, morphological derivations, named entities variations, acronyms, abbreviations, world knowledge about people and locations, etc. Although the focus for EOP-v1 is lexical entailment, we plan to rapidly move to syntactic phenomena thanks to the components and EDAs developed in BIUTEE and TIE. Interoperability. We intend to demonstrate the advantages of the component-based approach in term of interoperating components at several levels of the EOP architecture, including: The pipeline level: share UIMA-CAS for three languages. The EDA level: same EDAs used by all languages, (e.g. word overlap, edit distance) The component and resource level: share APIs of knowledge components, e.g. same high level APIs for three WordNets. Open source distribution. We aim to show that EOP-v1 is already synchronized with the goal of open source diffusion, and for each resource/software included in the prototype we provide the required information for its distribution. This part of activity has been synchronized with the activity in WP8. Relation with the project applicative scenarios. We intend to show that the first prototype already serves at least some of the requirements of the two industrial applications foreseen in the project: entailment-based graph exploration, entailment-based retrieval. This is part of activity has been synchronized with the activity in WP6 on the Transduction Layer. Demonstration. In order to make the first prototype demo as much as possible understandable to potential users we have been developing an EOP-v1 demonstrator based on a minimal web interface that: (i) allows the user to choose a configuration among a library of already experimented configurations; (ii) allows the user to insert a T-H pair; (iii) provides a library of built-in examples; (iv) provides a basic explanation of the system s behavior. 1.2 Architecture Specifications This document assumes the architecture specifications described in Deliverable D3.1 (specifically version D3.1.3 - December 2012), as well as the terminology used in that document. 5

1.3 EOP I Cycle development plan The following incremental releases of the platform have been planned and then realized during the first year of the project. EOP v0.1 October 15, 2012 ( proof of concept ) This release (see Figure 1) is mainly intended as a proof of concept of the architectural design defined in Deliverable 3.1a. EOP-v0.1 features: Linguistic annotation: minimum requirement for each language is tokenization; additional annotations are welcome, supposed that they are compatible with the adopted UIMA-CAS. EDA: a single EDA based on Edit Distance on tokens. The entailment decision is based on an edit distance threshold between T and H estimated on a training set. Components: a single, language independent, entailment algorithm based on Edit Distance on tokens (i.e. the one implemented in EDITS). ITA- LAP Tokenization (DKPro) GER- LAP Tokenization (DKPro) UIMA-CAS EDA: Edit Distance (Edits) - Fixed costs of operations - Test on single T/H pairs entailment/ not entailment Eng- LAP Tokenization (DKPro) Distance component Edit Distance on To- kens (Edits) Figure 1: EOP-v0.1 6

EOP v0.2 November 6, 2012 (basic lexical resources - Rome meeting). This release (see Figure 2) aims at integrating in the platform a number of selected functionalities currently implemented for the three languages in the available systems (BIUTEE, EDITS, TIE) at the level of lexical entailment. Minimum expected capacity is lexical entailment (e.g. synonymy, hypernyms, morphological derivations, etc.) based on lexical resources. Linguistic annotation: minimum requirements for each language are tokenization; pos-tagging, lemmatization, morphological analysis. Additional annotations are welcome, supposed that they are compatible with the adopted UIMA- CAS. EDA: has to be defined for each language on the base of the existing systems. No interoperability is required for this release. Components: separate components for each of the three languages. Minimum requirement is the use of lexical knowledge in order to address typical phenomena of lexical variability. ITA- LAP Tokenization, Lemma, POS GER- LAP Tokenization Lemma, POS UIMA-CAS EDA: Edit Distance (Edits) - Test on RTE-3 - Fixed costs of operations entailment/ not entailment Eng- LAP Tokenization Lemma, POS Distance component Edit Distance on To- kens (Edits) Lexical component Entailment rules (Biutee) WordNets Italian German English Wikipedia Italian English Figure 2: EOP-v0.2 7

EOP v1 December 31 2012. This release (see Figure 3) aims at integrating some level of interoperability among EDAs and components already used in version 0.2. This version will be also the official deliverable of WP4 at month 12. Linguistic annotation: same as for version 0.2; additional annotations are welcome, supposed that they are compatible with the adopted UIMA-CAS. EDA: minimum requirement is at least one EDA (parameterized on languages) which can work on the three languages (e.g. Dan Roth s lexical similarity algorithm); Components: minimum requirement is at least one component (e.g. WordNets), which shares APIs for the three languages. Training and test: minimum requirement is training and test on RTE-3 in the three languages (or part of it if not available). Additional training and test is welcome when available, particularly for English. Configuration: a configuration file allows to describe in a declarative way the parameters of the components and EDA used in a certain run of the platform. Configurator ITA- LAP Tokenization, Lemma, POS GER- LAP Tokenization Lemma, POS Eng- LAP Tokenization Lemma, POS U I M A - C A S EDA: Edit Distance (Edits) - Training and test on RTE- 3 EDA: Classification- based (TIE) - Training and test on RTE- 3 entailment/ not entailment Scoring. comp. BoW simi- larity (TIE) Distance comp. Edit Dis- tance on Tokens (Edits) Distribu- tion simi- larity German Lexical component Entailment rules (Biutee) Word- Net Italian German English Wikipe- dia Italian English Figure 3: EOP-v1.0 8

EOP v1.1 February 14 2013 (planned) This release will be shown at the project first review meeting (February 2013). All functionalities of version 1.0 will be carefully tested, and a minimal graphical interface will be realized for demo purposes. Linguistic annotation: same as version 1.0. EDA: same as version 1.0. Components: same as version 1.0 Training and test: same as version 1.0 Graphical interface: a minimal interface for demo purposes, able to allow to run the platform for training and testing on different configurations. 2 Task 4.1: Implementation of the platform architecture This task focuses on defining the policies for software development adopted by the Excitement consortium, as well as providing the main tools for managing software and data used by the EOP-v1.0. 2.1 Development environment Given the complexity of the EOP platform both in terms of different groups contributing at its development and because of the different nature of the software involved, the definition and setting of the development environment has required several decisions. The EOP platform, due to its characteristics, requires that different kinds of code and data are managed, with a significant level of complexity, also with respect to other academic experiences (e.g. Moses in the Machine Translation field). We have analysed the software and the data used by the platform according to two main dimensions: (1) whether the software/data has been developed as foreground of the Excitement project or by third parties; (2) whether the distribution licence of the software/data is free or restricted by specific constraints. On the basis of such characteristics, we have adopted a sophisticated schema for software development. Table 1 and 2 show, respectively, the schema of the software and data localization in the Excitement development environment: specifically we distinguish a version control repository, for which we use Git (http://git-scm.com), and an internal repository, for which we use Maven (http://maven.apache.org). 9

Java code Not Java code (e.g. C++) 3 rd party software Excitement code 3 rd party software Version control repository/ Version control LGPL /equivalent license Internal Maven Repository repository Up to the EOP user to install Others (e.g. GPL) Up to the EOP user to install Table 1: EOP software management 3 rd party Data Excitement Data Linguistic and Knowledge Resources RTE Data Sets Linguistic and Knowledge Resources RTE Data Sets Internal Maven Version control Internal Maven Version control Creative repository repository repository repository Commons or equivalent licenses Research (e.g. Italian MultiWordNet) (e.g. WordNet rules, Italian Wikipedia) (e.g. Italian RTE3) Commercial Up to the EOP user to get Proprietary Up to the EOP user to get (e.g. GermaNet) Table 2: EOP data management In the following of the Section we provide details of the software and data repositories. 10

Topology of Git repositories Git is a distributed revision control and source code management (SCM) system with an emphasis on speed. Git was initially designed and developed by Linus Torvalds for Linux kernel development; it has since been adopted by many other projects. Every Git working directory is a full-fledged repository with complete history and full revision tracking capabilities, not dependent on network access or a central server. Git is free software distributed under the terms of the GNU General Public License version 2. In order to handle the contribution of multiple groups to the public repository, we adopt a multi-layer Git repository topology, for groups that wish to develop their code in a separate private repository, as well. Group-only code is developed in two tiers developer tier and group tier (sharing between group developers and backup), and is entirely private. The Excitement code is developed in three tiers private developer tier and group tier (sharing between group developers and backup) and public excitement tier (sharing with all parties and backup), where code is pushed from lower tiers to upper tiers. Figure 4: EOP schema for software development The main Excitement code repository (1) includes all code that is shared as part of the excitement project between all groups, and will become public sooner or later. 11

Each of the groups may maintain a private code which is not part of the Excitement code (2), and an Excitement code which is under development, and is not ready for sharing with other groups (3). The group repository will depend on excitement repository, but not the other way. Each developer in each group can work with both repository instances at the same time (4 and 5 for developer 1, 6 and 7 for developer 2). Storage We identify three types of data, which is not a source code: Java third parties, e.g., jars that are referred to by the Excitement code, but are not presented by the Maven repository. Non-Java third parties, e.g., EasyFirst parser for English. Data files, e.g., knowledge resources These types of data should not be part of the Git repository, and should be stored in an online shared repository. Internal Maven repository Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information. We use the Maven mechanism for downloading from an additional repository, to handle the download of the required files (e.g. jars). The purpose of the Maven repository is to work as an internal private repository of all software libraries and data files like lexical and knowledge resources. Storing Maven artefacts (e.g. jars) in a dedicated Maven repository is preferable compared to storing them in a version control system like GitHub for the following reasons: (i) Libraries (jars) are binary files and do not belong in version control systems which are better at handling text files that are frequently edited. (ii) Keeps the version control repository small. (iii) Checkouts, updates and other actions on the version control system are quicker. The Excitement Maven repository consists of 3 sub-repositories: Private-internal-repository: This repository contains artefacts which are used only within the EOP project. These are manually uploaded by the developer. 12

This does not synchronize with the Maven central repository (http://repo1.maven.org/) as the artefacts in this repository are private to the organization. 3rd party repository, which contains artefacts which are publicly available but not in the Maven central repository. Examples are the latest versions of libraries, which are not yet available. This repository is not synchronized with the Maven central repository given that it does not have these libraries or data files. Repo1-cache: This repository is synchronized with the Maven central repository and it is a cache of the artefacts from it. Code contribution Currently (i.e. first EOP prototype), groups work individually on their own modules. At later stages, when code contributors will add their own code to different modules, the standard development procedure will be based on branch and merge. Each group of users (currently the four academic partners of the project) has identified a person that is responsible for the merging. FBK, as workpackage leader, has the overall responsibility of merging the contributions from different groups. Project dependencies The m2e (Maven to Eclipse) plugin allows easy generation of Maven projects and definition of dependencies. Ultimately, a pom.xml file is generated, defining the repositories and dependencies of the current project. It also defines the project s ID and version, which can then be used as a dependency in other Maven projects. 3 Task 4.2: Linguistic Analysis Pipeline For the pre-processing of test-hypothesis (T-H) pairs, according to the specifications described in D3.1.3, we have adopted the UIMA-CAS format to ensure interoperability among the linguistic analyses for the different languages. For EOP-v1 the focus has been on the following linguistic modules: tokenization, lemmatization and part of speech tagging, as they are the most relevant for lexical entailment. 13

3.1 Preprocessing for Italian For Italian pre-processing we use TextPro (Pianta et al. 2008), a suite of modular Natural Language Processing tools developed at FBK. The current version of the tool provides functions ranging from tokenization to part of speech tagging and Named Entity Recognition. The system s architecture is organized as a pipeline of processors wherein each stage accepts data from an initial input or from the output of a previous stage, executes a specific task, and sends the resulting data to the next stage, or to the output of the pipeline. To process Italian RTE data, an sample of which is shown in Example 1, we have built a wrapper around TextPro. The annotation produced for the text portion of the given pair is shown in Example 2. <pair id="xxx" entailment="nonentailment" task="xxx"> <t>hubble è un telescopio.</t> <h>hubble non è un telescopio.</h> </pair> Example 1: Italian Text/Hypothesis pair TextPRO annotations: # FILE:./esempio token tokenstart sentence pos lemma Hubble 0 - SPN Hubble è 7 - VI essere un 9 - RS indet telescopio 12 - SS telescopio. 22 <eos> XPS full_stop Example 2: TextPro output (tabular format) 14

Example 3: Italian pipeline output (UIMA-CAS) 3.2 Preprocessing for German and English For pre-processing German data we use DKPro Core, a collection of software components for natural language processing based on the Apache UIMA framework. Many state-of-the-art NLP components are already freely available in the NLP research community. DKPro Core provides wrappers for such third-party tools as well as original NLP components. DKPro Core builds heavily on uimafit which allows for rapid and easy development of NLP processing pipelines. DKPro Core ASL (http://code.google.com/p/dkpro-core-asl/) contains those components of DKPro Core that are licensed under the Apache Software License (ASL) version 2. Additional components are available in DKPro Core GPL (http://code.google.com/p/dkpro-core-gpl). Here is a brief (partial) list of the components included in DKPro Core: tokenization/segmentation (BreakIterator, OpenNLP, LanguageTool, Stanford CoreNLP) compound splitting (Banana Split, JWordSplitter) stemming (Snowball) lemmatization (TreeTagger) 15

part-of-speech tagging (TreeTagger, OpenNLP, Mecab) syntactic parsing (OpenNLP,Stanford CoreNLP, Berkeley Parser) dependency parsing (MaltParser, MstParser, Stanford CoreNLP) coreference resolution (Stanford CoreNLP) language identification (TextCat) spelling correction (Jazzy) Concerning the languages considered in EXCITEMENT (English, German, Italian), the components listed above are usually available for English and German. During EX- CITEMENT we will work to provide wrappers for tools for Italian and for additional tools for English and German. 4 Task 4.3: Entailment Algorithms This task, during the first development cycle, has provided the basic Entailment decision algorithms (EDA) and the basic scoring and knowledge components, i.e. the algorithms in charge of estimating the distance between T and H. In the following we briefly describe the EDAs and distance components of the EOP-v1. 4.1 Lexical Edit Distance: EDA and component EOP-v1 includes both an EDA and a distance component derived from the EDITS system, developed by FBK, mostly in the context of the Qall-Me EU project. EDITS is a software package for recognizing entailment relations between two portions of text. It is based on edit distance algorithms, and computes the T-H distance as the cost of the edit operations that are necessary to transform T into H. EDITS is open source, and available under GNU Lesser General Public License (LGPL). The tool is implemented in Java, it runs on Unix-based Operating Systems, and has been tested on MAC OSX, Linux, and Sun Solaris. The latest release of the package (version XX, date) can be downloaded at: http: //edits.fbk.eu. EDITS implements a distance-based framework, which assumes that the probability of an entailment relation between a given T-H pair is inversely proportional to the distance between T and H (i.e. the higher the distance, the lower is the probability of entailment). Within this framework the system implements and combines different approaches to distance computation, providing both edit distance algorithms, and similarity algorithms. Each algorithm returns a normalized distance score (a number be- 16

tween 0 and 1). At the training stage, distance scores calculated over annotated T-H pairs are used to estimate a threshold that best separates positive (entailment) from negative (non-entailment) examples. The threshold, which is stored in a Model, is used at the test stage to assign an entailment judgment and a confidence score to each test pair. 4.2 Classification-based EDA EOP-v1 includes a classification-based EDA derived from TIE, a textual entailment system developed by DFKI. TIE has three EDAs in its current implementation. Two EDAs are for special cases and they only determine Contradiction (NER-based EDA), and Entailment/Paraphrase (DIRT-based EDA), while the last (the main EDA) can process any given H-T pair to produce the decision of Paraphrase, Entailment, Contradiction, or Unknown. The Classification-Based EDA is the most general one among the three EDAs of the TIE engine. It relies on the output of three types of components as features: lexical level component, syntactic level (dependency tree) component and semantic level (graph of semantic roles) component. The outputs consist of a set of scores: the lexical and syntactic components return 2 scores each, and the semantic level component returns 4 scores. Thus, for the main EDA, every H-T pair is represented as a set of 8 numbers. As for EOP-v1, only the lexical component (described below) has been migrated from TIE into the EXCITEMENT platform. Step 1: Decide the three basic relationships "relatedness", "inconsistency", "inequality" The EDA first classifies an H-T pair with three independent classifiers, "Relatedness", "Inconsistency", and "Inequality" Classification. The EDA tries to decompose four TE relations into combinations of three independent relations. Thus, for example, if H and T are related (+ relatedness) and unequal (+ inequality), then it must be an ENTAIL- MENT. If H and T are related, but also inconsistent, it is CONTRADICTION, etc. Step 2: Deciding Entailment Relationship When a given H-T pair is processed up to this point, each H-T pair is represented as three binary classification results (+/- on relatedness, inconsistency and inequality). Instead of using binary labels (1/0 or +/-), the EDA uses confidence values that are obtained from the binary classifiers, so three classification results are represented as numbers between 0 and 1 (for this reason, it currently uses TADM, which outputs a normalized confidence score). 17

Finally, another round of classification is done with this three number representation. This second classifier, which has been trained on the final TE labels, can assign a new H-T pair to one of the four TE relationship labels. The system currently uses the TADM multi-class classifier. Bag-of-Words scoring component This component regards an H-T pair as two bags of words (bag-of-words). It compares the two bags, and returns two scores between 0 and 1. The returned scores can be regarded as relatedness/similarity scores between two bags of words. The component uses three knowledge resources to compare the given H-T bags: VerbOcean, WordNet and Google Normalized Distance (GND). The resources are used in two ways. One is expansion: VerbOcean and WordNet are used to expand the bags with related terms. Thus, a new set of expanded H' and T' will be used to calculate overlapping scores. The other is as associative scoring of terms (thus even terms not included in Word- Net/VerbOcean can result in non-zero scores). The return consists of two scores, one normalized by the size of H, and the other by the size of T. TIE regards this information (a sort of coverage balance/imbalance on H and T) as a possible indicative feature, and keeps both scores for almost all features. In the current implementation, since we aim at a unified API for accessing lexical resources, all the resources provided in the EXCITEMENT platform can be used for this scoring component. If the words have relations in the knowledge base, they will contribute to the final score. All the scores (from different lexical resources) will be kept as features for the classifier (or other usage) defined in the EDAs. 5 Task 4.4: Entailment Resources The aims of this task are collecting and integrating into the EOP platform the linguistic resources that are used by the textual entailment algorithms. Some of the resources used exited prior to the start of the current project, in particular WordNet versions for English, Italian and German, VerbOcean and FrameNet for English. Other resources were created specifically for this project: entailment rules extracted from Wikipedia for English and Italian, and in the future for German as well. These resources are (and some will be) integrated in the platform through the lexical component interface migrated from BIUTEE. A few resources are currently under construction e.g. Corpus 18

Pattern Analysis (CPA) for Italian and will be completed and integrated in the next development cycle. These resources are described in detail in deliverable 5.1 of WP 5. Apart from resources that provide entailment rules and features, we have also developed training and testing data. We have translated the English RTE-3 dataset consisting of 1600 T-H pairs (800 for training, 800 for testing) into German and Italian. This will allow comparable training and testing for the three languages of the project. The data was manually translated, and the new versions are aligned with the original. The two new datasets are distributed under a Creative Common licence. 6 Task 4.5: Combination and interoperability of Entailment Components This task focuses on investigating the potential of the component-based approach developed in Excitement, particularly under two perspectives: (i) the interoperability of the platform; (ii) the combination of different components carried on by an EDA. During the first cycle of development we have started to address the first issue. 6.1 Linguistic annotation interoperability Linguistic annotations for different languages can be used by the same EDA. This interoperability is achieved through the use of the same format (UIMA-CAS) and the same set of semantic types (derived from DK-Pro) by the linguistic pipelines for the three languages. 6.2 Resource interoperability Different linguistic resources can be used by the same EDA. This interoperability is achieved through the lexical component interface, which basically assumes that knowledge extracted from different resources is represented as entailment rules. As an example, although the Italian WordNet and the Italian Wikipedia entailment resources are stored in completely different formats, their relevant content is managed by the lexical component interface as entailment rules, allowing a single EDA (e.g. the edit distance EDA) to use both in a completely transparent way. 19

6.3 EDA interoperability Different EDAs can use the same distance component. This interoperability is achieved through a strict separation of the algorithm taking the entailment decision (i.e. the EDA) from the algorithm that calculates the distance between T and H in a pair (i.e. the distance component). As an example, both the Edit distance EDA (from Edits) and the classification-based EDA (from TIE) can take advantage of the result of the Distance component (from Edits). 6.4 Component interoperability Different components can be used by the same EDA. As in the previous case, this interoperability is achieved through a strict separation of the algorithm taking the entailment decision from the algorithm that calculates the distance between T and H in a pair. As an example the classification-based EDA (from TIE) can use, both separately and in combination, both the results of the distance component (from Edits) and the results of the BoW component (from TIE). 7 Development plans for the second cycle We have the following basic goals for the second development cycle of the EOP; (i) (ii) (iii) (iv) (v) concluding the migration of the BIUTEE system into the EOP; addressing textual entailment phenomena of higher complexity, including those that are based on syntax. This requires that the LAP for the three languages is extended with parsing. Further enrich the resources that provide entailment rules for the three languages. Extensively check the proposed structure and environment for software development. Investigate and progress on the interoperability of single components of the linguistic pipelines. 20

Appendix A: UIMA-CAS example This appendix provides an example of the UIMA-CAS output of the Italian LAP for the sentence Hubble è un telescopio ( Hubble is a telescope ). TextPRO annotations: # FILE:./esempio # FIELDS: token tokenstart sentence pos lemma Hubble 0 - SPN Hubble è 7 - VI essere un 9 - RS indet telescopio 12 - SS telescopio. 22 <eos> XPS full_stop Adding annotations for text: Hubble è un telescopio. uima.tcas.documentannotation "Hubble è un telescopio." begin = 0 end = 23 language = "it" eu.excitement.type.entailment.text "Hubble è un telescopio." begin = 0 end = 23 de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence "Hubble è un telescopio." begin = 0 21

end = 23 de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.np "Hubble" begin = 0 end = 6 PosValue = "SPN" "Hubble" begin = 0 end = 6 value = "Hubble" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "Hubble" begin = 0 end = 6 parent = null lemma = "Hubble" begin = 0 end = 6 value = "Hubble" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.np "Hubble" 22

begin = 0 end = 6 PosValue = "SPN" "è" begin = 7 end = 8 value = "essere" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.v "è" begin = 7 end = 8 PosValue = "VI" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "è" begin = 7 end = 8 parent = null lemma = "è" 23

begin = 7 end = 8 value = "essere" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.v "è" begin = 7 end = 8 PosValue = "VI" "un" begin = 9 end = 11 value = "indet" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "un" begin = 9 end = 11 parent = null lemma = "un" begin = 9 end = 11 value = "indet" 24

stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.art "un" begin = 9 end = 11 PosValue = "RS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.art "un" begin = 9 end = 11 PosValue = "RS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.nn "telescopio" begin = 12 end = 22 PosValue = "SS" "telescopio" begin = 12 end = 22 value = "telescopio" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "telescopio" 25

begin = 12 end = 22 parent = null lemma = "telescopio" begin = 12 end = 22 value = "telescopio" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.nn "telescopio" begin = 12 end = 22 PosValue = "SS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.punc "." begin = 22 end = 23 PosValue = "XPS" "." 26

begin = 22 end = 23 value = "full_stop" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "." begin = 22 end = 23 parent = null lemma = "." begin = 22 end = 23 value = "full_stop" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.punc "." begin = 22 end = 23 PosValue = "XPS" TextPRO annotations: # FILE:./esempio # FIELDS: token tokenstart sentence pos lemma Hubble 0 - SPN Hubble non 7 - B non è 11 - VI essere un 13 - RS indet telescopio 16 - SS telescopio. 26 <eos> XPS full_stop 27

Adding annotations for text: Hubble non è un telescopio. uima.tcas.documentannotation "Hubble non è un telescopio." begin = 0 end = 27 language = "it" eu.excitement.type.entailment.hypothesis "Hubble non è un telescopio." begin = 0 end = 27 de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence "Hubble non è un telescopio." begin = 0 end = 27 de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.np "Hubble" begin = 0 end = 6 PosValue = "SPN" "Hubble" 28

begin = 0 end = 6 value = "Hubble" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "Hubble" begin = 0 end = 6 parent = null lemma = "Hubble" begin = 0 end = 6 value = "Hubble" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.np "Hubble" begin = 0 end = 6 PosValue = "SPN" "non" 29

begin = 7 end = 10 value = "non" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.adv "non" begin = 7 end = 10 PosValue = "B" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "non" begin = 7 end = 10 parent = null lemma = "non" begin = 7 end = 10 value = "non" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.adv "non" begin = 7 30

end = 10 PosValue = "B" "è" begin = 11 end = 12 value = "essere" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.v "è" begin = 11 end = 12 PosValue = "VI" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "è" begin = 11 end = 12 parent = null lemma = "è" begin = 11 end = 12 value = "essere" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.v "è" 31

begin = 11 end = 12 PosValue = "VI" "un" begin = 13 end = 15 value = "indet" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "un" begin = 13 end = 15 parent = null lemma = "un" begin = 13 end = 15 value = "indet" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.art "un" 32

begin = 13 end = 15 PosValue = "RS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.art "un" begin = 13 end = 15 PosValue = "RS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.nn "telescopio" begin = 16 end = 26 PosValue = "SS" "telescopio" begin = 16 end = 26 value = "telescopio" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "telescopio" begin = 16 end = 26 33

parent = null lemma = "telescopio" begin = 16 end = 26 value = "telescopio" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.nn "telescopio" begin = 16 end = 26 PosValue = "SS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.punc "." begin = 26 end = 27 PosValue = "XPS" "." begin = 26 end = 27 value = "full_stop" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "." 34

Pair 450 begin = 26 end = 27 parent = null lemma = "." begin = 26 end = 27 value = "full_stop" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.punc "." begin = 26 end = 27 PosValue = "XPS" written as./450.xmi 35