Work Package 4 Deliverable 4.1. Textual Entailment Open Platform, cycle I

Size: px
Start display at page:

Download "Work Package 4 Deliverable 4.1. Textual Entailment Open Platform, cycle I"

Transcription

1 This document is part of EXCITEMENT project, funded by the 7th Framework Programme of the European Commission through grant agreement no.: EXploring Customer Interaction via Textual EntailMENT Work Package 4 Deliverable 4.1. Textual Entailment Open Platform, cycle I Authors: Dissemination Level: Bernardo Magnini, Roberto Zanoli, Vivi Nastase, Rui Wang, Meni Adler, Asher Stern, Tae-Gil Noh Public Date: December 31 th, 2012

2 Grant agreement no Project acronym EXCITEMENT Project full title EXploring Customer Interaction via Textual EntailMENT Funding scheme STREP Coordinator Moshe Wasserblat (NICE) Start date, duration 1 January 2012, 36 months Distribution Public Contractual date of delivery 31/12/2012 Actual date of delivery 31/12/2012 Deliverable number D4.1 Deliverable title Textual Entailment Open Platform, Cycle I Type Report Status and version Final version 1.1 Number of pages 35 Contributing partners FBK, DFKI, BIU, UHEI WP leader FBK Task leader FBK Authors Bernardo Magnini, Roberto Zanoli, Vivi Nastase, Rui Wang, Meni Adler, Asher Stern, Tae- Gil Noh EC project officer Carola Carstens The partners in EXCITEMENT are: Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Germany NICE Systems, Israel Fondazione Bruno Kessler (FBK), Italy Bar-Ilan University, Israel Heidelberg University, Germany OMQ, Germany ALMAWAVE, Italy For copies of reports, updates on project activities and other EXCITEMENT-related information, contact: NICE Systems EXCITEMENT Moshe Wasserblat [email protected] Hapnina 8 Phone: +972 (9) Ra anana, Israel Fax: +972 (9) Copies of reports and other material can also be accessed via , The Individual Authors No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner. 2

3 Table of Contents 1 Introduction EOP Cycle I: Main Characteristics Architecture Specifications EOP I Cycle development plan Task 4.1: Implementation of the platform architecture Development environment... 9 Topology of Git repositories Storage Code contribution Project dependencies Task 4.2: Linguistic Analysis Pipeline Preprocessing for Italian Preprocessing for German and English Task 4.3: Entailment Algorithms Lexical Edit Distance: EDA and component Classification-based EDA Task 4.4: Entailment Resources Task 4.5: Combination and interoperability of Entailment Components Linguistic annotation interoperability Resource interoperability EDA interoperability Component interoperability Development plans for the second cycle Appendix A: UIMA-CAS example

4 1 Introduction According to the project s DoW, work package 4 ( Textual Entailment Open Platform ) will develop the component-based approach proposed in EXCITEMENT, realizing the Open Platform for Textual Entailment Inference. This deliverable describes the progress in the development of the Excitement Open Platform (EOP) carried under WP4. We address five tasks within WP4, specifically: T4.1 Implementation of the platform architecture; T4.2 Linguistic analysis pipeline (pre-processing); T4.3 Entailment algorithms; T4.4 Entailment resources; T4.5 Combination of entailment components. For each task we show the progress after one year of activity (i.e. the first development cycle of the project) and we highlight the next steps. 1.1 EOP Cycle I: Main Characteristics We present the following main characteristics of the first prototype of the Excitement Open Platform (EOP-v1.0): The prototype is based on three existing systems. The EOP entailment engines should implement some the main functionalities of the three existing entailment engines (BIUTEE, EDITS, TIE), from which the platform originates. Multilinguality. Given an input Text-Hypothesis pair(s) in one of the three languages of the project (i.e. English, German, Italian), the EOP should produce corresponding entailment judgments (i.e. Entailment, Not-Entailment ). Conform to the component-based approach. Existing functionalities of the existing entailment engines are adapted to the blocks of the EOP s architecture (defined in WP3), which includes: pipeline, entailment decision algorithms (EDAs), knowledge components, distance components, resources. Learning capacity. The EOP should be able to build models on training data, which are then evaluated on corresponding test data. For this purpose we use the RTE-3 dataset in the three languages. Focus on lexical entailment. We start from entailment issues that are based on lexical phenomena in the three languages, for which all the academic partners have experience. For each language we individuate lexical resources contributing to entailment (e.g. WordNet, Wikipedia), and build corresponding 4

5 knowledge components, distance components, and EDAs. Typical phenomena that are responsible for lexical entailment include: synonymy, hyperonymy, antonymy, morphological derivations, named entities variations, acronyms, abbreviations, world knowledge about people and locations, etc. Although the focus for EOP-v1 is lexical entailment, we plan to rapidly move to syntactic phenomena thanks to the components and EDAs developed in BIUTEE and TIE. Interoperability. We intend to demonstrate the advantages of the component-based approach in term of interoperating components at several levels of the EOP architecture, including: The pipeline level: share UIMA-CAS for three languages. The EDA level: same EDAs used by all languages, (e.g. word overlap, edit distance) The component and resource level: share APIs of knowledge components, e.g. same high level APIs for three WordNets. Open source distribution. We aim to show that EOP-v1 is already synchronized with the goal of open source diffusion, and for each resource/software included in the prototype we provide the required information for its distribution. This part of activity has been synchronized with the activity in WP8. Relation with the project applicative scenarios. We intend to show that the first prototype already serves at least some of the requirements of the two industrial applications foreseen in the project: entailment-based graph exploration, entailment-based retrieval. This is part of activity has been synchronized with the activity in WP6 on the Transduction Layer. Demonstration. In order to make the first prototype demo as much as possible understandable to potential users we have been developing an EOP-v1 demonstrator based on a minimal web interface that: (i) allows the user to choose a configuration among a library of already experimented configurations; (ii) allows the user to insert a T-H pair; (iii) provides a library of built-in examples; (iv) provides a basic explanation of the system s behavior. 1.2 Architecture Specifications This document assumes the architecture specifications described in Deliverable D3.1 (specifically version D December 2012), as well as the terminology used in that document. 5

6 1.3 EOP I Cycle development plan The following incremental releases of the platform have been planned and then realized during the first year of the project. EOP v0.1 October 15, 2012 ( proof of concept ) This release (see Figure 1) is mainly intended as a proof of concept of the architectural design defined in Deliverable 3.1a. EOP-v0.1 features: Linguistic annotation: minimum requirement for each language is tokenization; additional annotations are welcome, supposed that they are compatible with the adopted UIMA-CAS. EDA: a single EDA based on Edit Distance on tokens. The entailment decision is based on an edit distance threshold between T and H estimated on a training set. Components: a single, language independent, entailment algorithm based on Edit Distance on tokens (i.e. the one implemented in EDITS). ITA- LAP Tokenization (DKPro) GER- LAP Tokenization (DKPro) UIMA-CAS EDA: Edit Distance (Edits) - Fixed costs of operations - Test on single T/H pairs entailment/ not entailment Eng- LAP Tokenization (DKPro) Distance component Edit Distance on To- kens (Edits) Figure 1: EOP-v0.1 6

7 EOP v0.2 November 6, 2012 (basic lexical resources - Rome meeting). This release (see Figure 2) aims at integrating in the platform a number of selected functionalities currently implemented for the three languages in the available systems (BIUTEE, EDITS, TIE) at the level of lexical entailment. Minimum expected capacity is lexical entailment (e.g. synonymy, hypernyms, morphological derivations, etc.) based on lexical resources. Linguistic annotation: minimum requirements for each language are tokenization; pos-tagging, lemmatization, morphological analysis. Additional annotations are welcome, supposed that they are compatible with the adopted UIMA- CAS. EDA: has to be defined for each language on the base of the existing systems. No interoperability is required for this release. Components: separate components for each of the three languages. Minimum requirement is the use of lexical knowledge in order to address typical phenomena of lexical variability. ITA- LAP Tokenization, Lemma, POS GER- LAP Tokenization Lemma, POS UIMA-CAS EDA: Edit Distance (Edits) - Test on RTE-3 - Fixed costs of operations entailment/ not entailment Eng- LAP Tokenization Lemma, POS Distance component Edit Distance on To- kens (Edits) Lexical component Entailment rules (Biutee) WordNets Italian German English Wikipedia Italian English Figure 2: EOP-v0.2 7

8 EOP v1 December This release (see Figure 3) aims at integrating some level of interoperability among EDAs and components already used in version 0.2. This version will be also the official deliverable of WP4 at month 12. Linguistic annotation: same as for version 0.2; additional annotations are welcome, supposed that they are compatible with the adopted UIMA-CAS. EDA: minimum requirement is at least one EDA (parameterized on languages) which can work on the three languages (e.g. Dan Roth s lexical similarity algorithm); Components: minimum requirement is at least one component (e.g. WordNets), which shares APIs for the three languages. Training and test: minimum requirement is training and test on RTE-3 in the three languages (or part of it if not available). Additional training and test is welcome when available, particularly for English. Configuration: a configuration file allows to describe in a declarative way the parameters of the components and EDA used in a certain run of the platform. Configurator ITA- LAP Tokenization, Lemma, POS GER- LAP Tokenization Lemma, POS Eng- LAP Tokenization Lemma, POS U I M A - C A S EDA: Edit Distance (Edits) - Training and test on RTE- 3 EDA: Classification- based (TIE) - Training and test on RTE- 3 entailment/ not entailment Scoring. comp. BoW simi- larity (TIE) Distance comp. Edit Dis- tance on Tokens (Edits) Distribu- tion simi- larity German Lexical component Entailment rules (Biutee) Word- Net Italian German English Wikipe- dia Italian English Figure 3: EOP-v1.0 8

9 EOP v1.1 February (planned) This release will be shown at the project first review meeting (February 2013). All functionalities of version 1.0 will be carefully tested, and a minimal graphical interface will be realized for demo purposes. Linguistic annotation: same as version 1.0. EDA: same as version 1.0. Components: same as version 1.0 Training and test: same as version 1.0 Graphical interface: a minimal interface for demo purposes, able to allow to run the platform for training and testing on different configurations. 2 Task 4.1: Implementation of the platform architecture This task focuses on defining the policies for software development adopted by the Excitement consortium, as well as providing the main tools for managing software and data used by the EOP-v Development environment Given the complexity of the EOP platform both in terms of different groups contributing at its development and because of the different nature of the software involved, the definition and setting of the development environment has required several decisions. The EOP platform, due to its characteristics, requires that different kinds of code and data are managed, with a significant level of complexity, also with respect to other academic experiences (e.g. Moses in the Machine Translation field). We have analysed the software and the data used by the platform according to two main dimensions: (1) whether the software/data has been developed as foreground of the Excitement project or by third parties; (2) whether the distribution licence of the software/data is free or restricted by specific constraints. On the basis of such characteristics, we have adopted a sophisticated schema for software development. Table 1 and 2 show, respectively, the schema of the software and data localization in the Excitement development environment: specifically we distinguish a version control repository, for which we use Git ( and an internal repository, for which we use Maven ( 9

10 Java code Not Java code (e.g. C++) 3 rd party software Excitement code 3 rd party software Version control repository/ Version control LGPL /equivalent license Internal Maven Repository repository Up to the EOP user to install Others (e.g. GPL) Up to the EOP user to install Table 1: EOP software management 3 rd party Data Excitement Data Linguistic and Knowledge Resources RTE Data Sets Linguistic and Knowledge Resources RTE Data Sets Internal Maven Version control Internal Maven Version control Creative repository repository repository repository Commons or equivalent licenses Research (e.g. Italian MultiWordNet) (e.g. WordNet rules, Italian Wikipedia) (e.g. Italian RTE3) Commercial Up to the EOP user to get Proprietary Up to the EOP user to get (e.g. GermaNet) Table 2: EOP data management In the following of the Section we provide details of the software and data repositories. 10

11 Topology of Git repositories Git is a distributed revision control and source code management (SCM) system with an emphasis on speed. Git was initially designed and developed by Linus Torvalds for Linux kernel development; it has since been adopted by many other projects. Every Git working directory is a full-fledged repository with complete history and full revision tracking capabilities, not dependent on network access or a central server. Git is free software distributed under the terms of the GNU General Public License version 2. In order to handle the contribution of multiple groups to the public repository, we adopt a multi-layer Git repository topology, for groups that wish to develop their code in a separate private repository, as well. Group-only code is developed in two tiers developer tier and group tier (sharing between group developers and backup), and is entirely private. The Excitement code is developed in three tiers private developer tier and group tier (sharing between group developers and backup) and public excitement tier (sharing with all parties and backup), where code is pushed from lower tiers to upper tiers. Figure 4: EOP schema for software development The main Excitement code repository (1) includes all code that is shared as part of the excitement project between all groups, and will become public sooner or later. 11

12 Each of the groups may maintain a private code which is not part of the Excitement code (2), and an Excitement code which is under development, and is not ready for sharing with other groups (3). The group repository will depend on excitement repository, but not the other way. Each developer in each group can work with both repository instances at the same time (4 and 5 for developer 1, 6 and 7 for developer 2). Storage We identify three types of data, which is not a source code: Java third parties, e.g., jars that are referred to by the Excitement code, but are not presented by the Maven repository. Non-Java third parties, e.g., EasyFirst parser for English. Data files, e.g., knowledge resources These types of data should not be part of the Git repository, and should be stored in an online shared repository. Internal Maven repository Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information. We use the Maven mechanism for downloading from an additional repository, to handle the download of the required files (e.g. jars). The purpose of the Maven repository is to work as an internal private repository of all software libraries and data files like lexical and knowledge resources. Storing Maven artefacts (e.g. jars) in a dedicated Maven repository is preferable compared to storing them in a version control system like GitHub for the following reasons: (i) Libraries (jars) are binary files and do not belong in version control systems which are better at handling text files that are frequently edited. (ii) Keeps the version control repository small. (iii) Checkouts, updates and other actions on the version control system are quicker. The Excitement Maven repository consists of 3 sub-repositories: Private-internal-repository: This repository contains artefacts which are used only within the EOP project. These are manually uploaded by the developer. 12

13 This does not synchronize with the Maven central repository ( as the artefacts in this repository are private to the organization. 3rd party repository, which contains artefacts which are publicly available but not in the Maven central repository. Examples are the latest versions of libraries, which are not yet available. This repository is not synchronized with the Maven central repository given that it does not have these libraries or data files. Repo1-cache: This repository is synchronized with the Maven central repository and it is a cache of the artefacts from it. Code contribution Currently (i.e. first EOP prototype), groups work individually on their own modules. At later stages, when code contributors will add their own code to different modules, the standard development procedure will be based on branch and merge. Each group of users (currently the four academic partners of the project) has identified a person that is responsible for the merging. FBK, as workpackage leader, has the overall responsibility of merging the contributions from different groups. Project dependencies The m2e (Maven to Eclipse) plugin allows easy generation of Maven projects and definition of dependencies. Ultimately, a pom.xml file is generated, defining the repositories and dependencies of the current project. It also defines the project s ID and version, which can then be used as a dependency in other Maven projects. 3 Task 4.2: Linguistic Analysis Pipeline For the pre-processing of test-hypothesis (T-H) pairs, according to the specifications described in D3.1.3, we have adopted the UIMA-CAS format to ensure interoperability among the linguistic analyses for the different languages. For EOP-v1 the focus has been on the following linguistic modules: tokenization, lemmatization and part of speech tagging, as they are the most relevant for lexical entailment. 13

14 3.1 Preprocessing for Italian For Italian pre-processing we use TextPro (Pianta et al. 2008), a suite of modular Natural Language Processing tools developed at FBK. The current version of the tool provides functions ranging from tokenization to part of speech tagging and Named Entity Recognition. The system s architecture is organized as a pipeline of processors wherein each stage accepts data from an initial input or from the output of a previous stage, executes a specific task, and sends the resulting data to the next stage, or to the output of the pipeline. To process Italian RTE data, an sample of which is shown in Example 1, we have built a wrapper around TextPro. The annotation produced for the text portion of the given pair is shown in Example 2. <pair id="xxx" entailment="nonentailment" task="xxx"> <t>hubble è un telescopio.</t> <h>hubble non è un telescopio.</h> </pair> Example 1: Italian Text/Hypothesis pair TextPRO annotations: # FILE:./esempio token tokenstart sentence pos lemma Hubble 0 - SPN Hubble è 7 - VI essere un 9 - RS indet telescopio 12 - SS telescopio. 22 <eos> XPS full_stop Example 2: TextPro output (tabular format) 14

15 Example 3: Italian pipeline output (UIMA-CAS) 3.2 Preprocessing for German and English For pre-processing German data we use DKPro Core, a collection of software components for natural language processing based on the Apache UIMA framework. Many state-of-the-art NLP components are already freely available in the NLP research community. DKPro Core provides wrappers for such third-party tools as well as original NLP components. DKPro Core builds heavily on uimafit which allows for rapid and easy development of NLP processing pipelines. DKPro Core ASL ( contains those components of DKPro Core that are licensed under the Apache Software License (ASL) version 2. Additional components are available in DKPro Core GPL ( Here is a brief (partial) list of the components included in DKPro Core: tokenization/segmentation (BreakIterator, OpenNLP, LanguageTool, Stanford CoreNLP) compound splitting (Banana Split, JWordSplitter) stemming (Snowball) lemmatization (TreeTagger) 15

16 part-of-speech tagging (TreeTagger, OpenNLP, Mecab) syntactic parsing (OpenNLP,Stanford CoreNLP, Berkeley Parser) dependency parsing (MaltParser, MstParser, Stanford CoreNLP) coreference resolution (Stanford CoreNLP) language identification (TextCat) spelling correction (Jazzy) Concerning the languages considered in EXCITEMENT (English, German, Italian), the components listed above are usually available for English and German. During EX- CITEMENT we will work to provide wrappers for tools for Italian and for additional tools for English and German. 4 Task 4.3: Entailment Algorithms This task, during the first development cycle, has provided the basic Entailment decision algorithms (EDA) and the basic scoring and knowledge components, i.e. the algorithms in charge of estimating the distance between T and H. In the following we briefly describe the EDAs and distance components of the EOP-v Lexical Edit Distance: EDA and component EOP-v1 includes both an EDA and a distance component derived from the EDITS system, developed by FBK, mostly in the context of the Qall-Me EU project. EDITS is a software package for recognizing entailment relations between two portions of text. It is based on edit distance algorithms, and computes the T-H distance as the cost of the edit operations that are necessary to transform T into H. EDITS is open source, and available under GNU Lesser General Public License (LGPL). The tool is implemented in Java, it runs on Unix-based Operating Systems, and has been tested on MAC OSX, Linux, and Sun Solaris. The latest release of the package (version XX, date) can be downloaded at: http: //edits.fbk.eu. EDITS implements a distance-based framework, which assumes that the probability of an entailment relation between a given T-H pair is inversely proportional to the distance between T and H (i.e. the higher the distance, the lower is the probability of entailment). Within this framework the system implements and combines different approaches to distance computation, providing both edit distance algorithms, and similarity algorithms. Each algorithm returns a normalized distance score (a number be- 16

17 tween 0 and 1). At the training stage, distance scores calculated over annotated T-H pairs are used to estimate a threshold that best separates positive (entailment) from negative (non-entailment) examples. The threshold, which is stored in a Model, is used at the test stage to assign an entailment judgment and a confidence score to each test pair. 4.2 Classification-based EDA EOP-v1 includes a classification-based EDA derived from TIE, a textual entailment system developed by DFKI. TIE has three EDAs in its current implementation. Two EDAs are for special cases and they only determine Contradiction (NER-based EDA), and Entailment/Paraphrase (DIRT-based EDA), while the last (the main EDA) can process any given H-T pair to produce the decision of Paraphrase, Entailment, Contradiction, or Unknown. The Classification-Based EDA is the most general one among the three EDAs of the TIE engine. It relies on the output of three types of components as features: lexical level component, syntactic level (dependency tree) component and semantic level (graph of semantic roles) component. The outputs consist of a set of scores: the lexical and syntactic components return 2 scores each, and the semantic level component returns 4 scores. Thus, for the main EDA, every H-T pair is represented as a set of 8 numbers. As for EOP-v1, only the lexical component (described below) has been migrated from TIE into the EXCITEMENT platform. Step 1: Decide the three basic relationships "relatedness", "inconsistency", "inequality" The EDA first classifies an H-T pair with three independent classifiers, "Relatedness", "Inconsistency", and "Inequality" Classification. The EDA tries to decompose four TE relations into combinations of three independent relations. Thus, for example, if H and T are related (+ relatedness) and unequal (+ inequality), then it must be an ENTAIL- MENT. If H and T are related, but also inconsistent, it is CONTRADICTION, etc. Step 2: Deciding Entailment Relationship When a given H-T pair is processed up to this point, each H-T pair is represented as three binary classification results (+/- on relatedness, inconsistency and inequality). Instead of using binary labels (1/0 or +/-), the EDA uses confidence values that are obtained from the binary classifiers, so three classification results are represented as numbers between 0 and 1 (for this reason, it currently uses TADM, which outputs a normalized confidence score). 17

18 Finally, another round of classification is done with this three number representation. This second classifier, which has been trained on the final TE labels, can assign a new H-T pair to one of the four TE relationship labels. The system currently uses the TADM multi-class classifier. Bag-of-Words scoring component This component regards an H-T pair as two bags of words (bag-of-words). It compares the two bags, and returns two scores between 0 and 1. The returned scores can be regarded as relatedness/similarity scores between two bags of words. The component uses three knowledge resources to compare the given H-T bags: VerbOcean, WordNet and Google Normalized Distance (GND). The resources are used in two ways. One is expansion: VerbOcean and WordNet are used to expand the bags with related terms. Thus, a new set of expanded H' and T' will be used to calculate overlapping scores. The other is as associative scoring of terms (thus even terms not included in Word- Net/VerbOcean can result in non-zero scores). The return consists of two scores, one normalized by the size of H, and the other by the size of T. TIE regards this information (a sort of coverage balance/imbalance on H and T) as a possible indicative feature, and keeps both scores for almost all features. In the current implementation, since we aim at a unified API for accessing lexical resources, all the resources provided in the EXCITEMENT platform can be used for this scoring component. If the words have relations in the knowledge base, they will contribute to the final score. All the scores (from different lexical resources) will be kept as features for the classifier (or other usage) defined in the EDAs. 5 Task 4.4: Entailment Resources The aims of this task are collecting and integrating into the EOP platform the linguistic resources that are used by the textual entailment algorithms. Some of the resources used exited prior to the start of the current project, in particular WordNet versions for English, Italian and German, VerbOcean and FrameNet for English. Other resources were created specifically for this project: entailment rules extracted from Wikipedia for English and Italian, and in the future for German as well. These resources are (and some will be) integrated in the platform through the lexical component interface migrated from BIUTEE. A few resources are currently under construction e.g. Corpus 18

19 Pattern Analysis (CPA) for Italian and will be completed and integrated in the next development cycle. These resources are described in detail in deliverable 5.1 of WP 5. Apart from resources that provide entailment rules and features, we have also developed training and testing data. We have translated the English RTE-3 dataset consisting of 1600 T-H pairs (800 for training, 800 for testing) into German and Italian. This will allow comparable training and testing for the three languages of the project. The data was manually translated, and the new versions are aligned with the original. The two new datasets are distributed under a Creative Common licence. 6 Task 4.5: Combination and interoperability of Entailment Components This task focuses on investigating the potential of the component-based approach developed in Excitement, particularly under two perspectives: (i) the interoperability of the platform; (ii) the combination of different components carried on by an EDA. During the first cycle of development we have started to address the first issue. 6.1 Linguistic annotation interoperability Linguistic annotations for different languages can be used by the same EDA. This interoperability is achieved through the use of the same format (UIMA-CAS) and the same set of semantic types (derived from DK-Pro) by the linguistic pipelines for the three languages. 6.2 Resource interoperability Different linguistic resources can be used by the same EDA. This interoperability is achieved through the lexical component interface, which basically assumes that knowledge extracted from different resources is represented as entailment rules. As an example, although the Italian WordNet and the Italian Wikipedia entailment resources are stored in completely different formats, their relevant content is managed by the lexical component interface as entailment rules, allowing a single EDA (e.g. the edit distance EDA) to use both in a completely transparent way. 19

20 6.3 EDA interoperability Different EDAs can use the same distance component. This interoperability is achieved through a strict separation of the algorithm taking the entailment decision (i.e. the EDA) from the algorithm that calculates the distance between T and H in a pair (i.e. the distance component). As an example, both the Edit distance EDA (from Edits) and the classification-based EDA (from TIE) can take advantage of the result of the Distance component (from Edits). 6.4 Component interoperability Different components can be used by the same EDA. As in the previous case, this interoperability is achieved through a strict separation of the algorithm taking the entailment decision from the algorithm that calculates the distance between T and H in a pair. As an example the classification-based EDA (from TIE) can use, both separately and in combination, both the results of the distance component (from Edits) and the results of the BoW component (from TIE). 7 Development plans for the second cycle We have the following basic goals for the second development cycle of the EOP; (i) (ii) (iii) (iv) (v) concluding the migration of the BIUTEE system into the EOP; addressing textual entailment phenomena of higher complexity, including those that are based on syntax. This requires that the LAP for the three languages is extended with parsing. Further enrich the resources that provide entailment rules for the three languages. Extensively check the proposed structure and environment for software development. Investigate and progress on the interoperability of single components of the linguistic pipelines. 20

21 Appendix A: UIMA-CAS example This appendix provides an example of the UIMA-CAS output of the Italian LAP for the sentence Hubble è un telescopio ( Hubble is a telescope ). TextPRO annotations: # FILE:./esempio # FIELDS: token tokenstart sentence pos lemma Hubble 0 - SPN Hubble è 7 - VI essere un 9 - RS indet telescopio 12 - SS telescopio. 22 <eos> XPS full_stop Adding annotations for text: Hubble è un telescopio. uima.tcas.documentannotation "Hubble è un telescopio." begin = 0 end = 23 language = "it" eu.excitement.type.entailment.text "Hubble è un telescopio." begin = 0 end = 23 de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence "Hubble è un telescopio." begin = 0 21

22 end = 23 de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.np "Hubble" begin = 0 end = 6 PosValue = "SPN" "Hubble" begin = 0 end = 6 value = "Hubble" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "Hubble" begin = 0 end = 6 parent = null lemma = "Hubble" begin = 0 end = 6 value = "Hubble" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.np "Hubble" 22

23 begin = 0 end = 6 PosValue = "SPN" "è" begin = 7 end = 8 value = "essere" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.v "è" begin = 7 end = 8 PosValue = "VI" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "è" begin = 7 end = 8 parent = null lemma = "è" 23

24 begin = 7 end = 8 value = "essere" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.v "è" begin = 7 end = 8 PosValue = "VI" "un" begin = 9 end = 11 value = "indet" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "un" begin = 9 end = 11 parent = null lemma = "un" begin = 9 end = 11 value = "indet" 24

25 stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.art "un" begin = 9 end = 11 PosValue = "RS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.art "un" begin = 9 end = 11 PosValue = "RS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.nn "telescopio" begin = 12 end = 22 PosValue = "SS" "telescopio" begin = 12 end = 22 value = "telescopio" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "telescopio" 25

26 begin = 12 end = 22 parent = null lemma = "telescopio" begin = 12 end = 22 value = "telescopio" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.nn "telescopio" begin = 12 end = 22 PosValue = "SS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.punc "." begin = 22 end = 23 PosValue = "XPS" "." 26

27 begin = 22 end = 23 value = "full_stop" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "." begin = 22 end = 23 parent = null lemma = "." begin = 22 end = 23 value = "full_stop" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.punc "." begin = 22 end = 23 PosValue = "XPS" TextPRO annotations: # FILE:./esempio # FIELDS: token tokenstart sentence pos lemma Hubble 0 - SPN Hubble non 7 - B non è 11 - VI essere un 13 - RS indet telescopio 16 - SS telescopio. 26 <eos> XPS full_stop 27

28 Adding annotations for text: Hubble non è un telescopio. uima.tcas.documentannotation "Hubble non è un telescopio." begin = 0 end = 27 language = "it" eu.excitement.type.entailment.hypothesis "Hubble non è un telescopio." begin = 0 end = 27 de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.sentence "Hubble non è un telescopio." begin = 0 end = 27 de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.np "Hubble" begin = 0 end = 6 PosValue = "SPN" "Hubble" 28

29 begin = 0 end = 6 value = "Hubble" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "Hubble" begin = 0 end = 6 parent = null lemma = "Hubble" begin = 0 end = 6 value = "Hubble" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.np "Hubble" begin = 0 end = 6 PosValue = "SPN" "non" 29

30 begin = 7 end = 10 value = "non" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.adv "non" begin = 7 end = 10 PosValue = "B" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "non" begin = 7 end = 10 parent = null lemma = "non" begin = 7 end = 10 value = "non" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.adv "non" begin = 7 30

31 end = 10 PosValue = "B" "è" begin = 11 end = 12 value = "essere" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.v "è" begin = 11 end = 12 PosValue = "VI" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "è" begin = 11 end = 12 parent = null lemma = "è" begin = 11 end = 12 value = "essere" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.v "è" 31

32 begin = 11 end = 12 PosValue = "VI" "un" begin = 13 end = 15 value = "indet" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "un" begin = 13 end = 15 parent = null lemma = "un" begin = 13 end = 15 value = "indet" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.art "un" 32

33 begin = 13 end = 15 PosValue = "RS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.art "un" begin = 13 end = 15 PosValue = "RS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.nn "telescopio" begin = 16 end = 26 PosValue = "SS" "telescopio" begin = 16 end = 26 value = "telescopio" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "telescopio" begin = 16 end = 26 33

34 parent = null lemma = "telescopio" begin = 16 end = 26 value = "telescopio" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.nn "telescopio" begin = 16 end = 26 PosValue = "SS" de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.punc "." begin = 26 end = 27 PosValue = "XPS" "." begin = 26 end = 27 value = "full_stop" de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.token "." 34

35 Pair 450 begin = 26 end = 27 parent = null lemma = "." begin = 26 end = 27 value = "full_stop" stem = null pos = de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.punc "." begin = 26 end = 27 PosValue = "XPS" written as./450.xmi 35

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,

More information

Interactive Dynamic Information Extraction

Interactive Dynamic Information Extraction Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Shallow Parsing with Apache UIMA

Shallow Parsing with Apache UIMA Shallow Parsing with Apache UIMA Graham Wilcock University of Helsinki Finland [email protected] Abstract Apache UIMA (Unstructured Information Management Architecture) is a framework for linguistic

More information

UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis

UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis Jan Hajič, jr. Charles University in Prague Faculty of Mathematics

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

D2.4: Two trained semantic decoders for the Appointment Scheduling task

D2.4: Two trained semantic decoders for the Appointment Scheduling task D2.4: Two trained semantic decoders for the Appointment Scheduling task James Henderson, François Mairesse, Lonneke van der Plas, Paola Merlo Distribution: Public CLASSiC Computational Learning in Adaptive

More information

Software Architecture Document

Software Architecture Document Software Architecture Document Natural Language Processing Cell Version 1.0 Natural Language Processing Cell Software Architecture Document Version 1.0 1 1. Table of Contents 1. Table of Contents... 2

More information

Natural Language Database Interface for the Community Based Monitoring System *

Natural Language Database Interface for the Community Based Monitoring System * Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University

More information

Technical Report. The KNIME Text Processing Feature:

Technical Report. The KNIME Text Processing Feature: Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold [email protected] [email protected] Copyright 2012 by KNIME.com AG

More information

Cross-domain Identity Management System for Cloud Environment

Cross-domain Identity Management System for Cloud Environment Cross-domain Identity Management System for Cloud Environment P R E S E N T E D B Y: N A Z I A A K H TA R A I S H A S A J I D M. S O H A I B FA R O O Q I T E A M L E A D : U M M E - H A B I B A T H E S

More information

An Online Service for SUbtitling by MAchine Translation

An Online Service for SUbtitling by MAchine Translation SUMAT CIP-ICT-PSP-270919 An Online Service for SUbtitling by MAchine Translation Annual Public Report 2011 Editor(s): Contributor(s): Reviewer(s): Status-Version: Volha Petukhova, Arantza del Pozo Mirjam

More information

Software Package Document exchange (SPDX ) Tools. Version 1.2. Copyright 2011-2014 The Linux Foundation. All other rights are expressly reserved.

Software Package Document exchange (SPDX ) Tools. Version 1.2. Copyright 2011-2014 The Linux Foundation. All other rights are expressly reserved. Software Package Document exchange (SPDX ) Tools Version 1.2 This document last updated March 18, 2014. Please send your comments and suggestions for this document to: [email protected] Copyright

More information

WebLicht: Web-based LRT services for German

WebLicht: Web-based LRT services for German WebLicht: Web-based LRT services for German Erhard Hinrichs, Marie Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft, University of Tübingen [email protected] Abstract This software

More information

The Prolog Interface to the Unstructured Information Management Architecture

The Prolog Interface to the Unstructured Information Management Architecture The Prolog Interface to the Unstructured Information Management Architecture Paul Fodor 1, Adam Lally 2, David Ferrucci 2 1 Stony Brook University, Stony Brook, NY 11794, USA, [email protected] 2 IBM

More information

Natural Language Processing in the EHR Lifecycle

Natural Language Processing in the EHR Lifecycle Insight Driven Health Natural Language Processing in the EHR Lifecycle Cecil O. Lynch, MD, MS [email protected] Health & Public Service Outline Medical Data Landscape Value Proposition of NLP

More information

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari [email protected]

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari [email protected] Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content

More information

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow A Framework-based Online Question Answering System Oliver Scheuer, Dan Shen, Dietrich Klakow Outline General Structure for Online QA System Problems in General Structure Framework-based Online QA system

More information

Automatic Text Analysis Using Drupal

Automatic Text Analysis Using Drupal Automatic Text Analysis Using Drupal By Herman Chai Computer Engineering California Polytechnic State University, San Luis Obispo Advised by Dr. Foaad Khosmood June 14, 2013 Abstract Natural language processing

More information

DATA MANAGEMENT PLAN DELIVERABLE NUMBER RESPONSIBLE AUTHOR. Co- funded by the Horizon 2020 Framework Programme of the European Union

DATA MANAGEMENT PLAN DELIVERABLE NUMBER RESPONSIBLE AUTHOR. Co- funded by the Horizon 2020 Framework Programme of the European Union DATA MANAGEMENT PLAN Co- funded by the Horizon 2020 Framework Programme of the European Union DELIVERABLE NUMBER DELIVERABLE TITLE D7.4 Data Management Plan RESPONSIBLE AUTHOR DFKI GRANT AGREEMENT N. PROJECT

More information

An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)

An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines) An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines) James Clarke, Vivek Srikumar, Mark Sammons, Dan Roth Department of Computer Science, University of Illinois, Urbana-Champaign.

More information

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg March 1, 2007 The catalogue is organized into sections of (1) obligatory modules ( Basismodule ) that

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

Deliverable Name. D1.1.3 IPR Management Plan v1. DuraArk DURAARK. Date: 201x-MM-DD. Date: 2013-03-31. Document id. id. : duraark/2013/d.1.1.3/v1.

Deliverable Name. D1.1.3 IPR Management Plan v1. DuraArk DURAARK. Date: 201x-MM-DD. Date: 2013-03-31. Document id. id. : duraark/2013/d.1.1.3/v1. Deliverable Name D1.1.3 IPR Management Plan v1 DuraArk FP7 FP7 ICT ICT Digital Digital Preservation Preservation Grant Grant agreement No.: No.: 600908 600908 Date: 201x-MM-DD Date: 2013-03-31 Version

More information

Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy)

Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy) Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy) Multilingual Word Sense Disambiguation and Entity Linking on the Web based on BabelNet Roberto Navigli, Tiziano

More information

Maven or how to automate java builds, tests and version management with open source tools

Maven or how to automate java builds, tests and version management with open source tools Maven or how to automate java builds, tests and version management with open source tools Erik Putrycz Software Engineer, Apption Software [email protected] Outlook What is Maven Maven Concepts and

More information

Scaling Web Applications in a Cloud Environment using Resin 4.0

Scaling Web Applications in a Cloud Environment using Resin 4.0 Scaling Web Applications in a Cloud Environment using Resin 4.0 Abstract Resin 4.0 offers unprecedented support for deploying and scaling Java and PHP web applications in a cloud environment. This paper

More information

HOPS Project presentation

HOPS Project presentation HOPS Project presentation Enabling an Intelligent Natural Language Based Hub for the Deployment of Advanced Semantically Enriched Multi-channel Mass-scale Online Public Services IST-2002-507967 (HOPS)

More information

Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP

Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP Bridge Consulting Based in Florence, Italy Foundedin 1998 98 employees Business Areas Retail, Manufacturing and Fashion Knowledge

More information

SeaClouds Project. Cloud Application Programming Interface. Seamless adaptive multi- cloud management of service- based applications

SeaClouds Project. Cloud Application Programming Interface. Seamless adaptive multi- cloud management of service- based applications SeaClouds Project D4.2- Cloud Application Programming Interface Project Acronym Project Title Call identifier Grant agreement no. Start Date Ending Date Work Package Deliverable code Deliverable Title

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

NDK: Novell edirectory Core Services. novdocx (en) 24 April 2008. Novell Developer Kit. www.novell.com NOVELL EDIRECTORY TM CORE SERVICES.

NDK: Novell edirectory Core Services. novdocx (en) 24 April 2008. Novell Developer Kit. www.novell.com NOVELL EDIRECTORY TM CORE SERVICES. NDK: Novell edirectory Core Services Novell Developer Kit www.novell.com June 2008 NOVELL EDIRECTORY TM CORE SERVICES Legal Notices Novell, Inc. makes no representations or warranties with respect to the

More information

Open EMS Suite. O&M Agent. Functional Overview Version 1.2. Nokia Siemens Networks 1 (18)

Open EMS Suite. O&M Agent. Functional Overview Version 1.2. Nokia Siemens Networks 1 (18) Open EMS Suite O&M Agent Functional Overview Version 1.2 Nokia Siemens Networks 1 (18) O&M Agent The information in this document is subject to change without notice and describes only the product defined

More information

CHAPTER 5 INTELLIGENT TECHNIQUES TO PREVENT SQL INJECTION ATTACKS

CHAPTER 5 INTELLIGENT TECHNIQUES TO PREVENT SQL INJECTION ATTACKS 66 CHAPTER 5 INTELLIGENT TECHNIQUES TO PREVENT SQL INJECTION ATTACKS 5.1 INTRODUCTION In this research work, two new techniques have been proposed for addressing the problem of SQL injection attacks, one

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Table of Contents. Chapter No. 1 Introduction 1. iii. xiv. xviii. xix. Page No.

Table of Contents. Chapter No. 1 Introduction 1. iii. xiv. xviii. xix. Page No. Table of Contents Title Declaration by the Candidate Certificate of Supervisor Acknowledgement Abstract List of Figures List of Tables List of Abbreviations Chapter Chapter No. 1 Introduction 1 ii iii

More information

How To Build A Connector On A Website (For A Nonprogrammer)

How To Build A Connector On A Website (For A Nonprogrammer) Index Data's MasterKey Connect Product Description MasterKey Connect is an innovative technology that makes it easy to automate access to services on the web. It allows nonprogrammers to create 'connectors'

More information

Drupal CMS for marketing sites

Drupal CMS for marketing sites Drupal CMS for marketing sites Intro Sample sites: End to End flow Folder Structure Project setup Content Folder Data Store (Drupal CMS) Importing/Exporting Content Database Migrations Backend Config Unit

More information

Cisco Data Preparation

Cisco Data Preparation Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and

More information

SAS 9.4 Intelligence Platform

SAS 9.4 Intelligence Platform SAS 9.4 Intelligence Platform Application Server Administration Guide SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2013. SAS 9.4 Intelligence Platform:

More information

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering

More information

Deposit Identification Utility and Visualization Tool

Deposit Identification Utility and Visualization Tool Deposit Identification Utility and Visualization Tool Colorado School of Mines Field Session Summer 2014 David Alexander Jeremy Kerr Luke McPherson Introduction Newmont Mining Corporation was founded in

More information

D37.2 - Test Strategy

D37.2 - Test Strategy D37.2 - Test Strategy Client Testbed Document Identification Date 16/05/2013 Status Final Version 1.0 Related SP / WP Related Deliverable(s) Lead Participant Contributors SP3 / WP37 Document Reference

More information

Capacity Plan. Template. Version X.x October 11, 2012

Capacity Plan. Template. Version X.x October 11, 2012 Template Version X.x October 11, 2012 This is an integral part of infrastructure and deployment planning. It supports the goal of optimum provisioning of resources and services by aligning them to business

More information

Special Topics in Computer Science

Special Topics in Computer Science Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS

More information

Source Control Guide: Git

Source Control Guide: Git MadCap Software Source Control Guide: Git Flare 11.1 Copyright 2015 MadCap Software. All rights reserved. Information in this document is subject to change without notice. The software described in this

More information

Professional. SlickEdif. John Hurst IC..T...L. i 1 8 О 7» \ WILEY \ Wiley Publishing, Inc.

Professional. SlickEdif. John Hurst IC..T...L. i 1 8 О 7» \ WILEY \ Wiley Publishing, Inc. Professional SlickEdif John Hurst IC..T...L i 1 8 О 7» \ WILEY \! 2 0 0 7 " > Wiley Publishing, Inc. Acknowledgments Introduction xiii xxv Part I: Getting Started with SiickEdit Chapter 1: Introducing

More information

Semantic annotation of requirements for automatic UML class diagram generation

Semantic annotation of requirements for automatic UML class diagram generation www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute

More information

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification SIPAC Signals and Data Identification, Processing, Analysis, and Classification Framework for Mass Data Processing with Modules for Data Storage, Production and Configuration SIPAC key features SIPAC is

More information

Using the MetaMap Java API

Using the MetaMap Java API Using the MetaMap Java API Willie Rogers June 2, 2014 Contents 1 Purpose 2 2 MetaMap API s Underlying Architecture 2 3 Pre-requisites 2 4 Downloading, Extracting and Installing the API distribution 2 5

More information

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks OnurSoft Onur Tolga Şehitoğlu November 10, 2012 v1.0 Contents 1 Introduction 3 1.1 Purpose..............................

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web

More information

Nexus Professional Whitepaper. Repository Management: Stages of Adoption

Nexus Professional Whitepaper. Repository Management: Stages of Adoption Sonatype Nexus Professional Whitepaper Repository Management: Stages of Adoption Adopting Repository Management Best Practices SONATYPE www.sonatype.com [email protected] +1 301-684-8080 12501 Prosperity

More information

Final Report - HydrometDB Belize s Climatic Database Management System. Executive Summary

Final Report - HydrometDB Belize s Climatic Database Management System. Executive Summary Executive Summary Belize s HydrometDB is a Climatic Database Management System (CDMS) that allows easy integration of multiple sources of automatic and manual stations, data quality control procedures,

More information

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery Dimitrios Kourtesis, Iraklis Paraskakis SEERC South East European Research Centre, Greece Research centre of the University

More information

How RAI's Hyper Media News aggregation system keeps staff on top of the news

How RAI's Hyper Media News aggregation system keeps staff on top of the news How RAI's Hyper Media News aggregation system keeps staff on top of the news 13 th Libre Software Meeting Media, Radio, Television and Professional Graphics Geneva - Switzerland, 10 th July 2012 Maurizio

More information

How To Manage Your Digital Assets On A Computer Or Tablet Device

How To Manage Your Digital Assets On A Computer Or Tablet Device In This Presentation: What are DAMS? Terms Why use DAMS? DAMS vs. CMS How do DAMS work? Key functions of DAMS DAMS and records management DAMS and DIRKS Examples of DAMS Questions Resources What are DAMS?

More information

Virtualization Techniques for Cross Platform Automated Software Builds, Tests and Deployment

Virtualization Techniques for Cross Platform Automated Software Builds, Tests and Deployment Virtualization Techniques for Cross Platform Automated Software Builds, Tests and Deployment Thomas Müller and Alois Knoll Robotics and Embedded Systems Technische Universität München Blotzmannstr. 3,

More information

1 File Processing Systems

1 File Processing Systems COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.

More information

Automatic Knowledge Base Construction Systems. Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014

Automatic Knowledge Base Construction Systems. Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014 Automatic Knowledge Base Construction Systems Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014 1 Text Contains Knowledge 2 Text Contains Automatically Extractable Knowledge 3

More information

Pattern Insight Clone Detection

Pattern Insight Clone Detection Pattern Insight Clone Detection TM The fastest, most effective way to discover all similar code segments What is Clone Detection? Pattern Insight Clone Detection is a powerful pattern discovery technology

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent

More information

D5.3.2b Automatic Rigorous Testing Components

D5.3.2b Automatic Rigorous Testing Components ICT Seventh Framework Programme (ICT FP7) Grant Agreement No: 318497 Data Intensive Techniques to Boost the Real Time Performance of Global Agricultural Data Infrastructures D5.3.2b Automatic Rigorous

More information

Open Source & XBRL: the Arelle Project

Open Source & XBRL: the Arelle Project Open Source & XBRL: the Arelle Project Authors: Herm Fischer, Mark V Systems Limited and Diane Mueller, XBRLSpy Research Inc. The Problem Currently, there are a number of proprietary XBRL processors on

More information

Clinical Knowledge Manager. Product Description 2012 MAKING HEALTH COMPUTE

Clinical Knowledge Manager. Product Description 2012 MAKING HEALTH COMPUTE Clinical Knowledge Manager Product Description 2012 MAKING HEALTH COMPUTE Cofounder and major sponsor Member and official submitter for HL7/OMG HSSP RLUS, EIS 'openehr' is a registered trademark of the

More information

Meta-Model specification V2 D602.012

Meta-Model specification V2 D602.012 PROPRIETARY RIGHTS STATEMENT THIS DOCUMENT CONTAINS INFORMATION, WHICH IS PROPRIETARY TO THE CRYSTAL CONSORTIUM. NEITHER THIS DOCUMENT NOR THE INFORMATION CONTAINED HEREIN SHALL BE USED, DUPLICATED OR

More information

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System

More information

Build management & Continuous integration. with Maven & Hudson

Build management & Continuous integration. with Maven & Hudson Build management & Continuous integration with Maven & Hudson About me Tim te Beek [email protected] Computer science student Bioinformatics Research Support Overview Build automation with Maven Repository

More information

USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE

USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE Ria A. Sagum, MCS Department of Computer Science, College of Computer and Information Sciences Polytechnic University of the Philippines, Manila, Philippines

More information

VOL. 2, NO. 1, January 2012 ISSN 2225-7217 ARPN Journal of Science and Technology 2010-2012 ARPN Journals. All rights reserved

VOL. 2, NO. 1, January 2012 ISSN 2225-7217 ARPN Journal of Science and Technology 2010-2012 ARPN Journals. All rights reserved Mobile Application for News and Interactive Services L. Ashwin Kumar Department of Information Technology, JNTU, Hyderabad, India [email protected] ABSTRACT In this paper, we describe the design and

More information

Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services

Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services speakers: Kai Zimmer and Jörg Didakowski Clarin Workshop WP2 February 2009 BBAW/DWDS The BBAW and its 40 longterm projects

More information

Oracle Database. Products Available on the Oracle Database Examples Media. Oracle Database Examples. Examples Installation Guide 11g Release 2 (11.

Oracle Database. Products Available on the Oracle Database Examples Media. Oracle Database Examples. Examples Installation Guide 11g Release 2 (11. Oracle Database Examples Installation Guide 11g Release 2 (11.2) E10846-01 August 2009 This document describes how to install and configure the products available on the Oracle Database Examples media.

More information

31 Case Studies: Java Natural Language Tools Available on the Web

31 Case Studies: Java Natural Language Tools Available on the Web 31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software

More information

IBM Watson Ecosystem. Getting Started Guide

IBM Watson Ecosystem. Getting Started Guide IBM Watson Ecosystem Getting Started Guide Version 1.1 July 2014 1 Table of Contents: I. Prefix Overview II. Getting Started A. Prerequisite Learning III. Watson Experience Manager A. Assign User Roles

More information

File S1: Supplementary Information of CloudDOE

File S1: Supplementary Information of CloudDOE File S1: Supplementary Information of CloudDOE Table of Contents 1. Prerequisites of CloudDOE... 2 2. An In-depth Discussion of Deploying a Hadoop Cloud... 2 Prerequisites of deployment... 2 Table S1.

More information

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns Stamatina Thomaidou 1,2, Konstantinos Leymonis 1,2, Michalis Vazirgiannis 1,2,3 Presented by: Fragkiskos Malliaros 2 1 : Athens

More information