The Prolog Interface to the Unstructured Information Management Architecture



Similar documents
Shallow Parsing with Apache UIMA

PMML and UIMA Based Frameworks for Deploying Analytic Applications and Services

» A Hardware & Software Overview. Eli M. Dow <emdow@us.ibm.com:>

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis

Natural Language to Relational Query by Using Parsing Compiler

Natural Language Database Interface for the Community Based Monitoring System *

Natural Language Processing in the EHR Lifecycle

Auto-Classification for Document Archiving and Records Declaration

A GENERALIZED APPROACH TO CONTENT CREATION USING KNOWLEDGE BASE SYSTEMS

Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources

UIMA and WebContent: Complementary Frameworks for Building Semantic Web Applications

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment

Rules and Business Rules

How To Make Sense Of Data With Altilia

Integrating Annotation Tools into UIMA for Interoperability

Using the MetaMap Java API

Unstructured Information Management Architecture (UIMA) Version 1.0

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery

Managing large sound databases using Mpeg7

Artificial Intelligence. Class: 3 rd

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Andre Standback. IT 103, Sec /21/12. IBM s Watson. GMU Honor Code on I am fully aware of the

Survey Results: Requirements and Use Cases for Linguistic Linked Data

A Case Study of Question Answering in Automatic Tourism Service Packaging

Software Architecture Document

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

Special Topics in Computer Science

1/20/2016 INTRODUCTION

Artificial Intelligence

The Knowledge Sharing Infrastructure KSI. Steven Krauwer

Abstracting the types away from a UIMA type system

Semantic Search in Portals using Ontologies

LINKING DOCUMENTS IN REPOSITORIES TO STRUCTURED DATA IN DATABASE

Watson. An analytical computing system that specializes in natural human language and provides specific answers to complex questions at rapid speeds

ONTOLOGY-BASED MULTIMEDIA AUTHORING AND INTERFACING TOOLS 3 rd Hellenic Conference on Artificial Intelligence, Samos, Greece, 5-8 May 2004

Simplifying e Business Collaboration by providing a Semantic Mapping Platform

Professional Organization Checklist for the Computer Science Curriculum Updates. Association of Computing Machinery Computing Curricula 2008

Software Engineering EMR Project Report

HOW TO DO A SMART DATA PROJECT

Pattern based approach for Natural Language Interface to Database

Building a Question Classifier for a TREC-Style Question Answering System

Business Proposition. Digital Asset Management. Media Intelligent

Lecture 9. Semantic Analysis Scoping and Symbol Table

Data Quality in Information Integration and Business Intelligence

Language Evaluation Criteria. Evaluation Criteria: Readability. Evaluation Criteria: Writability. ICOM 4036 Programming Languages

A Repository of Semantic Open EHR Archetypes

BUSINESS VALUE OF SEMANTIC TECHNOLOGY

Cognitive z. Mathew Thoennes IBM Research System z Research June 13, 2016

Pragmatic Web 4.0. Towards an active and interactive Semantic Media Web. Fachtagung Semantische Technologien September 2013 HU Berlin

BIG DATA ANALYTICS For REAL TIME SYSTEM

Language Interface for an XML. Constructing a Generic Natural. Database. Rohit Paravastu

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Search and Data Mining: Techniques. Introduction Anna Yarygina Boris Novikov

A Logic-Based Framework for Digital Signal Processing

Industry 4.0 and Big Data

Semantic Stored Procedures Programming Environment and performance analysis

TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL TIME FOR LARGE DATABASES

Introducing Formal Methods. Software Engineering and Formal Methods

Typical programme structures for MSc programmes in the School of Computing Science

Information Services for Smart Grids

Semantic annotation of requirements for automatic UML class diagram generation

WIRIS quizzes web services Getting started with PHP and Java

Flattening Enterprise Knowledge

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

UIMA Overview & SDK Setup

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

DATA SCIENCE ADVISING NOTES David Wild - updated May 2015

MANDARAX + ORYX An Open-Source Rule Platform

WordNet Website Development And Deployment Using Content Management Approach

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Performance Evaluation Techniques for an Automatic Question Answering System

Interactive Dynamic Information Extraction

A HUMAN RESOURCE ONTOLOGY FOR RECRUITMENT PROCESS

Linked Data Interface, Semantics and a T-Box Triple Store for Microsoft SharePoint

CSCI 3136 Principles of Programming Languages

A Teaching Strategies Engine Using Translation from SWRL to Jess

Winter 2016 Course Timetable. Legend: TIME: M = Monday T = Tuesday W = Wednesday R = Thursday F = Friday BREATH: M = Methodology: RA = Research Area

Perspectives of Semantic Web in E- Commerce

Transcription:

The Prolog Interface to the Unstructured Information Management Architecture Paul Fodor 1, Adam Lally 2, David Ferrucci 2 1 Stony Brook University, Stony Brook, NY 11794, USA, pfodor@cs.sunysb.edu 2 IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, {alally, ferrucci}@us.ibm.com Abstract. In this paper we describe the design and implementation of the Prolog interface to the Unstructured Information Management Architecture (UIMA) and some of its applications in natural language processing. The UIMA Prolog interface translates unstructured data and the UIMA Common Analysis Structure (CAS) into a Prolog knowledge base, over which, the developers write rules and use resolution theorem proving to search and generate new annotations over the unstructured data. These rules can explore all the previous UIMA annotations (such as, the syntactic structure, parsing statistics) and external Prolog knowledge bases (such as, Prolog WordNet and Extended WordNet) to implement a variety of tasks for the natural language analysis. We also describe applications of this logic programming interface in question analysis (such as, focus detection, answer-type and other constraints detection), shallow parsing (such as, relations in the syntactic structure), and answer selection. Keywords: Prolog interface, natural language processing, Unstructured Information Management Architecture 1 Introduction The Unstructured Information Management Architecture (UIMA) is a framework for analyzing large amounts of unstructured content, such as text, audio and video [1-4], used for text and multi-modal analytics, question answering, bioinformatics, machine translation systems, knowledge integration, semantic search, etc. The innermost part of UIMA is the Common Analysis Structure (CAS), a dynamic data structure which contains: unstructured data (i.e., data whose intended meaning is still to be inferred), structured annotations over this data and various user views over these annotations. Applications based on the UIMA framework are composed of workflows of Analysis Engines, distributed software components with powerful search capabilities, which analyze the unstructured data and pre-annotations and assert new annotations in the CAS. The UIMA Prolog interface is a generic Analysis Engine which translates the UIMA Common Analysis Structure (CAS) into a Prolog format, over which, the

developers write rules and use top-down inference to search for new annotations over the unstructured data. In the following sections, we describe the encoding of the CAS into a Prolog Knowledge base and exemplify rules that exploit all the previous UIMA annotations (such as, the syntactic structure, semantic patterns) and external knowledge bases (such as, Prolog WordNet) to find new annotations in question analysis: focus detection, answer-type and other constraints detection, and shallow parsing. The QParse system is a question analysis application of the UIMA CAS Prolog interface. The QParse rules analyze questions in order to detect the focus (i.e., the segment of the question replaced by the answer) and the lexical answer-type (i.e., the minimum text span that identifies the semantic type of the correct answer, preserving singular vs. plural) using the syntactic structure (UIMA pre-annotations), lexical chains, the WordNet 3.0 external database [5]. An evaluation of QParse is compared with the previous version of the system (implemented without Prolog). This document is structured as follows. The interface from the UIMA CAS to Prolog is introduced first, followed by the format of the data-structure and the retrieval of the results from Prolog as annotations in the UIMA CAS. Implementation details (such as: the API, some coding guidelines for the Prolog-based annotators and some data structure design decisions) are not included, but the used can easily access them on the UIMA framework Web site [2]. In the rest of the paper, we describe a set of use cases: shallow semantics detection of relations and the QParse component for question analysis operations (i.e., the focus and answer-type detection. We also evaluate these components' results. 2 The UIMA Prolog Generic Interface The generic Prolog interface to UIMA is a generic analysis engine which can be specialized for particular tasks (such as, shallow parser, question analysis, and answer selection). This interface (see figure 1) translates the UIMA CAS into a knowledge base (see figure 2.a) dynamically asserted into the Prolog system, queries the unstructured data and the UIMA annotations found by analysis engines former to the current one in the workflow (see figure 2.b), and external Prolog knowledge bases (i.e., Prolog WordNet [5]) using a set of Horn clauses, and generates/inserts new annotations into the CAS. Once an annotation is found and added to the annotations in the CAS, it can used to generate new annotations in the current analysis engine (e.g., in the shallow semantic parser) or analysis engines following the current one in the workflow. The transformed CAS is asserted dynamically for each specialized annotator and each data source, multiple Prolog annotators being concomitantly present in the workflow independent from each other. Currently, we have multiple Prolog analysis engines in our question answering system, for instance, shallow semantic parsing annotator (semantic relation finder), co-reference finder, question analysis annotator (focus and lexical aswer-type), etc. In the following section, we will exemplify a specific Prolog interface annotator for question analysis.

Fig. 1. The UIMA Prolog Analysis Engine Architecture 2.a. Translation of the UIMA CAS into Prolog facts dynamically asserted into the Prolog engine 2.b. A relation "castof" detection rule for the shallow semantic parser Fig. 2. Example of UIMA Prolog interface for shallow semantic parsing 3 The QParse Analysis Engine The QParse analysis engine is a specialization of the generic UIMA Prolog interface used for natural language question-analysis. Its rules examine questions to detect the focus and the lexical answer type. The focus segment is the text span in a question that

is replaced by the correct answer, while the lexical answer-type is the minimum text span that identifies the semantic type of the correct answer and preserves the number of the answer. It is used by many downstream analysis engines in a UIMA workflow for question-answering: proved based answer-selection, answer scoring components that identify the syntactic/semantic relations that must hold between the answer and other terms in the question, etc. Qparse analysis engine has multiple inputs: the unstructured data input, the syntactic structure of the question (i.e., parsing pre-annotations), lexicalization preannotations (entity-name classifications), and uses external knowledge based (e.g., the Prolog WordNet 3.0). These inputs are transformed into Prolog facts and inserted into the Prolog engine (see figure 3). The Prolog engine applies a set of Horn clauses on this knowledge base and asserts new annotations in the UIMA architectures. Fig. 3. The QParse analysis engine architecture in the question-answering system QParse queries for annotations in the Prolog exploiting the syntactic structure in the questions. For instance, one pattern in the focus rule set searched for templates of "WHAT is the X " (see figure 4.a), one pattern in the answer-type detection set, uses a lookup table of verb-objecttype tuples (see figure 4.b second example). 3.a. Examples of QParse focus detection rules 3.b. Examples of QParse answer-type detection rules Fig. 4. Examples of QParse rules for question analysis

We evaluated the results of QParse on 412 manually annotated questions from the Text REtrieval Conference (TREC) and we got 370 correct matches (89.5%) and 17 super-types of the correct types (a better specialization was not found), while the rest of the annotations were wrong answers (either word sense disambiguation problems (for instance, we got the answer-type "types.facility" instead of "types.sportsteam" for the test "What Liverpool club spawned the Beatles?", the reason being that "types.facility" meaning has higher probability than the "types.sportsteam" meaning for the word "club" in the WordNet ontology) or parsing problems). 5 Conclusion The UIMA generic Prolog annotator allowed us to develop faster and easier pattern matching rules for natural language analysis in a language familiar to our developers and users (i.e., the Prolog language), the Prolog engine being transparent to the UIMA pipeline (i.e., completely integrated in the pipeline), while having access to state-ofthe-art semantics and proving effective on question analysis (i.e., time and results). We implemented interfaces for various rule systems: the UIMA-Sicstus Prolog interface (using the PrologBeans library) [6], the UIMA-SWI Prolog interface (using the JPL library) [7] and the UIMA-InterProlog translator for used by XSB [8] and Yap Prolog [9] systems (using the Interprolog library [10]). Our applications of this annotator include: complex rules for question analysis, shallow semantic parsing, and tools for development and testing UIMA analytics. References 1. Ferrucci, David, Lally, Adam, UIMA: an architectural approach to unstructured information processing in the corporate research environment, Natural Language Engineering 10 (3/4): 327--348 (2004). 2. The Apache UIMA framework, http://incubator.apache.org/uima. 3. Ferrucci, David, Lally, Adam, Building an example application with the Unstructured Information Management Architecture. IBM Systems Journal 43(3): 455--475 (2004). 4. Goetz, T., Suhre, O., Design and implementation of the UIMA Common Analysis System, IBM Systems Journal 43, No. 3, 490-515 (2004). 5. Fellbaum, Christiane, WordNet: An Electronic Lexical Database, The MIT Press (1998). 6. Sicstus Prolog, http://www.sics.se/isl/sicstuswww/site 7. SWI Prolog, http://www.swi-prolog.org 8. XSB Prolog, http://xsb.sourceforge.net 9. Yap Prolog, http://www.dcc.fc.up.pt/~vsc/yap 10. Calejo, Miguel, InterProlog: Towards a Declarative Embedding of Logic Programming in Java, Logics in Artificial Intelligence European Conference: 714-717 (2004).