Combining Data Integration and Information Extraction Techniques

Similar documents
Dean Williams. A thesis submitted in fulfilment of the requirements for the degree of Doctor of. Philosophy in the University of London.

Natural Language to Relational Query by Using Parsing Compiler

Semantic annotation of requirements for automatic UML class diagram generation

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

A Uniform Approach to Workflow and Data Integration

SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Annotation for the Semantic Web during Website Development

Introduction to IE with GATE

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

SOCIS: Scene of Crime Information System - IGR Review Report

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

RETRATOS: Requirement Traceability Tool Support

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object

Exploration and Visualization of Post-Market Data

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

User Profile Refinement using explicit User Interest Modeling

Distributed Database for Environmental Data Integration

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

A Framework of Context-Sensitive Visualization for User-Centered Interactive Systems

A Framework and Architecture for Quality Assessment in Data Integration

MULTI AGENT-BASED DISTRIBUTED DATA MINING

Topics in basic DBMS course

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

Supporting Change-Aware Semantic Web Services

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Lightweight Data Integration using the WebComposition Data Grid Service

Service Oriented Architecture

It s all around the domain ontologies - Ten benefits of a Subject-centric Information Architecture for the future of Social Networking

Project Knowledge Management Based on Social Networks

IFS-8000 V2.0 INFORMATION FUSION SYSTEM

SPATIAL DATA CLASSIFICATION AND DATA MINING

KEYWORD SEARCH IN RELATIONAL DATABASES

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

A Framework of User-Driven Data Analytics in the Cloud for Course Management

Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1

Artificial Intelligence & Knowledge Management

Data Integration by Bi-Directional Schema Transformation Rules

Linked Science as a producer and consumer of big data in the Earth Sciences

Abstract. Introduction

Intelligent Analysis of User Interactions in a Collaborative Software Engineering Context

Information Brokering over the Information Highway: An Internet-Based Database Navigation System

Find the signal in the noise

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

ezdi s semantics-enhanced linguistic, NLP, and ML approach for health informatics

Deriving Business Intelligence from Unstructured Data

A Framework for the Delivery of Personalized Adaptive Content

Software Architecture Document

Caravela: Semantic Content Management with Automatic Information Integration and Categorization (System Description)

Visionet IT Modernization Empowering Change

Search Result Optimization using Annotators

Using NLP and Ontologies for Notary Document Management Systems

WHITE PAPER DATA GOVERNANCE ENTERPRISE MODEL MANAGEMENT

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study

Design and Implementation of Domain based Semantic Hidden Web Crawler

CitationBase: A social tagging management portal for references

LDIF - Linked Data Integration Framework

INTEROPERABILITY IN DATA WAREHOUSES

Service Road Map for ANDS Core Infrastructure and Applications Programs

CorHousing. CorHousing provides performance indicator, risk and project management templates for the UK Social Housing sector including:

Ontology-based Domain Modelling for Consistent Content Change Management

The Ontology and Architecture for an Academic Social Network

Test Data Management Concepts

Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources

CHAPTER 1 INTRODUCTION

Semantification of Query Interfaces to Improve Access to Deep Web Content

A Data Browsing from Various Sources Driven by the User s Data Models

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Information Services for Smart Grids

Ontology and automatic code generation on modeling and simulation

Automatic Timeline Construction For Computer Forensics Purposes

A Contribution to Expert Decision-based Virtual Product Development

Resource Management on Computational Grids

A Mind Map Based Framework for Automated Software Log File Analysis

SCADE System Technical Data Sheet. System Requirements Analysis. Technical Data Sheet SCADE System

Semantic Search in Portals using Ontologies

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Information Discovery on Electronic Medical Records

Patterns of Information Management

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

Annotea and Semantic Web Supported Collaboration

A METHOD FOR REWRITING LEGACY SYSTEMS USING BUSINESS PROCESS MANAGEMENT TECHNOLOGY

Scalable End-User Access to Big Data HELLENIC REPUBLIC National and Kapodistrian University of Athens

A semantic extension of a hierarchical storage management system for small and medium-sized enterprises.

Flattening Enterprise Knowledge

An Introduction to Data Mining

A generic approach for data integration using RDF, OWL and XML

A generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment

Semantic Lifting of Unstructured Data Based on NLP Inference of Annotations 1

SWAP: ONTOLOGY-BASED KNOWLEDGE MANAGEMENT WITH PEER-TO-PEER TECHNOLOGY

Recognition and Privacy Preservation of Paper-based Health Records

Ontology-based Archetype Interoperability and Management

Financial Events Recognition in Web News for Algorithmic Trading

SEMI AUTOMATIC DATA CLEANING FROM MULTISOURCES BASED ON SEMANTIC HETEROGENOUS

Reverse Engineering in Data Integration Software

Fogbeam Vision Series - The Modern Intranet

Data Integration using Agent based Mediator-Wrapper Architecture. Tutorial Report For Agent Based Software Engineering (SENG 609.

Transcription:

Combining Data Integration and Information Extraction Techniques Dean Williams dean@dcs.bbk.ac.uk School of Computer Science and Information Systems, Birkbeck College, University of London Abstract We describe a class of applications which are built using databases comprising some structured data and some free text. Conventional database management systems have proved ineffective for these applications and they are rarely suitable for current text and data mining techniques. We argue that combining Information Extraction and Data Integration techniques is a promising direction for research and we outline how our ESTEST system demonstrates this approach. 1. Introduction A class of applications exist which can be characterised by the way in which they combine both data conforming to a schema and some related free text. We describe this application class in Section 2. Our approach is to combine Data Integration (DI) and Information Extraction (IE) techniques to better exploit the text data and, in Section 3, there is a summary of related areas of research and we show how our method relates to these. Section 4 details why we belive text is used in these applications and as a result, why we belive combining DI and IE techniques will be beneficial to these applications. Details of our system Experimental Software to Extract Structure from Text (ES- TEST) are given in Section 5 which shows how we plan to realise our goals. Finally we give our conclusions and plans for future work in Section 6. 2. Partially Structured Data In [1] King and Poulovassilis define a distinct category of data - partially structured data (PSD). Many database applications rely on storing significant amounts of data in the form of free text. Recent developments in database technology have improved the facilities available for storing large amounts of text. However the provision for making use of this text data largely relies on searching the text for keywords. A class of applications exist where the information to be stored consists partly of some structured data conforming to a schema with the remainder left as free text. We consider this data to be partially structured. This idea of PSD is distinct from semistructured data, which is generally taken to mean data that is self-describing. In semistructured data there may not be a schema defined but the data itself contains some structural information e.g. XML tags.

An example of an application based on the use of PSD is operational intelligence gathering, which is used in serious crime investigations. The data collected in this application area takes the form of a report that contains some structured data such as the name of the Police Officer making the report, the time and location of the incident, as well as details of subjects and locations contained in the report. This is combined with the actual report of the sighting or information received which is captured as text. A number of other text based applications exist in crime e.g. for witness statements and scene of crime reports. Other application domains we are familiar with which have partially structured data include Road Traffic Accident reports where the standard format statistics are combined with free text accounts in a formalised subset of English. In Bioinfomatics, structured databases such as the SWISS-PROT [2] database includes comment fields that contain related unstructured information. A common theme of many of these applications, including crime and SWISS-PROT, is a requirement for expert users to annotate the text, trying to use standard terms to assist with queries, reduce duplication and highlight important facts. This is often a time consuming, demanding task with results less effective than would be desired and applications to assist with this work are being developed both as academic research projects e.g. [3] and commercial software e.g. [4]. 3. Related Areas A number of active areas of research deal with text in databases and we use the following definitions to establish how our approach relates to these. Data Integration Providing a single schema over a collection of data sources that facilitates queries across the sources [5] Information Extraction Finding pre-defined entities from text and using the extracted data to fill slots in a template using shallow NLP techniques [6]. Data Mining / Knowledge Discovery in Databases Finding patterns in structured data, discovering new deep knowledge embedded in data. Text Mining Application of data mining to text (often some NLP process creates a structured dataset from the text and then this is used for data mining [7]). Graph Based Data Models Current industry standard databases are essentially record based (e.g. the relational model or some form of object data model) where the schema must be determined in advance of populating the database. Graph-based data models offer finer semantic granularity and greater flexibility [8]. We are not proposing use of a text mining technique which finds patterns in very large collections of text e.g. Nahm and Mooney [9] who combine IE with Text Mining. For many of the PSD applications we have described this is unlikely to be effective as there are not very large static datasets to be mined (although there are some exceptions e.g. SWISS-PROT), rather over time new query requirements arise and extensions to the schema are required. We propose an evolutionary system where the user iterates through the steps as new information sources and new query requirements arise. Firstly an initial integrated 2

schema is built from a variety of sources including structured data schema, domain ontologies and natural language ontologies. Then information extraction rules are semiautomatically generated from this schema to be used as input for the IE processor. The data extracted from the text is added to the integrated schema and is available to answer queries. The schema may then be extended by new data-sources being added or new schema elements identified and the process repeats. Figure 1 shows how the user will use the ESTEST system in this evolutionary manner. Because of the evolutionary approach we suggest a graphical workbench will be required for end user use of ESTEST and we intend to consider the requirements of such a workbench. Integrate Datasources Create Data to assist the IE process IE Direction Data Enhance Schema Information Global Schema Extracted Data Information Extraction (IE) Query Global Schema Integrate Results of IE Control Flow Data Flow Fig. 1. Evolutionary Use of the ESTEST System 4. Combining Data Integration and Information Extraction We belive that the data collected in the form of free-text is important to PSD applications and is not stored as text due to its secondary value, and that there are two main reasons for storing data as text in PSD applications: It is not possible in advance to know all of the queries that will be required in the future. The text captured represents an intuitive attempt by the user to provide all information that could possibly be relevant. The Road Traffic Accident reports are 3

a good example of this. The schema of the structured part of the data covers all currently known requirements in a format known as STATS20 [10] and the text part is used when new reporting requirements arise. Data is captured as text due to the limitation of dynamically building a schema in conventional DBMS where simply adding a column to an existing table can be a major task in production systems. For example in systems storing witness statements in crime reports as entities and relationships are mentioned for the first time it is not possible to dynamically expand the underlying data schema and so the new information is only stored in its text form. Furthermore, the real world entities and relationships described in the text are related to the entities in the structured part of the data. An application combining IE and Data Integration will provide advantages in these applications for a number of reasons. Information Extraction is based on the idea of filling pre-defined templates and Data Integration can provide a global schema to be used as a template. Combining the schema of the structured data together with ontologies and other metadata sources can create the global schema / template. Metadata from the data sources can be used to assist the IE process by semi-automatically creating the required input to the IE modules. Data Integration systems which use a low level graph based common data model (e.g. AutoMed [11]) are able to extend schema as new entities become known without the overhead associated with conventional DBMS as they are not based on record based structures such as tables in relational databases. The templates filled by the IE process will provide a new data source to be added to the global schema supporting new queries which could not previously be answered. 5. The ESTEST System Our ESTEST system makes use of the AutoMed heterogeneous data integration system being developed at Birkbeck and Imperial Colleges [12]. In data integration systems, several data sources, each with an associated local schema, are integrated to form a single virtual database with an associated global schema. If the data sources conform to different data models, then these need to be transformed into a common data model as part of the integration process. The AutoMed system uses a low-level graph-based data model, the HDM, as its common data model - this is suitable for incremental increases in a global schema as new requirements arise. We have developed an AutoMed HDM data store [13] to store instance data and intermediate results for ESTEST. AutoMed implements bi-directional schema transformation pathways to transform and integrate heterogeneous schemas [11] which is a flexible approach amenable to including new domain knowledge dynamically. In summary the ESTEST system works as follows. The data sources are first identified and integrated into a single global schema. In AutoMed each data model which can be integrated is defined in terms of the HDM. Each construct in the external data model has an associated set of HDM nodes and edges. In the ESTEST system some features of data models are required to be preserved across all the integrated data sources. These features include an IS-A concept hierarchy; allowing for attributes; identifying text data to be mined and the ability to attach word forms to concepts. To facilitate the 4

automatic creation of the global schema all the data sources used by ESTEST will be transformed to an ESTEST data model. Each construct in the external model also has a set of transformations to map onto the ESTEST data model. Once all the data sources have been transformed to this standard representation and mappings between schema elements obtained - it will be possible to integrate the schemas. ESTEST then takes the metadata in the global schema and uses this to suggest input into the IE process. The user confirms, corrects and appends to this configuration data and the IE process is run. We make use of the GATE [14] IE architecture to build the ESTEST IE processor. As well as reusing standard IE components such as Named Entity gazetteers, sentence splitters, pattern matching grammars (with configured inputs semi-automatically created by ESTEST), a number of new IE components are being developed: TemplateFromSchema Takes an ESTEST global schema and creates templates to be filled by the IE engine and creates input to the standard IE components. NE-DB Named Entity recognition in IE is typically driven by flat file lists, the NE-DB component will associate a query on the global schema with a annotation type. A list of word forms will be materialised in the HDM store for use when the IE process is running (GATE NE gazetteers generate Finite State Machines for possible transitions of tokens). WordForm For a given concept will get relevant word forms from the WordNet natural language ontology. It will be possible to generate more words by increasing the number of traversals allowed through the WordNet hierarchy ordered by an approximation of semantic distance. The templates filled by the IE process will then be used to add to the extent of the concept in the global schema. Extracted annotations which match objects in the global schema will be extracted and put in the HDM store.the global query facilities of AutoMed are now available to the user who can query the global schema using the IQL query language [15, 16]. For more detailed information on the design of the ESTEST system we refer the reader to [17] and for an example of its operation in the Road Traffic Accident domain to [18]. Recent work within the Tristarp group [19], has resulted in advanced visualisation tools for graph-based databases becoming available [20] that may be of assistance in the proposed user workbench. This research interest is also reflected in recent products developed in industry, the Sentences [21] DBMS from Lazysoft is based on a quadruple store and sets out to challenge the dominance of the relational model. 6. Conclusions and Future Work We have discussed how a class of applications based on partially structured data are not adequately supported by current database and data mining techniques. We have stated why we belive combining Information Extraction and Data Integration techniques is a promising direction for research. We are currently completing an initial implementation of the ESTEST system which we will test in the Road Traffic Accident reporting and Crime Investigation domains. 5

ESTEST extends the facilities offered by data integration systems by moving towards handling text and extends IE systems by attempting to use schema information to semiautomatically configure the IE process. References 1. P.J.H.King and A Poulovassilis. Enhancing database technology to better manage and exploit partially structured data. Technical report, Birkbeck College, University of London, 2000. 2. Bairoch A., Boeckmann B., Ferro S., and Gasteiger E. Swiss-Prot: Juggling between evolution and stability. Brief. Bioinform., 5:39 55, 2000. 3. SOCIS Scene of Crime Information System. http://www.computing.surrey.ac.uk/ai/socis/. 4. QUENZA. http://www.xanalys.com/quenza.html. 5. Alon Y. Halevy. Data integration: A status report. In Gerhard Weikum, Harald Schöning, and Erhard Rahm, editors, BTW, volume 26 of LNI, pages 24 29. GI, 2003. 6. D. Appelt. An introduction to information extraction. Artificial Intelligence Communications, 1999. 7. A.H.Tan. Text mining: The state of the art and the challanges. Proceedings of the PAKDD 1999 Workshop on Knowledge Discovery from Advanced Databases, pages 65 70, 1999. 8. William Kent. Limitations of record-based information models. ACM Transactions on Database Systems, 4(1):107 131, March 1979. 9. R. Mooney U.Y. Nahm. Using information extraction to aid the discovery of prediction rules from text. Proceedings of the KDD-2000 Workshop on text Mining, pages 51 58, 2000. 10. UK Government Department for Transport. Instructions for the completion of road accident report form stats19. http://www.dft.gov.uk/stellent/groups/dft transstats/documents/page/dft transstats 505596.pdf. 11. P.J. McBrien and A. Poulovassilis. Data integration by bi-directional schema transformation rules. In Proc. ICDE 03, 2003. 12. AutoMed Project. http://www.doc.ic.ac.uk/automed/. 13. D. Williams. The AutoMed HDM data store. Technical report, Automed Project, 2003. 14. H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. Experience with a language engineering architecture: Three years of GATE. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 02), 2002. 15. A. Poulovassilis. The AutoMed Intermediate Query Language. Technical report, AutoMed Project, 2001. 16. E. Jasper. Global query processing in the AutoMed heterogeneous database environment. In Proc. BNCOD02, LNCS 2405, pages 46 49, 2002. 17. D. Williams and A.Poulovassilis. Combining data integration with natural language technology for the semantic web. In Proc. Workshop on Human Language Technology for the Semantic Web and Web Services, at ISWC 03, page TBC, 2003. 18. Dean Williams and Alexandra Poulovassilis. An example of the estest approach to combining unstructured text and structured data. In DEXA Workshops, pages 191 195. IEEE Computer Society, 2004. 19. Tristarp Project. http://www.dcs.bbk.ac.uk/tristarp. 20. M.N. Smith and P.J.H. King. Database support for exploring criminal networks. Intelligence and Security Informatics: First NSF/NIJ Symposium, 2003. 21. Lazysoft (maker of Sentences). http://www.lazysoft.com/index.html. 6