Data Integration Hub for a Hybrid Paper Search



Similar documents
Web Service Robust GridFTP

A Survey Study on Monitoring Service for Grid

Classic Grid Architecture

Event-based middleware services

A Web Services Framework for Collaboration and Audio/Videoconferencing

Managing XML Data to optimize Performance into Object-Relational Databases

RESEARCH ISSUES IN PEER-TO-PEER DATA MANAGEMENT

TOP-DOWN APPROACH PROCESS BUILT ON CONCEPTUAL DESIGN TO PHYSICAL DESIGN USING LIS, GCS SCHEMA

Databases in Organizations

Service Description: NIH GovTrip - NBS Web Service

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

XIII. Service Oriented Computing. Laurea Triennale in Informatica Corso di Ingegneria del Software I A.A. 2006/2007 Andrea Polini

DocuSign Connect Guide

ENABLING SEMANTIC SEARCH IN STRUCTURED P2P NETWORKS VIA DISTRIBUTED DATABASES AND WEB SERVICES

No Trade Secrets. Microsoft does not claim any trade secret rights in this documentation.

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Lightweight Data Integration using the WebComposition Data Grid Service

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Data Grids. Lidan Wang April 5, 2007

Modernize your NonStop COBOL Applications with XML Thunder September 29, 2009 Mike Bonham, TIC Software John Russell, Canam Software

Tier Architectures. Kathleen Durant CS 3200

Remote Sensitive Image Stations and Grid Services

STRATEGIES ON SOFTWARE INTEGRATION

Oracle Java CAPS Message Library for EDIFACT User's Guide

A collaborative platform for knowledge management

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

EDG Project: Database Management Services

How To Improve Performance In A Database

An Oracle White Paper October Oracle XML DB: Choosing the Best XMLType Storage Option for Your Use Case

Big Data With Hadoop

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

+ <xs:element name="productsubtype" type="xs:string" minoccurs="0"/>

keyon Luna SA Monitor Service Administration Guide 1 P a g e Version Autor Date Comment

A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS

The Direct Project. Implementation Guide for Direct Project Trust Bundle Distribution. Version March 2013

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Unified XML/relational storage March The IBM approach to unified XML/relational databases

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Module 1: Getting Started with Databases and Transact-SQL in SQL Server 2008

COM_2006_023_02.xsd <?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs=" elementformdefault="qualified">

Writing Queries Using Microsoft SQL Server 2008 Transact-SQL

A Semantic Marketplace of Peers Hosting Negotiating Intelligent Agents

An Oracle White Paper February Managing Unstructured Data with Oracle Database 11g

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

BUSINESSOBJECTS DATA INTEGRATOR

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Geodatabase Programming with SQL

The basic data mining algorithms introduced may be enhanced in a number of ways.

CA DLP. Stored Data Integration Guide. Release rd Edition

Advanced Information Management

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Universal Event Monitor for SOA Reference Guide

Foundations of Business Intelligence: Databases and Information Management

Data Integration using Agent based Mediator-Wrapper Architecture. Tutorial Report For Agent Based Software Engineering (SENG 609.

How To Create A Privacy Preserving And Dynamic Load Balancing System In A Distributed System

A P2P SERVICE DISCOVERY STRATEGY BASED ON CONTENT

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

DBMS / Business Intelligence, SQL Server

Gplus Adapter 8.0. for Siebel CRM. Developer s Guide

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS

A Peer-to-Peer Approach to Content Dissemination and Search in Collaborative Networks

Writing Queries Using Microsoft SQL Server 2008 Transact-SQL

Modern Databases. Database Systems Lecture 18 Natasha Alechina

Napster and Gnutella: a Comparison of two Popular Peer-to-Peer Protocols. Anthony J. Howe Supervisor: Dr. Mantis Cheng University of Victoria

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

NS DISCOVER 4.0 ADMINISTRATOR S GUIDE. July, Version 4.0

14 Databases. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:

The Role and uses of Peer-to-Peer in file-sharing. Computer Communication & Distributed Systems EDA 390

Bridge from Entity Relationship modeling to creating SQL databases, tables, & relations

P2P: centralized directory (Napster s Approach)

A Multidatabase System as 4-Tiered Client-Server Distributed Heterogeneous Database System

Principles of Distributed Database Systems

UIMA and WebContent: Complementary Frameworks for Building Semantic Web Applications

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Distributed Databases in a Nutshell

PEER TO PEER FILE SHARING USING NETWORK CODING

Reference Architecture, Requirements, Gaps, Roles

A Pluggable Security Framework for Message Oriented Middleware

MySQL for Beginners Ed 3

Middleware support for the Internet of Things

Generating XML from Relational Tables using ORACLE. by Selim Mimaroglu Supervisor: Betty O NeilO

Multicast vs. P2P for content distribution

City Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at

Combining Unstructured, Fully Structured and Semi-Structured Information in Semantic Wikis

Base One's Rich Client Architecture

Transcription:

Data Integration Hub for a Hybrid Paper Search Jungkee Kim 1,2, Geoffrey Fox 2, and Seong-Joon Yoo 3 1 Department of Computer Science, Florida State University, Tallahassee FL 32306, U.S.A., jungkkim@cs.fsu.edu, 2 Community Grids Laboratory, Indiana University, Bloomington IN 47404, U.S.A. 3 School of Computer Engineering, Sejong University, 98 Gunja-Dong, Gwangjin-gu, Seoul, 143-747, Korea Abstract. In this paper we describe the design of a hybrid search that combines simple metadata search with a traditional keyword search over unstructured context data. This paradigm provides the inquirer additional options to narrow the search with some semantic aspects through the XML metadata query. We demonstrate a paper search for a case study of the hybrid search, and describe a data integration hub to integrate those data dispersed on the Net. 1 Introduction To discover and share heterogeneous resources on the Net has been a long term challenge since computer communication networks were popularized. There are two traditional approaches to organizing the data to be searched one is structured data and the other is unstructured data. A Web search engine is a typical example of search on the Internet. Its technologies are rooted in information retrieval that represents search over the unstructured data. Web search engines provide clues for resource location, but they have no semantic schema and often produce meaningless keyword search results. The Semantic Web is a ambitious extension of the Web. It also includes multiple relation links with directed labeled graphs by which machines like Web crawlers can interpret the relationship between resources. Meanwhile the ordinary Web has a single relationship and a machine cannot infer further meaning. To represent the relations of the object on the Web, the object terms should be defined under a specific description-an ontology. Domain experts are usually needed to design an ontology due to the sophisticated definition required. Currently, many Web pages included no such semantic content, and no unified definition of general semantic agreement exists. Our hybrid keyword search aims to give an intermediate search paradigm on the Internet providing semantic value through XML metadata that are simpler than those of the Semantic Web. In this paper we describe our design of hybrid search systems. In earlier experiments, we had suffered from performance problems in a local level, and we proposed scalable hybrid search on distributed environments [4, 5]. In the architecture, a group of independent search providers share

their information through their own search systems. But some data providers, who possess small amounts of data, may join such group. They may not want to develop their own search services. Otherwise, participation of many nodes possessing small data in a group will increase the communication traffic and drop a chance to reach the target information under Time-to-Live (TTL) strategy. Partial integration may be one possible method to increase the data portion queried in the search group. In this paper we also present our architecture for data integration hub, which is an application communicating through a message broker with centralized control. This hub can act as a partial integrator on distributed databases or peer-to-peer systems. This paper is organized as follows. In the next section we describe one of our hybrid keyword search architectures. We present a data integration hub to integrate those papers in Section 3. In Section 4, we summarize and conclude. 2 Hybrid Keyword Search Our hybrid keyword search combines metadata search with a traditional keyword search over unstructured context data. Each chunk of unstructured data, usually represented by a file, has an assigned metadata. We use XML the de facto standard format for information exchange between machines as a metalanguage for the metadata. To demonstrate the practicality of the hybrid keyword search, we design and evaluate hybrid search systems based on a native XML database and a file system based text search library, as well as a market-leading relational database management system that integrates XML and text management. We have already introduced a relational database based implementation [3, 4]. It utilized an XML-enabled relational Database Management System (DBMS) with nested subqueries to implement the combination of query results against unstructured documents and semistructured metadata. The other implementation is based on a native XML database and a text search library. We use Apache Xindice [2] for XML instance repositories and XML query processing. Jakarta Lucene [1] is used to manage context query over unstructured data in our hybrid search. The query processing architecture is shown in figure 1. In the Xindice database, we associate an XML instance with an unstructured document by assigning the file name for the document as the key of the XML instance. For example, an XML instance in a file named pt1.xml with unstructured data in the file apaper.pdf by inserting the XML instance into the data collection as follows: xindice ad -c /a collection path -f pt1.xml -n apaper.pdf The assigned key the file name of unstructured document in the example as an attribute value on the root element of the XML created by executing an XPath query. For example the query result may start: <bibliography src:col="/a collection path " src:key="apaper.pdf " xmlns:src="http://xml.apache.org/xindice/query"> The key and result XML tuples are stored in a hash table. Another keyword search against unstructured documents returns names of documents which in-

Metadata in XML format Unstructured Data Document Name Unstructured Data Document Binary to Text Converter Xindice Storage Metadata Query Processing Query Hash Table (Key, XML) Text Document Lucene Indexing Text Query Processing List of matched documents Final Result Set (Key, Metadata) Fig. 1. Query Processing Architecture clude the given keyword in the text. The returned names are mapped in the hash table and the combined results are collected in a Java hash table. For efficient text search, Jakarta Lucene provides an index class along with a stop word filtering analyzer class. The analyzer filters out stop words articles and other words that are meaningless to the search. Some binary format files should be converted to pure text files before indexing. We pass the binary file name as well as the text file name as parameters to the index generating class in order to indicate the original document format. 2.1 Case Study: Hybrid Paper Search The initial demonstration of the hybrid search is in a simplified model a paper search. The paper search is a content search across various types of documents. Each document has metadata presented as an XML instance. An example XML instance is shown in figure 2. In this demonstration we use a relational DBMS instead of a native XML database. Two relational database tables are used for metadata and documents. For the XML instances representing the metadata, the XMLType of Oracle 9i [6] is used. A column with BFILE large object type is used for the external document table. Those large object rows are indexed using Oracle Text a text management system integrated into Oracle DBMS. Through an Oracle Text index, we can search the target content. The document table has a special attribute for a document type BINARY or TEXT. This attribute is necessary for the filtering option of Oracle Text. Oracle Text filters binary files to pure text instances before making an index. A one-to-one relationship set is used for relation between the paper document and metadata, but a one-to-many or many-to-many

<bibliography> <authors> <author>j. Kim</author> <author>o. Balsoy</author> <author>m. Pierce</author> <author>g. Fox</author> </authors> <title>design of a Hybrid Search in the OKC</title> <source>proceedings of the International Conference on IKS</source> <year>2002</year> </bibliography> Fig. 2. An Example XML Instance relationship set could be used for other applications. The relationship table is not necessary in one-to-one relationships, but it is essential to decompose relations to avoid anomalies in one-to-many or many-to-many relationship. With two data tables and a relationship table we can query keywords in the content, which can be associated with particular metadata through nested subqueries. For example, we can find documents with a keyword XML and published in 2002. Figure 3 shows the relational schema of our hybrid paper search. papers(papernd: string, descriptions: XMLType) paperfiles(filename: string, doctype: string, contents: BFILE) filelocator(papernd: string, filename: string) Fig. 3. Relational Schema of Hybrid Paper Search 3 Data Integration Hub for Hybrid Search In our earlier work [5] we assumed query clients only read resource in the other nodes. In this paper we take into account that clients may desire to be a data provider without providing an independent search service. Several or many data providers can share their information through a centralized database. They may upload, query, and download unstructured data with attached metadata via a central DBMS. Another aspect for the integration hub, which was not introduced in the local hybrid paper search, is the metadata validation against XML schema. The storage for the local hybrid paper search is managed by database administrators, but ordinary users can upload their own data to the database of the integration hub. We utilize an Oracle DB operation to check the validity of metadata presented in XML instances against a registered XML schema. The XML schema for our hybrid paper search is shown in figure 4.

<?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/xmlschema" elementformdefault="qualified" attributeformdefault="unqualified"> <xs:element name="bibliography"> <xs:complextype> <xs:sequence> <xs:element name="authors"> <xs:complextype> <xs:sequence> <xs:element name="author" type="xs:string" maxoccurs="unbounded"/> </xs:sequence> </xs:complextype> </xs:element> <xs:element name="title" type="xs:string"/> <xs:element name="source" type="xs:string"/> <xs:element name="year" type="xs:decimal"/> </xs:sequence> </xs:complextype> </xs:element> </xs:schema> Fig. 4. An XML Schema Example for Hybrid Paper Search We demonstrate a data integration hub on the central DBMS using messageoriented middleware. This integration hub is an integrated version of the hybrid paper search introduced in section 2.1. The general architecture of a data integration hub is summarized in figure 5. Those clients can communicate with an integrant through a message broker NaradaBrokering [7]. This broker includes JMS compliant topic-based communication a publish/subscribe model. As in the read-only query case, clients are publishers and the integrant is a subscriber. Clients and the integrant use the same topic. The integrant manages uploading of the metadata and unstructured data, query wrapping, and file transfer. Requests are classified by the job property, which is attached to the message sent from client to integrant. A temporary topic is also delivered to the integrant. This temporary topic provides unique identification for the requester client, and the integrant can return back to the client by publishing the message on a dynamic virtual channel a temporary topic whose session object was attached to the message. We use different message types for each service request, as follows: Upload request: initially the client should request to upload to the integrant, before the unstructured data files are sent. The message for this request includes a file name of unstructured data, metadata, and unique client user name. They are string contents in a MapMessage type message. The integrant checks the validity of the metadata against XML schema as clients may provide no well-formed or valid data. The central database should avoid

Message Broker Integrant Client... Client... Client Database or File System Fig. 5. An Integration Hub Architecture redundancy for the unstructured data stored in a file system. The name and directory of the unstructured data, which is generated by combining the user name and the file name, is used for a primary key. File upload: after an upload request is granted by the integrant, the client can send a file for the unstructured data. For the file uploading, we use a ByteMessage type message, which is appropriate for sending a stream of uninterpreted bytes. The client sends a header message first, and the file is published buffer by buffer. A job property, a file name, a user name, and a temporary topic are included in the header message. A JMS message has a unique ID when it is created, and the ID is attached to each message. The integrant can classify those file upload messages by extracting the message ID. This mechanism is necessary because several file uploads can occur simultaneously. Query request: clients are publishers for a query topic, and the integrant subscribe on the same topic. A MapMessage type message, which includes query parameters and properties, is published to the integrant. The query results are delivered back to the inquiry client subscribed to a temporary topic. Download request: the target unstructured data in a file can be obtained from the query request. The client subsequently requests a file download with a MapMessage type message that includes the file name and a temporary topic. The listener on the client then captures the ByteMessage type message published from the integrant, and the target file is written message by message on the client machine. Each message includes the file name property. The message broker is responsible for preserving order of message transfer. The database schema of our data integration hub are similar to those in the hybrid paper search in figure 3, but there is an additional table for the file uploading for the unstructured data a temporary file locator table. Our system allows the file upload on a temporary directory only, and moves the

files to the designated directory later. When a user request a file upload and incidentally the same file name already exists, a new file name is assigned for the final destination by the integrant. The original file name is stored in the table for metadata, but the actual file name is stored in the unstructured data table. We assume that a user does not assign the same file name to two or more different unstructured data. A naming and directory for each row in the paper metadata table is generated from combining the unique user name and the file name, and it makes the naming and directory to a potential primary key. Each data integration hub has a message broker and an integrant. A group of data integration hubs may provide a global search over a distributed information system by using the cooperative network features built in to NaradaBrokering, or by using an additional network layer a peer-to-peer overlay network. 4 Conclusion In this paper we described our approaches to hybrid search at a local level with a case study of a hybrid paper search. Our demonstration showed the possibility of the hybrid search paradigm for a practical semantic integration. We had another case study of a data integration hub, which is an application communicating through the message broker with a centralized control. The integration hub can be a search service node in a distributed database, or a peer in a peer-topeer overlay network under more scalable environment [4, 5]. This scalable generalization may have a practical bridging role for information search providing semantic value through metadata whose implementation are simpler than those of the Semantic Web. References 1. Apache Software Foundation. Jakarta Lucene. World Wide Web. http://jakarta.apache.org/lucene/. 2. Apache Software Foundation. Xindice. World Wide Web. http://xml.apache.org/xindice/. 3. J. Kim, O. Balsoy, M. Pierce, and G. Fox. Design of a Hybrid Search in the Online Knowledge Center. In Proceedings of the IASTED International Conference on Information and Knowledge Sharing, November 2002. 4. J. Kim and G. Fox. A Hybrid Keyword Search across Peer-to-Peer Federated Databases. In Proceedings of East-European Conference on Advances in Databases and Information Systems (ADBIS), September 2004. 5. J. Kim and G. Fox. Scalable Hybrid Search on Distributed Databases. In Proceedings of International Workshop of Autonomic Distributed Data and Storage Systems Management, volume 3516 of Lecture Notes in Computer Science. Springer, May 2005. 6. Oracle Corporation. Oracle9i Application Developers Guide XML, June 2001. 7. S. Pallickara and G. C. Fox. NaradaBrokering: A Distributed Middleware Framework and Architecture for Enabling Durable Peer-to-Peer Grids. In Proceedings of International Middleware Conference, June 2003.