Structured storage and retrieval of SGML documents using Grove



From this document you will learn the answers to the following questions:

What ISO is abstract SGML standardized in?

What type of SGML is standardized in ISO 8879?

How many pages did the paper Structured storage and retrieval of SGML documents take?

Similar documents
A Document Management System Based on an OODB

High-performance XML Storage/Retrieval System

Managing large sound databases using Mpeg7

Storage and Retrieval of XML Documents using Object-Relational Databases

Introduction to XML Applications

Multimedia Applications. Mono-media Document Example: Hypertext. Multimedia Documents

A LANGUAGE INDEPENDENT WEB DATA EXTRACTION USING VISION BASED PAGE SEGMENTATION ALGORITHM

Xml Mediator and Data Management

Object Oriented Databases. OOAD Fall 2012 Arjun Gopalakrishna Bhavya Udayashankar

Concepts of Database Management Seventh Edition. Chapter 9 Database Management Approaches

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

A Workbench for Prototyping XML Data Exchange (extended abstract)

XML-Based Software Development

LabVIEW Internet Toolkit User Guide

æ A collection of interrelated and persistent data èusually referred to as the database èdbèè.

12 File and Database Concepts 13 File and Database Concepts A many-to-many relationship means that one record in a particular record type can be relat

XFlash A Web Application Design Framework with Model-Driven Methodology

Chapter 1: Introduction

Lecture 9. Semantic Analysis Scoping and Symbol Table

Lightweight Data Integration using the WebComposition Data Grid Service

XQuery and the E-xml Component suite

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks

4. The Third Stage In Designing A Database Is When We Analyze Our Tables More Closely And Create A Between Tables

Electronic Document Management Using Inverted Files System

INTELLIGENT VIDEO SYNTHESIS USING VIRTUAL VIDEO PRESCRIPTIONS

A case study of evolution in object oriented and heterogeneous architectures

The Review of HyTime STopics

Recovering Business Rules from Legacy Source Code for System Modernization

NETMARK: A SCHEMA-LESS EXTENSION FOR RELATIONAL DATABASES FOR MANAGING SEMI-STRUCTURED DATA DYNAMICALLY

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

AN OBJECT-ORIENTED SGML/HYTIME COMPLIANT MULTIMEDIA DATABASE MANAGEMENT SYSTEM*

Unified XML/relational storage March The IBM approach to unified XML/relational databases

Integration of Heterogeneous Databases based on XML

XML: extensible Markup Language. Anabel Fraga

Heterogeneous Tools for Heterogeneous Network Management with WBEM

Chapter 2. Data Model. Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel

Web-based Multimedia Content Management System for Effective News Personalization on Interactive Broadcasting

Visionet IT Modernization Empowering Change

PHP Code Design. The data structure of a relational database can be represented with a Data Model diagram, also called an Entity-Relation diagram.

ECS 165A: Introduction to Database Systems

Monitoring Infrastructure (MIS) Software Architecture Document. Version 1.1

Migrating Legacy Software Systems to CORBA based Distributed Environments through an Automatic Wrapper Generation Technique

Overview of Data Management

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Overview RDBMS-ORDBMS- OODBMS

Data Integration for XML based on Semantic Knowledge

KITES TECHNOLOGY COURSE MODULE (C, C++, DS)

Design of Document Database Systems

Module 9. User Interface Design. Version 2 CSE IIT, Kharagpur

Course Name: ADVANCE COURSE IN SOFTWARE DEVELOPMENT (Specialization:.Net Technologies)

and ensure validation; documents are saved in standard METS format.

XML Processing and Web Services. Chapter 17

Structured vs. unstructured data. Motivation for self describing data. Enter semistructured data. Databases are highly structured

COCOVILA Compiler-Compiler for Visual Languages

User Manual. for the. Database Normalizer. application. Elmar Jürgens

Overview of Database Management

Semistructured data and XML. Institutt for Informatikk INF Ahmet Soylu

Exploiting Tag Clouds for Database Browsing and Querying

estatistik.core: COLLECTING RAW DATA FROM ERP SYSTEMS

A Metadata Model for Peer-to-Peer Media Distribution

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access

Search and Information Retrieval

XML DATA INTEGRATION SYSTEM

Core Syllabus. Version 2.6 B BUILD KNOWLEDGE AREA: DEVELOPMENT AND IMPLEMENTATION OF INFORMATION SYSTEMS. June 2006

Automated Test Approach for Web Based Software

CACHÉ: FLEXIBLE, HIGH-PERFORMANCE PERSISTENCE FOR JAVA APPLICATIONS

Graphical Web based Tool for Generating Query from Star Schema

A Peer-to-Peer Approach to Content Dissemination and Search in Collaborative Networks

Information Technology Career Field Pathways and Course Structure

AN ENHANCED DATA MODEL AND QUERY ALGEBRA FOR PARTIALLY STRUCTURED XML DATABASE

CMServer An Object-Oriented Framework for Website Development and Content Management

Introduction to Object-Oriented and Object-Relational Database Systems

Software documentation systems

Last Week. XML (extensible Markup Language) HTML Deficiencies. XML Advantages. Syntax of XML DHTML. Applets. Modifying DOM Event bubbling

Modeling Web Applications Using Java And XML Related Technologies

Implementing XML Schema inside a Relational Database

Integrating XML and Databases

Lesson 4 Web Service Interface Definition (Part I)

Keywords: Regression testing, database applications, and impact analysis. Abstract. 1 Introduction

Database Concepts. Database & Database Management System. Application examples. Application examples


HL7 and DICOM based integration of radiology departments with healthcare enterprise information systems

Natural Language to Relational Query by Using Parsing Compiler

Firewall Builder Architecture Overview

CHAPTER 2 DATABASE MANAGEMENT SYSTEM AND SECURITY

Database management support for a news-on-demand application

Database System Concepts

Flattening Enterprise Knowledge

Keywords: XML, Web-based Editor

Database Systems. Multimedia Database Management System. Application. User. Application. Chapter 2: Basics

A User Interface for XML Document Retrieval

Integrating Heterogeneous Data Sources Using XML

Pattern based approach for Natural Language Interface to Database

Optional custom API wrapper. C/C++ program. M program

INTRUSION PROTECTION AGAINST SQL INJECTION ATTACKS USING REVERSE PROXY

A Multidatabase System as 4-Tiered Client-Server Distributed Heterogeneous Database System

Automated Modeling of Legacy Systems Using the UML

Information Brokering over the Information Highway: An Internet-Based Database Navigation System

Transcription:

Information Processing and Management 36 (2000) 643±657 www.elsevier.com/locate/infoproman Structured storage and retrieval of SGML documents using Grove Hak-Gyoon Kim, Sung-Bae Cho* Department of Computer Science, Yonsei University, 134 Shinchon-dong, Sudaemoon-ku, Seoul, 120-749, South Korea Received 21 July 1999; accepted 6 December 1999 Abstract SGML standardized in ISO 8879 [International Organization for Standardization (1986)] has been proliferated because it can provide various styles and transform documents on di erent platforms. The SGML document has logical structure information in addition to the contents. As SGML documents are widely used, there is an increasing demand for a storage and retrieval system to use the logical structure of documents e ciently. However, traditional retrieval systems based on document indexes cannot exploit the logical structure appropriately. In this paper, we have developed a document storage and retrieval system based on structure information, where the SGML document is transformed into Grove, which is the document model for DSSSL and HyTime, and stored at an element level by an object-oriented DBMS, Object Store. It supports structured documents and provides a query interface to retrieve information contained in the structures. 7 2000 Elsevier Science Ltd. All rights reserved. Keywords: Storage and retrieval of structured documents; SGML; Grove; OODBMS; Object Store 1. Introduction Traditional systems to retrieve documents operate by assigning index terms to the document, and view a document as being a set of words and a collection as an unstructured set of documents. The ``documentary unit'', the appropriate target for indexing and retrieval, has long been an issue in indexing, although there is little * Corresponding author. Tel.: +82-2-361-2720; fax: +82-2-365-2579. E-mail address: sbcho@csai.yonsei.ac.kr (S.B. Cho). 0306-4573/00/$ - see front matter 7 2000 Elsevier Science Ltd. All rights reserved. PII: S0306-4573(99)00075-8

644 H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 literature on it. All A&I services make decisions as to the appropriate unit of indexing or analysis. The need for isolating particular documentary units may have been mitigated by full-text searching, where particular paragraphs or parts can be retrieved directly when they match a query well enough. Now we have the opportunity to identify and focus on structured parts, as structured documents represented by various styles such as memos, electronic mails, manuals and o cial documents become available. There is an increasing need to research on the impact of this capability and to develop a storage and retrieval system which takes advantage of structural knowledge. The integration of structural and textual information can allow one to achieve a higher quality of retrieval results. Standard Generalized Markup Language (SGML) provides a very powerful tool for describing document structure (SGML, 1986) and what seems to be required is a technique for representing such structures within a text retrieval system. Since SGML is a meta-language, it can de ne various markup languages that the markup tags which show the structure of the document are inserted into the text of a document. Based on SGML, HyperText Markup Language (HTML), Hypermedia/Time-based Structuring Language (HyTime), and Text Encoding Initiative (TEI) have been proposed as encoding standards. SGML documents have information about document structure, as well as the contents. This additional markup information plays an important role in decomposing a document into logical units and designating a function for each logical unit. As SGML documents are hierarchy models where the elements are connected complicatedly, several data models are proposed, but the following features should be satis ed to model the SGML documents.. A data model must support full addressability and manipulation of the SGML document.. Di erences in the data models between the SGML document and the database should be minimized.. A data model should support the generic abstract data model of the SGML document.. The system should be scalable with large data. To satisfy these requirements, we have used the modi ed Graph Representation Of property ValuEs (Grove), the document model for Document Style Semantics and Speci cation Language (DSSSL) and HyTime. Because the generic classes that manage structural information in Grove can maintain the structural information and store the documents at an element level, it is possible to retrieve the structural information with element units and support the various document styles. Also, we have used an Object Oriented DataBase Management System (OODBMS) to store this data model without any loss of structural information. We have separated the document de nition model and the document instance model. According to the rules de ned in the document de nition model the document instance is stored hierarchically, with which we can represent the document instance brie y and improve the retrieval performance. The key for applying formation retrieval techniques to SGML documents has to be found in using the document de nition model to guide the retrieval process. Also, we have supported an e ective user interface that a ords to give queries for structural information, as well as structure-based queries. This system is developed under

H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 645 Windows 98 with Visual C++ 5.0, and Object Store OODBMS is used for low level storage and retrieval. 2. Backgrounds 2.1. SGML SGML de ned in ISO 8879-1986 is a standard to de ne the structure of markup language. Markup in documents identi es the individual component of a logical document, i.e., elements. SGML documents are largely composed of declaration, document-type de nition, and document instance (SGML, 1986). 2.1.1. Declaration Declaration has information about character sets, description of special characters, and tag elimination. As declaration is placed on the head of documents, it can preserve the compatibility among di erent platforms. 2.1.2. Document-type de nition The allowable set of document structures for a particular collection of documents is de ned by a document-type de nition (DTD). It is essentially a grammar specifying the logical structure of documents of a certain type. It de nes the element, entity and attribute as three major components for the structure of a document. An element represents the logical unit of the document, and marks up the document with tags. Each element contains subelements or model groups that are the collection of elements. Entity is the character set referenced by name. Special characters are coded with unique names, and can be applied to documents with the unique names. An attribute attached to the element has unique information of the element. In addition to the three components, there is a processing instruction (PI) that represents the processing method of an element or entity, notation and short reference map which eases the markup. Fig. 1 shows an example of DTD, and the corresponding hierarchical structure of DTD is shown in Fig. 2. 2.1.3. Document Instance Document Instance (DI) is the part containing the contents according to DTD. Since each element contains corresponding contents, a document can be represented as a hierarchical tree. 2.2. Data models for SGML documents It is di cult to store SGML documents into target database according to DTD and to make appropriate index structures for fast processing (Christophides, Abiteboul, Cluet & Scholl, 1994; Sengupta & Dillon, 1997a). In this section, we investigate the database models of SGML in three categories: document-based; element-based; and object-based models.

646 H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 2.2.1. Document-based model Traditional information retrieval systems take the simplest approach, in which documents are treated as lists of words. The retrieval systems extract index words from documents, and submit them through calculating the similarity between document indexes and user queries (Salton & McGill, 1983). However, this model ignores the structures in the document. 2.2.2. Element-based model The element-based model divides a document into element units, and applies queries to them. It requires many tables and tuples to store element units based on document structure in relational DBMS. This results in complex object models or extended relational models which support nesting (Desai, Goval & Sadri, 1986) and reference (Macleod, 1990; Anick, Flynn & Hanssen, 1991; Davis, Kent, Ramamohanarao, Thom & Zobel, 1995b). A relational approach de nes an architecture that includes a text retrieval DBMS and a relational DBMS. This explicitly provides interoperability between RDBMS and text retrieval DBMS with access to Fig. 1. An example of document-type de nition.

H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 647 the structure of the text. The database administrator should de ne a schema for each combination of relational data and SGML document types (Blake et al., 1994). Another approach is to map complex elements onto tables in a relational database with text indexing. Each document is converted into internal tree representation. Since it requires the table for each element, there is an explosion in the number of tables and tuples required to capture su cient structure information (Macleod, 1990). 2.2.3. Object-based model The VERSO project exploits the object-based system, O 2, which requires some extensions to handle the aspects of text queries. Each de nition of an element in DTD is interpreted as a class having a type, some constraints and a default behavior (Christophides et al., 1994). However, this approach either requires complex class de nitions or sacri ces the facility of constraint checking for DTDs. In the work of the Integrated Publication and Information System Institute (IPISI), DTDs are considered as document instances and thus can be rewritten as instances of a particular DTD, so-called super-dtd. Therefore, they store element objects independent of particular DTDs. When DTD is inserted into the database, the speci cation whether element types are at or not should be made. Individual database objects do not represent at element types (Aberer, Bohm & Huser, 1994; Bohm, Aberer & Klas, 1997). 2.2.4. Grove Grove, the document model of Document Style Semantics and Speci cation Languages (DSSSL) and HyTime, represents an abstract data structure of SGML. Because Grove is Fig. 2. Document-type de nition represented by tree.

648 H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 compatible to represent complex data structure of SGML e ectively and can be transformed back into SGML, it can be used to format information for DSSSL and to represent semantic information of HyTime (DSSSL, 1996; HyTime, 1997). Grove describes SGML with a corresponding property set, which categorizes classes by module and contains classes that represent whole SGML document information. Each node in the SGML property set is an instance of pre-de ned class and composed of the property name and value. Available data types of property include Boolean, character string, and a nodal type that connects the nodes. 3. Storage for structured documents To utilize the power of grammar in a database schema and minimize data segmentation, we have designed a storage system based on Grove. The procedure to store structured documents is as follows: the system parses SGML documents and creates the Grove structure, which is transformed into persistent data by a database creation module. This procedure is shown in Fig. 3. 3.1. Data model We have used the SP parser that was developed by James Clark to parse SGML documents (Clark, 1999). Fig. 4 shows the parsing mechanism of our work. The SP parser has DTD and DI information in each Event class. This information can be obtained by the EventHandler class. To create the Grove structure, DTDGroveEventHandler and DIGroveEventHandler are inherited from EventHandler. The inherited EventHandler classes create a data model according to the Grove structure. Fig. 5 shows the SGMLDOC class that is the highest class. The document type has the information about DTD, and the document element has the information about DI. 3.1.1. Document Type De nition DTD plays a role in representing the documents without any loss of the structural information and determining the structure of DI. The document-type class manages the DTD and is composed of the Element type, Model group, Element token, PCDATA token, and Attribute de nition classes. The Element-type class has the element information starting with ``h!element'', which has the information of content model, attribute de nition, inclusion and exclusion. The Attribute de nition class represents the attribute information of the element, and Fig. 3. Storage system for structured documents.

H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 649 Fig. 4. Parsing SGML documents. the Model group class composed of the sub Model group classes, the set of terminal element token classes and PCDATA token classes has the information of the connector between the model group and occurrence rules. The type in Fig. 6 means that the relationship between classes is dynamically binded at run time. 3.1.2. Document Instance According to the DTD model, DI is stored hierarchically. It is composed of the attribute assign class, text class, and subelements. Fig. 7 shows the DI model. Each class in the class hierarchy has the attribute of the highest class node and the parent class node. Fig. 8 shows the relationship of the classes, where the relationship of the classes is dynamically binded. 3.2. Storage of document structure Whereas traditional relational DBMS has di culty in translating the documents into table structure, object-oriented DBMS can utilize the class structure by unifying the host Fig. 5. The hierarchy of the SGML document class.

650 H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 Fig. 6. The hierarchy of Document-Type De nition classes. programming languages and the DBMS. It is a great merit that they do not require the Data De nition Language (DDL) and the Data Manipulation Language (DML) like SQL. In this paper, the data model made in the application program is applied to an objectoriented DBMS, Object Store, by using OODBMS characteristics. Since Object Store uni es Fig. 7. The hierarchy of Document Instance classes.

H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 651 Fig. 8. The relationship of SGML document classes. the host programming language and database, the data model in the application program can be easily converted into a persistent model directly. Fig. 9 shows the process and the corresponding codes of Object Store. To keep the persistence of the data model, the following process is required: 1. Transform the object into a persistent variable. 2. Transform the relationship between classes into reference. 3. Register the class types. 4. Transform the new operator into a persistent new operator. Object Store provides the entry point to access class objects, but does not provide Extents that manage class objects. Therefore, we have to manage Extents in the application program. Fig. 9. Development interface of Object Store.

652 H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 4. Search for structured documents With the storage system developed, a user can give queries based on the structure using a user interface, which allows structure-based queries. These queries are submitted to the database, and the results are displayed to the user. Fig. 10 summarizes the entire process of the retrieval system. 4.1. Query language for SGML Query language in Object Store is based on the relationship between objects. Several query forms in a structured document are appended to traditional retrieval systems (Davis, Moore & Zobel, 1995a; Sengupta & Dillon, 1997b). 4.1.1. Boolean query This class applied to the traditional search engine is based on identifying the documents that contain the query terms, and require pattern matching. Q1. Search the documents that have ``SGML'' and ``OODBMS'' 1. Check all the text objects which have ``SGML'' and ``OODBMS''. 4.1.2. Query for structure information Not only the context of the document, but also the structure information, is stored in the database. With this, SGML can a ord to create additional types of query on purely structural characteristics of the documents. Q2. Search parent element type of hsectioni Fig. 10. Retrieval system.

H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 653 1. Search hsectioni among element type objects. 2. Retrieve parent element-type of applied object. 4.1.3. Structure-based query on restricted scope Since the document is stored hierarchically, it is desirable to limit the scope of a query to arbitrary elements within a document. Queries in this class retrieve the whole documents that satisfy the constraints of the Boolean query. Q3. Search the documents that have ``Grove'' in htitlei and hsectioni 1. Search the element objects whose name is htitlei and whose parent attribute is hsectioni. 2. Check whether the model group objects and text objects dependent on a selected element contain ``Grove'' or not. 4.1.4. Attribute-based query This is the query type that searches attributes of an element, which is stored in the attribute assign class objects. Q4. Search documents that contain ``sgml.gif'' in the attribute of hpicturei 1. Retrieve the hpicturei among element objects. 2. Search attributed assign objects that are dependent on the applied element object and Fig. 11. The interface for retrieving document structure.

654 H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 have a le attribute. 3. Retrieve the value of the attribute assign objects that have ``sgml.gif''. 4.2. Query interface The database contains the structure and the content information of SGML documents. There might be several methods to search the documents based on the structure. The following sections illustrate these search methods. 4.2.1. Query on document structure A structure-based retrieval system should navigate the element structure from the highest to the lowest elements. Because the SGML documents contain the hierarchical structure information, they can be represented by a tree structure. Each node in the tree is the elementtype object, and it has child nodes. Fig. 11 shows a screen shot of the structure information. The query on the document structure is processed by clicking the particular element in the tree. This gure shows the situation when the user views the hsectioni element structure. 4.2.2. Query on document content based on structure Retrieving the content based on structure is di erent from traditional retrieval in that it restricts the search area within some parts of the documents. Therefore, the user provides queries on a particular element using the user interface. Fig. 12 shows an example. When the user submits particular keywords on htitlei element, the retrieval system presents the documents that have the pattern in Fig. 12. Fig. 13 shows an example of the results searched. Users can query one particular element by double-clicking the corresponding item, which prompts a dialog box for query input. Then, the Fig. 12. Search pattern.

H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 655 Fig. 13. Retrieval based on the document structure.

656 H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 result of the document list is presented to the user. The result documents are produced by recombining logical units in the database. 4.2.3. Query on attribute Elements can have the attribute according to pre-de ned DTD. Searching the element with the attribute is the same as searching the structure based on content. When the dialog box appears on a particular element, it presents the text box that requests attribute information. The search process uses the attribute assign class objects. 5. Concluding remarks The proposed system decomposes the SGML document into logical units, which have been built according to the Grove structure that is the document model for DSSSL and HyTime, and stores and retrieves the logical units by using an OODBMS, Object Store. As the SGML document model with RDBMS is based on the at table structure, it requires many tables and tuples, as well as the overhead of transformation to represent the SGML document model. In this paper, we have used the OODBMS to overcome the limitation of RDBMS. The system for storage and retrieval and SGML documents has the following characteristics:. It presents the data model that is independent of speci c DTD. The data model can be divided into two models according to generality. The data model for individual DTD has the property that it can store structure information e ectively. However, there is the de ciency that it must apply the new data model on inserting DTD. On the other hand, the data model independent of DTD (Moore, Fuller, Lowe, Thom & Wilkinson, 1995) has the advantage that it can parse and analyze the DTD automatically and create the data model, but it is not easy to manage data e ectively. This paper has proposed a general data model that uses the Grove structure to manage the data e ectively.. To propose an e ective data model that maintains the structural information, we have separated the document de nition model and document instance model. As the data model that combines the document de nition model with the document instance model has the excessive regulation on inserting documents, the performance of the storage and retrieval deteriorates. In this paper, the document instance is represented by a fully connected tree. Since this tree structure is similar to the document structure tree of the retrieval interface, this can boost up the retrieval performance.. As the documents must be stored according to DTD, an e ective user interface is required. In this paper, users give queries with the structure information through a graphical user interface. Because the user interface presents DTD in the tree structure, users can provide queries at a particular region of the tree.. It supports e ective retrieval by using Extents in the classes frequently retrieved. Since it is di cult for OODBMS to view tables, we have supported Extents to enable this feature. Even though a prototype of the storage and retrieval system for structured documents has been developed in this paper, there remains the support of e ective indexing methods for fast

H.-G. Kim, S.-B. Cho / Information Processing and Management 36 (2000) 643±657 657 indexing as in RDBMS. Furthermore, the impact such retrieval might have on users should be investigated. References Aberer, K., Bohm, K., & Huser, C. (1994). The prospects of publishing using advanced database concepts. In Conference on Electronic Publishing. Anick, P., Flynn, R., & Hanssen, D. (1991). Addressing the requirements of a dynamic corporate textual information base. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (pp. 163±172). Blake, G. E., Consens, M. P., Kilpelainen, P., Larson, P. A., Snider, T., & Tompa, F. W. (1994). Text/relational data management systems: harmonizing SQL and SGML. In Proceedings of the Applications of Databases (pp. 267±280). Bohm, K., Aberer, K., & Klas, W. (1997). Building a hybrid database application for structured documents. Multimedia Ð Tools and Applications, pp. 275±300. Christophides, V., Abiteboul, S., Cluet, S., & Scholl, M. (1994). From structured documents to novel query facilities. In Special Interest Group on Management of Data (SIGMOD) (pp. 313±324). Clark, J. (1999). A Free, Object-oriented Toolkit for SGML Parsing and Entity Management, URL: http:// www.jclark.com/sp. Davis, R. S., Moore, T. A., & Zobel, J. (1995a). Database systems for structured documents. IEICE Transactions on Information and Systems, pp. 1335±1342. Davis, R. S., Kent, A., Ramamohanarao, K., Thom, J., & Zobel, J. (1995b). Atlas: a nested relational database system for text application. IEEE Transactions on Knowledge and Data Engineering, 7(3), 454±470. Desai, B., Goyal, P., & Sadri, S. (1986). A data model for use with formatted and textual data. Journal of the American Society for Information Science, 37(3), 158±165. International Organization for Standardization (1986). Information processing Ð text and o ce systems Ð Standard Generalized Markup Language (SGML). ISO/IEC 8879. International Organization for Standardization (1996). Information processing Ð Document Style Semantics and Speci cation Languages (DSSSL). ISO/IEC 10179. International Organization for Standardization (1997). Hypermedia/Time-based Structuring Language (HyTime). ISO/IEC 10744. Macleod, I. (1990). Storage and retrieval of structured documents. Information Processing and Management, 26(2), 197±208. Moore, T. A., Fuller, M., Lowe, B., Thom, J., & Wilkinson, R. (1995). The ELF data model and SGQL query language for structured document databases. In Proceedings of the Australasian Database Conference (pp. 17±26). Salton, G., & McGill, M. (1983). Introduction to modern information retrieval. McGraw-Hill. Sengupta, A., & Dillon, A. (1997a). Extending SGML to accommodate database functions: a methodological overview. Journal of the American Society of Information Systems, pp. 629±637. Sengupta, A., & Dillon, A. (1997b). Query by templates: a generalized approach for visual query formulation for text dominated databases. In Conference on Advanced Digital Libraries (pp. 36±47).