A Workbench for Prototyping XML Data Exchange (extended abstract)

A Workbench for Prototyping XML Data Exchange (extended abstract) Renzo Orsini and Augusto Celentano Università Ca Foscari di Venezia, Dipartimento di Informatica via Torino 155, 30172 Mestre (VE), Italy {orsini,auce}@dsi.unive.it 1 Introduction This paper describes a prototype software which is the outcome of a research carried at the Ca Foscari University of Venice in the framework of the project Data-X 1. The software is a workbench for data engineers, integrating several tools which assist the user in all the tasks of integration and exchange of data with a standard format, for instance for application integration, generation of content-rich web portals, building of virtual information systems, etc. The system described here is a first step towards the construction of a data hub : with this term, we intend a comprehensive tool which, like a network hub device, connects several information sources and consumers through standard ports, allowing the designer to dynamically interconnect these ports in different ways, and performing appropriate translations. Such a tool would be of great help for the tasks above mentioned, and its existence is made possible by the nowadays wide acceptance of XML. Data exchange on the WWW has received a lot of attention due to the rapid diffusion of the proposal of XML as standard for information description by the W3 Consortium. The complete flexibility and generality of the XML markup mechanism as well as its platform-independence, allows its use in many contexts as a language for describing data of any kind, not only for documents to be published on WWW. What is missing is either a tool or a linguistic level to denote the meaning and type of data, since the XML markup mechanism only denotes the logic structure of data. Currently, XML allows neither description of the semantics nor that of the internal representation of data, with respect to applications. The DTD (Document Type Declaration) is of little help to address semantics issues, because it only describes the structural scheme of the document parts composition. The internal representation of data cannot be used since XML is text-based and platform independent, therefore data should be translated in a standard alphanumeric coding by other means. When exchanging data between different data sources, or between data sources and applications by using an XML based mechanism, this omission may limit the possibility of verifying data coherency. This involves both the formalization aspects (type) and the semantics aspects (meaning). This problem is a big roadblock in many application areas. 1 The details of the project Data-X: Management, Transformation and Exchange of Data in a Web Environment can be read at the URL http://www.difa.unibas.it/datax/

2 A workbench for rapid prototyping We propose a development environment for a data engineer, which helps her/him to 1. map relational database schemas into XML DTD s, and vice versa, trying to match as closest as possible different schemas, possibly with missing elements; 2. transfer data between the two sides, a relational DB at one side and an XML document at the other side, in both directions, preserving as much as possible the data structure according to the designer s will; 3. generate programs and DTD for executing and validating data exchange; 4. combine and integrate such programs for building complex systems and Webbased data-oriented applications. The system should have a visual interface for defining the mapping between the different data producers and consumers, in order to keep it simple even for users not skilled in the programming techniques normally needed for performing format and data conversion. A visual environment can avoid the need for learning the languages used for performing queries and data restructuring. Figure 1 shows the overall architecture of the current workbench prototype. As a workbench, it is a collection of tools which, while being coordinated, retain their individual functionality and style of use. An effort has been done to minimize the differences in the interaction with each tool, by building a common interface. The workbench generalizes and integrates some functions of the different tools, such as window management, file management, database connections. However the tools have different goals and come from different experiences, and also tools developed out of the project are candidates for integration. Moreover, since the workbench is oriented towards data exchange, the tools are mainly focused on XML data-centric documents. The functions of such a workbench should allow a user, through a visual interaction style, to: VisualSQL-X query Java program DBtoXML Relational DB DTDMatch XML documents InfoDB XMLtoDB mapping Figure 1. The architecture of the workbench prototype

1. match relational schemas and XML DTDs against libraries of schemas and templates (via the DTDMatch tool); 2. generate XML data from complex queries on relational databases (via the DBtoXML tool); 3. store data of XML documents into relational databases (via the XMLtoDB tool); 4. query information about database schema and metadata (via the InfoDB tool); 5. generate Java code to perform the data exchange at application level (via the DBtoXML tool); 6. transform XML documents using languages like XSLT or graphical tools for mapping structures and contents. Not all the functions are currently integrated into the workbench, some of them are external programs, a few are still under development. Nevertheless the workbench provides a useful environment for prototyping data exchange applications both between data sources and destinations, and between data sources and applications. The prototype is tuned to the InterBase 6 DBMS [1], but its JDBC interface allows fast porting on other system. 3 DTD analysis and schema matching The DTDMatch tool approaches the problem of matching a set of XML DTDs against a view schema drawn from a relational database. Its goal is to evaluate their similarity as a preliminary step of relational tables translation into XML documents. It automatically computes the correspondence between a relational schema and a set of DTDs taken from a library based on a set of similarity parameters, returning a ranked list of DTDs from which the user can select the most appropriate one. This approach is intended as a support to design data integration with predefined (e.g., standard) document schemas. We expect that, unless relational views and DTDs come from a coordinated design, we can obtain only partial matches between the two structures. Some data belonging to the DB view will not be considered by the XML document schema, and the XML schema could require (or accept as an option) data which are not part of the relational DB view. As the other tools of the workbench, DTDMatch is oriented to prototyping and reuse; a system for evaluating the degree of correspondence between database data and document schemas can help to develop prototypes at low cost with great flexibility. Details of the matching operations and of the underlying data model are in [6]. DTDMatch is based on the tree-structured model of XML documents and DTDs described in [3]. An XML document is modeled as a loto (labeled ordered tree object). Nodes correspond to XML elements and their labels provide the type names of the elements. A loto type definition (ltd) models DTDs in a similar way. The problem of matching a relational view schema to a set of DTDs is translated into the problem of building ltds from relational views and from DTDs, and comparing them by computing a similarity measure. The similarity is defined not only by a numeric value denoting the degree of structural correspondence, but also by the list of the corresponding nodes in the two ltds. We must consider that two ltds are similar to the degree that they represent equivalent information, both from the structural and conceptual viewpoints. From the structural viewpoint, the correspondence between nodes and subtrees of the ltds compared

Figure 2. A panel showing results of DTD matching. must cover as much as possible of the two structures. From the conceptual viewpoint, types and labels in the nodes must correspond (at some extent). A thesaurus defines weighted synonymy among names, while type compatibility is defined a priori. Figure 2 shows one of the panels of the tool, showing the results of a match. DTDMatch allows a user to: define and execute SQL queries and build the ltd which represents the schema of the result; build a library of XML DTDs, imported or converted from SQL query results, and edit them; build and maintain a thesaurus storing synonymous names; edit the coefficients which bias the similarity computation; compute similarity between the ltd corresponding to the SQL query in input and the ltds selected from a library; select one DTD and build an XML document with the data returned by the SQL query. The experiments we have done show that the ranking proposed by the matching algorithm is plausible as long as DTDs do not differ seriously from the query schema. The structural similarity between the ltds is based on the reciprocal position of nodes and leaves in the ancestors and sibling nodes. In this way the tool models the need that a rich structure, organized along different aggregation levels, should be preserved in XML translation, and conversely a simple structure should not artificially grow.

4 DB to XML translation and data exchange The DBtoXML tool has been built on top of Visual SQL-X [5, 7], a visual system which assist the user in querying a relational database to produce XML documents of arbitrary complexity. Differently from the DTDMatch tool it is not based on the automatic selection or generation of the document schema. Rather, it is a fully interactive tool that allows a user to specify step-by-step the structure of the query and of the corresponding XML document. The queries are expressed in SQL-X[4], an extension of SQL which, with a style reminiscent of report generation languages, allows the extraction of trees of data from a relational database as XML documents. The tool hides the language syntax to the user, by providing a graphical interface which allows the construction of a query as a tree, which reflects the structure of the expected result. An example of query is shown in Figure 3. The query tree models the ltd of the expected resultand its nodes can be of one the following kinds: <Root> is the tree root, and represents the whole document, containing a set of elements corresponding either to tuples or to groups of tuples extracted from the database. <Rel> represent a database relation (obtained in general through an SQL query), whose set of tuples are converted into elements of its immediate container. <Att> (child of <Rel>), represents a column, a value of which is used as element of its container. <Nest> represents the tuples of a relation which are associated, with a join operation, to a tuple of its container, and which will become roots of subtrees. <Group> represents the grouping of the tuples of a child node by some expression: each group is an element containing the tuples of the group as elements. For each node, the user is given the possibility of specifying all the parameters of the corresponding operation. Besides these kinds of nodes, the user can specify if tuple fields are converted to attributes, instead of elements, the ordering of sequences, as well as other details of the conversion process. The main panel of the tool is the query editor (Figure 3), which allows the construction of the query tree by selecting a node and then applying an operator. In this case, the tree represents a query which returns a set of clients. They are grouped by Figure 3. The query editor in DBtoXML.

country, and each client contains an element with its name, and another one with the sequence of the product code and date of its orders. When a node is selected, the right panel shows the associated information, which depends on the node: for <Root>, the ordering of its elements; for <Rel>, the fields which are the elements attributes; for <Nest>, the join condition, the ordering of its elements, and possibly other conditions on the tuples corresponding to the subelements; for <Group>, the grouping condition (e.g. the field Country of Clients ), an ordering for its subelements, and the group s element attributes. For instance, the definition of the node corresponding to the set of orders for each client is shown in Figure 4a. The system facilitates the task of the user by providing a set of panels for composing conditions and other expressions (e.g. with aggregation functions) during the construction of the tree. In the prototype, the user can follow, with a set of panels, all the phases of the evaluation of the query: the conversion in SQL and the resulting relation, the tree which is the result of the data extraction, and the final document. For instance, the query of the example is translated into the following SQL query: SELECT Clients.Name, Orders.ProductCode, Orders.Date FROM Clients, Orders WHERE (Clients.Code = Orders.ClientCode) while in Figure 4b the loto of the resulting XML document is shown (only a few elements are expanded). In any phase of the evaluation, the user can go back and change its definition, for instance to experiment different grouping and nesting strategies. For the lack of space, the final documents, together with its DTD, is not shown here. The approach taken in translating the query into SQL is that of collecting all the necessary data into a single relation, which is then read only once for producing the resulting XML document. The example previously shown is in effect a very simple query: all the work is made in the final phase, which, through a visit of the query definition tree, generates an intermediate tree containing all the data (the loto tree), which can be directly mapped to the DOM representation of the document. 5 XML to DB translation The XMLtoDB tool has been developed to produce, with a graphical interface, a mapping from XML data-centric documents to relational databases. The basic ideas which rule the mapping are very simple: a text-only element (leaf) can map to a column; a non-leaf element can map to a relation; if an element B is nested inside an element A (its immediate ancestor), then B can be mapped to a column of the table mapped to A, if B is a leaf, or it can be mapped to rows of the database which are associated to the rows of A through some external key. The mapping is established by visiting in preorder the DTD of the document which contains the data. We assume that the DTD is not recursive since we are dealing with data-centric documents, which are strictly hierarchical. For each node one of the following three possibilities is established:

(a) Figure 4. (a) Editing a node in DBtoXML, (b) The resulting XML document. (b) 1. ignore the node and all of its descendents: no mapping is performed and the corresponding document subtree is ignored; 2. pass through : ignore the node, but continue the mapping over its descendents; 3. map the node to a database element, with a set of constraints. The constraints which govern the mapping process ensure that the data transferred in the database are correct with respect to the relational model, and are the following: 1. different DTD elements cannot be mapped to the same DB element; 2. non leaf nodes must be associated only to tables in the DB; 3. leaf nodes must be associated only to columns; 4. if a node N whose only descendents are leaf is associated to a table T, then the descendents of N can be associated only to columns of T. Figure 5 shows the main panel of the tool. The black lines show some illegal mappings, the labels refer to the list above. The user selects the DTD nodes in the order in the upper left panel, and applies one of the operators ignore, pass through or map. For the latter, a corresponding DB element is selected. The current associations are listed in the lower panel. At the end of this phase, the global consistency of the operation is checked, in particular all the NOT NULL columns of a mapped table must have received some value. The values of external keys in a row are automatically taken from the primary key of the row inserted for the corresponding father element. The system then produces a file which can be processed by an ad hoc modification of the XMLDBMS package of Bourret [2]. 6 Other tools There are other tools currently integrated in the workbench: 1. the InfoDB tool, which extracts both metadata and excerpts of data from a database and presents them in a graphical panel for the user to browse; 2. facilities to save and restore partial results, like Visual SQL-X queries in the DB to XML tool, and map files in the XML to DB tool; 3. a tool to compile Visual SQL-X queries and DB mappings in Java programs, which can be saved to be used thereafter. The generation is based on a set of predefined

Figure 5. The XMLtoDB tool templates that can be adapted by the user to particular tasks. The workbench prototype integrates with a unique, coherent interface several independent tools for the exchange of data in XML format. Until now the DBtoXML and XMLtoDB tools are completely integrated, while the DTDMatch tool is still external. New tools are being developed to extend the functionalities of the system, like a DTD to DTD mapper, which translates XML documents in other documents with different DTDs, as well as new facilities for working with complex projects. 7 Acknowledgments The work has been supported by MURST, the Italian Ministry of University and Research in the framework of the project Data-X: Management, Transformation and Exchange of Data in a Web Environment. The workbench has been developed with the co-operation of Massimo Pagotto, Matteo De Franceschi and Marica Bamberghi. 8 References [1] Borland, Interbase, http://www.borland.com/interbase/. [2] R. Bourret, XML and Databases, http://www.rpbourret.com/xmlanddatabases.htm. [3] B. Ludaescher, Y. Papakonstantinou, P. Velikhov, V. Vianu, View Definition and DTD Inference for XML, http://www.sdsc.edu/~ludaesch/paper/icdt-ws99.html. [4] R. Orsini, A preliminary proposal for SQL-X: A Language to Extract XML Documents from Relational Databases, SEBD 2000, L Aquila, June 2000. [5] R. Orsini, M. Pagotto, Visual SQL-X: A graphical tool for producing XML documents from Relational Databases, WWW10, Poster Proceedings of the 10th International World Wide Web Conference, Hong-Kong, 2001. [6] M. Pagotto, A. Celentano, Matching XML DTD To Relational Database Views, SEBD 2000, L Aquila, Jun 2000. [7] M. Pagotto, R. Orsini, Visual SQL-X: Uno strumento grafico per l estrazione di documenti XML da basi di dati relazionali, SEBD 2001, Venezia, June 2001.