On Building XML Data Warehouses



Similar documents
Logical Design of Data Warehouses from XML

Modernize your NonStop COBOL Applications with XML Thunder September 29, 2009 Mike Bonham, TIC Software John Russell, Canam Software

+ <xs:element name="productsubtype" type="xs:string" minoccurs="0"/>

Transformation of OWL Ontology Sources into Data Warehouse

Using Relational Algebra on the Specification of Real World ETL Processes

Module 1: Introduction to Data Warehousing and OLAP

XIII. Service Oriented Computing. Laurea Triennale in Informatica Corso di Ingegneria del Software I A.A. 2006/2007 Andrea Polini

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Data Integration Hub for a Hybrid Paper Search

Dynamic Data in terms of Data Mining Streams

keyon Luna SA Monitor Service Administration Guide 1 P a g e Version Autor Date Comment

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

INTEROPERABILITY IN DATA WAREHOUSES

Chapter 1: Introduction

No Trade Secrets. Microsoft does not claim any trade secret rights in this documentation.

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Deferred node-copying scheme for XQuery processors

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Conceptual Workflow for Complex Data Integration using AXML

A Survey on Web Mining From Web Server Log

The Benefits of Data Modeling in Data Warehousing

Conceptual Level Design of Semi-structured Database System: Graph-semantic Based Approach

New Approach of Computing Data Cubes in Data Warehousing

Model-driven Rule-based Mediation in XML Data Exchange

Indexing Techniques for Data Warehouses Queries. Abstract

Basics of Dimensional Modeling

Application of Data Mining Methods in Health Care Databases

How To Use X Query For Data Collection

Foundations of Business Intelligence: Databases and Information Management

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

A Workbench for Prototyping XML Data Exchange (extended abstract)

Visual Data Mining in Indian Election System

Introduction to XML Applications

A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Data warehouses. Data Mining. Abraham Otero. Data Mining. Agenda

Model-Mapping Approaches for Storing and Querying XML Documents in Relational Database: A Survey

[MS-DVRD]: Device Registration Discovery Protocol. Intellectual Property Rights Notice for Open Specifications Documentation

Service Description: NIH GovTrip - NBS Web Service

Introduction to XML. Data Integration. Structure in Data Representation. Yanlei Diao UMass Amherst Nov 15, 2007

XML and Data Management

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Dimensional Modeling for Data Warehouse

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

6NF CONCEPTUAL MODELS AND DATA WAREHOUSING 2.0

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

Cúram Business Intelligence and Analytics Guide

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks

CHAPTER 5: BUSINESS ANALYTICS

Database Design Patterns. Winter Lecture 24

Data warehouse design

Business Rules Modeling for Business Process Events: An Oracle Prototype

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Classification of Fuzzy Data in Database Management System

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

DATA WAREHOUSING AND OLAP TECHNOLOGY

1. INTRODUCTION TO RDBMS

Conceptual Workflow for Complex Data Integration using AXML

14 Databases. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:

Tracking System for GPS Devices and Mining of Spatial Data

The Direct Project. Implementation Guide for Direct Project Trust Bundle Distribution. Version March 2013

Effective Management and Exploration of Scientific Data on the Web. Lena Strömbäck Linköping University

Database Optimizing Services

XML DATA INTEGRATION SYSTEM

Foundations of Business Intelligence: Databases and Information Management

Week 3 lecture slides

Web Services Technologies

Proceedings of the International MultiConference of Engineers and Computer Scientists 2013 Vol I, IMECS 2013, March 13-15, 2013, Hong Kong

MDM and Data Warehousing Complement Each Other

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

Translating between XML and Relational Databases using XML Schema and Automed

Modern Databases. Database Systems Lecture 18 Natasha Alechina

SPATIAL DATA CLASSIFICATION AND DATA MINING

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

A Framework for Developing the Web-based Data Integration Tool for Web-Oriented Data Warehousing

Data Integration and ETL Process

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Agents and Web Services

Chapter 7 Multidimensional Data Modeling (MDDM)

Web-Based Genomic Information Integration with Gene Ontology

CA ERwin Data Modeler

TIM 50 - Business Information Systems

Fuzzy Duplicate Detection on XML Data

Managing XML Data to optimize Performance into Object-Relational Databases

Business Object Document (BOD) Message Architecture for OAGIS Release 9.+

Data warehousing with PostgreSQL

WEBVIEW An SQL Extension for Joining Corporate Data to Data Derived from the World Wide Web

Rational Reporting. Module 3: IBM Rational Insight and IBM Cognos Data Manager

A Framework for Data Migration between Various Types of Relational Database Management Systems

Introduction. Web Data Management and Distribution. Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart

Data Integration for XML based on Semantic Knowledge

A SURVEY ON WEB MINING TOOLS

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

How To Model Data For Business Intelligence (Bi)

Paper DM10 SAS & Clinical Data Repository Karthikeyan Chidambaram

Data Modeling for Big Data

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

CHAPTER 4: BUSINESS ANALYTICS

Transcription:

On Building XML Data Warehouses Laura Irina Rusu 1 Wenny Rahayu 1 David Taniar 2 1 LaTrobe University, Department of Computer Science and Computer Engineering, Bundoora, VIC 3086, Australia {lrusu,wenny}@cs.latrobe.edu.au 2 Monash University, School of Business Systems, Clayton, VIC 3800, Australia David.Taniar@infotech.monash.edu.au Abstract. Developing a data warehouse for XML documents implies two major processes: one of creating it, by processing XML raw documents into a specified data warehouse repository; and one of querying it, by applying techniques to better answer user s queries. This paper focuses on the first part; that is identifying a systematic approach for building a data warehouse of XML documents, specifically for transferring data from an underlying XML database into a defined XML data warehouse. The proposed methodology on building XML data warehouses covers processes such as data cleaning and integration, summarization, intermediate XML documents, and updating/linking existing documents and creating fact tables. We utilise the XQuery technology in all of the above processes. In this paper, we also present a case study on how to put this methodology into practice. 1. Introduction In the last few years, building a data warehouse for XML documents became a very important issue, if we consider continual growing of representing different kind of data as XML documents [1,11]. This is one of the reasons why researchers become interested in studying ways to optimise processing of XML documents and to obtain a better data warehouse to store optimised information for future reference. How to build a data warehouse and how to efficiently query it depend on the following two strong-related issues. In the XML document repository case, we cannot have good answers for our queries, in terms of speed, quality and up-to-date information, if the data warehouse is not properly designed to contain all necessary information and, on the other hand, we cannot properly design a data warehouse if a reasonable amount of possible future questions of users are unknown and unanalysed. Many papers have analysed how to design a better data warehouse for XML data, from different points of view (e.g. [1, 2, 5, 6]) and many other papers have focused on querying XML data warehouse or XML documents (e.g. [7, 8]), but almost all of them considered only the design and representations issues of XML data warehouse or how to query them and very few considered optimisation of data quality in their research.

In this paper, we propose a practical methodology for building XML documents data warehouse. We also ensure that the data warehouse is one where the occurrences of dirty data, errors, duplications or inconsistencies are minimized as much as possible and a good summarisation exists. The steps cover data cleaning and integration, summarization, intermediate XML documents, and updating/linking existing XML documents and creating fact tables. We use XQuery in all of the above processes. The main purpose of this paper is to show systematic steps to build a XML data warehouse as opposed to developing a model for designing a data warehouse. However, it is important to note that our proposed steps for building a XML data warehouse is generic enough to be applied on different XML data warehouse model. The rest of this paper is organised as follows. Section 2 gives a brief introduction to the XML technology, including XML data warehouses. Section 3 discusses related work. Section 4 presents our proposed methodology in building XML data warehouses. Section 5 describes a case study, and finally section 6 gives the conclusions. 2. Background As the background to our work, we briefly discuss XML technology covering XML, XML Schema and XQuery. We also briefly introduce XML data warehouses, including their schemas. 2.1. XML, XML Schema, and XQuery XML (extensible Markup Language) has increasingly been used for storing different kinds of semistructured data [11]. It became a new standard for data representation and exchange on the Internet, from web servers and web application data to business data, which previously were stored in relational databases. One new thing introduced by XML was tag significance, from a way to predict how the web page will look like (as presentation tags in HTML) to specifying what is the context or meaning of the element contained. Name of the elements & attributes can be easily bound to their content in XML; and therefore it is self-describing. Hierarchy can be chosen to accommodate grouping or categorization and elements/attributes can be added anytime in the structure, because it is extensible, as suggested by the name. Due to this extended feature in defining new elements, the XML document structure can be changed multiple times; and in the case of very big documents, the track of changes can be lost and even content of elements or value of attributes can differ, by mistake, from what we intend it to be. To overcome this problem, XML Schema is used to specify a valid structure for each XML document, where elements and attributes, hierarchy, data types, possible values (restrictions), order indicators, occurrence indicators etc. are clearly defined [11]. Strictly following XML schema can ensure that probability to have inconsistencies in the document is reduced and querying that document using XQuery, for example, will give better results.

XQuery is a language developed by W3C (World Wide Web Consortium) [9], still under development but functional at the same time, designed to provide query capabilities for the same range of data that XML stores [11]. It can be used for querying XML data that do not have a schema at all, or that is governed by a XML schema or by a DTD (Document Type Definition). XQuery data model is based on the concept where each document is represented as a tree of nodes (where each node can be a child of another node), as opposed to the traditional relational model which does not utilise hierarchies. One of the most powerful expressions in XQuery is FLWOR, translated by for-let-where-order by- and it is quite similar to select-from-where from SQL in relational databases, but FLWOR expression binds variables to values in for and let clauses and use them to create a result in, eliminating tuples which do not satisfy a specific condition in where clause and ordering the tuples in the order clause. 2.2. XML Data Warehouses Query processing can be facilitated by the existence of a data warehouse, that is, a repository where necessary data are stocked, after they have been extracted, transformed and integrated from initial documents [6]. Because these days diskspace is no longer an issue, whereas query speed is very significant, one of the main purposes of a data warehouse is optimising query processing speed and this can be achieved using multiple levels of summarisation during the processing. As volume of data contained in these XML documents are continually growing and changing, XML data warehouse intend to solve the issue of providing users with up-to-date, correct and comprehensive information. A star-schema approach of data warehouse would contain facts, dimensions, attributes and attributes hierarchies. Facts will contain cleaned and correct specific values representing the business activity (for example sales ). Dimensions are characteristics that will slice our data in specific subsets, for example sales by region or sales by years. Details contained in dimension documents are attributes and they can have a specific hierarchy, for example country=>region=>state=>city. 3. Related Work There is a large amount of work in data warehouse field. Many researchers have studied how to construct a data warehouse, first for relational databases [2,3,4,10] but in the last years, for XML documents [5,6], considering the spread of use for this kind of documents in a vast range of activities. A concrete methodology on how to construct a XML data warehouse analysing frequent patterns in user historical queries is provided in [6]. The authors start from determining which data sources are more frequently accessed by the users, transform those queries in Query Path Transactions and, after applying a rule mining technique, they calculate the Frequent Query Paths which stay at the base of building data warehouse schema. It was also mentioned that the final step in

building a data warehouse would be to acquire clean and consistent data to feed to the data warehouse. However, there is not enough detail on how to ensure this. Although it seems to be a simple thing to do in the whole process, this is the place where corrupted or inconsistent data can slip into the data warehouse. Another approach is proposed in [5], where a XML data warehouse is designed from XML schemas, proposing for this a semi-automated process. After preprocessing XML schema, creating and transforming schema graph, they choose facts for data warehouse and, for each facts, follow few steps in order to obtain starschema: building the dependency graph from schema graph, rearranging the dependency graph, defining dimensions and measures and creating logical schema. In their approach, XQuery is used to query XML documents in three different situations: examination of convergence and shared hierarchies, searching for manyto-many relationship between the descendants of the fact in schema-graph and searching for to-many relationship towards the ancestors of the fact in the schemagraph. The authors specify that in the presence of many-to-many relationship one of the logical design solution proposed by [10] is to be adopted. [3] consider the aspect of data correctness and propose a solution where datacleaning application is modelled as a directed acyclic flow of transformations applied to the source data. This framework consists of a platform offering three services: data transformation, multi-table matching and duplicate elimination, each service being supported by a shared kernel of four macro-operators consisting of high-level SQL-like commands. The framework proposed by these authors for data cleaning automation addresses three main problems: object-identity, data entry errors and data inconsistencies across overlapping autonomous databases but the method cover only data cleaning in relational databases aspect. Authors of [4] survey various summarisation procedures for databases and provide a categorisation of the mechanisms by which information can be summarised. They consider information capacity, vertical reduction by attribute projection, horizontal reduction by tuple selection, and horizontal/vertical reduction by concept ascension as being few very good methods of summarisation, but as they only analysed only implementation on databases projects, future work should be done for implementing some specific technique of summarisation for XML documents. Our proposed method focuses on practical aspects of building XML data warehouses through several practical steps, including data cleaning and integration, summarization, intermediate XML documents, and updating/linking existing documents and creating fact tables. We have developed generic methods whereby the proposed method is able to be applied to any collection of XML documents to be stored in an XML data warehouse. We believe that a step-by-step XML data warehouse creation from existing XML documents is essential and moreover, by employing these steps, the result XML data warehouse will offer better performance in processing future queries.

4. Proposed Method on Building XML Data Warehouses Our paper proposes a systematic approach on how to feed data from an underlying XML database into a XML data warehouse, in such a way so optimisation of queries can be made possible. Also, a methodology of building a data warehouse from an initial XML document is provided, developing necessary fact and dimensions. 4.1. Data cleaning and integration The existence of a XML schema is very important, especially for analysing data correctness, because its purpose is to define a valid way to specify the content of XML documents which are considered for analysis. If it is correctly established and all documents are compliant with it, there is a lower probability to have dirty data. We defined four rules for performing data cleaning and integration. Rule1. If a schema exists, we should verify correctness of all schema stipulations: - verifying if correct use of name of elements and attributes in the entire document: <name> <Name> <NAME> - observing if data type & natural logic is respected (for example, a telephone number will probably have xsd:string data type but it can have only digits, not letters etc ); - verifying if all elements and attributes are entered in their schemaspecified hierarchy, for example: <name>john</name> <address> <street>15, New St.</street> <city>melbourne</city> <state>vic</state> </address> <name>john</name> <street>15, New St.</street> <city>melbourne</city> <state>vic</state> - verifying if order indicators are respected. They can be: all, choice, sequence. All indicate that child elements declared in it can appear in any order and each child element must occur only once. Choice indicates that either one element or another can occur, and if sequence indicator exists, elements inside it should appear only in the specified order; - identifying Null values; In relational database theory, Null values have two main meanings [4]: (1) the value was unknown or (2) was inapplicable In XML schema, number of occurrences of an element can be specified, for example minoccur= 1 & maxoccur= 20 means that at least one apparition of element should occur and no more than 20 occurences; - any other schema related specification / restriction;

Rule2. Eliminating duplicate records, for example a customer name was entered two or more times, by different departments in a store, in a different manner /order (surname&firstname, firstname&surname, surname&the father s initial&firstname etc) Rule3. Eliminate inconsistencies from elements & attributes values, for example existence of two different dates of birth for the same person [3]. As they cannot be two, the right one only should be kept. Rule4. Eliminating errors in data, determined by entering process, for example mistyping. Some of the data-cleaning processes have to be done manually, because they require user intervention and occasionally domain expert understanding of the area. However, Rule 1 above can be automated to reduce the need of user intervention in this process. 4.2. Data summarisation This is a very important step, as multiple issues must be taken in consideration: - Because not the entire volume of existing data in the underlying database will be absorbed into the XML data warehouse, we should consider very carefully a process of data summarisation. We must extract only useful and valuable information. At the same time it should be diversified, as in the future it should answer to multiple queries. - Depending of how affordable is disk-space issue, how often specific queries appear or if historical data would be necessary, we can have multiple levels of summarisation, applying different mechanisms. - On the other hand, multiple levels of summarisation, when exist, can significantly improve the speed of querying. In most of the data warehouse projects, speed of querying is above disk-space issue, so should consider detailing levels of summarisation as much as necessary. Creating dimensions using summarization Depending of how many levels of summarisation we will have for a specific dimension, we would create and populate new documents, which will contain extraction from initial data or special-constructed values. For example (see Figure 1), a) if we need to create a part-of-the-day dimension (query: What are dynamic of sales during one day: morning, afternoon, evening ) we will create and populate it as a very new document; b) if we need country or region as a level of summarisation (query: What are the sales of product X by countries/regions ), we can find this info by querying directly into the primary document and searching for distinct values of country and/or region element.

XML documents storing sales details m Pick distinct values of country Pick distinct values of region Create & populate as a new XML document 1 Regions region_id region country ID m 1 1 Part-of-the-day time_id morning afternoon evening Countries country_id country region ID Dimension A Dimension B Fig.1 Dimensions created and populated as new XML documents Techniques to create time dimension: Different level of summarization used in time dimensions can be, for example: month, quarter, semester, year. For month and year some specific XQuery functions [12] can be applied to get them. Furthermore, we are interested only in those values related to our data, so only distinct value should be extracted. At the same time, a key should be constructed, to act as a link to the fact document. The following shows how time dimension is created using XQuery. let $a:=0 document { for $b in distinct-values(doc( doc_name.xml )/date_node) let $a:=$a+1 <time_node> <timeid>{$a}</timeid> <date>{$b}</date> <month>{get-month-from-date($b)}</month> <year>{get- year -from-date($b)}</ year > </time_node> } To exemplify with a more specific case, for example to create a semester dimension, we should create and populate it with our own data. document { <semester> <semid>1</semid> <start_month>1</start_month> <end_month>6</end_month> </semester> <semester> <semid>2</semid> <start_month>7</start_month> <end_month>12</end_month> </semester> }

When we link this dimension to the fact data, we will only determine which semester that each date corresponds. Functions to extract date and time are available in XQuery [12], so a large range of time level summarisation can be analysed and created. Techniques to create other kind of dimensions: After determining the dimensions, we should check if values of that specific dimension exist in our document. If not, they should be created, such as in semester example above. If exist, we will be interested in distinct values of the element involved. let $a:=0 document { for $t in distinct-values(doc( doc_name.xml )//element) let $a:=$a+1 <new_element> <elementid>{$a}</elementid> <element_name>{$t}</element_name> </new_element> }, where <new_element> is a tag representing a new created element in the dimension. It contains a key (<elementid>, which takes predetermined values), actual value which are interesting for us (that is <element_name>, e.g. values of country ) and any other elements which can be helpful in the dimension. If we have more level of summarization for the same dimension (e.g. country and region in location dimension), we will create a new document for each of those levels and we will connect each other using keys. 4.3. Creating intermediate XML documents In the process of creating a data warehouse from collection of documents, creating intermediate documents is a common way to extract valuable & necessary information (refer to Figure 2)

XML documents storing sales details invoice_id invoicedate price quantity customer_id customer_name customer_address carrier transport_fee.. Intermediate Orders D invoice_id invoicedate quantity price customer ID Fig.2 Extracting important information as new XML document Which information in the initial documents is most important and necessary and should be kept in the data warehouse is a very good question and researchers have attempted to answer it by determining different complex techniques (detecting patterns in historical user queries [6] or detecting shared hierarchies and convergence of dependencies [7]). Still, for general users the analysis of possible queries in the domain remains a common way to do it. During this step, we are interested only in data representing activity, which include data involved in queries, calculations etc., from our initial document. At the same time, we will bring in the intermediate document elements from our initial document that are keys to dimensions, if already exist (e.g. customerid in Fig.2, which reference customer dimension). Actual fact document in data warehouse will be this intermediate document, but linked to the dimensions. document { for $t in (doc( doc_name.xml )) <temp_fact> <elem1_name>{$t//elem1_content}</elem1_name> <elem2_name>{$t//elem2_content}</elem2_name> </temp_fact> } where <temp_fact> is a tag representing a new element in the intermediate document, containing <elem1_name> (name of the element) and <elem1_content> (value of element which is valuable for our fact document) etc. Actual correct path in expression {$t//elem1_content} depends on the schemagraph of the initial document and where elem2_content exist in that schema. 4.4. Updating/Linking existing documents; Creating fact table At this step all intermediate XML documents created before should be linked, in such a way that relationships between keys are established (see Fig 3).

Intermediate Orders invoice_id invoicedate quantity price customer_id Orders Fact invoice_id time_id quantity price value customer ID Part-of-the-day time_id morning afternoon evening Countries country_id country region ID (another dimensions, e.g. customers ) Fig.3 Linking documents and creating star-schema of data warehouse Techniques to link dimensions and to create data warehouse star-schema: If linking dimensions to intermediate document and obtaining fact are processed all together, the number of iterations through our initial document will be lower, so it subsequently reduces the processing time. A general way to do it can be: let $a:=doc( dimension1.xml ) e.g. time dimension let $b:=doc( dimension2.xml ) e.g. customer dimension document { for $t in (doc( intermediate.xml )/node) <dim1_key>{for $p in $a where $p//element=$t//element $p//dim1_key} </dim1_key> <dim2_key>{for $p in $b where $p//element=$t//element $p//dim2_key} </dim2_key> ------------- ( for all dimensions ) --------- <elem1>{$t//elem1_name}</elem1> <elem2>{$t//elem2_name}</elem2> <elem3>{$t//elem1 * $t//elem2}</elem3> ---- (for all extracted & calculated elements ) ---- } In the example above, we just obtain the fact, where <dim1_key>, <dim2_key> etc represent the new created keys elements which will link the fact to dimensions and <elem1>, <elem2>,<elem3> etc are elements of the fact, extracted from intermediate document. As can be seen in <elem3> declaration, a large range of

operators can be applied, in order to obtain desired values for analysis (e.g. price * quantity=income) 5. A Case Study Because the main purpose of this paper is to show systematic steps to build a XML data warehouse as opposed to developing a model for designing a data warehouse, purely for the purpose of a case study we have adopted an existing model and schema mapping as proposed in [5]. However, it is important to note that our proposed steps for building a XML data warehouse is generic enough to be applied on different XML data warehouse model. In this section, we will show how the star-schema of data warehouse can be obtained from initial XML documents structure, following our proposal and steps presented in the previous section. An example of visual representation of XML document (based on [5]) is shown in Figure 4. purchaseorder billto shipto item orderdate product name name brand street country name quantity price productcode size city Fig.4 Example of XML document schema-graph The following codes show the mapping from the visual representation as described in Figure 4 to the implementation in XML Schema. <?xml version="1.0" encoding="iso-8859-1"?> <xs:schema xmlns:xs="http://www.w3.org/2001/xmlschema"> <xs:element name="purchaseorder"> <xs:complextype> <xs:element name="shipto"> <xs:complextype> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="street" type="xs:string"/> <xs:element name="city" type="xs:string"/> <xs:element name="country" type="xs:string"/> </xs:sequence> </xs:complextype> <xs:element name="billto">

<xs:complextype> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="street" type="xs:string"/> <xs:element name="city" type="xs:string"/> <xs:element name="country" type="xs:string"/> </xs:sequence> </xs:complextype> </xs:element> <xs:element name="item" maxoccurs="unbounded"> <xs:complextype> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="quantity" type="xs:positiveinteger"/> <xs:element name="price" type="xs:decimal"/> </xs:sequence> </xs:complextype> </xs:element> <xs:element name="product"> <xs:complextype> < xs:element name="name" type="xs:string"/> <xs:element name="productcode" type="xs:string"/> <xs:element name="size" type="xs:string"/> <xs:element name="brand" type="xs:string"/> </xs:complextype> </xs:complextype> </xs:element> </xs:schema> Step 1. Data cleaning and integration The first step is to apply the specified Rules 1 to 4 as described in section 4.1 which include verifying correctness of all schema stipulations, eliminating duplicate records, inconsistencies and data entry errors. These rules are applied to the data that will be transferred to our data warehouse. This step is normally performed by a user who has a good understanding of the domain, through manual techniques or otherwise. Once all the above rules have been applied to the data, the following step 2 is then performed. Step 2. Data summarisation By analysing possible queries to the data warehouse for decision making, for example, we need to summarise our purchase details at a month level, so in our time dimension one of the attributes will be month (the same way can be done for year, semester, quarter etc). At the same time, a summarisation by customer and by product are necessary and almost all queries ask for an income calculation, that is price multiplied by quantity. For easy referencing, we bring again the schemagraph in attention:

a) Creating Time Dimension- because we need month level of summarization, we should link each particular date in our data with it s month value, so we will extract only distinct values of orderdate from initial document and will apply a getmonth-from-date function, in order to find which month the date corresponds to. We name this new document timedim.xml. Each new document will be created using document instruction from XQuery. let $b:=0 document{ for t in distinct-values (doc( purchaseorder.xml )//orderdate) let $b:=$b+1 <ordtime> <timekey>{$b}</timekey> <orderdate>{$t}</orderdate> <month>{get-month-from-date($t)}</month> </ordtime> } b) Creating Customer Dimension we can have for a purchaseorder different customers for shipto and billto destinations or they can be the same, so by creating a customer dimension actually we create something which shipto and billto will refer in the future, without unnecessary duplication of data in principal document. To do this, we will extract only distinct (unique) customers from our initial document and we will name it customerdim.xml. let c:=doc( purchaseorder.xml )//purchaseorder let $b:=0 document{ for $t in distinct-values ($c//name) let $b:=$b+1 <customer> <customerkey>{$b}</customerkey> <name>{$t}</name> <street>{$t/../street}</street> <city>{$t/../city}</city> <country>{$t/../country}</country> </customer> } c) Creating product dimension- the same as for customer dimension, we will create a product dimension which contains only distinct products extracted from our initial document and we will consider ProductCode element as a key, because it can uniquely identify a product. We will name it productdim.xml. let $b:=0 let $p:= doc( purchaseorder.xml )//purchaseorder document { for $t in dictinct-values($p//productcode) let $b:=$b+1

} <product> <productkey>{$t}</productkey> <productname>{$t/../..//name} </productname> <size>{$t/../..//size}</size> <brand>{$t/../..//brand}</brand> </product> Step 3. Creating intermediate XML documents As our possible queries refer to month, customer and product as possible levels of summarisation, we need to extract from the initial purchaseorder example all records having order date, customer name (from shipto, billto ) and product code and, among them, price and quantity as principal figures when analysing this sales activity. It is all we need for now and we will name it PurchaseTemp.xml. document { for $t in doc( purchaseorder.xml )//purchaseorder <purchase> } <shiptoname>{$t/shipto/name}</shiptoname> <billtoname>{$t/billto/name}</billtoname> <orderdate>{$t/orderdate}</orderdate> <productcode>{$t//productcode}</productcode> <price>{$t//item/price}</price> <quantity>{$t//item/quantity}</quantity> </purchase> Step 4. Linking existing (new created) documents; Creating data warehouse It is straightforward that we now have to link orderdate from our intermediate document with time dimension, name with customer dimensions and productcode with product dimension. We create now the final document, which will be the fact of data warehouse and will contain: keys for links with all three dimensions, price and quantity as raw specific activity data and an income element as common required calculation. let $p:=doc( productdim.xml ) let $c:=doc( customerdim.xml ) let $t:=doc( timedim.xml ) document { for $a in doc( PurchaseTemp.xml )/purchase <shiptocustomerkey>{for $b in $c/customer where $b/name=$a/shiptoname $b/customerkey } </shiptocustomerkey> <billtocustomerkey> {for $b in $c/customer

} where $b/name=$a/billtoname $b/customerkey} </billtocustomerkey> <orderdatekey> {for $b in $t/ordtime where $t/orderdate= $a/orderdate $t/timekey} </orderdatekey> <productkey> {for $b in $p/product where $b/productkey=$a/productcode $b/productkey } </productkey> <USPrice>{$a/price}</USPrice> <quantity>{$a/quantity}</quantity> <income>{$a/price* $a/quantity } </income> Following all the steps from 1 to 4, we have just obtained the following star-schema for purchaseorder XML data warehouse, as mentioned in [5]. TIME timekey orderdate month PRODUCT productkey productname size brand PURCHASE_ORDER shiptocustomerkey billtocustomerkey orderdatekey productkey USPrice quantity income CUSTOMER customerkey name street city country Fig. 5 Star - schema of purchaseorder XML data warehouse [5] 6. Discussions As mentioned in the introduction at this paper, data contained in data warehouse ideally should be in-time, high-quality, up-to-date and easy to query. Our method helps these requirements to be accomplished: If the fact and dimensions of data warehouse are correctly built following the steps and if cleaning data and integration are performed every time when new data should be included in data warehouse, a high-quality data warehouse will be achieved and subsequently it will support more efficient querying. Using the appropriate summarization in developing dimensions can bring an optimization of data volume added in time in data warehouse, too. So we can say that the methodology s features bring optimization in the long run. Another strong accomplishment is that the steps of work are presented in a very clear and easy manner and people who doesn t have too much knowledge of XQuery or XML schema can iterate them, with adequate and proper modifications, in order

to obtain a data warehouse corresponding to their necessities. Queries will be easy to ask and process, as long as dimensions reflect correct all level of summarization suitable for each specific case. For example, query Give all purchase orders which were billed to customers from UK will only look in customer dimension and take keys for customers who have country= UK and, for each of these keys will take order date, price, quantity and income from fact document. 7. Conclusions and Future work The aim of this paper is to establish a framework for building a XML data warehouse, in term of quality of data and procedures to follow by the possible working group. This framework has been build following few important steps but, considering the specific situation of XML documents involved in processing, more detailed analysis of data, cleaning, summarization and integration can be done. Because at this stage there are no automatic methods for covering data cleaning, our aim is to study this concept and try to identify some techniques which can be applied to XML documents processing. Other interesting aspect to study can be transformation of XML schema when data to be included in the data warehouse come from documents with different structures but with equivalent data content, how we will manipulate all aspects of data integration, creating dimensions and facts in data warehouse. References 1. Widom J., Data Management for XML: Research Directions, IEEE Data Engineering Bulletin, 22(3):44:52, Sept.1999 2. Goffarelli, M., Maio, D., Rizzi, S., Conceptual design of data warehouses from E/R schemes Proc. HICSS-31, Kona, Hawaii, 1998 3. Galhardas, H., Florescu, D., Shasha, D., Simon, E., An Extensible framework for data cleaning, Proc. of the International Conference on Data Engineering, San Diego, CA,2000 4. Roddick, J.F., Mohania, M.K., Madria, S.K., Methods and Interpretation of Database Summarisation, Database and Expert Systems Application, Florence, Italy, Lecture Notes in Computer Science, Vol.1677, pp.604-615, Springer-Verlag, 1999 5. Vrdoljak, B., Banek M. and Rizzi S., Designing Web Warehouses from XML Schema, Data Warehousing and Knowledge Discovery, 5 th International Conference DaWak 2003, Prague, Czech Republic, Sept.3-5, 2003 6. Zhang J., Ling T.W., Bruckner R.M. and Tjoa A.M., Building XML Data Warehouse Based on Frequent Patterns in User Queries, Data Warehousing and Knowledge Discovery, 5 th International Conference DaWak 2003, Prague, Czech Republic, Sept.3-5, 2003 7. Fernandez M., Simeon J. and Wadler P., XML Query Languages : Experiences and Exemplars, Draft manuscris, September 1999, http://homepages.inf.ed.ac.uk/wadler/topics/xml.html 8. Deutch A., Fernandez M., Florescu D., Levy A. and Suciu D., A Query Language for XML, Computer Networks, vol.31, pp.1155-1169, Amsterdam, Netherlands, 1999

9. Robie J., XQuery, A guided Tour (ISBN 0-321-18060-7), copyright 2004. chapter 1, posted with permission from Addison-Wesley http://www.datadirect.com/news/whatsnew/xquerybook/index.ssp 10. Song I.Y., Rowen W., Medsker C. and Ewen E., An analysis of many-to-many relationships between fact and dimension tables in dimensional modelling, Proc. DMDW, Interlaken, Switzerland, pp.6.1-6.13, 2001 11. World Wide Web Consortium (W3C), XML Schema Part 0: Primer, http://www.w3.org/tr/xmlschema-0/#emptycontent 12. http://www.w3.org/tr/xpath-functions