A Tool for Data Cube Construction from. Structurally Heterogeneous XML Documents

Size: px
Start display at page:

Download "A Tool for Data Cube Construction from. Structurally Heterogeneous XML Documents"

Transcription

1 This is a preprint of an article accepted for publication in JASIST 2007 Wiley Periodicals, Inc. A Tool for Data Cube Construction from Structurally Heterogeneous XML Documents Turkka Näppilä, a Kalervo Järvelin, b and Timo Niemi c Department of Computer Sciences a,c ; Department of Information Studies b University of Tampere, Finland E mail: {Turkka.Nappila; Kalervo.Jarvelin; Timo.Niemi}@uta.fi Abstract Data cubes for OLAP (Online Analytical Processing) often need to be constructed from data located in several distributed and autonomous information sources. Such a data integration process is challenging due to semantic, syntactic, and structural heterogeneity among the data. While XML (Extensible Markup Language) is the de facto standard for data exchange, the three types of heterogeneity remain. Moreover, popular path oriented XML query languages, such as XQuery, require the user to know in much detail the structure of the documents to be processed and are, thus, effectively impractical in many real world data integration tasks. Several Lowest Common Ancestor (LCA) based XML query evaluation strategies have recently been introduced to provide a more structure independent way to access XML 1

2 documents. We shall, however, show that this approach leads in context of certain not uncommon types of XML documents to undesirable results. This paper introduces a novel high level data extraction primitive that utilizes the purpose built Smallest Possible Context (SPC) query evaluation strategy. We demonstrate, through a system prototype for OLAP data cube construction and a sample application in informetrics, that our approach has real advantages in data integration. INTRODUCTION Background For proper operations, advanced organizations need to analyze both their internal data and data produced by external organizations, such as competitors. This requires data extraction from several autonomous information sources. OLAP (Online Analytical Processing; Codd, Codd & Salley, 1993) is a popular means for analyzing multidimensionally organized summary data for ad hoc information needs. In multidimensional analysis, the underlying summary data are viewed simultaneously along multiple dimensions. Data from operational information systems are aggregated in OLAP into data cubes which consist of dimension and measure attributes. The former provide the factors (e.g., time) along which the values of the latter (e.g., performance, productivity) are analyzed. Individual dimensions may be organized by hierarchical levels of granularity like day month year. OLAP operations, such as roll up and drill down, in association with aggregation functions (e.g., sum, average) afford the opportunity to analyze measure attributes at different levels of granularity to identify interesting changes, prevailing trends, or to compare objects of analysis in the data with each other. 2

3 XML (Extensible Mark up Language; Bray et al., 2004) has become the de facto standard for data representation by removing one of the traditional obstacles of largescale data exchange syntactic data incompatibility. Thus, XML offers a natural starting point for data extraction from autonomous and heterogeneous data sources, especially in the Web environment. In XML documents, the logical structures are represented as elements. An element is an entity delimited by its start and end tags, which carry its name. A start tag may also contain attributes describing properties or characteristics of an element or carrying an independent data value. An element can consist of nested elements or contain only data values. The user is able to freely name the elements and attributes in his/her XML documents. Their names should be selected carefully, since the semantic interpretation of them is based only on these names. Due to the semistructured nature of the XML data, description (schema) and content components are mixed in a document. Because of the element nesting, XML data are modeled as ordered, labeled trees. This is also the starting point in path oriented XML query languages developed by the World Wide Web Consortium, the XML standardization organization. These query languages, most notably XQuery, are based on the path expressions of the XPath standard (Clark & DeRose, 1999). XPath offers mechanisms both for navigating XML structures and extracting content. An XPath expression specifies the path from the root node of a document to the node of interest. In XPath, elements and attributes are treated differently. XQuery (Boag et al., 2007) has become the most popular XML query language. Besides being a query language, XQuery is also a full scale, Turing complete programming language. XQuery has inherited several features from the XPath and XML Schema (Fallside & Walmsley, 2004; Thompson et al., 2004; Biron & Malhotra, 2004) standards. Informa 3

4 tion extraction is done in XQuery by specifying iterative structures and functions. Furthermore, XQuery follows its own, non XML syntax. In XML, it is possible to represent semantically similar information in multiple different ways within one document and between documents. This leads to data heterogeneity. XML documents suffer from three types of heterogeneity. In semantic heterogeneity, semantically similar information is represented by different names or dissimilar information by the same names. In syntactic content heterogeneity, semantically the same content is expressed in different languages (French, English) or units of measurement ($,, ; o F, o C). Finally, in structural heterogeneity the same or similar data is organized in structurally different ways, e.g., in different levels of hierarchy. In addition to those, a specific piece of information can be represented in XML documents as a name of an element, as a name of an attribute, or as their values. These types of heterogeneity are independent of each other and all combinations among them may appear simultaneously. The heterogeneity between information sources must be harmonized before meaningful data cubes can be constructed based on XML documents. In this paper, we focus on construction of data cubes in structurally heterogeneous XML environments. Thus, we do not consider multidimensional analysis or OLAP per se. Contribution and Outline Before the actual OLAP data cube construction, it is necessary to carry out laborious integration among heterogeneous XML data. Popular XML query languages, such as XQuery, are path oriented and require the user to know the structure of the documents to be processed in much detail. We demonstrate in the present paper that such XML query languages are indeed laborious and troublesome in data integration because the data inte 4

5 grator must explicitly account for structural variety in the documents while (s)he may not be aware of its scope. To relieve the data integrator from this burden, we propose a highlevel XML data extraction primitive, which avoids explicit path specification when extracting structurally heterogeneous data from XML based information sources. We also provide a novel data cube construction operation, which allows high level declarative expressions for specifying the dimension and measure attributes with their extraction conditions. In XML data extraction for data cubes, it is necessary to identify which XML components in heterogeneous structures are meaningfully related to each other. The Lowest Common Ancestor (LCA) semantics has been proposed to solve the problem in XML query languages (Liu, Yu & Jagadish, 2004; Xu & Papakonstantinou, 2005; Hristidis et al., 2006). We will point out potential shortcomings of the Smallest Lowest Common Ancestor (SLCA) semantics (Xu & Papakonstantinou, 2005) as well as of the similar Meaningful Lowest Common Ancestor (Liu, Yu & Jagadish, 2004) and Minimum Connected Tree (Hristidis et al., 2006) approaches which introduce a useful restriction to the LCA semantics and contribute our Smallest Possible Context (SPC) evaluation strategy to extract meaningfully related data. By using informetrics (see, e.g., Egghe, 2006) as a sample application, the paper aims to demonstrate that the proposed novel operations and the SPC query evaluation strategy provide a real improvement over path oriented and LCA based XML query languages for data integration from structurally heterogeneous XML documents. We assume here that the XML element and attribute names have unambiguous semantics which the 5

6 data integrator knows while the data may be organized in heterogeneous structures often largely unknown to him/her. The rest of the paper is organized as follows. The next section reviews related work on XML based data cube construction. In the third section, we discuss the requirements for data integration from heterogeneous XML documents and give the goals of our contribution and our heterogeneous XML document collection for the example in informetrics. Thereafter, we shall analyze the integration of XML data and present our data extraction primitive is_component_of and discuss the SPC query evaluation strategy. The following section presents query processing and demonstrates, through sample queries, the benefits of the proposed approach. The paper ends in a discussion section and conclusions. RELATED WORK Jensen et al. (2001) present an architecture for integrating XML and relational resources. The architecture contains a component that transforms XML queries into SQL. The mapping between XML and SQL is based on a specific UML diagram, called UML snowflake diagram. Unlike in our approach, the construction of a UML snowflake diagram presupposes that the designer has detail knowledge about the domain. In addition, they focus on finding the multidimensional structures directly in the XML data that are distributed based on, e.g., legal issues or the nature of the data. Niemi et al. (2002), by contrast, assume that the XML data at hand are intended to be used in multidimensional analysis and are distributed mainly for technical reasons. They present a system where XML is used for data collection to resolve the possible syntactic heterogeneity in data sources. Contrary to our goals, Niemi et al. concentrate on the tech 6

7 nical side of the distribution architecture and they offer no explicit mechanisms for manipulating XML based information. Beyer et al. (2005) recognize the limited capabilities for data grouping in XQuery. However, queries requiring grouping of the data are essential in business analytics. Therefore, they propose a new construct that simplifies the query result grouping and provide OLAP style functionality (aggregation and roll ups) in XQuery. Further, Wiwatwattana et al. (2007) extend XQuery with an X ^3 operation that allows the manipulation of such XML structures which differ slightly from each other. In our approach, which is not based on XQuery, the SPC query evaluation strategy is utilized to resolve complex heterogeneity among XML structures without the need for user control. Park et al. (2005) propose a framework for constructing XML data cubes from wellorganized XML documents, whereas in our approach the XML documents are assumed to be very heterogeneous. In their approach, an XML data cube is constructed and queried using the XML MDX language developed for this purpose. GOALS AND SAMPLE ENVIRONMENT Requirements for Data Integration with XML In spite of the data self description, XML does not specify, (a) should a specific piece of information be represented as an element, attribute or a value; (b) how to label elements and/or attributes; (c) how to structurally arrange the relationships between elements and/or attributes; and (d) how to syntactically represent data values. Feature (b) is an example on semantic heterogeneity and feature (d) on syntactic one. Features (a) and (c) are related to structural heterogeneity, which is our focus. 7

8 Due to the above types of heterogeneity, data integration for data cubes is a very demanding task. Therefore, a full scale system for data cube construction from heterogeneous XML documents should meet the following requirements: 1. Semantic heterogeneity: the system should support the identification of semantically equivalent elements and attributes even if their names are inconsistent. Analogously, it should support the automatic integration of such semantically equivalent but inconsistent XML structures. 2. Syntactic heterogeneity: the system should support the identification of syntactically equivalent element and attribute values even if their representations (value domains) differ. It should also support automatic harmonization of XML structures. 3. Structural heterogeneity: the system should 3.1 not require its user to master the structural diversity in XML structures in detail or to know which kinds of components (elements or attributes) are used to represent the information; 3.2 not require its user to specify explicitly the navigation in XML structures; 3.3 relieve its user from writing complex structural data integration specifications; and 3.4 therefore, execute automatically structural data integration on the basis of compact, declarative and high level specifications. In this paper we shall focus on requirements 3.1 to 3.4. Study Goals The goal of our study is to support complex data integration tasks for construction of data cubes under the requirements given above. The specific goals are to: 8

9 1. demonstrate that popular XPath based XML query languages, being path oriented, require the data integrator to know the structure of the documents in too much detail; 2. demonstrate that such XML query languages are too laborious to use in data integration due to the possibility of great structural variation among data of interest; 3. demonstrate that the Lowest Common Ancestor (LCA) semantics, relieving the requirements on structural knowledge in XML query languages, is insufficient; 4. propose a high level data extraction primitive is_component_of for structural data integration, which avoids explicit path specification and incorporates the novel Smallest Possible Context (SPC) evaluation strategy to extract meaningfully related data from XML documents; 5. introduce a high level CREATE CUBE operation for data cube construction based on the is_component_of primitive and the SPC query evaluation strategy; and 6. present a sample application in informetrics and sample queries that demonstrate the applicability and salient features of our approach. Niemi, Hirvonen and Järvelin (2003) introduced an advanced tool for multidimensional analysis based on data cubes. Here we introduce a system through which the user is able to flexibly construct data cubes from available XML documents for such analysis. In our Prolog based implementation the user constructs a data cube by specifying the CREATE CUBE operation. The operation and the underlying system are described later on. Now, we introduce our structurally heterogeneous sample XML document collection from which sample data cubes for informetrics are later constructed. 9

10 A Heterogeneous Sample XML document Collection Our application area is informetrics. Informetrics studies various statistical phenomena of literature which are typically based on bibliographic information provided by online databases. These statistical phenomena may concern productivity issues (of authors, countries, journals; Almind & Ingwersen, 1997), generalized impact factors (of journals, authors; Hjortgaard Christensen, Ingwersen & Wormell, 1997), activity profiles (of authors, organizations), citation networks (bibliographic coupling, author co citations), literature growth and aging, to mention a few (for more, see, e.g., Egghe, 2006). Järvelin, Ingwersen and Niemi (2000) presented an easy to use interface for generalized informetric analysis. Later, Niemi, Hirvonen and Järvelin (2003) demonstrated that OLAP is a promising approach for advanced analysis in informetrics. However, the prerequisite for OLAP is that there is some underlying data cube, on which the actual analysis is based. We focus on constructing OLAP data cubes based on information extracted from heterogeneous XML documents. 10

11 Figure 1. Examples of structural organizations in sample documents of type publications The sample document collection consists of XML documents of two types, publications and grants. Figures 1 and 2 describe several structural alternatives in sample documents based on the same content. The sample documents of the type publications contain information on publications. Possible element and attribute names used in the documents of this type as well as their possible structures are depicted in Figure 1. We assume that the semantics of these documents are self explanatory. The information within these documents can be grouped by individual publications (articles or brochures) or according to the publisher, author, year, domain, or type of an article. Likewise, the sample documents of the type grants contain information on grants. Possible element and attribute names and their structures are given in Figure 2. In this case, information can be grouped by individual grants or by the institution, year, grantee, domain, or amount of a grant. 11

12 Figure 2. Examples of structural organizations in sample documents of type grants As Figures 1 and 2 show, the same information within the sample documents can be represented as a value of an element or attribute. Moreover, information about the domain of an article or grant (db or ir) and the type of an article (refereed or non_refereed) can also be expressed by a name of an element. Due to space limitations, the contents of all the sample documents (brochures omitted) are summarized as tables in Appendix A. INTEGRATION OF XML DATA The Data Extraction Primitive We motivate developing a new data extraction primitive by requiring it to be able (1) to handle the elements and attributes of an XML document equally and (2) to utilize both schema and instance level information in a query. First, we argue that in data integration there is no semantic difference between elements and attributes in an XML document 12

13 other than that only elements may have substructures. In contrast, popular XML query languages separate elements and attributes. Second, we note that both XML and relational databases share structural heterogeneity problems (see, e.g., Lakshmanan, Sadri & Subramanian, 2001), but unfortunately most relational and XML query languages offer only limited capabilities to utilize schema level information in queries. In XML based data integration this is, however, often a necessity. Next we take a look at the is_component_of data extraction primitive. The novel is_component_of primitive is developed for querying the components of an XML document which are its elements and attributes. Generally, a component in an XML document has a name (i.e., the name of an element or attribute) and a value (i.e., the textual linearization of element content or an attribute value). The components of an XML document are specified in the is_component_of primitive with component expressions. The primitive has the following basic form: <CompExpr2> is_component_of <CompExpr1> in <DocName> where <Comp_Expr1> and <Comp_Expr2> are component expressions and <DocName> is any valid XML document name or a variable referring to an unknown XML document in a document collection. A component expression has the form <Name>(<Value>), where <Name> is an atom referring to the name of a known component or a variable referring to an unknown component name. If <Value> is an atom it expresses explicitly the value of the component <Name> whereas as a variable it refers to an unknown value of the component <Name>. Variable, as in deductive databases (see, e.g., Liu, 1999), begin with an upper case letter. Strings are written between quotes. The notion of shared variable, typical of logic pro 13

14 gramming (see, e.g., Sterling & Shapiro, 1994) and deductive databases, is used throughout our approach: that is, if several component expressions in a query contain the same variable then its instantiations must be the same throughout the query processing. The <Name> is the only compulsory part of a component expression. Note that a component expression matches both elements and attributes without the user needing to know which one it is in the document. Variables in component expressions can refer to extensional, intensional, or both levels. For example, in the component expression N(smith) the variable N refers to a component with the value smith. Thus, it is related to the intensional level. In the component expression author(v) the variable V is related to the extensional level since it will be instantiated with a value of the component name author. In the component expression N(V), the variable N is related to the intensional whereas the variable V is related to the extensional level. A component expression with two atomic values, for example, author(smith), expresses an explicit condition which, depending on the XML document at hand, may be true or false. A single atom or variable, for example, author or N, is a component expression referring to a known or unknown component name. 14

15 Figure 3. Three examples of data organization in XML documents The is_component_of primitive may be used in several ways. For example, the expression author(v) is_component_of publications in example.xml returns all the components (elements and attributes) with the name author which are, immediate or indirect, subcomponents of the component publications in the XML document example.xml. The corresponding expression in XPath would require the two expressions doc( example )//publications//author and doc( example Analogously, the expressions C(V) is_component_of E in D refers to any name (C) and its value (V) of each subcomponent of any element (E) in any XML document (D) in the XML document collection. It is not possible to specify a single XPath expression corresponding to the above is_component_of expression because it presupposes the manipulation of the unknown intensional level. 15

16 Lowest Common Ancestor Semantics Compared to XQuery, the most obvious advantage of the is_component_of query primitive is its capability to query multiple components in a single query expression. In this case, the component expressions are written within braces ({,}) and separated by commas. For example, the expression {author(a), year(y), title(t)} is_component_of publications in example.xml returns all the values of subcomponents author, year, and title associated with the same publications component (publications component may have several occurrences) in the XML document example.xml. The components within the braces may appear in any order and at any nesting levels within their common ancestor element. However, this approach as such contains one considerable disadvantage. In the cases where the ancestor element contains multiple subcomponents with the same name, all the possible combinations of the occurrences of these subcomponents will be returned. For example, when evaluating the expression above against the XML document depicted in Figure 3 (a) eight sets of component combinations are returned, although it is easy to see that only two of them are meaningful. All other combinations mix information from unrelated components. This is why one has to develop more advanced means to select only the meaningfully related components. For this purpose, several Lowest Common Ancestor (LCA) based semantics have been proposed (Liu, Yu & Jagadish, 2004; Xu & Papakonstantinou, 2005; Hristidis et al., 2006). The central idea in the LCA semantics is that the given nodes in an XML document are searched only in the context of their lowest common ancestor node, thus reducing the search space. To illustrate this, let us first assume that XML documents are indexed with the structural indices introduced in (Niemi, 1983). Thus, we denote the root node of an 16

17 XML document by the index < 1 >. Any other node in the document is denoted by the index < ξ, i > where ξ is the index of its parent and i an integer that is gained by traversing the node ξ in preorder. For example, in the XML document tree in Figure 3 (a) the LCAs for nodes author, year and title are the nodes with indices < 1,1 >, < 1,2 >, and < 1 >. Note that the application of the LCA semantics does not require such indexing but they are used here only for demonstration. The problem with the LCA semantics is that the root of an XML document is ultimately the LCA of all the other nodes in any XML document and it, thus, allows descendant nodes to be combined in an arbitrary manner. To prevent this, a restriction to the LCA semantics has been recently developed. The Smallest Lowest Common Ancestor (SLCA; Xu & Papakonstantinou, 2005) semantics like the similar approaches in (Liu, Yu & Jagadish, 2004) and (Hristidis et al., 2006) modifies to the LCA semantics by requiring that the result must not contain any such a node which is the root of a subtree that also contains all the desired nodes. This restriction is not, however, always sufficient. Next we consider two such cases: 1. A document contains partial information, that is, it has an unequal number of nodes we are interested in. When querying, for example, the nodes author, year, and title in an XML document tree corresponding to Figure 3 (b) in all the LCA based semantics developed so far, the node ( author, < 1,1,1 >) would also be wrongly associated with the nodes ( year, < 1,2,1 >) and ( title, < 1,2,2 >). 2. A piece of information at a higher level of the document tree must be replicated with the information represented at a lower level of the same tree structure. For example, when querying the nodes author, year, and title in an XML document tree in Figure 3 17

18 (c) the node ( author, < 1,1 >) has to be replicated with the right combinations of the nodes year and title. In LCA based semantics the year and title nodes in the document would be combined in an arbitrary manner (e.g., the node ( year, < 1,2,1>) would be associated with the node (title, < 1, 3,1, 2, 1 >) because their LCA and SLCA is the node ( publications, < 1 >), i.e., the root of the XML document tree). The two examples above demonstrate the general problem with LCA based semantics: they do not offer any mechanism to prevent combining unrelated nodes when some nodes are associated through the root node of the document tree (or its subtree). We have developed the Smallest Possible Context (SPC) query evaluation strategy to produce only such node combinations that do not mix data that are not meaningfully related to each other. Smallest Possible Context Evaluation Strategy A query evaluated with the SPC query evaluation strategy returns the correct node combinations in all those cases when a query evaluated with some LCA based semantics does so. In addition, it also returns the correct node combinations in the cases discussed above. The SPC query evaluation strategy does not offer reduced contexts for query evaluation, as LCA based semantics do, but returns directly a set of meaningfully connected nodes of a given XML document. Unlike the LCA based semantics, the SPC query evaluation strategy requires indexing of the nodes of XML documents. Next we demonstrate how meaningfully related nodes are selected in the SPC query evaluation strategy. Let us assume that we want to extract meaningfully related author, year, and title nodes from an XML document organized as in Figure 3 (b). First, we construct the set A that contains combinations of any two (author, year, and title) nodes. Each node in a node pair is also equipped with its index. A node pair is chosen into the 18

19 set A by the criterion that it is not possible to construct any other pair of the same node types whose constituents would have a longer common prefix. The set A constructed from the nodes in Figure 3 (b) is A = { ( author, < 1,1,1 >), ( year,< 1,1,2 >), ( author, < 1,1,1 >), ( title,< 1,1,3 >), ( year, < 1,1,2 >), ( author,< 1,1,1 >), ( year, < 1,1,2 >), ( title,< 1,1,3 >), ( year, < 1, 21, >), ( title,< 1, 2, 2 >), ( title, < 1,1, 3 >), ( author,< 1,1, 1 >), ( title, < 1,1,3 >), ( year,< 1,1,2 >), ( title, < 1,2,2 > ), ( year,< 1,2,1 >) }. The pairs ( author, < 1,1,1 >), ( year,< 1,2,1 >) and (author,< 1,1,1 >), (title,< 1,2,2 >) are not included in the set A because their longest common prefix is < 1 >, that is, it is shorter than the longest common prefix of the indices of the node pairs ( author,< 1,1,1 >), ( year,< 1,1,2 >) and ( author,< 1,1,1 >), (title,<1,1,3 >), whose length is 2. The node pairs of the set A can be viewed as arcs in a directed graph. Now, based on these arcs we construct the set G that contains all possible complete graphs they constitute. In a complete graph all the nodes are adjacent, that is, each node is directly connected to every other node in the graph. In our example, the elements of the set A constitute only one complete graph. Thus, the set G is G = {( ( author, < 1,1,1 >), ( year,< 1,1,2 >), ( year, < 1,1,2 >), ( author,< 1,1,1 >), ( year, < 1,1,2 >), ( title,< 1,1,3 >), ( title, < 1,1,3 >), ( year,< 1,1,2 >), ( title, < 1,1,3 >), ( author,< 1,1,1 >), ( author, < 1,1,1 >), ( title,< 1,1,3 >) )}. 19

20 The arcs ( year, < 1,2,1 >), ( title,< 1,2,2 >) and (title, < 1,2,2 >), ( year,< 1,2,1 >) in the set A do not constitute complete graph because no author node can be reached from either components. The smallest possible contexts of the nodes we are interested in are framed by selecting the separate nodes from the graphs in the set G. In our example, the smallest possible context for nodes author, year, and title is the sequence ( author, < 1,1,1 >), ( year,< 1,1,2 >), (title,< 1,1,3 >). The advantage of the SPC query evaluation strategy lies in choosing the node pairs that constitute the complete graphs based on the longest common prefix of their indices instead of their mutual distance. If the nodes were chosen based on their mutual distance then querying nodes author, year, and title in an XML document organized as Figure 3 (c) would not return the smallest possible context ( author, < 1,1 >),( year,< 1,3,1,1 >), (title,<1,3,1,2,1 >). This is due to the fact that the distance between the nodes ( author, < 1,1 > ) and ( year, < 1,3,1,1> ) is longer than the distance between nodes ( author, < 1,1 > ) and ( year, < 1,2,1 > ). The node ( year, < 1,3,1,1> ) would be discarded and only the node ( year, < 1,2,1 > ) would be selected. However, their longest common prefix is < 1 > and they are, thus, both selected in the SPC query evaluation strategy. It is worth noting that the longest common prefix of two structural indices is the same as the SLCA of the two nodes. We have embedded the SPC query evaluation strategy as a part of the is_component_of data extraction primitive. Following the original idea of the LCA based semantics, the SPC semantics offers a powerful tool for treating the structural obscurity in XML documents. It is often the case that the data integrator does not know the structure of the XML documents in detail but 20

21 (s)he, however, knows their contents. If the integrator wants to formulate an effective XQuery expression, (s)he must know the structure of the documents used in the query in some detail. By using the is_component_of data extraction primitive, the user only needs to know the names of elements and attributes used in the XML document collection and, based on this information, the primitive is able to extract all meaningfully related components from these XML documents. DATA CUBE CONSTRUCTION FROM XML DOCUMENTS The Data Cube Construction Process In our approach, a data cube is constructed from the XML documents by the data cube construction operation CREATE CUBE, specified by the user. The operation consists of two parts: CREATE CUBE WHERE <Cube_Specification> <Conditions_Specification> The cube specification part specifies the content of the data cube. The conditions part gives data extraction conditions in terms of is_component_of expressions. The cube specification part consists of the cube name and a set of dimension and measure attribute specifications. A dimension attribute specification has the form dim(<name>,<vars>), where <Name> is the name of the dimension attribute and <Vars> refers to a variable (or a list of variables) used in the conditions part. A measure attribute specification has analogously the form mes(<name>,<var>,<func>), where <Name> is the name of the measure attribute and <Var> is a variable whose instantiations are used for the calculation of the values of the attribute based on the aggregation function 21

22 <Func>. In our implementation, the following aggregation functions are available: count, sum, avg, min, and max. In the conditions part, the user specifies the extraction of the desired data from the XML documents in terms of is_component_of expressions. By using shared variables in the is_component_of expressions and in the cube specification part, the user associates the cube attributes with relevant information extracted from the documents. If an integration case requires the use of several alternative component expressions then the user should write all the different is_component_of expressions separately. In order to facilitate the specification of a variety of queries, our system is able to automatically generate the needed is_component_of expressions based on the user s descriptions. Such descriptions consist of two sets of component expressions and of one sample is_component_of expression. Through this sample is_component_of expression the user expresses the relationships between the component expressions in these two sets (i.e., which are subcomponents and which are supercomponents) in the actual is_component_of expressions. The generation process will be explained in detail in the context of Sample Queries. The complete BNF syntax of the CREATE CUBE operation is given in Appendix B. 22

23 Figure 4. Overview of the system prototype Before proceeding to the Sample Queries, we briefly look at our system prototype. It has been implemented in Prolog and has a simple text based interface. An overview of the system implementation is given in Figure 4. First, the user specifies a query (i.e., CREATE CUBE operation) which is parsed by the system. Second, if the query is correct, it will be analyzed for the is_component_of expression generation. Otherwise the user is asked to rewrite the query. Third, a set of is_component_of expressions is generated. Each of the is_component_of expressions is then run in the XML document collection and the extracted information is gathered in the runtime memory as an intermediate result. The intermediate result consists of value sequences whose components are the instantiations of dimension and measure attributes. Fifth, the measure attribute values are aggregated from the intermediate result. The measure attribute values associated with a specific combination of dimension attribute values are gathered and the indicated aggregation function is applied. Finally, the data cube is printed as an XML document. 23

24 Figure 5. Sample of the XML representation of a result data cube A sample of the resulting data cube (the first line of Table C1) is depicted in Figure 5. The root of the XML document is the datacube element. It has one attribute whose value is the data cube name given in the cube specification part. The rows of the data cube are represented within tuple elements. The dimension attributes are represented as dimension elements. It has an attribute that expresses its name. The value of the dimension attribute is represented as the value of the element. The measure attributes are analogously represented as measure elements. They contain a name attribute whose value expresses their name. The corresponding data values, based on the calculation given in the cube specification part, are the values of the element. Sample Queries Our Sample Query 1 demonstrates a typical multidimensional informetric query by analyzing annual productivity rates of authors. It constructs a data cube that contains the number of articles grouped by the year of publication, by the author, and by the type of an article. The structure of the data cube specification on line (1) is straightforward. The name of the data cube (article_cube), three dimension attributes (year, author, and type) and one measure attribute (number_of_articles) are given. The values of the measure attribute are calculated on the basis of the count of the articles. 24

25 Sample Query 1 (1) CREATE CUBE article_cube(dim(year,y), dim(author,a), dim(type,t), mes(number_of_articles,n,count)) (2) WHERE C = {year(y), author(a), title(n), type(t), T = refereed, T = non_refereed} (3) S = {publications} (4) C is_component_of S in DOC. In the conditions part, the user specifies the XML structures based on the values of which the data cube attributes are populated. On lines (2) and (3) two sets of components expressions are specified. The is_component_of expression description on line (4) shows that the component expressions given in the set C on line (2) occur in the left hand side whereas the component expressions in the set S on (3) occur in the right hand side of the is_component_of expressions to be generated. In other words, the component expressions in the set S offer a reduced context in which the component expressions included in the set C are evaluated. The expressions T = refereed and T = non_refereed in the set C denote that a value of the variable T is bound to the given element or attribute names and are analogous to component expressions refereed and non_refereed. This means that some generated is_component_of expressions deal with the information represented at the intensional level. The variable DOC will be instantiated with some document name in the XML document collection. Based on the sets C and S the system generates a set of is_component_of expressions for extracting the desired information from the XML documents. The main principle in the generation process is that each variable occurring in the cube specification part must 25

26 occur only once in each generated is_component_of expression. In the context of Sample Query 1, this means that the following three is_component_of expressions are generated: (a) (b) (c) {year(y), author(a), title(n), type(t)} is_component_of publications in DOC, {year(y), author(a), title(n), T = refereed} is_component_of publications in DOC, {year(y), author(a), title(n), T = non_refereed} is_component_of publications in DOC. The above three is_component_of expressions are, in turn, evaluated in the XML document collection and their result sets are gathered as an intermediate result based on which the data cube is constructed after some data cleaning operations have been executed. Due to space limitations, the result of the Sample Query 1 is represented in tabular form (see Table C1 in Appendix C). As Sample Query 1 shows, the CREATE CUBE operation offers a very declarative tool for constructing data cubes. A data cube is specified in a simple way by naming the cube, its attributes and the aggregation functions used. Likewise, the query primitives for extracting the required information from XML documents are specified simply by giving the relevant component names or data values. After that the system generates the required query primitives and constructs the data cube automatically. Structures in the source document may vary without effecting the query specification. These features facilitate the data integration considerably from the data integrator s perspective. Path oriented XML query languages do not offer any straightforward expression for constructing data cubes. For example, by using XQuery one first needs to specify a set of sub queries for extracting the desired information from the XML documents. After that one needs to specify a new query that reorganizes the results of the previous sub queries and aggregates the data values in them. This can become very laborious and troublesome, 26

27 in particular, if the user does not know the structures of the documents to be integrated in much detail. On the basis of the single is_component_of expression on line (a) above one is able to extract information from all the XML documents where the information is structured according to the document fragments (a) (e) in Figure 1 (and from all possible variations of them, too). By using XQuery one would have to write 14 different queries (for all possible element attribute combinations) to extract the same information. Further, those 14 XQuery queries do not construct any data cube but they only correspond to the conditions part of CREATE CUBE operation. A sample XQuery query that extracts (partly) the same information than the is_component_of primitive on line (a) is given in Figure 6. This XQuery expression obviously presupposes that the user masters a set of predefined functions and iterative thinking and is also able to synchronize variable instantiations between loops. Figure 6. An XQuery query corresponding to an is_component_of expression 27

28 When the number of alternative component expressions in the conditions part increases, also the number of is_component_of primitives generated from it increases. For example, if all the component expressions in the sets C and S in Sample Query 1 had 5 three alternative forms, then 243 (3 ) is_component_of primitives should be generated. Most of them would probably return no result. However, we think that the overgeneration of is_component_of primitives is only a minor drawback compared to the great expressive power of the CREATE CUBE operation. Also, sometimes two or more generated is_component_of primitives may extract exactly the same information from an XML document. In this case duplicate information is removed. If the element and attribute names in the XML documents are ambiguous, that is, semantically different objects are denoted with the same name, in some cases some of the generated is_component_of primitives may return erroneous results. This is because they are too inclusive in combining information from too wide a context. However, this is not a problem solely in our system since similar problems also occur in XQuery and in LCAbased query languages if the nodes in XML documents are named inconsistently. Sample Query 2 demonstrates a more demanding multidimensional informetric analysis. The user analyzes the correlation between the number of published articles and the total sum of grants grouped by year, person, and domain. The required information has to be extracted from both types of our sample documents (publications and grants). Regardless of the differences in their complexity, the cube specification parts in Sample Queries 1 and 2 resemble each other. The only significant deviation between them is the need to specify two measure attributes in Sample Query 2. 28

29 Sample Query 2 (1) CREATE CUBE article_grant_cube(dim(year,y), dim(person,p), dim(domain,d), mes(no_of_articles,n,count), mes(sum_of_grants,m,sum)) (2) WHERE C1 = {year(y), author(p), title(n), domain(d), D = db, D = ir}, (3) S1 = {publications}, (4) C1 is_component_of S1 in DOC1, (5) C2 = {year(y), grantee(p), amount(m), domain(d), D = db, D = ir}, (6) S2 = {grants, institution}, (7) C2 is_component_of S2 in DOC2 Now, in the conditions part one needs to specify two is_component_of expression descriptions. On lines (2) to (3) and (5) to (6), four sets of component expressions are given. Utilizing these sets the system generates 9 is_component_of expressions based on the descriptions on lines (4) and (7), respectively. The first description produces the following three is_component_of expressions: (a) (b) (c) {year(y), author(p), title(n), domain(d)} is_component_of publications in DOC1, {year(y), author(p), title(n), D = db} is_component_of publications in DOC1, {year(y), author(p), title(n), D = ir} is_component_of publications in DOC1. Analogously, the second is_component_of expression description generates the following six is_component_of expressions: (d) (e) (f) {year(y), grantee(p), amount(m), domain(d)} is_component_of grants in DOC2, {year(y), grantee(p), amount(m), D = db} is_component_of grants in DOC2, {year(y), grantee(p), amount(m), D = ir} is_component_of grants in DOC2, 29

30 (g) (h) (i) {year(y), grantee(p), amount(m), domain(d)} is_component_of institution in DOC2, {year(y), grantee(p), amount(m), D = db} is_component_of institution in DOC2, {year(y), grantee(p), amount(m), D = ir} is_component_of institution in DOC2. The use of the shared variables Y, P, and D in the data cube specification and in all the generated is_component_of expressions enables the system to associate relevant XML structures with the correct data cube attributes. A part of the content of the resulting data cube is given in Table C2 in Appendix C. The cube specification part of the CREATE CUBE operation remains relatively simple regardless of the complexity of the data cube. The complexity of the conditions part depends on the number of the XML document types, where the required information is to be extracted from, but remains still considerably simple compared to the number of XQuery expressions needed. For example, if the user wants to construct a data cube corresponding to that of Sample Query 2 using XQuery (s)he first needs to formulate explicitly two separate sets of queries. The first contains XQuery expressions for extracting information from the documents of type publications and the second from the documents of type grants. As we demonstrated in the context of Sample Query 1, an is_component_of expression often corresponds to multiple XQuery expressions. In the case of Sample Query 2, we have nine generated is_component_of expressions, so it is not difficult to guess that the number of XQuery expressions the user needs to formulate is large. After formulating and running these XQuery queries, the user needs to combine the result documents returned by each query and finally to execute the required aggregation operations for the data in the combined document. In the CREATE CUBE operation, all this is done automatically and it is invisible to the user. 30

31 DISCUSSION XML greatly supports data sharing between organizations by providing a standard data exchange format. In spite of wide adoption of the XML, the problems related to semantic, syntactic and structural heterogeneity remain. For this reason, XML based data integration for the OLAP data cube construction is very laborious and troublesome. Our contribution is a powerful tool to facilitate the structural integration problems. Our study goals were stated in the third section. The sample application in informetrics was purposefully designed to represent structural heterogeneity. The goals 1 2 (requirement of structural knowledge and laboriousness of XML query languages) were discussed in the context of LCA semantics and through sample queries. Sample queries based on the CREATE CUBE operation remained compact compared to the corresponding expressions in XQuery (shown only partially). In addition, queries based on XQuery were complex and required quite detailed structural knowledge on part of the data integrator. Goal 3 (inefficiency of LCA semantics) was demonstrated through an example showing that LCA semantics and its enhancement (SLCA) do not always extract meaningfully related data from an XML document. For Goal 4, we introduced the high level is_component_of data extraction primitive. It is capable of extracting named components elements or attributes from any XML document without explicit path specifications or knowledge of the component type. Depending on the use of atomic values and variables in the is_component_of primitive, they can refer to extensional, intensional, or both levels. The proposed Smallest Possible Context (SPC) evaluation strategy extends the LCA semantics to extract meaningfully related data from XML documents. 31

32 We also proposed the CREATE CUBE operation for data cube construction (Goal 5). It is based on the is_component_of primitive and the SPC query evaluation strategy. The CREATE CUBE operation allows high level declarative specification of the target data cube. A set of is_component_of data extraction primitives is automatically generated for each CREATE CUBE operation. In fact, these comprise of all structural combinations which are needed in the extraction. Some of the generated expressions do not extract anything from a given set of XML documents if the specific structural combinations do not exist. On one hand, such over generation consumes some processing time but on the other hand it relieves the integrator from writing long specifications that would require good knowledge on the underlying XML data. We chose informetrics as our sample application area as it typically contains much heterogeneity in its data sources (Goal 6). The sample queries demonstrate the applicability and salient features of our approach in informetrics data cubes may be constructed declaratively through compact expressions that do not require explicit navigation and thus support users who are not aware of the detailed structures of autonomously produced XML documents. The sample queries do not demonstrate the full capabilities of our CREATE CUBE data cube construction operation, for example, the range of aggregation primitives supported. There are some limitations in the capabilities of our data cube construction operation. Consistent with our present focus, it handles only structural heterogeneity both interesting and demanding per se. Still, both semantic and syntactic heterogeneity, also interesting and demanding problems, remain and we will focus on them in later papers. Järvelin & Niemi (1991) describe our early prototype in automatic resolution of syntactic 32

Chapter 1: Introduction

Chapter 1: Introduction Chapter 1: Introduction Database System Concepts, 5th Ed. See www.db book.com for conditions on re use Chapter 1: Introduction Purpose of Database Systems View of Data Database Languages Relational Databases

More information

A Workbench for Prototyping XML Data Exchange (extended abstract)

A Workbench for Prototyping XML Data Exchange (extended abstract) A Workbench for Prototyping XML Data Exchange (extended abstract) Renzo Orsini and Augusto Celentano Università Ca Foscari di Venezia, Dipartimento di Informatica via Torino 155, 30172 Mestre (VE), Italy

More information

Managing large sound databases using Mpeg7

Managing large sound databases using Mpeg7 Max Jacob 1 1 Institut de Recherche et Coordination Acoustique/Musique (IRCAM), place Igor Stravinsky 1, 75003, Paris, France Correspondence should be addressed to Max Jacob (max.jacob@ircam.fr) ABSTRACT

More information

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

Chapter 1: Introduction. Database Management System (DBMS) University Database Example This image cannot currently be displayed. Chapter 1: Introduction Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Database Management System (DBMS) DBMS contains information

More information

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics Journal of Advances in Information Technology Vol. 6, No. 4, November 2015 Data Warehouse Snowflake Design and Performance Considerations in Business Analytics Jiangping Wang and Janet L. Kourik Walker

More information

Chapter 2. Data Model. Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel

Chapter 2. Data Model. Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel Chapter 2 Data Model Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel 1 In this chapter, you will learn: Why data models are important About the basic data-modeling

More information

Deferred node-copying scheme for XQuery processors

Deferred node-copying scheme for XQuery processors Deferred node-copying scheme for XQuery processors Jan Kurš and Jan Vraný Software Engineering Group, FIT ČVUT, Kolejn 550/2, 160 00, Prague, Czech Republic kurs.jan@post.cz, jan.vrany@fit.cvut.cz Abstract.

More information

14 Databases. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:

14 Databases. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to: 14 Databases 14.1 Source: Foundations of Computer Science Cengage Learning Objectives After studying this chapter, the student should be able to: Define a database and a database management system (DBMS)

More information

Modern Databases. Database Systems Lecture 18 Natasha Alechina

Modern Databases. Database Systems Lecture 18 Natasha Alechina Modern Databases Database Systems Lecture 18 Natasha Alechina In This Lecture Distributed DBs Web-based DBs Object Oriented DBs Semistructured Data and XML Multimedia DBs For more information Connolly

More information

Database Design Patterns. Winter 2006-2007 Lecture 24

Database Design Patterns. Winter 2006-2007 Lecture 24 Database Design Patterns Winter 2006-2007 Lecture 24 Trees and Hierarchies Many schemas need to represent trees or hierarchies of some sort Common way of representing trees: An adjacency list model Each

More information

CHAPTER 5: BUSINESS ANALYTICS

CHAPTER 5: BUSINESS ANALYTICS Chapter 5: Business Analytics CHAPTER 5: BUSINESS ANALYTICS Objectives The objectives are: Describe Business Analytics. Explain the terminology associated with Business Analytics. Describe the data warehouse

More information

DATA WAREHOUSING AND OLAP TECHNOLOGY

DATA WAREHOUSING AND OLAP TECHNOLOGY DATA WAREHOUSING AND OLAP TECHNOLOGY Manya Sethi MCA Final Year Amity University, Uttar Pradesh Under Guidance of Ms. Shruti Nagpal Abstract DATA WAREHOUSING and Online Analytical Processing (OLAP) are

More information

Semistructured data and XML. Institutt for Informatikk INF3100 09.04.2013 Ahmet Soylu

Semistructured data and XML. Institutt for Informatikk INF3100 09.04.2013 Ahmet Soylu Semistructured data and XML Institutt for Informatikk 1 Unstructured, Structured and Semistructured data Unstructured data e.g., text documents Structured data: data with a rigid and fixed data format

More information

Extending Data Processing Capabilities of Relational Database Management Systems.

Extending Data Processing Capabilities of Relational Database Management Systems. Extending Data Processing Capabilities of Relational Database Management Systems. Igor Wojnicki University of Missouri St. Louis Department of Mathematics and Computer Science 8001 Natural Bridge Road

More information

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc. PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions A Technical Whitepaper from Sybase, Inc. Table of Contents Section I: The Need for Data Warehouse Modeling.....................................4

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION 1.1 Introduction Nowadays, with the rapid development of the Internet, distance education and e- learning programs are becoming more vital in educational world. E-learning alternatives

More information

DATABASE MANAGEMENT SYSTEM

DATABASE MANAGEMENT SYSTEM REVIEW ARTICLE DATABASE MANAGEMENT SYSTEM Sweta Singh Assistant Professor, Faculty of Management Studies, BHU, Varanasi, India E-mail: sweta.v.singh27@gmail.com ABSTRACT Today, more than at any previous

More information

Database Management. Chapter Objectives

Database Management. Chapter Objectives 3 Database Management Chapter Objectives When actually using a database, administrative processes maintaining data integrity and security, recovery from failures, etc. are required. A database management

More information

2. Basic Relational Data Model

2. Basic Relational Data Model 2. Basic Relational Data Model 2.1 Introduction Basic concepts of information models, their realisation in databases comprising data objects and object relationships, and their management by DBMS s that

More information

A Business Process Driven Approach for Generating Software Modules

A Business Process Driven Approach for Generating Software Modules A Business Process Driven Approach for Generating Software Modules Xulin Zhao, Ying Zou Dept. of Electrical and Computer Engineering, Queen s University, Kingston, ON, Canada SUMMARY Business processes

More information

Oracle OLAP. Describing Data Validation Plug-in for Analytic Workspace Manager. Product Support

Oracle OLAP. Describing Data Validation Plug-in for Analytic Workspace Manager. Product Support Oracle OLAP Data Validation Plug-in for Analytic Workspace Manager User s Guide E18663-01 January 2011 Data Validation Plug-in for Analytic Workspace Manager provides tests to quickly find conditions in

More information

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing 1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing 2. What is a Data warehouse a. A database application

More information

XML DATA INTEGRATION SYSTEM

XML DATA INTEGRATION SYSTEM XML DATA INTEGRATION SYSTEM Abdelsalam Almarimi The Higher Institute of Electronics Engineering Baniwalid, Libya Belgasem_2000@Yahoo.com ABSRACT This paper describes a proposal for a system for XML data

More information

Unified XML/relational storage March 2005. The IBM approach to unified XML/relational databases

Unified XML/relational storage March 2005. The IBM approach to unified XML/relational databases March 2005 The IBM approach to unified XML/relational databases Page 2 Contents 2 What is native XML storage? 3 What options are available today? 3 Shred 5 CLOB 5 BLOB (pseudo native) 6 True native 7 The

More information

Language Interface for an XML. Constructing a Generic Natural. Database. Rohit Paravastu

Language Interface for an XML. Constructing a Generic Natural. Database. Rohit Paravastu Constructing a Generic Natural Language Interface for an XML Database Rohit Paravastu Motivation Ability to communicate with a database in natural language regarded as the ultimate goal for DB query interfaces

More information

Module 9. User Interface Design. Version 2 CSE IIT, Kharagpur

Module 9. User Interface Design. Version 2 CSE IIT, Kharagpur Module 9 User Interface Design Lesson 21 Types of User Interfaces Specific Instructional Objectives Classify user interfaces into three main types. What are the different ways in which menu items can be

More information

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)? Database Indexes How costly is this operation (naive solution)? course per weekday hour room TDA356 2 VR Monday 13:15 TDA356 2 VR Thursday 08:00 TDA356 4 HB1 Tuesday 08:00 TDA356 4 HB1 Friday 13:15 TIN090

More information

INTEROPERABILITY IN DATA WAREHOUSES

INTEROPERABILITY IN DATA WAREHOUSES INTEROPERABILITY IN DATA WAREHOUSES Riccardo Torlone Roma Tre University http://torlone.dia.uniroma3.it/ SYNONYMS Data warehouse integration DEFINITION The term refers to the ability of combining the content

More information

CHAPTER 4: BUSINESS ANALYTICS

CHAPTER 4: BUSINESS ANALYTICS Chapter 4: Business Analytics CHAPTER 4: BUSINESS ANALYTICS Objectives Introduction The objectives are: Describe Business Analytics Explain the terminology associated with Business Analytics Describe the

More information

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems Proceedings of the Postgraduate Annual Research Seminar 2005 68 A Model-based Software Architecture for XML and Metadata Integration in Warehouse Systems Abstract Wan Mohd Haffiz Mohd Nasir, Shamsul Sahibuddin

More information

Lesson 8: Introduction to Databases E-R Data Modeling

Lesson 8: Introduction to Databases E-R Data Modeling Lesson 8: Introduction to Databases E-R Data Modeling Contents Introduction to Databases Abstraction, Schemas, and Views Data Models Database Management System (DBMS) Components Entity Relationship Data

More information

How To Improve Performance In A Database

How To Improve Performance In A Database Some issues on Conceptual Modeling and NoSQL/Big Data Tok Wang Ling National University of Singapore 1 Database Models File system - field, record, fixed length record Hierarchical Model (IMS) - fixed

More information

Data Warehouse design

Data Warehouse design Data Warehouse design Design of Enterprise Systems University of Pavia 21/11/2013-1- Data Warehouse design DATA PRESENTATION - 2- BI Reporting Success Factors BI platform success factors include: Performance

More information

Database System Concepts

Database System Concepts s Design Chapter 1: Introduction Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

Technologies for a CERIF XML based CRIS

Technologies for a CERIF XML based CRIS Technologies for a CERIF XML based CRIS Stefan Bärisch GESIS-IZ, Bonn, Germany Abstract The use of XML as a primary storage format as opposed to data exchange raises a number of questions regarding the

More information

KNOWLEDGE FACTORING USING NORMALIZATION THEORY

KNOWLEDGE FACTORING USING NORMALIZATION THEORY KNOWLEDGE FACTORING USING NORMALIZATION THEORY J. VANTHIENEN M. SNOECK Katholieke Universiteit Leuven Department of Applied Economic Sciences Dekenstraat 2, 3000 Leuven (Belgium) tel. (+32) 16 28 58 09

More information

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING Ramesh Babu Palepu 1, Dr K V Sambasiva Rao 2 Dept of IT, Amrita Sai Institute of Science & Technology 1 MVR College of Engineering 2 asistithod@gmail.com

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,

More information

BUILDING OLAP TOOLS OVER LARGE DATABASES

BUILDING OLAP TOOLS OVER LARGE DATABASES BUILDING OLAP TOOLS OVER LARGE DATABASES Rui Oliveira, Jorge Bernardino ISEC Instituto Superior de Engenharia de Coimbra, Polytechnic Institute of Coimbra Quinta da Nora, Rua Pedro Nunes, P-3030-199 Coimbra,

More information

Concepts of Database Management Seventh Edition. Chapter 9 Database Management Approaches

Concepts of Database Management Seventh Edition. Chapter 9 Database Management Approaches Concepts of Database Management Seventh Edition Chapter 9 Database Management Approaches Objectives Describe distributed database management systems (DDBMSs) Discuss client/server systems Examine the ways

More information

Graphical Web based Tool for Generating Query from Star Schema

Graphical Web based Tool for Generating Query from Star Schema Graphical Web based Tool for Generating Query from Star Schema Mohammed Anbar a, Ku Ruhana Ku-Mahamud b a College of Arts and Sciences Universiti Utara Malaysia, 0600 Sintok, Kedah, Malaysia Tel: 604-2449604

More information

Design Patterns in Parsing

Design Patterns in Parsing Abstract Axel T. Schreiner Department of Computer Science Rochester Institute of Technology 102 Lomb Memorial Drive Rochester NY 14623-5608 USA ats@cs.rit.edu Design Patterns in Parsing James E. Heliotis

More information

Relational Databases

Relational Databases Relational Databases Jan Chomicki University at Buffalo Jan Chomicki () Relational databases 1 / 18 Relational data model Domain domain: predefined set of atomic values: integers, strings,... every attribute

More information

Distributed Database for Environmental Data Integration

Distributed Database for Environmental Data Integration Distributed Database for Environmental Data Integration A. Amato', V. Di Lecce2, and V. Piuri 3 II Engineering Faculty of Politecnico di Bari - Italy 2 DIASS, Politecnico di Bari, Italy 3Dept Information

More information

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

More information

WebSphere Business Monitor

WebSphere Business Monitor WebSphere Business Monitor Monitor models 2010 IBM Corporation This presentation should provide an overview of monitor models in WebSphere Business Monitor. WBPM_Monitor_MonitorModels.ppt Page 1 of 25

More information

Modeling Guidelines Manual

Modeling Guidelines Manual Modeling Guidelines Manual [Insert company name here] July 2014 Author: John Doe john.doe@johnydoe.com Page 1 of 22 Table of Contents 1. Introduction... 3 2. Business Process Management (BPM)... 4 2.1.

More information

Ontology and automatic code generation on modeling and simulation

Ontology and automatic code generation on modeling and simulation Ontology and automatic code generation on modeling and simulation Youcef Gheraibia Computing Department University Md Messadia Souk Ahras, 41000, Algeria youcef.gheraibia@gmail.com Abdelhabib Bourouis

More information

INTRODUCTION TO BUSINESS INTELLIGENCE What to consider implementing a Data Warehouse and Business Intelligence

INTRODUCTION TO BUSINESS INTELLIGENCE What to consider implementing a Data Warehouse and Business Intelligence INTRODUCTION TO BUSINESS INTELLIGENCE What to consider implementing a Data Warehouse and Business Intelligence Summary: This note gives some overall high-level introduction to Business Intelligence and

More information

Lightweight Data Integration using the WebComposition Data Grid Service

Lightweight Data Integration using the WebComposition Data Grid Service Lightweight Data Integration using the WebComposition Data Grid Service Ralph Sommermeier 1, Andreas Heil 2, Martin Gaedke 1 1 Chemnitz University of Technology, Faculty of Computer Science, Distributed

More information

CSE 132A. Database Systems Principles

CSE 132A. Database Systems Principles CSE 132A Database Systems Principles Prof. Victor Vianu 1 Data Management An evolving, expanding field: Classical stand-alone databases (Oracle, DB2, SQL Server) Computer science is becoming data-centric:

More information

Moving from CS 61A Scheme to CS 61B Java

Moving from CS 61A Scheme to CS 61B Java Moving from CS 61A Scheme to CS 61B Java Introduction Java is an object-oriented language. This document describes some of the differences between object-oriented programming in Scheme (which we hope you

More information

A Design and implementation of a data warehouse for research administration universities

A Design and implementation of a data warehouse for research administration universities A Design and implementation of a data warehouse for research administration universities André Flory 1, Pierre Soupirot 2, and Anne Tchounikine 3 1 CRI : Centre de Ressources Informatiques INSA de Lyon

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Indexing XML Data in RDBMS using ORDPATH

Indexing XML Data in RDBMS using ORDPATH Indexing XML Data in RDBMS using ORDPATH Microsoft SQL Server 2005 Concepts developed by: Patrick O Neil,, Elizabeth O Neil, (University of Massachusetts Boston) Shankar Pal,, Istvan Cseri,, Oliver Seeliger,,

More information

Efficient Structure Oriented Storage of XML Documents Using ORDBMS

Efficient Structure Oriented Storage of XML Documents Using ORDBMS Efficient Structure Oriented Storage of XML Documents Using ORDBMS Alexander Kuckelberg 1 and Ralph Krieger 2 1 Chair of Railway Studies and Transport Economics, RWTH Aachen Mies-van-der-Rohe-Str. 1, D-52056

More information

Clustering through Decision Tree Construction in Geology

Clustering through Decision Tree Construction in Geology Nonlinear Analysis: Modelling and Control, 2001, v. 6, No. 2, 29-41 Clustering through Decision Tree Construction in Geology Received: 22.10.2001 Accepted: 31.10.2001 A. Juozapavičius, V. Rapševičius Faculty

More information

New Approach of Computing Data Cubes in Data Warehousing

New Approach of Computing Data Cubes in Data Warehousing International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 14 (2014), pp. 1411-1417 International Research Publications House http://www. irphouse.com New Approach of

More information

Time: A Coordinate for Web Site Modelling

Time: A Coordinate for Web Site Modelling Time: A Coordinate for Web Site Modelling Paolo Atzeni Dipartimento di Informatica e Automazione Università di Roma Tre Via della Vasca Navale, 79 00146 Roma, Italy http://www.dia.uniroma3.it/~atzeni/

More information

Using Object And Object-Oriented Technologies for XML-native Database Systems

Using Object And Object-Oriented Technologies for XML-native Database Systems Using Object And Object-Oriented Technologies for XML-native Database Systems David Toth and Michal Valenta David Toth and Michal Valenta Dept. of Computer Science and Engineering Dept. FEE, of Computer

More information

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011 DATA MINING CONCEPTS AND TECHNIQUES Marek Maurizio E-commerce, winter 2011 INTRODUCTION Overview of data mining Emphasis is placed on basic data mining concepts Techniques for uncovering interesting data

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

An Interactive Visualization Tool for the Analysis of Multi-Objective Embedded Systems Design Space Exploration

An Interactive Visualization Tool for the Analysis of Multi-Objective Embedded Systems Design Space Exploration An Interactive Visualization Tool for the Analysis of Multi-Objective Embedded Systems Design Space Exploration Toktam Taghavi, Andy D. Pimentel Computer Systems Architecture Group, Informatics Institute

More information

Log Analysis Software Architecture

Log Analysis Software Architecture Log Analysis Software Architecture Contents 1 Introduction 1 2 Definitions 2 3 Software goals 2 4 Requirements 2 4.1 User interaction.......................................... 3 4.2 Log file reading..........................................

More information

Semantic Analysis of Business Process Executions

Semantic Analysis of Business Process Executions Semantic Analysis of Business Process Executions Fabio Casati, Ming-Chien Shan Software Technology Laboratory HP Laboratories Palo Alto HPL-2001-328 December 17 th, 2001* E-mail: [casati, shan] @hpl.hp.com

More information

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner 24 Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner Rekha S. Nyaykhor M. Tech, Dept. Of CSE, Priyadarshini Bhagwati College of Engineering, Nagpur, India

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

The basic data mining algorithms introduced may be enhanced in a number of ways.

The basic data mining algorithms introduced may be enhanced in a number of ways. DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,

More information

Week 3 lecture slides

Week 3 lecture slides Week 3 lecture slides Topics Data Warehouses Online Analytical Processing Introduction to Data Cubes Textbook reference: Chapter 3 Data Warehouses A data warehouse is a collection of data specifically

More information

Journal of Information Technology Management SIGNS OF IT SOLUTIONS FAILURE: REASONS AND A PROPOSED SOLUTION ABSTRACT

Journal of Information Technology Management SIGNS OF IT SOLUTIONS FAILURE: REASONS AND A PROPOSED SOLUTION ABSTRACT Journal of Information Technology Management ISSN #1042-1319 A Publication of the Association of Management SIGNS OF IT SOLUTIONS FAILURE: REASONS AND A PROPOSED SOLUTION MAJED ABUSAFIYA NEW MEXICO TECH

More information

Information Discovery on Electronic Medical Records

Information Discovery on Electronic Medical Records Information Discovery on Electronic Medical Records Vagelis Hristidis, Fernando Farfán, Redmond P. Burke, MD Anthony F. Rossi, MD Jeffrey A. White, FIU FIU Miami Children s Hospital Miami Children s Hospital

More information

Software quality improvement via pattern matching

Software quality improvement via pattern matching Software quality improvement via pattern matching Radu Kopetz and Pierre-Etienne Moreau INRIA & LORIA {Radu.Kopetz, Pierre-Etienne.Moreau@loria.fr Abstract. Nested if-then-else statements is the most common

More information

Microsoft' Excel & Access Integration

Microsoft' Excel & Access Integration Microsoft' Excel & Access Integration with Office 2007 Michael Alexander and Geoffrey Clark J1807 ; pwiueyb Wiley Publishing, Inc. Contents About the Authors Acknowledgments Introduction Part I: Basic

More information

Oracle Database: SQL and PL/SQL Fundamentals

Oracle Database: SQL and PL/SQL Fundamentals Oracle University Contact Us: 1.800.529.0165 Oracle Database: SQL and PL/SQL Fundamentals Duration: 5 Days What you will learn This course is designed to deliver the fundamentals of SQL and PL/SQL along

More information

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 Outline More Complex SQL Retrieval Queries

More information

Oracle SQL. Course Summary. Duration. Objectives

Oracle SQL. Course Summary. Duration. Objectives Oracle SQL Course Summary Identify the major structural components of the Oracle Database 11g Create reports of aggregated data Write SELECT statements that include queries Retrieve row and column data

More information

Cognos 8 Best Practices

Cognos 8 Best Practices Northwestern University Business Intelligence Solutions Cognos 8 Best Practices Volume 2 Dimensional vs Relational Reporting Reporting Styles Relational Reports are composed primarily of list reports,

More information

CHAPTER 3 PROPOSED SCHEME

CHAPTER 3 PROPOSED SCHEME 79 CHAPTER 3 PROPOSED SCHEME In an interactive environment, there is a need to look at the information sharing amongst various information systems (For E.g. Banking, Military Services and Health care).

More information

Design and Implementation

Design and Implementation Pro SQL Server 2012 Relational Database Design and Implementation Louis Davidson with Jessica M. Moss Apress- Contents Foreword About the Author About the Technical Reviewer Acknowledgments Introduction

More information

General Purpose Database Summarization

General Purpose Database Summarization Table of Content General Purpose Database Summarization A web service architecture for on-line database summarization Régis Saint-Paul (speaker), Guillaume Raschia, Noureddine Mouaddib LINA - Polytech

More information

Multidimensional Modeling - Stocks

Multidimensional Modeling - Stocks Bases de Dados e Data Warehouse 06 BDDW 2006/2007 Notice! Author " João Moura Pires (jmp@di.fct.unl.pt)! This material can be freely used for personal or academic purposes without any previous authorization

More information

Introduction to XML Applications

Introduction to XML Applications EMC White Paper Introduction to XML Applications Umair Nauman Abstract: This document provides an overview of XML Applications. This is not a comprehensive guide to XML Applications and is intended for

More information

Completeness, Versatility, and Practicality in Role Based Administration

Completeness, Versatility, and Practicality in Role Based Administration Completeness, Versatility, and Practicality in Role Based Administration Slobodan Vukanović svuk002@ec.auckland.ac.nz Abstract Applying role based administration to role based access control systems has

More information

On XACML, role-based access control, and health grids

On XACML, role-based access control, and health grids On XACML, role-based access control, and health grids 01 On XACML, role-based access control, and health grids D. Power, M. Slaymaker, E. Politou and A. Simpson On XACML, role-based access control, and

More information

Symbol Tables. Introduction

Symbol Tables. Introduction Symbol Tables Introduction A compiler needs to collect and use information about the names appearing in the source program. This information is entered into a data structure called a symbol table. The

More information

An Eclipse Plug-In for Visualizing Java Code Dependencies on Relational Databases

An Eclipse Plug-In for Visualizing Java Code Dependencies on Relational Databases An Eclipse Plug-In for Visualizing Java Code Dependencies on Relational Databases Paul L. Bergstein, Priyanka Gariba, Vaibhavi Pisolkar, and Sheetal Subbanwad Dept. of Computer and Information Science,

More information

8. Business Intelligence Reference Architectures and Patterns

8. Business Intelligence Reference Architectures and Patterns 8. Business Intelligence Reference Architectures and Patterns Winter Semester 2008 / 2009 Prof. Dr. Bernhard Humm Darmstadt University of Applied Sciences Department of Computer Science 1 Prof. Dr. Bernhard

More information

Markup Languages and Semistructured Data - SS 02

Markup Languages and Semistructured Data - SS 02 Markup Languages and Semistructured Data - SS 02 http://www.pms.informatik.uni-muenchen.de/lehre/markupsemistrukt/02ss/ XPath 1.0 Tutorial 28th of May, 2002 Dan Olteanu XPath 1.0 - W3C Recommendation language

More information

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process ORACLE OLAP KEY FEATURES AND BENEFITS FAST ANSWERS TO TOUGH QUESTIONS EASILY KEY FEATURES & BENEFITS World class analytic engine Superior query performance Simple SQL access to advanced analytics Enhanced

More information

Tutorials for Project on Building a Business Analytic Model Using Data Mining Tool and Data Warehouse and OLAP Cubes IST 734

Tutorials for Project on Building a Business Analytic Model Using Data Mining Tool and Data Warehouse and OLAP Cubes IST 734 Cleveland State University Tutorials for Project on Building a Business Analytic Model Using Data Mining Tool and Data Warehouse and OLAP Cubes IST 734 SS Chung 14 Build a Data Mining Model using Data

More information

An Architecture for Integrated Operational Business Intelligence

An Architecture for Integrated Operational Business Intelligence An Architecture for Integrated Operational Business Intelligence Dr. Ulrich Christ SAP AG Dietmar-Hopp-Allee 16 69190 Walldorf ulrich.christ@sap.com Abstract: In recent years, Operational Business Intelligence

More information

A User Interface for XML Document Retrieval

A User Interface for XML Document Retrieval A User Interface for XML Document Retrieval Kai Großjohann Norbert Fuhr Daniel Effing Sascha Kriewel University of Dortmund, Germany {grossjohann,fuhr}@ls6.cs.uni-dortmund.de Abstract: XML document retrieval

More information

Oracle Service Bus Examples and Tutorials

Oracle Service Bus Examples and Tutorials March 2011 Contents 1 Oracle Service Bus Examples... 2 2 Introduction to the Oracle Service Bus Tutorials... 5 3 Getting Started with the Oracle Service Bus Tutorials... 12 4 Tutorial 1. Routing a Loan

More information

Facilitating the Interaction with Data Warehouse Schemas through a Visual Web-Based Approach

Facilitating the Interaction with Data Warehouse Schemas through a Visual Web-Based Approach Computer Science and Information Systems 11(2):481 501 DOI: 10.2298/CSIS131130032B Facilitating the Interaction with Data Warehouse Schemas through a Visual Web-Based Approach Clemente R. Borges and José

More information

SDMX technical standards Data validation and other major enhancements

SDMX technical standards Data validation and other major enhancements SDMX technical standards Data validation and other major enhancements Vincenzo Del Vecchio - Bank of Italy 1 Statistical Data and Metadata exchange Original scope: the exchange Statistical Institutions

More information

A Model-driven Approach to Predictive Non Functional Analysis of Component-based Systems

A Model-driven Approach to Predictive Non Functional Analysis of Component-based Systems A Model-driven Approach to Predictive Non Functional Analysis of Component-based Systems Vincenzo Grassi Università di Roma Tor Vergata, Italy Raffaela Mirandola {vgrassi, mirandola}@info.uniroma2.it Abstract.

More information

DATA WAREHOUSING - OLAP

DATA WAREHOUSING - OLAP http://www.tutorialspoint.com/dwh/dwh_olap.htm DATA WAREHOUSING - OLAP Copyright tutorialspoint.com Online Analytical Processing Server OLAP is based on the multidimensional data model. It allows managers,

More information

How To Use X Query For Data Collection

How To Use X Query For Data Collection TECHNICAL PAPER BUILDING XQUERY BASED WEB SERVICE AGGREGATION AND REPORTING APPLICATIONS TABLE OF CONTENTS Introduction... 1 Scenario... 1 Writing the solution in XQuery... 3 Achieving the result... 6

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART A: Architecture Chapter 1: Motivation and Definitions Motivation Goal: to build an operational general view on a company to support decisions in

More information