Hybrid Similarity Measure for XML Data Integration and Transformation

Transcription

1 Thesis for the Degree of Doctor of Philosophy Hybrid Similarity Measure for XML Data Integration and Transformation Pham Thi Thu Thuy Department of Computer Engineering Graduate School Kyung Hee University Seoul, Korea August, 2012

2 Hybrid Similarity Measure for XML Data Integration and Transformation Pham Thi Thu Thuy Department of Computer Engineering Graduate School Kyung Hee University Seoul, Korea August, 2012

3 Hybrid Similarity Measure for XML Data Integration and Transformation by Pham Thi Thu Thuy Advised by Professor Young-Koo Lee Professor Sungyoung Lee Submitted to the Department of Computer Engineering and the Faculty of the Graduate School of Kyung Hee University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Dissertation Committee: Professor Byeong-soo Jeong, Ph.D Professor Brian J. d Auriol, Ph.D Professor Jin-Ho Kim, Ph.D Professor Donghai Guan, Ph.D Professor Young-Koo Lee, Ph.D

4

5 Hybrid Similarity Measure for XML Data Integration and Transformation by Pham Thi Thu Thuy Submitted to the Department of Computer Engineering on July 8, 2012, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract XML (extensible Markup Language) has been widely used as a standard for sharing data between web-based applications. In order to share XML data with another XML application system, it is needed to integrate various XML data sources into a coherent XML data set. Moreover, to share XML data with semantic supporting system, such as Web Ontology Language (OWL), XML data also need to be transfomed into OWL ontology. However, since the heterogeneous of XML data, the same information can be published in many different ways in terms of tag names and structures or the same tag names can represent different contents, the sharing of XML data is not yet fully automatic. This heterogeneity of XML data has led to research in measuring the similarity of elements between XML schemas or element similarity within a schema. Therefore, to perform the integrating and transforming tasks, the similarity measure of XML schema play a crucial role due to the heterogeneous of XML data sources. In this thesis, we deal with the problem of data transformation and integration for XML data sources. This data format presents a lot of challenges that need XML-specific solutions: an XML schema is not required for an XML document, and if XML schema exists, it may be expressed in a number of different XML schema types such as XML Schema (XSD) or Document Type Definition (DTD) ; also resolving the heterogeneity in the schema is not straightforward method due to the hierarchical nature of XML data. We propose a hybrid similarity measure based approach, that handles the distinct problems of syntactic, semantic, and schematic heterogeneity of XML data. Our similarity measure addresses both structural and semantic components and can be applied for both schema types of XML. Due to the different targets between integration and transformation of XML data, we propose two types of similarity measures, which are similarity of elements between two schemas for data integration and similarity of elements within a schema for data transformation. In particular, we can divide the thesis into two main parts, both related to enhance the sharing of XML data. The first part focuses on the similarity measure between schemas for data integration. We propose the novel similarity measure that concurrently considers both structural and semantic ini

6 formation of two specific XML Schemas. Specifically, we introduce new metrics to compute the data type and cardinality constraint similarities which improve the quality of the current semantic assessments. Based on the similarity between element pairs, we put forward an algorithm to calculate the similarity between two XML Schema trees. Based on the similarity measure, we propose an integration method to merge two or more disparate XML data sources into a single coherent data set to support the information needs of the target business or enterprise. Experimental results lead to the conclusion that our methodology provides better similarity values than the others with regard to the accuracy of semantics and structure similarities. The second part of the work is related to the similarity measure of duplicate within a schema and the transformation of XML Schema into OWL (Web Ontology Language). This part is also divided into two different sub-parts. The first one is focused on the problem of duplicate elements in XML Schema. Recent studies on transforming XML Schema into OWL have shown that the associated duplicates problem can be solved by creating a unique identifier for each element. However, this solution considers duplicate elements to be different nodes, whereas most duplicates represent the same information. We present a novel method to measure the semantic similarity between duplicate elements within an XML Schema. Semantic similarity is the combination of the declaration and context features, which capture all the descriptions and relationships of the duplicate elements. Based on the similarity values, we classify the duplicates into two groups: similar and non-similar, and then propose the suitable strategy to transform these duplicates into appropriate OWL concepts. In the second sub-part, we present a mechanism to ease the interpretation and automate the semantic transformation of specific XML data into the OWL ontology (S-Trans), which allows an easier and better semantic communication among information systems. On the basis of the XML schemas (XSD or DTD), we extract the document structure and add more descriptions for XML elements. Experimental results show that the proposed method reliably predicts semantic similarity of duplicates and produces a better-quality of OWL ontology. Thesis Supervisor: Young-Koo Lee Title: Professor Thesis Co-Supervisor: Sungyoung Lee Title: Professor ii

7 Acknowledgments There are countless people who have supported, directed, assisted and encouraged me in completing this PhD, and that I would like to thank. First of all, I would like to express my deepest gratitude to my supervisor, Professor Young- Koo Lee, for supervising my work and for always having the right suggestions during any discussion we had. He has not only led me to the research area of the semantic similarity measure for data integration but also offered me lots of insightful suggestions based on which I have developed and completed my dissertation. I would like to thank Kyung Hee University and IITA scholarship as a whole to have given me the opportunity - and have provided the funds - to carry out this PhD. I also grateful to my co-supervisor, Professor Sungyoung Lee, who guided me toward the proper direction with his inquisitive questions and helpful comments. I also would like to thanks Professor Brian J.d Aurial s for his advices on my presentation and visualization skill have been going with me not only in the past but also at the present and absolutely in my future research career. I would like to thank number of Professors in Computer Engineering Department for their excellent lectures, Professor Tae-Choong Chung, Professor Ok-Sam Chae, Professor Byeong-Soo Jeong, Professor Choong-Seon Hong, and Professor Eui-Nam Huh. Their wisdom greatly contributed to consolidating and widening my knowledge on computer engineering, which is also the very important background for my dissertation. My thanks also go to many of my colleagues who help and encourage me during my stay in Korea. Especially, Prof. Dr. Donghai Guan, Dr. Phan Tran Ho Truc, Dr. Dang Viet Hung, senior Le Tuan Anh, senior Nguyen Hoang Viet, senior Vo Thi Luu Phuong, the couple Nguyen Van Mui + Tran Thi Kim Loc, Korean friends Yongkoo Han, Jinseung Kim, Kisung Park, and lab-mates La The Vinh, Dinh Dong Luong, Pham The Anh, Iram Fatima, and many others, who have shared iii

8 their knowledge and technical expertise with me. Without their suggestions, my life and my research would have been much harder. I also would like to express my deeply thanks to the dissertation committee members whose helpful comments have helped me to improve and complete this dissertation. Last but not least my thanks go to my family of course, to whom I dedicate this achievement: my parents and my parents in law, who, with love and comprehension, has always pushed me to pursue a PhD. I would like to send my love to my sweet husband and my two lovely daughters, Bin and Su, who always stay beside me during a somewhat stressful time. Thanks to you all.. iv

9 Contents Table of Contents List of Figures List of Tables v viii x 1 Introduction Introduction XML data sharing scenario Motivation and contributions Thesis Organization Background and Related Work Background on XML Data and OWL Ontology XML data Ontology OWL fundamental constructs Term definitions Related Work Similarity between documents and XML integration Duplicate similarity and XML schema transformation v

10 3 ESim: Element Similarity measure for XML integration Similarity Measure Framework Semantic Similarity Measurement (SeSim) Name similarity (NSim) Data type similarity Constraint similarity Structural Similarity Measurement (StSim) Ancestor similarity Sibling similarity Children similarity Similarity between Two Schema Trees XML Schema Integration S-Trans: Duplicate Similarity Measure for XML2OWL General modules of XML2OWL Transformation Semantic Similarity of Duplicate Elements Motivating example Ancestor similarity (ASim) Sibling similarity (SbSim) Children similarity (ChSim) Transforming DTD/XSD into the OWL Ontology Experimental Results Experiments on XML Schemas Similarity Measure Determining of parameter values Results based on real-world XSDs Experiments on Duplicate Similarity in XML Transformation Experimental setup Results vi

11 5.3 Experimental Summarization Conclusion and Future Researches Conclusion Thesis summary Contributions Future Researches Appendix A: ESim - Evaluation Results 100 Appendix B: Sample of XML Schema for Transformation 103 Appendix C: OWL Ontology Result 108 References 118 vii

12 List of Figures 1.1 Semantic Web stack architecture Different solutions to integration and transformation XML data Thesis organization Example of a XML document and its respective DTD Example of the respective XSD of document in Figure OWL root classes OWL subclass definition OWL class individual OWL Datatype property definition OWL Class instance with datatype property General framework of similarity measure method Tree representation for Schema Patient A Tree representation for Schema Patient B Expressions for Schema Patient A A fragment of WordNet The structure similarity algorithm XML Schema integration framework architecture General syntactic to semantic architecture Example of a DTD document, prescription.dtd viii

13 4.3 Example of a DTD and a part of is corresponding XSD document The corresponding tree of XML schema (XSD/DTD) in Figure The ancestor similarities at different ancestor levels with five candidate values The ancestor similarity algorithm Transforming framework from XML into OWL The transforming correspondences between DTD/XSD and OWL OWL results of duplicates which are highly similar OWL results of duplicates which are less similar Tree representation for Schema Patient C Tree representation for Schema Patient D Determining weights of ESim function Determining weights of SeSim function Matching comparisons of ESim to COMA, XMLSim, and XClust Quality of name measure Quality of data type measure Quality of cardinality constraint measure Quality of structure measure F measure comparison The error rate of classification at different thresholds Evaluation results, drug medicament schema Evaluation results, patient admission schema Evaluation results, healthcaremetadata schema Evaluation results, pathology.report schema Quality of S-Trans, PrSim, ChSim, and CaSim A.1 Evaluation results of matching system for schemas in Table ix

14 List of Tables 3.1 Data type compatibility table Cardinality constraint similarity table The similarity of synthetic XSDs The characteristics of the tested schemas Element similarity result of the two schemas (Patient A and Patient B) The characteristics of the tested schemas x

15 Chapter 1 Introduction 1.1 Introduction Recently, many web-page applications and services publish their data using XML, the standard for sharing data, since the use of XML as a common data representation format helps sharing XML data with other applications and services. Usually, to improve the sharing of XML data with the same XML application system, all XML data sources are integrated into a coherent data set to support the information needs of the target applications. Moreover, to enhance the sharing of XML data with the semantic supporting system using OWL, XML data are transformed into the target OWL ontology. However, since the heterogeneous of XML data in which the same information can be published using XML in many different ways in terms of tag names and structures or the same tag names can represent different contents, the exchange of XML data is not fully automatic. To solve the heterogeneity problem of XML data, many researches have been proposed similarity measure methods to compute the similarity of heterogeneous XML data before integrating or transforming them. The algorithms that automate these similarity computations help to reduce time and effort spent on creating and maintaining data sharing in many applications [90] such as in e-business [12], [100], [93], e-goverment [70], [73], [48], [25], [33], e-learning [18], [11], [98], and e-health [95], [88], [68]. 1

16 CHAPTER 1. INTRODUCTION 2 Intelligent Domain Services, Applications Use, Intent Trust Reasoning/Proof Higher Semantics Semantics Structure Syntax: Data Pragmatic Web Security/Identity Inference Engine OWL RDF/RDF Schema XML Schema XML URI Unicode Figure 1.1: Semantic Web stack architecture To illustrate the important of XML data integration, let us take one integration example in e-health system. In the e-health system, there are various of XML healthcare data. These data are the collection of healthcare data from the large number of environmental and patient sensors, and actuators to monitor and improve patient s physical and mental conditions [86]. Nowadays, the XML healthcare data are increasing, so the healthcare providers need to integrate these data in order to keep them as the electronic health record (EHR) [32]. Therefore, the integration of XML healthcare data plays an important role in enhancing the quality of the patient care and the information exchange among the medical systems. In general, although heterogeneous XML sources may have similar content, they may be de-

17 CHAPTER 1. INTRODUCTION 3 XML data sharing Schema matching/ mapping Schema integration Schema transformation Integrate DTDs Integrate XSDs XML2RDF XML2OWL Similarity between docs. Similarity within doc. Similarity measure Figure 1.2: Different solutions to integration and transformation XML data scribed using different tag names and structures. Integration of similar XML documents from different data sources benefits applications which use the same XML language, giving them access to more complete and useful information and query systems to retrieve information from a single integrated source instead of various sources. On the other hand, recently, the Semantic Web has been developed and widely used by many semantic applications. This development leads to the need for sharing the existing XML data with semantic applications. However, XML is disadvantage when it comes to the semantic interoperability because it focuses primarily on the syntactics, with no way to describe the semantics of the data [34]. This lack of semantic description leads to the problems when semantic agents seek to understand and reason about these XML data. Therefore, to enable the sharing of XML data with semantic supporting systems, it is needed to map or transform XML data into a semantic

18 CHAPTER 1. INTRODUCTION 4 supporting language. In this thesis, we choose OWL as the target source for the transformation, since OWL is described as higher semantic language in the Semantic Web stack architecture [36]. Moreover, since the heterogeneous of XML data where duplicates may represent different or same information, to improve the semantics of the transformation, we propose a pre-step to compute the semantic similarity of XML elements, specifically the duplicate elements, before the XML transforming process. In general, this thesis tackles the problem of sharing XML data between the same XML applications and between XML application and the semantic supporting application. In particular, we have developed an approach to the integration and transformation of heterogeneous XML data sources. Our approach is based on the similarity measure method, meaning that the output is a set of similarity scores of elements between XML schema documents, in an XML data integration scenario, or a set of similarity scores of duplicate elements within XML schema, in an XML data transformation scenario. The overview of different solutions to enhance the data sharing and our focused research is illustrated in Figure 1.2. The rest of this chapter is constructed as follows. Section 1.2 introduces the different scenarios in the broad area of data sharing. Section 1.3 presents the motivation and contributions of the work described in this thesis. Section 1.4 gives an overview of the thesis organization. 1.2 XML data sharing scenario The sharing of XML data across applications and services may involve several scenarios, including: XML schema integration and XML schema transformation. However, all scenarios share the same process of similarity measure, particularly, similarity between documents for the integration scenario and similarity within a document for the transformation scenario. We introduce below

19 CHAPTER 1. INTRODUCTION 5 the major scenarios and processes in schema integration and transformation. XML schema integration is an XML data sharing scenario in which XML data from multiple data sources are combined in order to give users a single integrated source. This task may retain all of the original logical structures and tag names of the XML schema sources (XSD or DTD), since it generates a union or global XML schema which combines the data sources in more complex ways. XML schema transformation is an XML data sharing scenario in which one needs to defines rules for transforming a source XML schema S 1 and its associated XML instances DS 1 to a structure of the target schema S 2 which is defined in a different modeling language as S 1, for the purposes of query processing or materialization of S 2, using the data DS 1. XML data exchange is a stricter form of XML data transformation, which also respects the constraints defined within the target XML schema, and not just its structure. Element similarity between XML schemas is the automatic or semi-automatic process of determining the similarity scores between elements of an XML schema S 1 and those of another XML schema S 2. The next step of this process is the classification process in which highly similarity element pairs are combined into an integrated source. The process of choosing a classification value is discussed in the experiment section. Similarity of elements within an XML schema is the automatic or semi-automatic process of determining the similarity scores between elements within a schema S 1. The similarity results can then be used to transform data from the data source of S 1 into S 2. In this thesis, we compute the similarity value of duplicate elements in an XML Schema and then classify them into the similar or non-similar group for the transformation.

20 CHAPTER 1. INTRODUCTION Motivation and contributions From the above overview, a number of research questions arise regarding XML data sharing, which form the motivation for our research: How to improve the data sharing between applications using the same XML system or sharing XML system with higher semantic supporting language, OWL? Different XML data sources may be associated with different XML schema types, or may not have a same schema type at all. Can we encompass all types of XML data sources with a data transformation or an integration approach? How to solve the heterogeneous problem of XML data during the integration or transformation XML data? Which aspects of XML data transformation and integration can be automated? Are they clearly distinguishable from the manual aspects? Can we minimize the manual aspects? XML data sources may be structurally incompatible, which may lead to loss of information when transforming or integrating them. How to sole this problem automatically? Have existing approaches performed the integration or transformation of XML data? If so, do they have any problem needed to resolve? With these research questions as a starting point, this thesis proposes a similarity measure based approach for the integration and transformation of heterogeneous XML data sources and makes the following contributions: 1. We propose the integration method-based similarity measure to improve the data sharing between the same XML applications. For sharing data with higher semantic application, we

21 CHAPTER 1. INTRODUCTION 7 propose the transformation of XML into OWL ontology method with consider the duplicate similarity in XML schema. 2. Our approach can be applied on any type of XML data sources, regardless of the schema type used, XSD or DTD. 3. We propose a hybrid similarity measure to compute both semantic and structural similarities of XML elements. 4. We automate the similarity measure process for data integration and transformation by providing the metrics to compute all similarity factors. There is no similarity value given by users. Our propose metrics generates more precise similarity values than those by manual. Moreover, we minimize the manual aspect by proposing the method to determine the weighted values to balance the role of the similarity factors. 5. To solve the loss of information problem, in the integration process, our integrator take a union of all elements in XML schemas instead of retaining only common elements. In the transformation process, we follow the structural descriptions of XML schemas to transform all elements and their relationships with other elements into appropriate OWL concepts. 6. There are several approaches proposed to integrate and transform XML data. However, our methods are overcome the existing work because of some reasons. For the integration approaches, in most of related approaches, the data type, cardinality constraint, and weight parameters values are manually given whereas we provide novel metrics to determine those values. In the transformation approaches, most existing methods solve the duplicate problem of XML data by simply giving each XML element a unique identifier, which may cause the redundancy data when duplicates represent the same information. We resolve this duplicate problem by proposing the duplicate similarity measuring and giving an appropriate strategy to transform them.

22 CHAPTER 1. INTRODUCTION 8 With respect to existing approaches to XML schema transformation and integration, our approach makes a number of contributions: 1. We propose a new metric to measure the data type similarity between two attribute types whereas data type similarity value is given manually in related work. 2. We present the novel metric to measure the similarity of the cardinality constraints which are also manually given by user. 3. In order to avoid the case that two nodes have the same structure but difference in their names, we compute the structural similarity of two concepts by relying on the semantic similarity and each pair of their neighborhood elements. 4. We present an algorithm to calculate the similarity between two schema trees based on the similarity values of the element pairs. 5. We propose a method to determine the weight parameters which are used to balance the role of the similarity measuring factors. 6. We discovers the semantic problem during transformation of duplicate elements in an XML schema into ontology. 7. We proposes method to measure the semantic similarity between repeated elements, which considers not only the relationship similarity, but also the inside descriptions of each duplicate node. 8. We propose a method to formally determine the duplicate classifying value. 9. It proposes the strategy to transform XML schema and their duplicates into ontology. 10. Finally, our approach addresses the problem of human intervention during the integration and data redundancy in transformation of XML data. Experimental results reveal that our method overcomes the related work in terms of semantics and accuracy.

23 CHAPTER 1. INTRODUCTION Thesis Organization This section describes the the road map for the entire thesis. We provide the thesis organization in Figure 1.3. A brief summary of each chapter is shown below. Chapter 1 Introduction. This chapter briefly introduces the population of XML data and an example of XML in e-health system. The challenges and disadvantages of XML s flexibility in creating new document and lack of semantics support of XML are clearly addressed. After that the dissertation focuses and contributions are also made clear. Chapter 2 Background and Related Work. This chapter presents to sections. First, we review background knowledge on XML data and OWL ontology. Second, we give a comprehensive survey of the existing work especially work that relates to two problems: measuring the similarity between XML Schema documents and transforming XML into OWL ontology. The state of the art and limitations of existing work are clearly addressed. Chapter 3 Semantic and Structural Similarity between XML Schemas. The proposed solution for the semantic and structural measuring problem is described in detail in this chapter. Chapter 4 Duplicate and Transforming XML schemas into OWL ontology. This chapter describes all the details of the semantic similarity measuring for duplicate elements in XML schemas and proposes solution for each similarity level and transforms all XML schemas elements into OWL ontology. Chapter 5 Experimental results and discussions. Comprehensive experiments are conducted, the results are analyzed to enlighten the advantages of the proposed algorithms. Chapter 6 Conclusion and future work. In this chapter, a conclusion is given. Besides, some limitations of the work are also pointed out with potential solutions, which may need further research effort to be completed.

24 CHAPTER 1. INTRODUCTION 10 Chap. 1: Introduction Motivations of proposed integration and transformation XML data based similarity measure Chap. 2: Related Work Section Overview of XML and ontology Related work - XML integration and similarity between documents. + Structure based approaches + Semantics based approaches + Hybrid approaches - XML transformation and similarity within document. + XML2OWL + Element similarity within single document. Chap. 3: XML similarity measure for data integration Propose a complete hybrid similarity framework. Propose novel metrics to compute data type and constraint similarities Provide novel method to balance similarity factors. Section Chap. 4: Duplicate similarity measure for transformation XML into OWL Propose a novel method to solve the duplicate problem in XML2OWL. Propose novel metrics to measure duplicate similarities. Present effective method to determine the classification value. Propose strategy to transform duplicates. Chap. 5: Experiments and Discussions Propose a complete hybrid similarity framework. Propose novel metrics to compute data type and constraint similarities Provide novel method to balance similarity factors. Chap. 6: Conclusion and Future Research Summary of proposed approaches. Future researches: - Measure the similarity between different data models. - Match different data models. - Measure the similarity between Web pages. Figure 1.3: Thesis organization.

25 Chapter 2 Background and Related Work Since XML data and ontology are two main objects in this dissertation, in this chapter we give a brief introduction to their characteristics and technologies. After that, we discuss the related researches to our work. 2.1 Background on XML Data and OWL Ontology XML data XML (extensible Markup Language) is a flexible representation language. There are two varieties of XML data: XML documents and XML schemas. An XML schema provides the data definitions and structure of the XML document [65]. While XML documents are the instances of an XML schema which gives a snapshot of what the document may contain. A schema includes what elements are allowed or are not allowed; what attributes for any elements may be and the number of occurrences of XML elements; etc. A schema for a document may be included as both internally (located within the schema document) and externally (independently located outside XML schema file). 11

26 CHAPTER 2. BACKGROUND AND RELATED WORK 12 There are several XML schema languages, but only two are commonly used. They are DTD (Document Type Definition) and XML Schema or XML Schema Definition (XSD), both of which allow the construction of XML documents to be described and their contents to be constrained [79]. A DTD specifies the structure of an XML element by specifying the names of its subelements and attributes. Subelement structure is specified using some operators, such as * (zero or more elements), + (one or more elements),? (optional), and (or), as well as with properties type (PCDATA, ID, IDREF, ENUMERATION). The DTD language is disadvantaged in compare with an XSD language since it only supports a limited set of data types, has loose structure constraints, uses different language with XML, etc. To overcome the above limitations of DTD, the XSD language provides the novel features, such as simple and complex types, rich data type sets, occurrence constraints and especially using the same language with XML. An XML Schema is usually comprised of a set of schema components, such as the data type definitions and cardinality constraint declarations, etc. They can be used to evaluate the validity of the well-formed element information items. It is believed that XSD will soon replace DTD due to its flexibility [41]. Throughout this thesis, we use the term XML schema to express both the DTD and XSD, while XML Schema represents the XSD. Figure 2.1 illustrates a simple example of a XML document and its corresponding DTD. Figure 2.2 shows a respective XML Schema Ontology In computer science, an ontology is an explicit specification of a conceptualization [31], i.e. an ontology is a model that describes the concepts of a problem domain, as well as the association between those concepts. An ontology can be used as an interface to one or more data sources which means that it can be used as a schema, or it can be used to reason about the problem domain.

27 CHAPTER 2. BACKGROUND AND RELATED WORK 13 <?xml version= 1.0 encoding= UTF-8?> <Companies> <!DOCTYPE Companies [ <Company> <!ELEMENT Companies (Company+)> <Symbol> Eagle.img </Title> <!ELEMENT Company (Symbol, Name, <Name> EagleFarm </Name> Sector?, Industry, (Profile))> <Industry> Dairy </Industry> <!ELEMENT Profile (MarketCap, <Profile> EmployeeNo, (Address), <MarketCap> 1000 </ MarketCap > Description)> <EmployeeNo> 20 </ EmployeeNo > <!ELEMENT Address (State,City?)> <Address> <!ELEMENT Symbol(#PCDATA)> <State> QLD </State> <!ELEMENT Name (#PCDATA)> </Address> <!ELEMENT Sector (#PCDATA)> <Description> gdsfkls </Description> <!ELEMENT Industry (#PCDATA)> </Profile> <!ELEMENT MarketCap (#PCDATA)> </Company> <!ELEMENT EmployeeNo (#PCDATA)>  <!ELEMENT State (#PCDATA)>. <!ELEMENT City (#PCDATA)> </Companies> ]> Figure 2.1: Example of a XML document and its respective DTD RDF (Resource Description Framework) [64] is a family of W3C specifications which is used primarily for specifying the information about a problem domain. RDF has the triple form of subject-predicate-object. Therefore, a set of RDF statements generates a labeled, directed graph. RDF Schemais one of the W3C RDF specifications. RDF Schema allows the definition of RDF vocabularies. Note that RDF can also be used as the data format for the exchange and integration of data from different information systems. OWL (Web Ontology Language) [37], like RDF Schema, is used to define ontologies. OWL is also a Semantic Web language designed to represent more rich and complex knowledge about things, groups of things, and relations between things than RDF. OWL is a logic-based language so knowledge expressed in OWL can be reasoned with by computer programs either to verify the consistency of that knowledge or to understand about the expressed knowledge. The OWL doc-

28 CHAPTER 2. BACKGROUND AND RELATED WORK 14 1 <xsd:schema xmlns:xsd= > 2 <xsd:element name= Companies > 3 <xsd:complextype> 4 <xsd:sequence> 5 <xsd:element name= Company maxoccurs= unbounded > 6 <xsd:complextype> 7 <xsd:sequence> 8 <xsd:element name= Symbol type= xsd:string /> 9 <xsd:element name= Name type= xsd:string /> 10 <xsd:element name= Sector type= xsd:string /> 11 <xsd:element name= Industry type= xsd:string /> 12 <xsd:element name= Profile > 13 <xsd:complextype> 14 <xsd:sequence> 15 <xsd:element name= MarketCap type= xsd:string /> 16 <xsd:element name= EmployeeNumber type= xsd:unsignedint /> 17 <xsd:element name= Address > 18 <xsd:complextype> 19 <xsd:sequence> 20 <xsd:element name= State type= xsd:string /> 21 <xsd:element name= City type= xsd:string /> 22 </xsd:sequence> 23 </xsd:complextype> 24 </xsd:element> 25 <xsd:element name= Description type= xsd:string /> 26 </xsd:sequence> 27 </xsd:complextype> 28 </xsd:element> 29 </xsd:sequence> 30 </xsd:complextype> 31 </xsd:element> 32 </xsd:sequence> 33 </xsd:complextype> 34 </element> 35 </xsd:schema> Figure 2.2: Example of the respective XSD of document in Figure 2.1

29 CHAPTER 2. BACKGROUND AND RELATED WORK 15 uments, known as ontologies, can be distributed in the World Wide Web and may refer to or be referred from other OWL ontologies. The OWL language has three increasingly expressive sublanguages as following: OWL Lite [59], [1] supports those users primarily needing a classification hierarchy and simple constraint features. For example, the cardinality constraints in OWL Lite only allows cardinality values of 0 or 1. Thus, OWL Lite provides a quick migration path for thesauri and other taxonomies. OWL DL [71], [60] provides those users who want the maximum expressiveness without losing computational completeness and all computations, which will finish in finite time, of the reasoning systems. OWL DL includes all the OWL language constructs with restrictions such as type separation (for instances, a class cannot also be an individual or property, a property cannot also be an individual or class). OWL DL is so named due to its correspondence with description logics, a field of research that has studied a particular decidable fragment of first order logic. OWL DL was designed to support the existing Description Logic business segment and has desirable computational properties for the reasoning systems. OWL Full [44], [15] is meant for users who want maximum expressiveness and the syntactic freedom of RDF with no computational guarantees. For example, in OWL Full a class can be treated simultaneously as a collection of individuals. Another significant difference from OWL DL is that an OWL full data type property may be inverse functional. OWL Full allows an ontology to augment the meaning of the pre-defined (RDF or OWL) vocabulary. It is unlikely that any reasoning software will be able to support every feature of OWL Full. Each of these sublanguages is an extension of its simpler predecessor, both in what can be legally expressed and in what can be validly concluded. The following set of relations hold. Every legal OWL Lite ontology is a legal OWL DL ontology.

30 CHAPTER 2. BACKGROUND AND RELATED WORK 16 1 <owl:class rdf:id= RedWine /> 2 <owl:class rdf:id= Winery /> Figure 2.3: OWL root classes Every legal OWL DL ontology is a legal OWL Full ontology. Every valid OWL Lite conclusion is a valid OWL DL conclusion. Every valid OWL DL conclusion is a valid OWL Full conclusion OWL fundamental constructs In this section, we will present the fundamental elements of OWL, which include the classes, properties and individuals. Every OWL construct is uniquely defined by an rdf:id. The OWL classes describe sets of individuals that have common properties and belong to the same group. OWL classes are the most basic concept that are the roots of various taxonomic trees. Every individual in the OWL document is a member of the owl:thing class. Thus, each created class is implicitly a subclass of owl:thing. Domain specific root classes are defined by simply declaring a named class. OWL also defines the empty class, owl:nothing. Figure 2.3 shows two declarations of root classes inside an OWL ontology. OWL classes are defined inside an element < owl : Class >. The declarations shown above describes only the unique ID of the classes, without going deeper. A class can be defined as the union, intersection and complement of other classes by using the constructs owl:unionof, owl:intersectionof and owl:complementof respectively, or as an enumeration of its members by using the construct owl:oneof. Moreover, the most specific component of the classes is rdfs:subclassof. It connects a more

31 CHAPTER 2. BACKGROUND AND RELATED WORK 17 1 <owl:class rdf:about= RedWine > 2 <rdfs:subclassof rdf:resource= #Wine /> </owl:class> Figure 2.4: OWL subclass definition particular class with a more general one. The rdfs:subclassof relation is derivative, if X is a subclass of the class Y, then every instance of X is also an instance of Y. The rdfs:subclassof relation is also transitive, so that if X is a subclass of class Y and Y a subclass of class Z then X is a subclass of Z. Moreover, OWL class has some more descriptions to extend the definition of a resource. For example, see the declaration of rdf:about in Figure 2.4. Figure 2.4 shows how the class RedWine is derived from the general class Wine. The construct rdf:about is used because the class RedWine is already declared and at this moment we want to extend this class by relating it to a general class, through the subclass mechanism, in order to inherit the properties and the characteristics of Wine. Furthermore, two OWL classes may be regarded as equivalent or disjoint by using the mapping constructs owl:equivalentclass and owl:disjointwith, respectively. OWL individuals are the instances of classes, see example in Figure 2.5. Instances are declared by using the rdf:type construct or the name of the class as the name of the element in which the individual is defined. The individuals may have the properties and have to satisfy all the constraints that are predefined for the corresponding OWL class. OWL properties provide general facts about the classes and specific facts about the class individuals. There are two categories of properties: object properties and data type properties.

32 CHAPTER 2. BACKGROUND AND RELATED WORK 18 1 <RedWine rdf:id= Syrah > 2 3 OR 4 <owl:thing rdf:id= Syrah /> 5 <owl Thing rdf:about= #Syrah > 6 <rdf:type rdf:resource= RedWine /> 7 </owl:thing> Figure 2.5: OWL class individual 1 <owl:class rdf:id= VintageYear > 2 <owl:datatypeproperty rdf:id= yearvalue > 3 <rdfs:domain rdf:resource= #VintageYear /> 4 <rdfs:range rdf:resource= &xsd;positiveinteger /> 5 </owl:datatypeproperty> Figure 2.6: OWL Datatype property definition Object properties are relations between the instances of two classes. An object property is described using the owl:objectproperty construct, which connects individuals of the domain class with individuals of the range class. Data type properties are relations between class instances and RDF literals or XML Schema data types. A data type property is defined by using the owl:datatypeproperty construct, which relates individuals of the domain class to values of the range data type. The example of data type property is illustrated in Figure 2.6. Figure 2.6 describes the definition of a data type property which relates the vintage years of a wine production to positive integers. An instance of the VintageYear class is shown in Figure 2.7.

33 CHAPTER 2. BACKGROUND AND RELATED WORK 19 1 <VintageYear rdf:id= Year1998 > 2 <yearvalue rdf:datatype= &xsd;positiveinteger 3 >1998</yearValue> 4 </VintageYear> Figure 2.7: OWL Class instance with datatype property Term definitions Since our thesis usually use the term structure and semantics, in this section we restate their definitions again. According to the business dictionary [30], structure is the construction of identifiable elements in which each element is functionally connected to others, and the interrelationships between elements are fixed or changing occasionally or slowly. Based on this definition, we can infer that XML element s structure is the relation of that element to its ancestor, sibling, and descendant elements. Therefore, the structure similarity of XML element is the combination of the similarity scores of those relation elements. According to the Kamil [99], semantics is the scientific study of the meaning of words. This meaning is analyzed in terms of their semantic features which are the way that a word is used in a document. From this definition, we figure out that semantic similarity between XML elements is the combination of the meaning similarity of element name and the similarities of their other characteristic, such as data type, cardinality constraint.

34 CHAPTER 2. BACKGROUND AND RELATED WORK Related Work As mention ed in the previous chapter, our goal are to enhance the data sharing between XML applications by integration of XML Schemas (XSDs) and transformation of XML schema (XS- D/DTD) into OWL ontology based on the similarity measures. To perform these two tasks, it is require to measure the similarity of elements in XML schemas. The main difference of similarity measure in two methods is: The first method is based on the similarity measure of elements in two different documents, whereas the second method relies on the similarity measure of elements within a single document. Therefore, in this section, we introduce two subsections: XML integration with element similarity between different documents and XML transformation with element similarity within a document Similarity between documents and XML integration Much work has addressed the similarity between XML documents. Similarity can be computed at different layers of abstraction: at the instance layer (i.e., similarity between instance documents), at the schema type layer (i.e., similarity between data types, also referred to as schema, models, or structures, depending on the application domain), or between the two layers: instance and schema. XML similarity can be categorized as either of three approaches: (1) structural similarity or (2) semantic (content) similarity or (3) Hybrid approach: semantic (content) and structural similarity Structural similarity Structural similarity focuses mainly on the relationship similarity of elements between schema graphs. David Buttler [14] summarized three approaches to structural similarity: (1) tag similarity, (2) tree edit distance (TED), and (3) Fourier transform similarity. Tag similarity This is the most simplest way to measure the structural similarity between XML documents. It

35 CHAPTER 2. BACKGROUND AND RELATED WORK 21 measures how close element names from the two XML documents are. Documents which use resemblance element names are likely to have similar schema. This measure evaluates the number of intersected elements from the compared documents and it is divided by the union of elements between two documents. However, this approach is not suitable for several reasons. One critical problem is that some XML documents deriving from the same schema may have only a limited number of element names, whereas some XML documents may contain a large number of a particular element name. In addition, tag similarity completely ignores the similar of the relationships between elements, thus yielding low similarity quality. Tree edit distance (TED) According to Bille [9], tree edit distance between two labeled trees, T 1 and T 2, is the optimal sequence edit operations that turn T 1 into T 2. The edit operations include of insertion, deletion, and substitution. Previously, those edit operations are only applied on single nodes. One of the typical approach is Chawathe s method [17]. They performs the insertion and deletion operations at the leaf-node level and process the substitution of node labels anywhere in the tree but, without considering the move operation. The overall complexity of Chawathe s algorithm is expressed as O(N 2 ) where N is the maximum number of nodes of the compared trees. This complexity is quite expensive then leads to the longer run time. Therefore, Chawathe s approach is not practical for measuring the similarity of large XML data. On the other hand, one of the typical approach, which uses the complex edit operations is proposed by Shasha et al. [103]. They introduce a TED metric that permits the addition and deletion of single leaf node anywhere in the tree, not just at the leaf level. However, the entire subtrees cannot be inserted or deleted in one step. The complexity of this approach is expressed as O( T 1 T 2 depth(t 1 )depth(t 2 )). Here, T 1 and T 2 represent the number of nodes in label trees T 1 and T 2, respectively.

36 CHAPTER 2. BACKGROUND AND RELATED WORK 22 Nierman and Jagadish [69] focuss on the structural similarity of the subtrees. Their edit operations are similar to Chawathes, but they add two more new operations: insert tree and delete tree. To determine subtree similarities, they introduce containment in the relationship between trees or subtrees. A labeled tree T 1 is said to be contained in a labeled tree T 2 if all nodes of T 1 occur in T 2 with the same parent/child edge relationship and node order. The overall complexity of this algorithm is expressed as O(N 2 ). This approach proved more accurate in detecting XML structural similarities than those of either Chawathe or Shasha. Also based on Chawathe s method, Dalamagas et al. [23] introduce a framework for clustering XML documents on the basis of the structure similarities. They present the XML documents as rooted ordered labeled trees, then study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. Wei Li et al. [51] extend Dalamagas method to cluster dynamic XML documents based on the frequently changing in their structures. There are other three approaches which are based on structural similarity but result in higher accuracy than TED method. First, Lian et al. [53] represent XML document structures as directed graphs called s-graphs, and define a distance metric that captures the number of edges common to the graph representations of two XML documents: Dist(G 1, G 2 ) = 1 Edges(G 1) Edges(G 2 ) MaxEdges(G 1 ), Edges(G 2 ) (2.1) This equation 2.1 is more effective than others based on TED, in separating documents that are structurally different. It can be applied not only to tree-structured documents but also to document collections of arbitrary (graph) structure. Second, Bertino et al. [8] proposed a matching algorithm for measuring the structural sim-