Learning to Match XML Schemas A Decision Tree based Approach

Learning to Match XML Schemas A Decision Tree based Approach A.Rajesh 1, and S.K.Srivatsa 2 1 Research Scholar, Dr.MGR Educational and Research Institute University, Maduravoyil, Chennai, India Email: amrajesh2@rediffmail.com 2 Senior Professor, St.Joseph College of Engineering, Chennai, India Email: profsks@rediffmail.com Abstract Schema matching, finding semantic correspondences between elements of two schema is an essential activity in many application domains such as data mining, data warehousing, web data integration, and XML message mapping. In this paper, a decision tree based approach is presented to match the similarity between two XML schemas. Experimental results reveal that the proposed approach is efficient and reliable in predicting the similarity between two XML schemas. Index Terms XML schema matching, Schema matching, Semantic matching, Schema Integration, Machine learning I. INTRODUCTION Schema matching process typically takes two schemas as input and outputs a set of matching elements between the two schemas. This is a vital step in many application domains such as XML message mapping, Web data integration, Data Warehousing, Information Retrieval etc. Manual identification of matching schema elements is a tediou time consuming, and error prone process. Hence, the need for automatic or semi automatic approaches for schema matching. Though there are many approaches proposed for schema matching in the literature, there is still room for improvements in terms of efficiency and reliability. In this paper, a decision tree based approach has been proposed for matching XML schemas. The proposed approach has focused on matching XML schema as XML schemas are being used as the standard means to express and exchange information among enterprise web applications. The paper is organized as follows section II gives a brief description of some of the successful approaches proposed in the literature, section III gives the details of the proposed decision tree based approach for matching XML schema section IV gives the experimental evaluation of the proposed approach, and section V gives the conclusion and future work in this direction. II. LITERATURE SURVEY A recent survey of the schema matching algorithms in the literature indicate that these techniques in general rely on the following factors: label similarity, structural similarity, constraint similarity, and in addition may also use auxiliary information such as dictionarie thesauri and domain specific ontology [1],[2],[3],[4],[5],[6],[7],[8],[9] to match schemas. Some of these techniques use only 58 schema information [1],[2],[3],[5],[6],[7],[9] for identifying matches and some use instances [5],[8] for identifying matches. A still broader taxonomy of the schema matching approaches is presented in [10]. In the subsequent paragraphs a few of the prominent approaches for schema matching problem is discussed to give an idea regarding the progress of the research in this area and also to highlight how our proposed system differs from these existing approaches. LSD[10] (Learning Source Descriptions) uses a machine learning approach for schema matching problem. It uses a set of base learners to learn linguistic, structural, data and domain specific information. The results from the base learners are given as input to a meta learner which determines the match between the schema elements. Cupid[6] is a hybrid matching system that uses a combination of linguistic and structural matching methods to perform the matching process. The linguistic method proposed in the system tokenizes the label of the given elements and determines the similarity between the elements based on the number similar tokens between the labels of the elements. The structural approach determines the similarity between the elements based on the number of similar leaf elements in the subtree originating from these elements. The final similarity between elements is computed by adding weighted linguistic and structural similarity measure. COMA[4] is a composite schema matching system which proposes an architecture to include multiple matching algorithms to perform the schema matching process. It also proposes methods to combine the results of the various matching algorithms in the system and determine the similarity between the element pairs. QMatch[9] is a hybrid matching system which uses a combination of linguistic, property, and structural matching approaches to perform the matching process. The structural matching process uses a path based approach dominated by the number of similar child elements to determine the similarity between the elements in the schema. Based on the analysis of the various approaches used for schema matching process in the literature, certain characteristics exhibited by similar elements based on the level of occurrence of the elements within the XML schema leaf, interior node and root has been identified. These characteristics have been used in the proposed approach to effectively learn matching the XML schemas.

III. DECISION TREE BASED APPROACH FOR XML SCHEMA MATCHING The proposed approach comprises of the following phases preprocessing, feature similarity computation, decision tree induction, and matching. A. Preprocessing The given pair of XML schemas to be matched are parsed into a schema tree structure as shown in Fig. 1. A depth first enumeration of the schema trees are obtained. Then similarity measures between the elements of the enumerations are computed under various heads as described in the next section. B. Feature similarity computation The feature heads used in determining similarity between schema elements are linguistic similarity, descendants similarity, sibling similarity, and path similarity. The proposed approach uses Edit Distance (Levenshtein Distance), Affix match, and a domain specific thesaurus to determine the linguistic similarity measure between the elements. The maximum of the similarity measures returned by the previously stated linguistic matching approaches is taken as the linguistic similarity measure between the matched elements. as showed in (1). ( ld(, a(, tf (, th( )) l ( = max t (1) s source element t target element ld( similarity as per edit distance between s and t a( similarity as per affix match between s and t tf( similarity as per TF/IDF between s and t th( similarity as per look up in the thesaurus The descendant similarity measure is computed as shown in (2). dessim( lsimleaf ( leaf ( s) + leaf ( = (2) s an interior node from the source tree t an interior node from the target tree lsimleaf( number of linguistically similar leaves originating from s and t leaf(n) number of leaves originating from node (n) Two elements are considered to be similar if they share the maximum number of similar children between them. Similarly, two elements are considered to be similar if they share the maximum number of similar siblings between them. The sibling similarity measure between any two elements is computed as shown in (3). lsimsib( sibsim( = (3) sib( s) + sib( lsimsib( returns the number of linguistically similar siblings between the source and target interior nodes 59 sib(n) returns the number of siblings of an interior node The path similarity measure between any two elements is computed as shown in (4). pathsim( sp, tp) lsimelems( sp, tp) elems( sp) + ( tp) = (4) sp source path tp target path lsimelems(sp,tp) returns the number of linguistically similar elements between the source and target paths elems(p) returns the number of elements in the path pathsim(sp,tp) similarity measure between the source and target paths Fig. 1. An Example schema tree The various similarity measures between the elements of the source and target schemas are stored in a table. Each tuple in the table contains the following information source schema element label, target schema element label, linguistic similarity, descendants similarity, sibling similarity, path similarity, source schema element type, and target schema element type. Here the element type means whether the element is leaf (l), interior (i), or root (r) element of the schema tree. C. Decision Tree Induction The feature similarity measures as described in the previous section are computed for the elements of different pairs of example schemas. Some of the example schema pairs are matched by domain experts. These matched pairs along with the feature similarity measures of their elements are used as training samples to build a decision tree. The match information is added as new column in the table containing the similarity measure between the two schemas as shown in Table 1. The purpose of the decision tree is to classify whether a given pair of schema element is similar or dissimilar based on their feature similarity measures. The study of the various schema matching approaches in the literature indicates that the methods used to measure similarity between the elements vary in efficiency depending on the type of the elements i.e., leaf, interior, or root element. For example, the descendants similarity measure does not

contribute in measuring similarity between leaf elements as there are no descendants for leaf elements. On the other hand, descendants similarity measure contributes the maximum in measuring similarity between interior and root elements. Other approaches in the literature take into account these variations by means of giving weights for the various approaches. In the proposed approach this is taken into account by using a separate decision tree to classify the different type of elements. The basic algorithm for decision tree induction is a version of ID3, a well known decision tree induction algorithm. The basic strategy is as follow 1. The root node represents all the training samples 2. A node becomes a leaf node if the samples are all of the same class and is labeled with that class 3. Otherwise, an attribute is chosen that will best separate the samples into individual classes. Once an attribute is chosen it is never again selected for separating the samples a second time. 4. A branch is created for each known value of the test attribute, and the samples are partitioned accordingly. 5. The algorithm uses the same process recursively to form a decision tree for the samples at each partition 6. The recursive partitioning stops only when a) All samples at a node belong to the same class b) There are no remaining attributes to test c) There are no samples for the test attribute For both the conditions b and c, the node is labeled with the class in majority among samples. The step 4 of the algorithm indicates that the attributes are discrete valued. But the attributes in the matching examples are continuous valued. Hence, we introduce a small change in the way the algorithm is used for the matching problem. This change is with respect to steps 3 and 4. The test attribute in step 3 is chosen in the following way: 1. Find the minimum value of each attribute such that all matching element pairs possess a value for the attribute which is greater than or equal to this minimum 2. Now for this minimum value for each attribute, the proportion of matching element pairs whose values for the attribute is greater than or equal to this minimum value is computed from among the total number of element pairs whose value is greater than or equal to this minimum value in the samples at the node 3. The attribute which scores the highest in step 2 is chosen as the test attribute. In case of tie in the scores of attribute the first attribute from the attribute list having the highest score is chosen Thu a distinct decision tree is grown for each type of schema element pairs. D. Matching In the matching phase we use the decision trees induced from examples in the previous phase to classify unlabeled samples. Based on the type of the elements an appropriate decision tree is chosen for the matching purpose. Once a decision tree is chosen, the test attributes at each level are applied to the samples and the samples are split successively based on its value of the test attribute at each level until a leaf node is reached. The samples that are carried over to the leaf nodes are labeled with the class of the leaf node. That i if the class of the leaf node is match then, the element pairs carried over to this leaf are considered to match each other. Table 1. A sample similarity information table after manual matching Element1 Element2 Lmatch Desmatch Sibmatch Pathmatch Eltype1 Eltype2 correct ponumber49 ordernumber88 0.58 0.0 0.37 1.0 l l 1 ponumber49 ouraccountcode90 0.26 0.0 0.36 0.82 l l 0 ponumber49 youraccountcode91 0.21 0.0 0.35 0.83 l l 0 ponumber49 telephone101 0.14 0.0 0.30 0.76 l l 0 ponumber49 postalcode123 0.55 0.0 0.23 0.93 l l 0 ponumber49 country124 0.25 0.0 0.28 0.56 l l 0 podate50 telephone101 0.18 0.0 0.27 0.79 l l 0 poheader44 youraccountcode91 0.26 0.0 0.41 0.90 i l 0 poheader44 email100 0.12 0.0 0.43 0.77 i l 0 poheader44 invoiceto86 0.28 0.31 0.59 0.69 i i 0 60

2009 International Journal of Recent Trends in Engineering, Vol 2, No. 1, November IV. EXPERIMENTAL EVALUATION The proposed system was subjected to evaluation using XML schemas from diverse domains used extensively in testing similar systems in the literature. The samples and their characteristics are shown in Table 2. Several trials where conducted using the above samples in constructing and testing the decision trees. In each trial the samples were divided into training and test set in random proportions. Some of the resultant decision trees for each type of schema elements with their respective test attributes at each level is shown in Fig. 2. In the Fig. 2 lcutoff linguistic match cut off, scutoff sibling match cutoff, dcutoff descendant match cutoff, pcutoff path match cut off. Based on the trials conducted the optimum values learned by the method for the test attributes is shown in Table 3. The accuracy of the approach is measured in terms of the precision and recall of the system. Precision The number of real matches identified from among the candidate matches returned by the system. This gives an estimate of the reliability of the match predictions. Recall The number of real matches identified from among the number of real matches identified manually. This specifies the share of real matches discovered by the system. Based on the experimentation conducted, the recall of the proposed system was found to be one i.e., all the matches identified manually has been also identified by the system. The precision of the system for the various sample schemas is shown in Graph 1. Table 2. Characteristics of the sample data sets Xml Schemas No. of elements Max. depth No. of leaves po,purchase Order variant 1 10 x 9 4 x 3 7 x 7 po,purchase Order variant 2 13 x 15 4 x 4 8 x 8 po,purchase Order variant 3 40 x 43 4 x 4 33 x 33 course schemas variant 1 14 x 16 4 x 4 10 x 12 course schemas variant 2 14 x 20 4 x 5 10 x 15 course schemas variant 3 14 x 20 4 x 4 10 x 16 University schemas 8 x 9 3 x 3 5 x 6 Statistic schemas 14 x 14 2 x 3 10 x 9 Supplier schemas 17 x 43 5 x 2 10 x 34 leaf - leaf root - root >=lcutoff <lcutoff >=dcutoff <dcutoff >=scutoff <scutoff >=lcutoff <lcutoff match >=pcutoff <pcutoff match Fig. 2. Sample decision tree generated 61

2009 International Journal of Recent Trends in Engineering, Vol 2, No. 1, November Table 3. Parameters learned by the approach Element types lcutoff scutoff dcutoff pcutoff leaf-leaf 0.428 0.233-1.0 interior-interior 0.400-0.933 1.0 root-root 0.153-0.800 - interior-root 0.655-0.899 - interior-leaf 0.899 0.270 0.921 0.656 System Performance 1 0.9 0.8 0.7 0.6 Values 0.5 0.4 0.3 0.2 0.1 0 popair1 popair2 popair3 course1 course2 course3 university statistics suppliers Graph 1. System performance precision recall The results obtained are quite promising compared to other similar approaches based on machine learning. The precision of the machine learning based systems proposed earlier averaged around 0.75 where as the proposed system s precision averages around 0.88. V. CONCLUSION The proposed approach is quite effective in identifying one to one correspondences compared to any other schemes proposed in the literature. The approach can be modified to identify many to many correspondences also. If any schemas violates the pattern learned by the approach then relearning is required. In future methods can be devised to learn incrementally based on user feedback on the system identified matches. REFERENCES [1] Bergamaschi, S., Castano, S., Vincini, M., Beneventano, D., Semantic Integration of Heterogeneous Information Source Data & Knowledge Engineering, vol. 36, no. 3, pp. 215 249, 2001. [2] Bright, M.W., Hurson, A.R., and Pakzad, S. H., Automated Resolution of Semantic Heterogeneity in Multidatabase in the proceddings of TODS, vol.19, no. 2, pp. 212 253, 1994. [3] Berlin, J., and Motro, A., AutoPlex: Automated Discovery of Content for Virtual Database in the proceedings of CoopIS, pp.108 122, 2001. [4] Hong Hai Do and Rahm, E., COMA - A System for Flexible Combination of Schema Matching Approache in the proceedings of Int. Conference on Very Large Data Base 2002. [5] Li, W., Clifton, C., SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural network Data & Knowledge Engineering, vol. 33, no. 1, pp. 49-84, 2000. [6] Madhavan, J., Bernstein, P., and Rahm, E., Generic Schema Matching with Cupid, in the proceedings of Int. Conference on Very Large Data Base pp. 49 58, 2001. [7] Melnik, S., Garcia-Molina, H., Rahm, E., Similarity Flooding: A Versatile Graph Matching Algorithm, in the proceedings of ICDE, 2002. [8] Miller, R.J., et a, The Clio Project: Managing Heterogeneity, SIGMOD Record vol. 30, no. 1, pp. 78-83, 2001. [9] Naiyana Tansalarak, Kajal T. Claypool, Qmatch Using paths to match XML Schema Data & Knowledge Engineering, vol. 60, pp. 260 282, 2007. [10] Rahm, E., Bernstein, P.A., A Survey of Approaches to Automatic Schema Matching, VLDB Journal, vol. 10, no. 4, 2001. 62