Mining Equivalent Relations from Linked Data

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Mining Equivalent Relations from Linked Data"

Transcription

1 Mining Equivalent Relations from Linked Data Ziqi Zhang 1 Anna Lisa Gentile 1 Isabelle Augenstein 1 Eva Blomqvist 2 Fabio Ciravegna 1 1 Deartment of Comuter Science, 2 Deartment of Comuter and Information University of Sheffield, UK Science, Linköing University, Sweden {z.zhang, a.l.gentile, i.augenstein, Abstract Linking heterogeneous resources is a major research challenge in the Semantic Web. This aer studies the task of mining equivalent relations from Linked Data, which was insufficiently addressed before. We roduce an unsuervised method to measure equivalency of relation airs and cluster equivalent relations. Early exeriments have shown encouraging results with an average of 0.75~0.87 recision in redicting relation air equivalency and 0.78~0.98 recision in relation clustering. 1 Introduction Linked Data defines best ractices for exosing, sharing, and connecting data on the Semantic Web using uniform means such as URIs and RDF. It constitutes the conjunction between the Web and the Semantic Web, balancing the richness of semantics offered by Semantic Web with the easiness of data ublishing. For the last few years Linked Oen Data has grown to a gigantic knowledge base, which, as of 2013, comrised 31 billion triles in 295 datasets 1. A major research question concerning Linked Data is linking heterogeneous resources, the fact that ublishers may describe analogous information using different vocabulary, or may assign different identifiers to the same referents. Among such work, many study maings between ontology concets and data instances (e.g., Isaac et al, 2007; Mi et al., 2009; Le et al., 2010; Duan et al., 2012). An insufficiently addressed roblem is linking heterogeneous relations, which is also widely found in data and can cause roblems in information retrieval (Fu et al., 2012). Existing work in linking relations tyically emloy string similarity metrics or semantic similarity mea- 1 htt://lod-cloud.net/state/ sures that require a-riori domain knowledge and are limited in different ways (Zhong et al., 2002; Volz et al., 2009; Han et al., 2011; Zhao and Ichise, 2011; Zhao and Ichise, 2012). This aer roduces a novel method to discover equivalent grous of relations for Linked Data concets. It consists of two comonents: 1) a measure of equivalency between airs of relations of a concet and 2) a clustering rocess to grou equivalent relations. The method is unsuervised; comletely data-driven requiring no a- riori domain knowledge; and also language indeendent. Two tyes of exeriments have been carried out using two major Linked Data sets: 1) evaluating the recision of redicting equivalency of relation airs and 2) evaluating the recision of clustering equivalent relations. Preliminary results have shown encouraging results as the method achieves between 0.75~0.85 recision in the first set of exeriments while 0.78~0.98 in the latter. 2 Related Work Research on linking heterogeneous ontological resources mostly addresses maing classes (or concets) and instances (Isaac et al, 2007; Mi et al., 2009; Le et al., 2010; Duan et al., 2012; Schoman et al., 2012), tyically based on the notions of similarity. This is often evaluated by string similarity (e.g. string edit distance), semantic similarity (Budanitsky and Hirst, 2006), and distributional similarity based on the overla in data usage (Duan et al., 2012; Schoman et al., 2012). There have been insufficient studies on maing relations (or roerties) across ontologies. Tyical methods make use of a combination of string similarity and semantic similarity metrics (Zhong et al., 2002; Volz et al., 2009; Han et al., 2011; Zhao and Ichise, 2012). While string similarity fails to identify equivalent relations if their lexicalizations are distinct, semantic similarity often deends on taxonomic structures 289 Proceedings of the 51st Annual Meeting of the Association for Comutational Linguistics, ages , Sofia, Bulgaria, August c 2013 Association for Comutational Linguistics

2 in existing ontologies (Budanitsky and Hirst, 2006). Unfortunately many Linked Data instances use relations that are invented arbitrarily or originate in rudimentary ontologies (Parundekar et al., 2012). Distributional similarity has also been used to discover equivalent or similar relations. Mauge et al. (2012) extract roduct roerties from an e-commerce website and align equivalent roerties using a suervised maximum entroy classification method. We study linking relations on Linked Data and roose an unsuervised method. Fu et al. (2012) identify similar relations using the overla of the subjects of two relations and the overla of their objects. On the contrary, we aim at identifying strictly equivalent relations rather than similarity in general. Additionally, the techniques roduced our work is also related to work on aligning multilingual Wikiedia resources (Adar et al., 2009; Bouma et al., 2009) and semantic relatedness (Budanitsky and Hirst, 2006). 3 Method Let t denote a 3-tule (trile) consisting of a subject (t s ), redicate (t ) and object (t o ). Linked Data resources are tyed and its tye is called class. We write tye (t s ) = c meaning that t s is of class c. denotes a relation and r is a set of triles whose t =, i.e., r ={t t = }. Given a secific class c, and its airs of relations (, ) such that r ={t t =, tye(t s )=c} and r ={t t =, tye (t s )=c}, we measure the equivalency of and and then cluster equivalent relations. The equivalency is calculated locally (within same class c) rather than globally (across all classes) because two relations can have identical meaning in secific class context but not necessarily so in general. For examle, for the class Book, the relations db:title and foaf:name are used with the same meaning, however for Actor, db:title is used erchangeably with awards db:awards (e.g., Oscar best actor). In ractice, given a class c, our method starts with retrieving all t from a Linked Data set where tye(t s )=c, using the universal query language SPARQL with any SPARQL data endo. This data is then used to measure equivalency for each air of relations (Section 3.1). The equivalence scores are then used to grou relations in equivalent clusters (Section 3.2). 3.1 Measure of equivalence The equivalence for each distinct air of relations deends on three comonents. Trile overla evaluates the degree of overla 2 in terms of the usage of relations in triles. Let SO() be the collection of subject-object airs from r and SO the ersection SO (,' ) SO( r ) SO( r ) [1] then the trile overla TO(, ) is calculated as SO( r,r' ) SO( r,r' ) MAX {, } [2] r r Intuitively, if two relations and have a large overla of subject-object airs in their data instances, they are likely to have identical meaning. The MAX function allows addressing infrequently used, but still equivalent relations (i.e., where the overla covers most triles of an infrequently used relation but only a very small roortion of a much more frequently used). Subject agreement While trile overla looks at the data in general, subject agreement looks at the overla of subjects of two relations, and the degree to which these subjects have overlaing objects. Let S() return the set of subjects of relation, and O( s) returns the set of objects of relation whose subjects are s, i.e.: O( s ) we define: S ' ' O( r s ) {t t,t s} [3] o (,' ) S( r ) S( r ) [4] s S (,' ) 0, S ( ' 1,if O( s ) O( ' s ) 0, ' ) s otherwise [5] S(,' ) / S( ) S( ' ) [6] then the agreement AG(, ) is AG (,' ) [7] Equation [5] counts the number of overlaing subjects whose objects have at least one overla. The higher the value of α, the more the two relations agree in terms of their shared subjects. For each shared subject of and we count 1 if they have at least 1 overlaing object and 0 otherwise. This is because both and can be 1:many relations and a low overla value could mean that one is densely oulated while the other is not, which does not necessarily mean they do not agree. Equation [6] evaluates the degree to which two relations share the same set of subjects. The agreement AG(, ) balances the two factors by taking the roduct. As a result, 2 In this aer overla is based on exact match. 290

3 relations that have high level of agreement will have more subjects in common, and higher roortion of shared subjects with shared objects. Cardinality ratio is a ratio between cardinality of the two relations. Cardinality of a relation CD() is calculated based on data: r CD( ) [8] S( r ) and the cardinality ratio is calculated as MIN {CD( ),CD( ' )} CDR(, ' ) [9] MAX {CD( ),CD( ' )} The final equivalency measure egrates all the three comonents to return a value in [0, 2]: TO(,' ) AG(,' ) E(,' ) [10] CDR(,' ) The measure will favor two relations that have similar cardinality. 3.2 Clustering We aly the measure to every air of relations of a concet, and kee those with a non-zero equivalence score. The goal of clustering is to create grous of equivalent relations based on the air-wise equivalence scores. We use a simle rule-based agglomerative clustering algorithm for this urose. First, we rank all relation airs by their equivalence score, then we kee a air if (i) its score and (ii) the number of triles covered by each relation are above a certain threshold, T mineqvl and T mintp resectively. Each air forms an initial cluster. To merge clusters, given an existing cluster c and a new air (, ) where either c or c, the air is added to c if E(, ) is close (as a fractional number above the threshold T mineqvlrel ) to the average scores of all connected airs in c. This reserves the strong connectivity in a cluster. This is reeated until no merge action is taken. Adjusting these thresholds allows balancing between recision and recall. 4 Exeriment Design To our knowledge, there is no ublically available gold standard for relation equivalency using Linked Data. We randomly selected 21 concets (Figure 1) from the DBedia ontology (v3.8): Actor, Aircraft, Airline, Airort, Automobile, Band, BasketballPlayer, Book, Bridge, Comedian, Film, Hosital, Magazine, Museum, Restaurant, Scientist, TelevisionShow, TennisPlayer, Theatre, University, Writer Figure 1. Concets selected for evaluation. We aly our method to each concet to discover clusters of equivalent relations, using as SPARQL endo both DBedia 3 and Sindice 4 and reort results searately. This is to study how the method erforms in different conditions: on one hand on a smaller and cleaner dataset (DBedia); on the other hand on a larger and multi-lingual dataset (Sindice) to also test crosslingual caability of our method. We chose relatively low thresholds, i.e. T mineqvl =0.1, T mintp = 0.01% and T mineqvlrel =0.6, in order to ensure high recall without sacrificing much recision. Four human annotators manually annotated the outut for each concet. For this reliminary evaluation, we have limited the amount of annotations to a maximum of 100 to scoring airs of relations er concet, resulting in 16~100 airs er concet (avg. 40) for DBedia exeriment and 29~100 airs for Sindice (avg. 91). The annotators were asked to rate each edge in each cluster with -1 (wrong), 1 (correct) or 0 (cannot decide). Pairs with 0 are ignored in the evaluation (about 12% for DBedia; and 17% for Sindice mainly due to unreadable encoded URLs for certain languages). To evaluate cross-lingual airs, we asked annotators to use translation tools. Inter-Annotator-Agreement (observed IAA) is shown in Table 1. Also using this data, we derived a gold standard for clustering based on edge connectivity and we evaluate (i) the recision of to n% ranked equivalent relation airs and (ii) the recision of clustering for each concet. Mean High Low DBedia Sindice Table 1. IAA on annotating air equivalency So far the outut of 13 concets has been annotated. This dataset 5 contains 1800 relation airs and is larger than the one by Fu et al. (2012). Annotation rocess shows that over 75% of relation airs in the Sindice exeriment contain non-english relations and mostly are crosslingual. We used this data to reort erformance, although the method has been alied to all the 21 concets, and the comlete results can be visualized at our demo website link. Some examles are shown in Figure 2. 3 htt://dbedia.org/sarql 4 htt://sarql.sindice.com/ 5 htt://staffwww.dcs.shef.ac.uk/eole/z.zhang/ resources/aer/acl2013short/web/ 291

4 Figure 2. Examles of visualized clusters 5 Result and Discussion Figure 3 for air equivalency 6 and Figure 4 shows clustering recision. Figure The box lots show the ranges of recision at each n%; the lines show the average. Figure 4. Clustering recision As it is shown in Figure 2, Linked Data relations are often heterogeneous. Therefore, finding equivalent relations to imrove coverage is imortant. Results in Figure 3 show that in most cases the method identifies equivalent relations with high recision. It is effective for both single- and cross-language relation airs. The worst erforming case for DBedia is Aircraft (for all n%), mostly due to dulicating numeric valued objects of different relations (e.g., weight, length, caacity). The decreasing recision with resect to n% suggests the measure effectively ranks correct airs to the to. This is a useful feature from IR o of view. Figure 4 shows that the method effectively clusters equivalent relations with very high recision: 0.8~0.98 in most cases. Overall we believe the results of this early roofof-concet are encouraging. As a concrete examle to comare against Fu et al. (2012), for BasketballPlayer, our method creates searate clusters for relations meaning draft team and former team because although they are similar they are not equivalent. We noticed that annotating equivalent relations is a non-trivial task. Sometimes relations and their corresonding schemata (if any) are oorly documented and it is imossible to understand the meaning of relations (e.g., due to acronyms) and even very difficult to reason based on data. Analyses of the evaluation outut show that errors are tyically found between highly similar relations, or whose object values are numeric tyes. In both cases, there is a very high robability of having a high overla of subject-object airs between relations. For examle, for Aircraft, the relations db:heightin and db: weight are redicted to be equivalent because many instances have the same numeric value for the roerties. Another examle are the Airort roerties db:runwaysurface, db:r1surface, db:r2surface etc., which according to the data seem to describe the construction material (e.g., concrete, ashalt) of airort runways. The relations are semantically highly similar and the object values have a high overla. A otential solution to such issues is incororating ontological knowledge if available. For examle, if an ontology defines the two distinct roerties of Airort without exlicitly defining an equivalence relation between them, they are unlikely to be equivalent even if the data suggests the oosite. 6 Conclusion This aer roduced a data-driven, unsuervised and domain and language indeendent method to learn equivalent relations for Linked Data concets. Preliminary exeriments show encouraging results as it effectively discovers equivalent relations in both single- and multilingual settings. In future, we will revise the equivalence measure and also exeriment with clustering algorithms such as (Beeferman et al., 2000). We will also study the contribution of individual comonents of the measure in such task. Large scale comarative evaluations (incl. recall) are lanned and this work will be extended to address other tasks such as ontology maing and ontology attern mining (Nuzzolese et al., 2011). 6 Per-concet results are available on our website. 292

5 Acknowledgement Part of this research has been sonsored by the EPSRC funded roject LODIE: Linked Oen Data for Information Extraction, EP/J019488/1. Additionally, we also thank the reviewers for their valuable comments given for this work. References Eytan Adar, Michael Skinner, Daniel Weld Information Arbitrage across Multilingual Wikiedia. Proceedings of the Second ACM International Conference on Web Search and Data Mining, Gosse Bouma, Sergio Duarte, Zahurul Islam Cross-lingual Alignment and Comletion of Wikiedia Temlates. Proceedings of the Third International Worksho on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies, Doug Beeferman, Adam Berger Agglomerative clustering of a search engine query log. Proceedings of the sixth ACM SIGKDD ernational conference on Knowledge discovery and data mining, Alexander Budanitsky and Graeme Hirst Evaluating WordNet-based Measures of Semantic Distance. Comutational Linguistics, 32(1), Songyun Duan, Achille Fokoue, Oktie Hassanzadeh, Anastasios Kementsietsidis, Kavitha Srinivas, and Michael J. Ward Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing. ISWC 2012, Linyun Fu, Haofen Wang, Wei Jin, Yong Yu Towards better understanding and utilizing relations in DBedia. Web Intelligence and Agent Systems, Volume 10 (3) Andrea Nuzzolese, Aldo Gangemi, Valentina Presutti, Paolo Ciancarini Encycloedic Knowledge Patterns from Wikiedia Links. Proceedings of the 10th International Semantic Web Conference, Lushan Han, Tim Finin and Anuam Joshi GoRelations: An Intuitive Query System for DBedia. Proceedings of the Jo International Semantic Technology Conference Antoine Isaac, Lourens van der Meij, Stefan Schlobach, Shenghui Wang An emirical study of instance-based ontology matching. Proceedings of the 6th International Semantic Web Conference and the 2nd Asian conference on Asian Semantic Web Conference, Ngoc-Thanh Le, Ryutaro Ichise, Hoai-Bac Le Detecting hidden relations in geograhic data. Proceedings of the 4th International Conference on Advances in Semantic Processing, Karin Mauge, Khash Rohanimanesh, Jean-David Ruvini Structuring E-Commerce Inventory. Proceedings of ACL2012, Jinhua Mi, Huajun Chen, Bin Lu, Tong Yu, Gang Pan Deriving similarity grahs from oen linked data on semantic web. Proceedings of the 10th IEEE International Conference on Information Reuse and Integration, Rahul Parundekar, Craig Knoblock, José Luis. Ambite Discovering Concet Coverings in Ontologies of Linked Data Sources. Proceedings of ISWC2012, Balthasar Schoman, Shenghui Wang, Antoine Isaac, Stefan Schlobach Instance-Based Ontology Matching by Instance Enrichment. Journal on Data Semantics, 1(4), Julius Volz, Christian Bizer, Martin Gaedke, Georgi Kobilarov Silk A Link Discovery Framework for the Web of Data. Proceedings of the 2nd Worksho on Linked Data on the Web Lihua Zhao, Ryutaro Ichise Mid-ontology learning from linked data. Proceedings of the Jo International Semantic Technology Conference, Lihua Zhao, Ryutaro Ichise Grah-based ontology analysis in the linked oen data. Proceedings of the 8th International Conference on Semantic Systems, Jiwei Zhong, Haiing Zhu, Jianming Li and Yong Yu Concetual Grah Matching for Semantic Search. The 2002 International Conference on Comutational Science. 293