A typology of ontology-based semantic measures



Similar documents
A Framework for Ontology-Based Knowledge Management System

Visualizing WordNet Structure

How To Find Influence Between Two Concepts In A Network

User Profile Refinement using explicit User Interest Modeling

Word Sense Disambiguation for Text Mining

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

A Non-Linear Schema Theorem for Genetic Algorithms

Discuss the size of the instance for the minimum spanning tree problem.

Does it fit? KOS evaluation using the ICE-Map Visualization.

ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

Semantic Analysis of. Tag Similarity Measures in. Collaborative Tagging Systems

How Well Do Semantic Relatedness Measures Perform? A Meta-Study

A Semantic Dissimilarity Measure for Concept Descriptions in Ontological Knowledge Bases

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

Part-Based Recognition

Why? A central concept in Computer Science. Algorithms are ubiquitous.

An ontology-based approach for semantic ranking of the web search engines results

CHAPTER 1 INTRODUCTION

Characteristics of Computational Intelligence (Quantitative Approach)

Software quality measurement Magne Jørgensen University of Oslo, Department of Informatics, Norway

WikiRelate! Computing Semantic Relatedness Using Wikipedia

The Open University s repository of research publications and other research outputs. PowerAqua: fishing the semantic web

Appendix B Data Quality Dimensions

Chapter 8 The Enhanced Entity- Relationship (EER) Model

Myth or Fact: The Diminishing Marginal Returns of Variable Creation in Data Mining Solutions

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Cross-lingual Synonymy Overlap

Web-Based Genomic Information Integration with Gene Ontology

Analysis of Algorithms, I

E3: PROBABILITY AND STATISTICS lecture notes

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS

Big Ideas in Mathematics

Holland s GA Schema Theorem

SIMS 255 Foundations of Software Design. Complexity and NP-completeness

Ontology Evolution for Customer Services

RECIFE: a MCDSS for Railway Capacity

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Identifying free text plagiarism based on semantic similarity

Cost Models for Vehicle Routing Problems Stanford Boulevard, Suite 260 R. H. Smith School of Business

Protein Protein Interaction Networks

Digital resource discovery: semantic annotation and matchmaking techniques

1 o Semestre 2007/2008

SEMANTIC INTEGRATION OF INFORMATION SYSTEMS

Utilising Ontology-based Modelling for Learning Content Management

Analyzing the Scope of a Change in a Business Process Model

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Measurement Information Model

TREE BASIC TERMINOLOGIES

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

Modeling coherence in ESOL learner texts

6.3 Conditional Probability and Independence

Multi-Algorithm Ontology Mapping with Automatic Weight Assignment and Background Knowledge

5 Discussion and Implications

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

3.4 Statistical inference for 2 populations based on two samples

Overview for Families

n 2 + 4n + 3. The answer in decimal form (for the Blitz): 0, 75. Solution. (n + 1)(n + 3) = n lim m 2 1

Extracting user interests from search query logs: A clustering approach

Representing Reversible Cellular Automata with Reversible Block Cellular Automata

Performance Evaluation and Optimization of Math-Similarity Search

Fairfield Public Schools

Semantic analysis of text and speech

A Neural-Networks-Based Approach for Ontology Alignment

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

A Search Engine to Categorize LOs

On the k-path cover problem for cacti

1 Example of Time Series Analysis by SSA 1

Using semantic properties for real time scheduling

Metadata Management for Data Warehouse Projects

Ontological Modeling: Part 6

Semantic Search in Portals using Ontologies

Depth-of-Knowledge Levels for Four Content Areas Norman L. Webb March 28, Reading (based on Wixson, 1999)

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Words with Attitude. Jaap Kamps and Maarten Marx. Abstract. 1 Introduction

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

On the effect of data set size on bias and variance in classification learning

Internet of Things, data management for healthcare applications. Ontology and automatic classifications

LOGISTIC REGRESSION ANALYSIS

Integrating Benders decomposition within Constraint Programming

Overview of the TACITUS Project

Note on growth and growth accounting

Graph Mining and Social Network Analysis

Words with Attitude. Abstract. 1 Introduction

Overview of MT techniques. Malek Boualem (FT)

An Introduction to. Metrics. used during. Software Development

Lecture 16 : Relations and Functions DRAFT

A Comparison of LSA, WordNet and PMI-IR for Predicting User Click Behavior

Behavioral Entropy of a Cellular Phone User

Vilnius University. Faculty of Mathematics and Informatics. Gintautas Bareikis

A Case Retrieval Method for Knowledge-Based Software Process Tailoring Using Structural Similarity

arxiv: v1 [cs.ir] 12 Jun 2015

Cluster analysis and Association analysis for the same data

Facilitating Business Process Discovery using Analysis

SEMI AUTOMATIC DATA CLEANING FROM MULTISOURCES BASED ON SEMANTIC HETEROGENOUS

Does Artificial Tutoring foster Inquiry Based Learning?

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Metric Spaces. Chapter Metrics

Transcription:

A typology of ontology-based semantic measures Emmanuel Blanchard, Mounira Harzallah, Henri Briand, and Pascale Kuntz Laboratoire d Informatique de Nantes Atlantique Site École polytechnique de l université de Nantes rue Christian Pauc BP 50609-44306 Nantes Cedex 3 - France emmanuel.blanchard@univ-nantes.fr Abstract. Ontologies are in the heart of the knowledge management process. Different semantic measures have been proposed in the literature to evaluate the strength of the semantic link between two concepts or two groups of concepts from either two different ontologies ontology alignment) or the same ontology. This article presents an off-context study of eight semantic measures based on an ontology restricted to subsomption links. We first present some common principles, and then propose a comparative study based on a set of semantic and theoretical criteria. 1 Introduction A consensus is now established about the definition and the role of an ontology in knowledge engineering: "an ontology is a formal, explicit, specification of a shared conceptualization" [1], it constitutes knowledge repository that supports information exchanges between computer systems. Several application fields search to exploit their wealth: semantic web, system interoperability, competence management, e-learning, natural language processing, etc. Numerous semantic measures on ontology have been proposed in literature to evaluate the strength of the semantic link between two concepts or two groups of concepts from two different ontologies ontology alignment) or inside an ontology. The majority of the semantic measures defined on a unique ontology is developed and validated in a specific context. This article presents an off-context study of eight semantic measures whose definitions take only into account subsomption links. The synthesis of the parameters which appear in at least one of these measures has provided us a basis of comparison to develop a semantic measure typology. First, we identify ontology-based semantic measure characteristics and the parameters that influence these measures. Then, we present a comparative study of eight measures structured according to the previously defined criterion. 2 Semantic measure classification We consider here the basic primitives of an ontology which are concepts and relations. Among the set of possible relations, some of them are not used systematically. For instance, the taxonomic relations hyperonymy/hyponymy) which

correspond to the subsomption link is-a) are the commonly used relations. Additional relations may also appear e.g. partonomic relations meronymy/holonymy) which correspond to the composition link part-of), lexical relations synonymy, antonymy, etc.) [2][3]. We have here decided to focus ourselves on taxonomic relations before we generalize this study to other relation types. This choice is guided by most of previous works which are limited to one or some relation types but which always consider taxonomic relations. Measure validation is evoked according to three ways [4]: mathematical analysis, comparison with human judgment and application specific evaluation. In this paper we favour the mathematical analysis, and we introduce the different characteristics of semantic measures, and the different parameters which influence them. When defining a measure, three characteristics are generally specified: Information sources. Each considered measure is based on a given ontology most often WordNet). Some definitions require a corpus of texts to add information such as the distribution of concept frequencies. Principles. Most of measures are based on axiomatic principles e.g. they make functions of the information content or the shortest path length. Semantic class. Different classes have been introduced in the literature: semantic distance, semantic similarity and semantic relatedness between two concepts in the same ontology. The semantic similarity evaluates the resemblance between two concepts from a subset of significant semantic links e.g is-a and part-of). The semantic relatedness evaluates the closeness between two concepts from the whole set of their semantic links. All pairs of concepts with a high semantic similarity value have a high semantic relatedness value whereas the inverse is not necessarily true. The semantic distance evaluates the disaffection between two concepts; it is an inverse notion to the semantic relatedness. We have identified four parameters associated with the ontology taxonomic hierarchy which influence at least one of the measures: p 1 p 2 p 3 p 4 the length of the shortest path: the length of the shortest path between two concepts c i and c j ; the depth of the most specific common subsumer: the length of the shortest path between the root and the most specific common subsumer of c i and c j ; the density of the concepts of the shortest path: the number of sons of each concept which belongs to the shortest path between two concepts c i and c j ; the density of the concepts from the root to the most specific common subsumer: the number of sons of the concepts which belong to the shortest path from the root to the most specific common subsumer of two concepts c i and c j. Since our study is restricted to the taxonomic hierarchy, p 1 is equal to the sum of the two shortest path from the two concepts to their most specific common subsumer. The parameter p 2 is the difference between the length of the shortest path between the root and one of the two concepts and the length of the shortest path between this concept and the most specific common subsumer.

The measures which also use a corpus consider the information content of some concepts. Let us consider a concept c. The information content is defined by CIc) = logp c)) where P c) corresponds to the occurrence probability, in a consequent corpus of texts, of c or one of its directly or indirectly subsumed concepts. Let us notice that this definition contains the information on the shortest path from the root to the concept c depth of c). As P c) is an exponential decreasing function of the depth of c, CIc) is proportional to this latter. In addition, the information extracted from a corpus by this approach contains also the information of the density of the concepts on the same path. Indeed, let us consider the set S of the concepts which have the same father direct subsumer) as c. When the cardinality of S increases, the average occurrence probability of each element of S decreases. Consequently, CIc) increases. 3 Semantic measure presentation In the following, we present eight measures using information sources and principles defined in the previous section. Then, we analyse the theoretic definition of each measure. Some of the following ontology or corpus characteristics are considered in the definitions: rt: ontology root pthsx, y): set of paths between the concepts x and y len e x): length in number of edges of the path x len n x): length in number of nodes of the path x P x): occurrence probability of a concept x in a corpus mscsx, y): the most specific common subsumer of x and y trnx): number of direction changes of the path x min r : minimum weight assigned to the relation r max r : maximum weight assigned to the relation r n r x): number of relations of type of r which leave x Rada et al. s distance[5]. It is based on the shortest path between two concepts c i and c j p 1 ) in an ontology restricted to taxonomic links: dist rmbb c i, c j ) = min len ep) 1) p pthsc i,c j) When considering the shortest path, all the taxonomic links between two adjacent concepts are supposed to have a same value. Resnik s similarity[6]. It is established on the following hypothesis: the more the information two concepts share in common, the more similar they are. Like the previous measure 1), this one only considers taxonomic links. On the basis of the information theory, Resnik proposes to add the information content. The information shared by two concepts is indicated by the information content of their most specific common subsumer: sim r c i, c j ) = log P mscsc i, c j )) 2)

The use of the information content of the most specific common subsumer implies that this measure depends on two parameters: the length of the shortest path from the root to the most specific common subsumer of c i and c j p 2 ) and the density of concepts on this path p 4 ). Leacock and Chodorow s similarity[7]. It corresponds to a transformation of the Rada distance into a similarity. The shortest path between two concepts of the ontology restricted to taxonomic links is normalized by introducing a division by the double of the maximum hierarchy depth: sim lc c i, c j ) = log 2 max c cpts min len np) p pthsc i,c j) max len np) p pthsc,rt) ) 3) Like the Rada s measure, only the shortest path length p 1 ) influences this measure. Wu and Palmer s similarity[8]. It is a measure between concepts in an ontology restricted to taxonomic links. The two parameters which are the length of the two paths from c i to mscsc i, c j ) and from c j to mscsc i, c j ) in the Wu and Palmer s definition have been added. Their addition corresponds to the shortest path between c i and c j in the formula below: 2 min len ep) p pthsmscsc i,c j),rt) sim wp c i, c j ) = min len ep) + 2 min len ep) p pthsc i,c j) p pthsmscsc i,c j),rt) 4) Here, the depth of the most specific common subsumer p 2 ) has a non linear influence. We can observe this evolution if we set the shortest path length to a constant k : inf luencex) = x/x + k)). Furthermore, this measure is sensitive to the shortest path p 1 ). Jiang and Conrath s distance[3]. The authors have used, like Resnik, a corpus in addition to the ontology restricted to taxonomic links. Jiang and Conrath formulate the distance between two concepts as the difference between the sum of the information content of the two concepts and the information content of their most specific common subsumer: dist jc c i, c j ) = 2 log P ) mscsc i, c j ) ) log P c i ) + log P c j ) This definition is composed of two interesting components which are the information content of the two concepts and the information content of their most specific common subsumer. We can suppose that it varies according to all the proposed parameters. But the combination revokes the effect of two parameters. Finally, this measure is sensitive to the shortest path length between c i and c j p 1 ) and the density of concepts along this same path p 3 ). Lin s similarity[9]. Lin deduces from an axiomatic approach a measure based on an ontology restricted to taxonomic links and a corpus. This similarity takes into account the information shared by two concepts like Resnik, but also 5)

the difference between them. The definition contains the same components as in the previous measure but the combination is not a difference but a ratio: ) 2 log P mscsc i, c j ) sim l c i, c j ) = ) 6) log P c i ) + log P c j ) In this case, the combination allows this measure to be sensitive to the whole parameter set p 1, p 2, p 3, p 4 ). Lin notices that the Wu and Palmer measure is a particular case of his measure. Indeed if, for c the father of c, we consider that P c c ) is constant, then we obtain the Wu and Palmer s measure. Sussna s distance[2]. It is based on all the possible links. For each relation r, we define a weight wc i r c j ) from a given interval [min r ; max r ]. This weight is calculated with the local density which corresponds to the number of relations of the type r which go from c i : wc i r c j ) = max r max r min r n r c i ) Sussna defines the distance between two adjacent concepts. This link corresponds to the relations r and its inverse r. dist s c i, c j ) = 7) wc i r c j ) + wc j r c i ) [ ] 8) 2 max min len ep); min len ep) p pthsc i,rt) p pthsc j,rt) This formula is defined for adjacent nodes only. To calculate the distance between two concepts, we have to sum the distances of all the links which compose the shortest path between these two concepts. The distance obtained is sensitive to three parameters: the shortest path length between c i and c j p 1 ), the density of the concepts along this same path p 2 ) and the shortest path length from the root to the most specific common subsumer of c i and c j p 3 ). Hirst and St Onge s relatedness[10]. It is based on an ontology. Hirst and St Onge distinguish four relation types between two concepts qualified of extra-strong, strong, medium and week. Hirst and St Onge propose a different way to calculate the relatedness functions of the relation type. In the following formula, C and K are two constants: 3 Cextra-strong); 2 Cstrong); 0week); rel hs c i, c j ) = C min p pthsci,c j) len e p) K min p pthsci,c j) trnp) medium) 9) Synthesis and conclusion. The table 1 summarizes the studied characteristics and the influential parameters of the measures. The four parameters are independent and issued from the ontology. However, no proposed measure takes into account the whole parameter set without the use of a corpus. Concerning the introduction of a corpus in addition to an ontology when building a measure, we believe that this latter brings few information in comparison with its algorithmic complexity.

Table 1. Typology of semantic measures characteristics Parameters sources semantics p 1 p 2 p 3 p 4 Rada ontology distance Resnik ontology+corpus similarity Leacock-Chodorow ontology similarity Jiang-Conrath ontology+corpus distance Wu-Palmer ontology similarity Lin ontology+corpus similarity Sussna ontology distance Hirst-StOnge ontology relatedness In the next future, we plan to define a semantic similarity measure based on an ontology which will be sensitive to all parameters: the length and density of concepts of the shortest path between two concepts and of the shortest path from the root to their most specific common subsumer. References 1. Gruber, T.R.: A translation approach to portable ontology specifications. Knowledge Acquisition 5 1993) 199 220 2. Sussna, M.: Word sense disambiguation for free-text indexing using a massive semantic network. In: Proceedings of the Second International Conference on Information and Knowledge Management. 1993) 67 74 3. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of International Conference on Research in Computational Linguistics. 1997) 4. Budanitsky, A.: Lexical semantic relatedness and its application in natural language processing. Technical report, Computer Systems Research Group - University of Toronto 1999) 5. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics 19 1989) 17 30 6. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence. Volume 1. 1995) 448 453 7. Leacock, C., Chodorow, M.: Combining local context and wordnet similarity for word sense identification. In Fellbaum, C., ed.: WordNet: An electronic lexical database. MIT Press 1998) 265 283 8. Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting of the Associations for Computational Linguistics. 1994) 133 138 9. Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann 1998) 296 304 10. Hirst, G., StOnge, D.: Lexical chains as representation of context for the detection and correction of malaproprisms. In Fellbaum, C., ed.: WordNet: An electronic lexical database. MIT Press 1998) 305 332