International Journal of Computer Science & Applications

Size: px
Start display at page:

Download "International Journal of Computer Science & Applications"

Transcription

1 ISSN International Journal of Computer Science & Applications Volume 4 Issue 2 July 2007 Special Issue on Communications, Interactions and Interoperability in Information Systems Editor-in-Chief Rajendra Akerkar Editors of Special Issue Colette Rolland, Oscar Pastor and Jean-Louis Cavarero

2 ADVISORY EDITOR Douglas Comer Department of Computer Science, Purdue University, USA EDITOR-IN-CHIEF Rajendra Akerkar Technomathematics Research Foundation 204/17 KH, New Shahupuri, Kolhapur , INDIA ASSOCIATE EDITORS Ngoc Thanh Nguyen Wroclaw University of Technology, Poland COUNCIL OF EDITORS Stuart Aitken University of Edinburgh, UK Tetsuo Asano JAIST, Japan. Costin Badica University of Craiova,Craiova, Romania JF Baldwin University of Bristol, UK Pavel Brazdil LIACC/FEP,University of Porto, Portugal Ivan Bruha Mcmaster University, Canada Jacques Calmet Universität Karlsruhe Germany Narendra S. Chaudhari Nanyang Technological University, Singapore Walter Daelemans University of Antwerp, Belgium K. V. Dinesha IIIT, Bangalore, India David Hung-Chang Du University of Minnesota, USA Hai-Bin Duan Beihang University, P. R. China. Yakov I. Fet Russian Academy of Sciences, Russia Maria Ganzha, Gizycko Private Higher Educational Institute, Gizycko, Poland S. K. Gupta IIT, New Delhi, India Henry Hexmoor University of Arkansas, Fayetteville, U.S.A. Ray Jarvis Monash University, Victoria, Australia Peter Kacsuk MTA SZTAKI Research Institute, Budapest, Hungary International Journal of Computer Science & Applications Vol. 4, No.2, July 2007 MANAGING EDITOR David Camacho Universidad Carlos III de Madrid, Spain Pawan Lingras Saint Mary's University, Halifax, Nova Scotia, Canada. Huan Liu Arizona State University, USA Pericles Loucopoulos UMIST, Manchester, UK Wolfram - Manfred Lippe University of Muenster, Germany Lorraine McGinty University College Dublin, Belfield, Ireland C. R. Muthukrishnan Indian Institute of Technology, Chennai, India Marcin Paprzycki SWPS and IBS PAN, Warsaw Lalit M. Patnaik Indian Institute of Science, Bangalore, India Dana Petcu Western University of Timisoara, Romania Shahram Rahimi Southern Illinois University, Illinois, USA Sugata Sanyal Tata Institute of Fundamental Research, Mumbai, India. Dharmendra Sharma University of Canberra, Australia Ion O. Stamatescu FEST, Heidelberg, Germany José M. Valls Ferrán Universidad Carlos III, Spain Rajeev Wankar University of Hyderabad, Hyderabad, India Krzysztof Wecel The Poznan University of Economics, Poland Editorial Office: Technomathematics Research Foundation, 204/17 Kh, New Shahupuri, Kolhapur , India. editor@tmrfindia.org Copyright 2007 by Technomthematics Research Foundation All rights reserved. This journal issue or parts thereof may not be reproduced in any form or by any means, electrical or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the copyright owner. Permission to quote from this journal is granted provided that the customary acknowledgement is given to the source. International Journal of Computer Science & Applications (ISSN ) is high quality electronic journal published six-monthly by Technomathematics Research Foundation, Kolhapur, India. The www-site of IJCSA is ii

3 Contents Editorial (v) 1. A New Quantitative Trust Model for Negotiating Agents using Argumentation Jamal Bentahar, Concordia Institute for Information Systems Engineering, John-Jules Ch. Meyer, Department of Information and Computing Sciences, Utrecht University, The Netherlands (1 21) 2. Protocol Management Systems as a Middleware for Inter-Organizational Workflow Coordination (23-41) Andonoff Eric, IRIRT/UT1, Bouaziz Wassim, IRIRT/UT1, Hanachi Chihab, IRIRT/UT1 3. Adaptability of Methods for Processing XML Data using Relational Databases the State of the Art and Open Problems (43-62) Irena Mlynkova, Department of Software Engineering, Charles University, Jaroslav Pokorny, Department of Software Engineering, Charles University 4. XML View Based Access to Relational Data in Workflow Management Systems (63-74) Marek Lehmann, University of Vienna, Department of Knowledge and Business E,Johann Eder, University of Vienna, Department of Knowledge and Business Engineering, Christian Dreier, University of Klagenfurt, Jurgen Mangler 5. Incremental Trade-Off Management for Preference-Based Queries (75-91) Wolf-Tilo Balke, L3S Research Center, University of Hannover, Germany, Ulrich Güntzer, University of Tübingen, Germany, Christoph Lofi, L3S Research Center, University of Hannover, Germany 6. What Enterprise Architecture and Enterprise Systems Usage Can and Can not Tell about Each Other. (93-109) Maya Daneva, University of Twente, Pascal Van Eck, University of Twente 7. UNISC-Phone - A case study. ( ) International Journal of Computer Science & Applications Vol. 4, No.2, July 2007 iii

4 Jacques Schreiber, Gunter Feldens, Eduardo Lawisch, Luciano Alves, Informatics Department, UNISC- Santa Cruz do Sul University 8. Fuzzy Ontologies and Scale-free Networks Analysis. ( ) Silvia Calegari, DISCo, University of Milano-Bicocca, Fabio Farina, DISCo, University f Milano-Bicocca 9. Extracted Knowledge Interpretation in mining biological data: a survey. ( ) Martine Collard, University of Nice, Ricardo Martinez, University of Nice International Journal of Computer Science & Applications Vol. 4, No.2, July 2007 iv

5 Editorial The First International Conference on Research Challenges in Information Science (RCIS) aimed at providing an international forum for scientists, researchers, engineers and developers from a wide range of information science areas to exchange ideas and approaches in this evolving field. While presenting research findings and state-of-art solutions, authors were especially invited to share experiences on new research challenges. High quality papers in all information science areas were solicited and original papers exploring research challenges did receive especially careful interest from reviewers. Papers that had already been accepted or were currently under review for other conferences or journals were not to be considered for publications at RCIS papers were submitted and 31 were accepted. They will be published in the RCIS 07 proceedings. This special issue of the International Journal of Computer Science & Applications, dedicated to Communications, Interactions and Interoperability in Information Systems, presents 9 papers that obtained the highest marks in the reviewing process. They are presented in an extended version in this issue. In the context of RCIS 07, they are linked to Information System Modelling and Intelligent Agents, Description Logics, Ontologies and XML based techniques. Colette Rolland Oscar Pastor Jean-Louis Cavarero International Journal of Computer Science & Applications Vol. 4, No.2, July 2007 v

6 Vol. IV, No. II, pp A New Quantitative Trust Model for Negotiating Agents using Argumentation Jamal Bentahar 1, John-Jules Ch. Meyer 2 1 Concordia University, Concordia Institute for Information Systems Engineering, Canada bentahar@ciise.concordia.ca 2 Utrecht University, Department of Information and Computing Sciences, The Netherlands jj@cs.uu.nl Abstract In this paper, we propose a new quantitative trust model for argumentation-based negotiating agents. The purpose of such a model is to provide a secure environment for agent negotiation within multi-agent systems. The problem of securing agent negotiation in a distributed setting is core to a number of applications, particularly the emerging semantic grid computing-based applications such as e-business. Current approaches to trust fail to adequately address the challenges for trust in these emerging applications. These approaches are either centralized on mechanisms such as digital certificates, and thus are particularly vulnerable to attacks, or are not suitable for argumentation-based negotiation in which agents use arguments to reason about trust. Key words: Intelligent Agents, Negotiating Agents, Security, Trust. 1 Introduction Research in agent communication protocols has received much attention during the last years. In multi-agent systems (MAS), protocols are means of achieving meaningful interactions between software autonomous agents. Agents use these protocols to guide their interactions with each other. Such protocols describe the allowed communicative acts that agents can perform when conversing and specify the rules governing a dialogue between these agents. Protocols for multi-agent interaction need to be flexible because of the open and dynamic nature of MAS. Traditionally, these protocols are specified as finite state machines or Petri nets without taking into account the agents autonomy. Therefore, Jamal Bentahar, John-Jules Ch. Meyer 1

7 Vol. IV, No. II, pp they are not flexible enough to be used by agents expected to be autonomous in open MAS [16]. This is due to the fact that agents must respect the whole protocol specification from the beginning to the end without reasoning about them. To solve this problem, several researchers recently proposed protocols using dialogue games [6, 11, 15, 17]. Dialogue games are interactions between players, in which each player moves by performing utterances according to a pre-defined set of roles. The flexibility is achieved by combining different small games to construct complete and more complex protocols. This combination can be specified using logical rules about which agents can reason [11]. The idea of these logic-based dialogue game protocols is to enable agents to effectively and flexibly participate in various interactions with each other. One such type of interaction that is gaining increasing prominence in the agent community is negotiation. Negotiation is a form of interaction in which a group of agents, with conflicting interests, but a desire to cooperate, try to come to a mutually acceptable agreement on the division of scarce resources. A particularly challenging problem in this context is security. The problem of securing agent negotiation in a distributed setting is core to a number of applications, particularly the emerging semantic grid computingbased applications such as e-science (science that is enabled by the use of distributed computing resources by end-user scientists) and e-business [9, 10]. The objective of this paper is to address this challenging issue by proposing a new quantitative, probabilistic-based model to trust negotiating agents, which is efficient, in terms of computational complexity. The idea is that in order to share resources and allow mutual access, involved agents in e-infrastructures need to establish a framework of trust that establishes what they each expect of the other. Such a framework must allow one entity to assume that a second entity will behave exactly as the first entity expects. Current approaches to trust fail to adequately address the challenges for trust in the emerging e-computing. These approaches are mostly centralized on mechanisms such as digital certificates, and thus are particularly vulnerable to attacks. This is because if some authorities who are trusted implicitly are compromised, then there is no other check in the system. By contrast, in the decentralized approach we propose in this paper and where the principals maintain trust in each other for more reasons than a single certificate, any invaders can cause limited harm before being detected. Recently, some decentralized trust models have been proposed [2, 3, 4, 7, 13, 19] (see [18] for a survey). However, these models are not suitable for argumentation-based negotiation, in which agents use their argumentation abilities as a reasoning mechanism. In addition, some of these models do not consider the case where false information is collected from other partners. This paper aims at overcoming these limits. The rest of this paper is organized as follows. In Section 2, we present the negotiation framework. In Section 3, we present our trustworthiness model. We highlight its formulation, algorithmic description, and computational complexity. In Section 4, we describe and discuss implementation issues. In Sections 5, we compare our framework to related work, and in Section 6, we conclude. 2 Negotiation Framework In this section, we briefly present the dialogue game-based framework for negotiating agents [11, 12]. These agents have a BDI architecture (Beliefs, Desires, and Intention) Jamal Bentahar, John-Jules Ch. Meyer 2

8 Vol. IV, No. II, pp augmented with argumentation and logical and social reasoning. The architecture is composed of three models: the mental model, the social model, and the reasoning model. The mental model includes beliefs, desires, goals, etc. The social model captures social concepts such as conventions, roles, etc. Social commitments made by agents when negotiating are a significant component of this model because they reflect mental states. Thus, agents must use their reasoning capabilities to reason about their mental states before creating social commitments. The agent's reasoning capabilities are represented by the reasoning model using an argumentation system. Agents also have general knowledge, such as knowledge about the conversation subject. This architecture has the advantage of taking into account the three important aspects of agent communication: mental, social, and reasoning. It is motivated by the fact that conversation is a cognitive and social activity, which requires a mechanism making it possible to reason about mental states, about what other agents say (public aspects), and about the social aspects (conventions, standards, obligations, etc). The main idea of our negotiation framework is that agents use their argumentation abilities in order to justify their negotiation stances, or influence other agent s negotiation stances considering interacting preferences and utilities. Argumentation can be abstractly defined as a dialectical process for the interaction of different arguments for and against some conclusion. Our negotiation dialogue games are based on formal dialectics in which arguments are used as a way of expressing decision-making [8, 14]. Generally, argumentation can help multiple agents to interact rationally, by giving and receiving reasons for conclusions and decisions, within an enriching dialectical process that aims at reaching mutually agreeable joint decisions. During negotiation, agents can establish a common knowledge of each other s commitments, find compromises, and persuade each other to make commitments. In contrast to traditional approaches to negotiation that are based on numerical values, argument-based negotiation is based on logic. An argumentation system is simply a set of arguments and a binary relation representing the attack-relation between the arguments. The following definition, describe formally these notions. Here indicates a possibly inconsistent knowledge base. stands for classical inference and for logical equivalence. Definition 1 (Argument). An argument is a pair (H, h) where h is a formula of a logical language and H a sub-set of such that : i) H is consistent, ii) H h and iii) H is minimal, so no subset of H satisfying both i and ii exists. H is called the support of the argument and h its conclusion. Definition 2 (Attack Relation). Let (H 1, h 1 ), (H 2, h 2 ) be two arguments. (H 1, h 1 ) attacks (H 2, h 2 ) iff h 1 h 2. Negotiation dialogue games are specified using a set of logical rules. The allowed communicative acts are: Make-Offer, Make-Counter-Offer, Accept, Refuse, Challenge, Inform, Justify, and Attack. For example, according to a logical rule, before making an offer h, the speaker agent must use its argumentation system to build an argument (H, h). The idea is to be able to persuade the addressee agent about h, if he decides to refuse the offer. On the other side, the addressee agent must use his own argumentation system to select the answer he will give (Make-Counter-Offer, Accept, etc.). Jamal Bentahar, John-Jules Ch. Meyer 3

9 Vol. IV, No. II, pp Trustworthiness Model for Negotiating Agents In recent years, several models of trust have been developed in the context of MAS [2, 3, 4, 13, 18, 19]. However, these models are not designed to trust argumentation-based negotiating agents. Their formulations do not take into account the elements we use in our negotiation approach (accepted and refused arguments, satisfied and violated commitments). In addition, these models have some limitations regarding the inaccuracy of the collected information from other agents. In this section we present our argumentation and probabilistic-based model to trust negotiating agents that overcome some limitations of these models. 3.1 Formulation Let A be the set of agents. We define an agent s trustworthiness in a distributed setting as a probability function as follows: TRUST : A A D 0, 1 This function associates to each agent a probability measure representing its trustworthiness in the domain D according to another agent. To simplify the notation, we omit the domain D from the TRUST function because we suppose that is always known. Let X be a random variable representing an agent s trustworthiness. To evaluate the trustworthiness of an agent Ag b, an agent Ag a uses the history of its interactions with Ag b. Equation 1 indicates how to calculate this trustworthiness as a probability measure (number of successful outcomes / total number of possible outcomes). TRUST ( Ag b) Ag a Nb _ Arg( Ag b) Ag Nb _ C( Ag ) a b Aga T _ Nb _ Arg( Ag b) Ag T _ Nb _ C( Ag b) Ag a a (1) TRUST ( Ag b) Ag a indicates the trustworthiness of Ag b according to Ag a s point of view. Nb _ Arg( Ag b) Ag a is the number of Ag b s arguments that are accepted by Ag a. Nb _ C( Ag b) Ag a is the number of satisfied commitments made by Ag b towards Ag a. T _ Nb _ Arg( Ag b) Ag a is the total number of Ag b s arguments towards Ag a. T _ Nb _ C( Ag b) Ag a is the total number of commitments made by Ag b towards Ag a. All these commitments and arguments are related to the domain D. The basic idea is that the trust degree of an agent can be induced according to how much information acquired from him has been accepted as belief in the past. Using the number of accepted arguments when computing the trust value reflects the agent s knowledge level in the domain D. Particularly, in the argumentation-based negotiation, the accepted arguments capture the agent s reputation level. If some argument conflicts in the domain D exist between the two agents, this will affect the confidence they have about each other. However, this is related only to the domain D, and not generalized to other domains in which the two agents can trust each other. In a negotiation setting, the existence of Jamal Bentahar, John-Jules Ch. Meyer 4

10 Vol. IV, No. II, pp argument conflicts reflects a disagreement in the perception of the negotiation domain. Because all the factors of Equation 1 are related to the past, this information number is finite. Trustworthiness is a dynamic characteristic that changes according to the interactions taking place between Ag a and Ag b. This supposes that Ag a knows Ag b. If not, or if the number of interactions is not sufficient to determine this trustworthiness, the consultation of other agents becomes necessary. As proposed in [1, 2, 3], each agent has two kinds of beliefs when evaluating the trustworthiness of another agent: local beliefs and total beliefs. Local beliefs are based on the direct interactions between agents. Total beliefs are based on the combination of the different testimonies of other agents that we call witnesses. In our model, local beliefs are given by Equation 1. Total beliefs require studying how different probability measures offered by witnesses can be combined. We deal with this aspect in the following section. 3.2 Estimating Agent s Trustworthiness Let us suppose that an agent Ag a wants to evaluate the trustworthiness of an agent Ag b with who he never (or not enough) interacted before. This agent must ask agents he knows to be trustworthy (we call these agents confidence agents). To determine whether an agent is confident or not, a trustworthiness threshold w must be fixed. Thus, Ag b will be considered trustworthy by Ag a iff TRUST ( Ag b) Ag a is higher or equal to w. Ag a attributes a trustworthiness measure to each confidence agent Ag i. When he is consulted by Ag a, each confidence agent Ag i provides a trustworthiness value for Ag b if Ag i knows Ag b. Confidence agents use their local beliefs to assess this value (Equation 1). Thus, the problem consists in evaluating Ag b s trustworthiness using the trustworthiness values transmitted by confidence agents. Fig. 1 illustrates this issue. Trust(Ag 1 ) Ag1 Trust(Ag 2 ) Ag2 Trust(Ag 3 ) Ag3 Aga Trust(Ag b )? Trust(Ag b ) Trust(Ag b ) Trust(Ag b ) Agb Fig. 1. Problem of measuring Ag b s trustworthiness by Ag a We notice that this problem cannot be formulated as a problem of conditional probability. Consequently, it is not possible to use Bayes theorem or total probability theorem. The reason is that events in our problem are not mutually exclusive, whereas this condition is necessary for these two theorems. Here an event is the fact that a Jamal Bentahar, John-Jules Ch. Meyer 5

11 Vol. IV, No. II, pp confidence agent is trustworthy. Consequently, events are not mutually exclusive because the probability that two confidence agents are at the same time trustworthy is not equal to 0. To solve this problem, we must investigate the distribution of the random variable X representing the trustworthiness of Ag b. Since X takes only two values: 0 (the agent is not trustworthy) or 1 (the agent is trustworthy), variable X follows a Bernoulli distribution ß(1, p). According to this distribution, we have Equation 2: E( X ) p (2) where E(X) is the expectation of the random variable X and p is the probability that the agent is trustworthy. Thus, p is the probability that we seek. Therefore, it is enough to evaluate the expectation E(X) to find TRUST ( Ag b) Ag a. However, this expectation is a theoretical mean that we must estimate. To this end, we can use the Central Limit Theorem (CLT) and the law of large numbers. The CLT states that whenever a random sample of size n (X1, X n ) is taken from any distribution with mean, then the sample mean (X 1 + +X n )/n will be approximately normally distributed with mean. As an application of this theorem, the arithmetic mean (average) (X X n )/n approaches a normal distribution of mean, the expectation and standard deviation n.generally, and according to the law of large numbers, the expectation can be estimated by the weighted arithmetic mean. Our random variable X is the weighted average of n independent random variables X i that correspond to Ag b s trustworthiness according to the point of view of confidence agents Ag i. These random variables follow the same distribution: the Bernoulli distribution. They are also independent because the probability that Ag b is trustworthy according to an agent Ag t is independent of the probability that this agent (Ag b ) is trustworthy according to another agent Ag r. Consequently, the random variable X follows a normal distribution whose average is the weighted average of the expectations of the independent random variables X i. The mathematical estimation of expectation E(X) is given by Equation 3. n i 1TRUST ( Ag i) Ag TRUST ( Ag ) a b Agi M0 n i 1TRUST ( Ag i) Aga (3) The value M 0 represents an estimation of TRUST ( Ag b) Ag. a Equation 3 does not take into account the number of interactions between confidence agents and Ag b. This number is an important factor because it makes it possible to promote information coming from agents knowing more Ag b. In addition, an other factor might be used to reflect the timely relevance of transmitted information. This is because the agent s environment is dynamic and may change quickly. The idea is to promote recent information and to deal with out-of-date information with less emphasis. Equation 4 gives us an estimation of TRUST ( Ag b ) Ag a if we take into account these factors and we suppose that all confidence agents have the same trustworthiness. Jamal Bentahar, John-Jules Ch. Meyer 6

12 Vol. IV, No. II, pp n i 1 N( Ag i) Ag TR( Ag ) ( ) b i Ag TRUST Ag b b Agi M1 n i 1 N( Ag i) TR( Ag i) Ag Ag b b (4) The factor N( Ag i) Ag b indicates the number of interactions between a confidence agent Ag i and Ag b. This number can be identified by the total number of Ag b s commitments and arguments. The factor TR( Ag i) Ag b represents the timely relevance coefficient of the information transmitted by Ag i about Ag b s trust (TR denotes Timely Relevance). We denote here that removing TR( Ag i ) Ag b from the Equation 4, results in the classical probability equation used to calculate the expectation E(X). In our model, we assess the factor TR( Ag i) Ag b by using the function defined in Equation 5. We call this function: the Timely Relevance function. Ag Ag Ag b ln( t ) b Ag TR( t ) e i (5) i t is the time difference between the current time and the time at which Ag i updates its information about Ag b s trust. is an application-dependant coefficient. The intuition behind this formula is to use a function decreasing with the time difference (Fig. 2). Consequently, the more recent the information is, the higher is the timely relevance coefficient. The function ln is used for computational reasons when dealing with large numbers. Intuitively, the function used in Equation 5 reflects the reliability of the transmitted information. Indeed, this function is similar to the well known reliability t function for systems engineering ( R( t) e ). 1.0 Agb TR( t ) Ag i Ag Ag b TR( t ) e i Ag b Agi ln( t ) 0 1 Agb Time ( t ) Ag i Fig. 2. The timely relevance function The combination of Equation 3 and Equation 4 gives us a good estimation of TRUST ( Ag b ) Ag a (Equation 6) that takes into account the four most important factors: (1) the trustworthiness of confidence agents according to the point of view of Aga; (2) the Ag b s trustworthiness according to the point of view of confidence agents; (3) the number of interactions between confidence agents and Ag b ; and (4) the timely relevance Jamal Bentahar, John-Jules Ch. Meyer 7

13 Vol. IV, No. II, pp of information transmitted by confidence agents. This number is an important factor because it makes it possible to highlight information coming from agents knowing more Ag b. n i 1 TRUST ( Ag i) Ag N( Ag ) ( ) ( ) a i Ag TR Ag b i Ag TRUST Ag b b Ag i M2 n i 1 TRUST ( Ag i) N( Ag i) Ag TR( Ag ) Ag a b i Ag b (6) The way of combining Equation 3 (M 0 ) and Equation 4 (M 1 ) in the calculation of Equation 6 (M2) is justified by the fact that it reflects the mathematical expectation of the random variable X representing the Ag b s trustworthiness. This equation represents the sum of the probability of each possible outcome multiplied by its payoff. This Equation shows how trust can be obtained by merging the trustworthiness values transmitted by some mediators. This merging method takes into account the proportional relevance of each trustworthiness value, rather than treating them equally. According to Equation 6, we have: i,trust( Ag b ) w Agi n i 1TRUST( Ag i ) Ag N( Ag ( ) a i ) Ag TR Ag b i Agb M w. n i 1TRUST( Ag i ) N( Ag i ) Ag TR( Ag ) Aga b i Agb M w Consequently, if all the trust values sent by the consulted agents about Ag b are less than the threshold w, then Ag b can not be considered as trustworthy. Thus, the well-known Kyburg s lottery paradox can never happen. The lottery paradox was designed to demonstrate that three attractive principles governing rational acceptance lead to contradiction, namely that: 1. it is rational to accept a proposition that is very likely true; 2. it is not rational to accept a proposition that you are aware is inconsistent; and 3. if it is rational to accept a proposition A and it is rational to accept another proposition B, then it is rational to accept A B, are jointly inconsistent. In our situation, we do not have such a contradiction. To assess M, we need the trustworthiness of other agents. To deal with this issue, we propose the notion of trust graph. 3.3 Trust Graph In the previous section, we provided a solution to the trustworthiness combination problem to evaluate the trustworthiness of a new agent (Ag b ). To simplify the problem, we supposed that each consulted agent (a confidence agent) offers a trustworthiness value of Ag b if he knows him. If a confidence agent does not offer any trustworthiness value, it will not be taken into account at the moment of the evaluation of Ag b s trustworthiness by Ag a. However, a confidence agent can, if he does not know Ag b, offer to Ag a a set of agents who eventually know Ag b. In this case, Ag a will ask the proposed agents. These agents also have a trustworthiness value according to the point of view of the agent who proposed them. For this reason, Ag a applies Equation 5 to assess the trustworthiness values of these agents. These new values will be used to evaluate the Jamal Bentahar, John-Jules Ch. Meyer 8

14 Vol. IV, No. II, pp Ag b s trustworthiness. We can build a trust graph in order to deal with this issue. We define such a graph as follows: Definition 3 (Trust Graph). A trust graph is a directed and weighted graph. The nodes are agents and an edge (Ag i, Ag j ) means that agent Ag i knows agent Ag j. The weight of the edge (Ag i, Ag j ) is a pair (x, y) where x is the Ag j s trustworthiness according to the point of view of Ag i and y is the interaction number between Ag i and Ag j. The weight of a node is the agent s trustworthiness according to the point of view of the source agent. According to this definition, in order to determine the trustworthiness of the target agent Ag b, it is necessary to find the weight of the node representing this agent in the graph. The graph is constructed while Ag a receives answers from the consulted agents. The evaluation process of the nodes starts when all the graph is built. This means that this process only starts when Ag a has received all the answers from the consulted agents. The process terminates when the node representing Ag b is evaluated. The graph construction and the node evaluation algorithms are given respectively by Algorithms 1 and 2. Correctness of Algorithm 1: The construction of the trust graph is described as follows: 1- Agent Ag a sends a request about the Ag b s trustworthiness to all the confidence agents Ag i. The nodes representing these agents (denoted Node(Ag i )) are added to the graph. Since the trustworthiness values of these agents are known, the weights of these nodes (denoted Weight(Node(Ag i ))) can be evaluated. These weights are represented by TRUST ( Ag i) Ag a (i.e. by Ag i s trustworthiness according to the point of view of Ag a ). 2- Ag a uses the primitive Send(Ag i, Investigation(Ag b )) in order to ask Ag i to offer a trustworthiness value for Ag b. The Ag i s answers are recovered when they are offered in a variable denoted Str by Str = Receive(Ag i ). Str.Agents represents the set of agents referred by Ag i. Str. TRUST ( Ag ) is the trustworthiness value of an agent Ag j j Ag i (belonging to the set Str.Agents) from the point of view of the agent who referred him (i.e. Ag i ). 3- When a consulted agent answers by indicating a set of agents, these agents will also be consulted. They can be regarded as potential witnesses. These witnesses are added to a set called: Potonial_Witnesses. When a potential witness is consulted, he is removed from the set. 4- To ensure that the evaluation process terminates, two limits are used: the maximum number of agents to be consulted (Limit_Nbr_Visited_Agents) and the maximum number of witnesses who must offer an answer (Limit_Nbr_Witnesses). The variable Nbr_Additional_Agents is used to be sure that the first limit is respected when Ag a starts to receive the answers of the consulted agents. Jamal Bentahar, John-Jules Ch. Meyer 9

15 Vol. IV, No. II, pp Construct-Graph(Ag a, Ag b, Limit_Nbr_Visited_Agents, Limit_Nbr_Witnesses) { Graph := Nbr_Witnesses := 0 Nbr_Visited_Agents := 0 Nbr_Additional_Agents := Max(0, Limit_Nbr_Visited_Agents Size(Confidence(Ag a ))) Potential_Witnesses := Confidence(Ag a ) Add Node(Ag b ) to Graph While (Potential_Witnesses ) and (Nbr_Witnesses < Limit_Nbr_Witnesses) and (Nbr_Visited_Agents < Limit_Nbr_Visited_Agents) { n := Limit_Nbr_Visited_Agents - Nbr_Visited_Agents m := Limit_Nbr_Witnesses - Nbr_Witnesses For (i =1, i min(n, m), i++) { Ag 1 := Potential_Witnesses(i) If Node(Ag 1 ) Graph Then Add Node(Ag 1 ) to Graph If Ag 1 Confidence(Ag a ) Then Weight(Node(Ag 1 )) := Trust(Ag 1 )Ag a Send(Ag 1, Investigation(Ag b )) Nbr_Visited_Agents := Nbr_Visited_Agents +1 } } For (i =1, i min(n, m), i++) { Ag 1 := Potential_Witnesses(1) Str := Receive(Ag 1 ) Potential_Witnesses := Potential_Witnesses / {Ag 1 } While (Str.Agents ) and (Nbr_Additional_Agents > 0) { If Str.Agents = {Ag b } Then { Nbr_Witnesses := Nbr_Witnesses + 1 Add Arc(Ag 1, Ag b ) Weight1(Arc(Ag 1, Ag b )) := Str.TRUST(Ag b ) Ag1 Weight2(Arc(Ag 1, Ag b )) := Str.n(Ag b ) Ag1 Str.Agents := } Else { Nbr_Additional_Agents := Nbr_Additional_Agents 1 Ag 2 := Str.Agents(1) Str.Agents := Str.Agents / {Ag 2 } If Node(Ag 2 ) Graph then Add Ag 2 to Graph Weight1(Arc(Ag 1, Ag 2 )) := Str.TRUST(Ag 2 ) Ag1 Weight2(Arc(Ag 1, Ag 2 )) := Str.n(Ag 2 ) Ag1 Potential_Witnesses := Potential_Witnesses {Ag 2 } } } } } Algorithm 1 Jamal Bentahar, John-Jules Ch. Meyer 10

16 Vol. IV, No. II, pp Evaluate-Node(Ag y) { Arc(Ag x, Ag y ) If Node(Ag x ) is note evaluated Then Evaluate-Node(Ag x ) } m1 := 0, m2 := 0 Arc(Ag x, Ag y ) { m1 = m1 + Weight(Node(Ag x )) * Weight(Arc(Ag x, Ag y )) m2 = m2 + Weight(Node(Ag x )) } Weight(Node(Ag y )) = m1 / m2 Algorithm 2 Correctness of Algorithm 2: The trustworthiness combination formula (Equation 5) is used to evaluate the graph nodes. The weight of each node indicates the trustworthiness value of the agent represented by the node. Such a weight is assessed using the weights of the adjacent nodes. For example, let Arc(Ag x, Ag y ) be an arc in the graph, before evaluating Ag y it is necessary to evaluate Ag x. Consequently, the evaluation algorithm is recursive. The algorithm terminates because the nodes of the set Confidence(Ag a ) are already evaluated by Algorithm 1. Since the evaluation is done recursively, the call of this algorithm in the main program has as parameter the agent Ag b. Complexity Analysis. Our trustworthiness model is based on the construction of a trust graph and on a recursive call to the function Evaluate-Node(Ag y ) to assess the weight of all the nodes. Since each node is visited exactly once, there are n recursive calls, where n is the number of nodes in the graph. To assess the weight of a node we need the weights of its neighboring nodes and the weights of the input edges. Thus, the algorithm takes a time in (n) for the recursive calls and a time in (a) to assess the agents trustworthiness where a is the number of edges. The run time of the trustworthiness algorithm is therefore in (max(a, n)) i.e. linear in the size of the graph. Consequently, our algorithm is an efficient one. 4 Implementation In this section we describe the implementation of our negotiation dialogue game framework and the trustworthiness model using the Jack TM platform (The Agent Oriented Software Group, 2004). We select this language for three main reasons: 1- It is an agent-oriented language offering a framework for multi-agent system development. This framework can support different agent models. 2- It is built on top of and fully integrated with the Java programming language. It includes all components of Java and it offers specific extensions to implement agents behaviors. 3- It supports logical variables and cursors. A cursor is a representation of the results Jamal Bentahar, John-Jules Ch. Meyer 11

17 Vol. IV, No. II, pp of a query. It is an enumerator which provides query result enumeration by means of rebinding the logical variables used in the query. These features are particularly helpful when querying the state of an agent s beliefs. Their semantics is mid-way between logic programming languages with the addition of type checking Java style and embedded SQL. 4.1 General Architecture Our system consists of two types of agents: negotiating agents and trust model agents. These agents are implemented as Jack TM agents, i.e. they inherit from the basic class Jack TM Agent. Negotiating agents are agents that take part in the negotiation protocol. Trust model agents are agents that can inform an agent about the trustworthiness of another agent (Fig. 3). Agents must have knowledge and argumentation systems. Agents knowledge are implemented using Jack TM data structures called beliefsets. The argumentation systems are implemented as Java modules using a logical programming paradigm. These modules use agents beliefsets to build arguments for or against certain propositional formulae. The actions that agents perform on commitments or on their contents are programmed as events. When an agent receives such an event, it seeks a plan to handle it. Jack Agent Type: Negotiating Agent Jack Agent Type: Trust_Model_Agent Ag 1 Ag 2 Trust_Ag 1 Trust_Ag n Negotiation protocol Interactions for determining Ag 1 s trustworthiness Fig. 3. The general architecture of the system The trustworthiness model is implemented using the same principle (events + plans). The requests sent by an agent about the trustworthiness of another agent are events and the evaluations of agents trustworthiness are programmed in plans. The trust graph is implemented as a Java data structure (oriented graph). As Java classes, negotiating agents and trust model agents have private data called Belief Data. For example, the different commitments and arguments that are made and manipulated are given by a data structure called CAN implemented using tables and the different actions expected by an agent in the context of a particular negotiation game are given by a data structure (table) called data_expected_actions. The different agents trustworthiness values that an agent has are recorded in a data structure (table) called data_trust. These data and their types are given in Fig. 4 and Fig. 5. Jamal Bentahar, John-Jules Ch. Meyer 12

18 Vol. IV, No. II, pp Fig. 4. Belief Data used in our prototype 4.2 Implementation of the Trustworthiness Model The trustworthiness model is implemented by agents of type: trust model agent. Each agent of this type has a knowledge base implemented using JackTM beliefsets. This knowledge base, called table_trust, has the following structure: Agent_name, Agent_trust, and Interaction_number. Thus, each agent has information on other agents about their trustworthiness and the number of times that he interacted with them. The visited agents during the evaluation process and the agents added in the trust graph are recorded in two JackTM beliefsets called: table_visited_agents and table_graph_trust. The two limits used in Algorithm 1 (Limit_Nbr_Visited_Agents and Limit_Nbr_Witnesses) and the trustworthiness threshold w are passed as parameters to the JackTM constructor of the original agent Aga that seeks to know if his interlocutor Agb is trustworthy or not. This original agent is a negotiating agent. Jamal Bentahar, John-Jules Ch. Meyer 13

19 Vol. IV, No. II, pp Fig. 5. Beliefsets used in our prototype The main steps of the evaluation process of Ag b s trustworthiness are implemented as follows: 1- By respecting the two limits and the threshold w, Ag a consults his knowledge base data_trust of type table_trust and sends a request to his confidence agents Ag i (i = 1,.., n) about Ag b s trustworthiness. The Jack TM primitive Send makes it possible to send the request as a Jack TM message that we call Ask_Trust of MessageEvent type. Ag a sends this request starting by confidence agents whose trustworthiness value is highest. 2- In order to answer the Ag a s request, each agent Ag i executes a Jack TM plan instance that we call Plan_ev_Ask_Trust. Thus, using his knowledge base, each agent Ag i offers to Ag a an Ag b s trustworthiness value if Ag b is known by Ag i. If not, Ag i proposes a set of confidence agents from his point of view, with their trustworthiness values and the number of times that he interacted with them. In the first case, Ag i sends to Ag a a Jack TM message that we call Trust_Value. In the second case, Ag i sends a message that we call Confidence_Agent. These two messages are of type MessageEvent. 3- When Ag a receives the Trust_Value message, he executes a plan: Plan_ev_Trust_Value. According to this plan, Ag a adds to a graph structure called graph_data_trust two information: 1) the agent Ag i and his trustworthiness value as graph node; 2) the trustworthiness value that Ag i offers for Ag b and the number of times that Ag i interacted with Ag b as arc relating the node Ag i and the node Ag b. This first part Jamal Bentahar, John-Jules Ch. Meyer 14

20 Vol. IV, No. II, pp of the trust graph is recorded until the end of the evaluation process of Ag b s trustworthiness. When Ag a receives the Confidence_Agent message, he executes another plan: Plan_ev_Confidence_Agent. According to this plan, Ag a adds to another graph structure: graph_data_trust_sub_level three information for each Ag i agent: 1) the agent Ag i and his trustworthiness value as a sub-graph node; 2) the nodes Ag j representing the agents proposed by Ag i ; 3) For each agent Ag j, the trustworthiness value that Ag i assigns to Ag j and the number of times that Ag i interacted with Ag j as arc between Ag i and Ag j. This information that constitutes a sub-graph of the trust graph will be used to evaluate Ag j s trustworthiness values using Equation 5. These values are recorded in a new structure: new_data_trust. Thus, the structure graph_data_trust_sub_level releases the memory once Ag j s trustworthiness values are evaluated. This technique allows us to decrease the space complexity of our algorithm. 4- Steps 1, 2, and 3 are applied again by substituting data_trust by new_data_trust, until all the consulted agents offer a trustworthiness value for Ag b or until one of the two limits (Limit_Nbr_Visited_Agents or Limit_Nbr_Witnesses) is reached. 5- Evaluate the Ag b s trustworthiness value using the information recorded in the structure graph_data_trust by applying Equation 5. The different events and plans implementing our trustworthiness model and the negotiating agent constructor are illustrated by Fig. 6. Fig. 7 illustrates an example generated by our prototype of the process allowing an agent Ag 1 to assess the trustworthiness of another agent Ag 2. In this example, Ag 2 is considered trustworthy by Ag 1 because its trustworthiness value (0.79) is higher than the threshold (0.7). 4.3 Implementation of the Negotiation Dialogue Games In our system, agents knowledge bases contain propositional formulae and arguments. These knowledge bases are implemented as Jack TM beliefsets. Beliefsets are used to maintain an agent s beliefs about the world. These beliefs are represented in a first order logic and tuple-based relational model. The logical consistency of the beliefs contained in a beliefset is automatically maintained. The advantage of using beliefsets over normal Java data structures is that beliefsets have been specifically designed to work within the agent-oriented paradigm. Our knowledge bases (KBs) contain two types of information: arguments and beliefs. Arguments have the form ([Support], Conclusion), where Support is a set of propositional formulae and Conclusion is a propositional formula. Beliefs have the form ([Belief], Belief) i.e. Support and Conclusion are identical. The meaning of the propositional formulae (i.e. the ontology) is recorded in a beliefset called table_ontology whose access is shared between the two agents. This beliefset has two fields: Proposition and Meaning. Jamal Bentahar, John-Jules Ch. Meyer 15

21 Vol. IV, No. II, pp Fig. 6. Events, plans and the conversational agent constructor implementing the trustworthiness model Agent communication is done by sending and receiving messages. These messages are events that extend the basic Jack TM event: MessageEvent class. MessageEvents represent events that are used to communicate with other agents. Whenever an agent needs to send a message to another agent, this information is packaged and sent as a MessageEvent. A MessageEvent can be sent using the primitive: Send(Destination, Message). Our negotiation dialogue games are implemented as a set of events (MessageEvents) and plans. A plan describes a sequence of actions that an agent can perform when an event occurs. Whenever an event is posted and an agent chooses a task to handle it, the first thing the agent does is to try to find a plan to handle the event. Plans are reasoning methods describing what an agent should do when a given event occurs. Each dialogue game corresponds to an event and a plan. These games are not implemented within the agents program, but as event classes and plan classes that are external to agents. Thus, each negotiating agent can instantiate these classes. An agent Ag 1 starts a dialogue game by generating an event and by sending it to his interlocutor Ag 2. Ag 2 executes the plan corresponding to the received event and answers by generating another event and by sending it to Ag 1. Consequently, the two agents can communicate by using the same protocol since they can instantiate the same classes Jamal Bentahar, John-Jules Ch. Meyer 16

22 Vol. IV, No. II, pp representing the events and the plans. For example, the event Event_Attack_Commitment and the plan Plan_ev_Attack_commitment implement the Attack game. The architecture of our negotiating agents is illustrated in Fig. 8. Fig. 7. The screen shot of a trustworthiness evaluation process 5 Related Work Recently, some online trust models have been developed (see [20] for a detailed survey). The most widely used are those on ebay and Amazon Auctions. Both of these are implemented as a centralized trust system so that their users can rate and learn about each other s reputation. For example, on ebay, trust values (or ratings) are +1, 0, or 1 and user, after an interaction, can rate its partner. The ratings are stored centrally and summed up to give an overall rating. Thus, reputation in these models is a global single value. However, the model can be unreliable, particularly when some buyers do not return ratings. In addition, these models are not suitable for applications in open MAS Jamal Bentahar, John-Jules Ch. Meyer 17

23 Vol. IV, No. II, pp such as agent negotiation because they are too simple in terms of their trust rating values and the way they are aggregated. Dialogue games Jack Event Jack Plan Jack Event Jack Plan Jack Event Jack Plan Ag 1 (Jack Agent) Ag 2 (Jack Agent) Argumentation system (Java + Logical programming) Argumentation system (Java + Logical programming) Knowledge base (Jack Beliefset) Ontology (Jack Beliefset) Knowledge base (Jack Beliefset) Fig. 8. The architecture of the negotiating agents Another centralized approach called SPORAS has been proposed by Zacharia and Maes [7]. SPORAS does not store all the trust values, but rather updates the global reputation value of an agent according to its most recent rating. The model uses a learning function for the updating process so that the reputation value can reflect an agent s trust. In addition, it introduces a reliability measure based on the standard deviations of the trust values. However, unlike our models, SPORAS deal with all ratings equally without considering the different trust degrees. Consequently, it suffers from rating noise. In addition, like ebay, SPORAS is a centralized approach, so it is not suitable for open negotiation systems. Broadly speaking, there are three main approaches to trust in open multi-agent systems. The first approach is built on an agent s direct experience of an interaction partner. The second approach uses information provided by other agents [2, 3, 4]. The third approach uses certified information provided by referees [9, 19]. In the first approach, methods by which agents can learn and make decisions to deal with trustworthy or untrustworthy agents should be considered. In the models based on the second and the third approaches, agents should be able to reliably acquire and reason about the transmitted information. In the third approach, agents should provide thirdparty referees to witness about their previous performance. Because the first approaches are only based on a history of interactions, the resulting models are poor because agents with no prior interaction histories could trust dishonest gents until a sufficient number of interactions is built. Sabater [13] proposes a decentralized trust model called Regret. Unlike the first approach models, Regret uses an evaluation technique not only based on an agent s direct experience of its partners reliability, but it also uses a witness reputation component. In addition, trust values (called ratings) are dealt with according to their recency relevance. Thus, old ratings are given less importance compared to new ones. Jamal Bentahar, John-Jules Ch. Meyer 18

24 Vol. IV, No. II, pp However, unlike our model, Regret does not show how witnesses can be located, and thus, this component is of limited use. In addition, this model does not deal with the possibility that an agent may lie about its rating of another agent, and because the ratings are simply equally summed, the technique can be sensitive to noise. In our model, this issue is managed by considering the witnesses trust and because our merging method takes into account the proportional relevance of each trustworthiness value, rather than treating them equally (see Equation 6 Section III.B) Yu and Singh [2, 3, 4] propose an approach based on social networks in which agents, acting as witnesses, can transmit information about each other. The purpose is to tackle the problem of retrieving ratings from a social network through the use of referrals. Referrals are pointers to other sources of information similar to links that a search engine would plough through to obtain a Web page. Through referrals, an agent can provide another agent with alternative sources of information about a potential interaction partner. The social network is presented using a referral network called TrustNet. The trust graph we propose in this paper is similar to TrustNet, however there are several differences between our approach and Yu and Singh s approach. Unlike Yu and Singh s approach in which agents do not use any particular reasoning, our approach is conceived to secure argumentation-based negotiation in which agents use an argumentation-based reasoning. In addition, Yu and Singh do not consider the possibility that an agent may lie about its rating of another agent. They assume all witnesses are totally honest. However, this problem of inaccurate reports is considered in our approach by taking into account the trust of all the agents in the trust graph, particularly the witnesses. Also, unlike our model, Yu and Singh s model do not treat the timely relevance information and all ratings are dealt with equally. Consequently, this approach cannot manage the situation where the agents behavior changes. Huynh, Jennings, and Shadbot [19] tackle the problem of collecting the required information by the evaluator itself to assess the trust of its partner, called the target. The problem is due to the fact that the models based on witness implicitly assume that witnesses are willing to share their experiences. For this reason, they propose an approach, called certified reputation, based not only on direct and indirect experiences, but also on third-party references provided by the target agent itself. The idea is that the target agent can present arguments about its reputation. These arguments are references produced by the agents that have interacted with the target agents certifying its credibility (the model proposed by Maximilien and Singh [5] uses the same idea). This approach has the advantage of quickly producing an assessment of the target s trust because it only needs a small number of interactions and it does not require the construction of a trust graph. However, this approach has some serious limitations. Because the referees are proposed by the target agent, this agent can provide only referees that will give positive ratings about it and avoid other referees, probably more credible than the provided ones. Even if the provided agents are credible, their witness could not reflect the real picture of the target s honesty. This approach can privilege opportunistic agents, which are agents only credible with potential referees. For all these reasons, this approach is not suitable for trusting negotiating agents. In addition, in this approach, the evaluator agent should be able to evaluate the honesty of the referees using a witness-based model. Consequently, a trust graph like the one proposed in this paper could be used. This means that, in some situations, the target s trust might not be assessed without asking for witness agents. Jamal Bentahar, John-Jules Ch. Meyer 19

25 Vol. IV, No. II, pp Conclusion The contribution of this paper is the proposition and the implementation of a new probabilistic model to trust argumentation-based negotiating agents. The purpose of such a model is to provide a secure environment for agent negotiation within multi-agent systems. To our knowledge, this paper is the first work addressing the security issue of argumentation-based negotiation in multi-agent settings. Our model has the advantage of being computationally efficient and of gathering four most important factors: (1) the trustworthiness of confidence agents; (2) the target s trustworthiness according to the point of view of confidence agents; (3) the number of interactions between confidence agents and the target agent; and (4) the timely relevance of information transmitted by confidence agents. The resulting model allows us to produce a comprehensive assessment of the agents credibility in an argumentation-based negotiation setting. Acknowledgements We would like to thank the Natural Sciences and Engineering Research Council of Canada (NSERC), le fonds québécois de la recherche sur la nature et les technologies (NATEQ), and le fonds québécois de la recherche sur la société et la culture (FQRSC) for their financial support. The first author is also supported in part by Concordia University, Faculty of Engineering and Computer Science (Start-up Grant). Also, we would like to thank the three anonymous reviewers for their interesting comments and suggestions. References [1] A. Abdul-Rahman, and S. Hailes. Supporting trust in virtual communities. In Proceedings of the 33rd Hawaii International Conference on System Sciences, 6, IEEE Computer Society Press [2] B. Yu, and M. P. Singh. An evidential model of distributed reputation management. In Proceedings of the First International Joint Conference on Autonomous Agents and Multi- Agent Systems. ACM Press, pages , [3] B. Yu, and M. P. Singh. Detecting deception in reputation management. In Proceedings of the 2nd International Joint Conference on Autonomous Agents and Multi-Agent Systems. ACM Press, pages 73-80, [4] B. Yu, and M. P. Singh. Searching social networks. In Proceedings of the second International Joint Conference on Autonomous Agents and Multi-Agent Systems. ACM Press, pp , [5] E. M. Maximilien, and M. P. Singh. Reputation and endorsement for web services. ACM SIGEcom Exchanges, 3(1):24-31, [6] F. Sadri, F. Toni, and P. Torroni. Dialogues for negotiation: agent varieties and dialogue sequences. In Proceedings of the International workshop on Agents, Theories, Architectures and Languages. Lecture Notes in Artificial Intelligence (2333): , [7] G. Zacharia, and P. Maes. Trust management through reputation mechanisms. Applied Artificial Intelligence, 14(9): , [8] H. Prakken. Relating protocols for dynamic dispute with logics for defeasible argumentation. In Synthese (127): , [9] H. Skogsrud, B. Benatallah, and F. Casati. Model-driven trust negotiation for web services. IEEE Internet Computing, 7(6):45-52, Jamal Bentahar, John-Jules Ch. Meyer 20

26 Vol. IV, No. II, pp [10] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: enabling the scalable virtual organization. The International Journal of High Performance Computing Applications, 15(3), , [11] J. Bentahar, B. Moulin, J-.J. Ch. Meyer, and B. Chaib-draa. A computational model for conversation policies for agent communication. In J. Leite and P. Torroni editors, Computational Logic in Multi-Agent Systems. Lecture Notes in Artificial Intelligence (3487): , [12] J. Bentahar. A pragmatic and semantic unified framework for agent communication. Ph.D. Thesis, Laval University, Canada, May [13] J. Sabater. Trust and Reputation for Agent Societies. Ph.D. Thesis, Universitat Autµonoma de Barcelona, [14] L. Amgoud, N. Maudet, S. Parsons. Modelling dialogues using argumentation. In Proceeding of the 4th International Conference on Multi-Agent Systems, pages 31-38, [15] M. Dastani, J. Hulstijn, and L. V. der Torre. Negotiation protocols and dialogue games. In Proceedings of Belgium/Dutch Artificial Intelligence Conference, pages 13-20, [16] N. Maudet, and B. Chaib-draa, Commitment-based and dialogue-game based protocols, new trends in agent communication languages. In Knowledge Engineering Review. Cambridge University Press, 17(2): , [17] P. McBurney, and S. Parsons, S. Games that agents play: A formal framework for dialogues between autonomous agents. In Journal of Logic, Language, and Information, 11(3):1-22, [18] S. D. Ramchurn, T. D. Huynh, and N. R. Jennings. Trust in multi-agent systems. The Knowledge Engineering Review, 19(1):1-25, March [19] T. D. Huynh, N. R. Jennings, and N. R. Shadbolt. An integrated trust and reputation model for open multi-agent systems. Journal of Autonomous Agents and Multi-Agent Systems AAMAS, 2006, [20] T. Grandison, and M. Sloman. A survey of trust in internet applications. IEEE Communication Surveys & Tutorials, 3(4), Jamal Bentahar, John-Jules Ch. Meyer 21

27 Vol. IV, No. II 22

28 Vol. IV, No. II, pp Protocol Management Systems as a Middleware for Inter-Organizational Workflow Coordination Eric ANDONOFF, Wassim BOUAZIZ, Chihab HANACHI IRIT/UT1, 1 Place Anatole France Toulouse Cédex, France {Eric.Andonoff, Wassim.Bouaziz, Chihab.Hanachi}@univ-tlse1.fr Abstract Interaction protocols are well identifiable and recurrent in Inter-Organizational Workflow (IOW): they notably support finding partners, negotiation and contract establishment between partners. So, it is useful to isolate these interaction protocols in order to better study, design and implement them as specific entities so as to allow the different Workflow Management Systems (WfMS) involved in an IOW to share protocols and reuse them at run-time. Consequently, our aim in this paper is to propose a Protocol Management System (PMS) architecture as a middleware to support the design, the selection and the enactment of protocols on behalf of an IOW system. The paper then gives a protocol meta-model on top of which this PMS should be built. Finally, it presents a partial implementation of such a PMS combining agent and semantic web technologies. While agent technology eases the cooperation between the IOW components, semantic Web technology supports the definition, the sharing and the selection of protocols. Keywords: Inter-Organizational Workflow, Protocol, Protocol Management System, Agent Technology. 1 Introduction Inter-Organizational Workflow (IOW) is essential given the growing need for organizations to cooperate and coordinate their activities in order to meet the new demands of highly dynamic and open markets. The different organizations involved in such cooperation must correlate their respective resources and skills, and coordinate their respective business processes towards a common goal, corresponding to a valueadded service [1], [2]. A fundamental issue for IOW is the coordination of these different distributed, heterogeneous and autonomous business processes in order to both support semantic interoperability between the participating processes, and efficiently synchronize the distributed execution of these processes. Coordination in IOW raises several problems such as: (i) the definition of the universe of discourse, without which it would not be possible to solve the various semantic conflicts that are bound to occur between several autonomous and Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 23

29 Vol. IV, No. II, pp heterogeneous workflows, (ii) finding partners able to realize a business/workflow process, (iii) the negotiation of a workflow process between partners according to criteria such as due time, price, quality of service, visibility of the process evolution or way of doing it, (iv) the signature of contracts between partners and (v) the synchronization of the distributed and concurrent execution of these different processes. Today, organizations are shifting from a tight/static case of cooperation (e.g. virtual enterprises) to a loose/dynamic case (e.g. e-commerce) where dynamic relations and alliances are established between organizations. IOW coordination has been widely studied for the static case investigating issues concerning formal specification of workflow interactions [1], [2], interoperability [3], finding partners [4] and contract specification [4], [5]. Conversely, the dynamic case has been less examined, and tools developed in the static case cannot be straight forwardly adapted. Indeed, the context in which loose IOW is deployed has three main specific and additional features: - Flexibility which means that cooperation should be free from structural constraints in order to maintain organizations' autonomy i.e. the ability to decide by themselves the conditions of the cooperation: when, how and with whom. - Openness which means that the set of partners involved in an IOW can evolve through time, and that it is not necessarily fixed a priori but may be dynamically decided at run-time in an opportunistic way. - Scalability, mainly in the context of the Internet, that increases the complexity of IOW coordination: its design, its enactment and its efficiency. Therefore, IOW coordination must be revisited and adapted in this highly dynamic context, notably finding partners, negotiation between partners, and contracts enactment and monitoring. Besides, new issues must be considered such as the definition of mechanisms for business process specification, discovery and matching... This paper is based on the observation of the fact that, whatever the coordination problem considered in loose IOW is, it follows a recurrent schema. After an informal interaction, the participating partners are committed to follow a strict interaction protocol. This protocol rules the conversation by a set of laws which constraint the behavior of the participating partners, assigns roles to each of them, and therefore organizes their cooperation. Since interaction protocols constitute well identifiable and recurrent coordination patterns in loose IOW, it is useful to isolate them in order to better study, design and implement them as first-class citizen entities so as to allow the different Workflow Management Systems (WfMS) involved in a loose IOW to share them and reuse them at run-time [6]. Following this abstraction implies the application of the principle of separation of concerns, which allows the separation of individual and intrinsic capabilities of each workflow system from what relates to loose IOW coordination. This principle of separation of concerns is widely recognized as a good design practice from a software engineering point of view [7] and has led to the advent of new technologies in Information System as discussed in [8]. Indeed, [8] explains how data, user interfaces and more recently business processes have been pushed out of applications and led to specific software to handle them (respectively Database Management Systems, User Interface Management Systems, and WfMS). Following this perspective, we argue that interaction protocols have to be pushed out of IOW applications, and that a Protocol Management System has to be defined. Hence our objective is to specify a Protocol Management System (PMS) to support interaction protocol-based coordination in loose IOW. Such a PMS provides a loose IOW system with the three following services: the description of useful interaction protocols for IOW coordination, their selection and their execution. Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 24

30 Vol. IV, No. II, pp This paper also relies on the exploitation of agent and semantic Web approaches viewed as enabling technologies to deal with the computing context in which IOW is deployed. The agent approach brings technical solutions and abstractions to deal with distribution, autonomy and openness [9], which are inherent to loose IOW. This approach also provides organizational concepts, such as groups, roles, commitments, which are useful to structure and rule at a macro level the coordination of the different partners involved in a loose IOW [10]. Using this technology, we also inherit numerous concrete solutions to deal with coordination in multi-agent systems [11] (middleware components, sophisticated interactions protocols). The semantic Web approach facilitates communication and semantic interoperability between organizations involved in a loose IOW. It provides means to describe and access common business vocabulary and shared interaction protocols. The contribution of this paper is fourfold. First, this paper defines a multi-agent PMS architecture and shows how this architecture can be connected to any WfMS whose architecture is compliant with the WfMC reference architecture. Second, it proposes an organizational model, instance of the Agent Group Role meta-model [12], to structure and rule the interactions between the components of the PMS' and WfMS' architectures. Third, it provides a protocol meta-model specified with OWL to constitute a shared ontology of coordination protocols. This meta-model contains the necessary information to select an appropriate interaction protocol at run-time according to the current coordination problem to be dealt with. This meta-model is then refined to integrate a classification of interaction protocols devoted to loose IOW coordination. Fourth, this paper presents a partial implementation of this work limited to a matchmaker protocol useful to deal with finding partners' problem. The remaining of this paper is organized as follows. Section 2 presents the PMS architecture stating the role of its components and explaining how they interact with each other. Section 3 shows how to implement any WfMS engine connectable to a PMS, while remaining compliant with the WfMC reference architecture. For reasons related to homogeneity and flexibility, this engine is also provided with an agentbased architecture. Section 4 gives an organizational model that structures and rules the communication between the different agents involved in a loose IOW. Section 5 addresses engineering issues for protocols. It presents the protocol meta model and also identifies, among multi-agent system interaction protocols, the ones which are appropriate for the loose IOW context. Section 6 gives a brief overview of the implementation of the matchmaker protocol to deal with finding partners' problem. Finally, section 7 compares our contribution to related works and concludes the paper. 2 The Protocol Management System Architecture The Protocol Management System (PMS) follows a multi-agent architecture represented in figure 1. The PMS is composed of persistent agents (represented by rectangles) and dynamic ones (represented by ellipses) created and killed at run-time. The PMS architecture consists of two blocks of components, each one providing specific services. The left hand side block supports the design and selection of interaction/coordination protocols appropriate for loose IOW coordination, while the right hand side block supports the execution of the selected protocols. The Protocol Design and Selection block is organized around three agents (Protocol Design Agent, Protocol Selection Agent and Protocol Launcher Agent) and two knowledge sources (Domain Ontology and Coordination Protocol Ontology) described below. The Protocol Design Agent (PDA) is an agent that is responsible for the specification of protocols. It proposes tools to allow users to graphically or textually Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 25

31 Vol. IV, No. II, pp specify protocols i.e. their control structures, the actors involved in them and the necessary information for their execution. The Protocol Selection Agent (PSA) is an agent whose aim is to help a WfMS requester to select the most appropriate coordination protocol according to the objective of the conversation to be initiated (finding partners, negotiation between partners ), and some specific criteria (quality of service, due time ) depending on the type of the chosen coordination protocol. These criteria will be presented in section 4. The Protocol Launcher Agent (PLA) is an agent that creates and launches agents called Moderators that will play as protocols. The Domain Ontology source structures and records different business taxonomies to solve semantic interoperability problems that are bound to occur in a loose IOW context. Indeed, organizations involved in such an IOW must adopt a shared business view through a common terminology before starting their cooperation. The Coordination Protocol Ontology source describes and records protocol description which may be queried by the PSA agent or which may be used as models of behavior by the PLA agent when creating moderators. Protocol Design Agent Protocol Design and Selection Protocol Management System Protocol Execution Coordination Protocol Ontology Domain Ontology Communication Act Database Communication Act Database Conversation Database Protocol Launcher Agent Protocol Selection Agent Moderator... Moderator Conversation Server Agent Communication Channel Message Dispatcher Figure 1: PMS Architecture The Protocol Execution block is composed of two types of agents: the conversation server and as many moderators as the number of conversations in progress. It exploits the Domain Ontology source, maintains the Conversation databases and handles a Communication Act database for each moderator. We now describe the two types of agents and how they interact with these database and knowledge sources. Each moderator manages a single conversation which is conform to a given coordination protocol, and a moderator has the same lifetime as the conversation it manages. A moderator grants roles to agents and ensures that any communication act that takes place in its conversation is compliant with the protocol's rules. It also records all the communication acts within its conversation in the Communication Act database. A moderator also exploits the Domain Ontology source to interact with the agents involved in its conversation using an adequate vocabulary. The Conversation Server publishes and makes global information about each current conversation accessible (such as its protocol, the identity of the moderator supervising the conversation, the date of its creation, the requester initiator of the conversation and the participants involved in the conversation). This information is stored in the Conversation database. By allowing participants to get information about Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 26

32 Vol. IV, No. II, pp the current (and past) conversations and be notified of new conversations [6], the Conversation Server makes the interaction space explicit and available. Then, this interaction space may be public or private according to the policies followed by moderators. A database oriented view mechanism may be used to specify what is public or private and for whom. In addition to the two main blocks of the PMS, two other components are needed: The Message Dispatcher and the Agent Communication Channel. The Message Dispatcher is an agent that interfaces the PMS with any WfMS. Thus, any WfMS intending to invoke a PMS s service only needs to know the Message Dispatcher address. Finally, the Agent Communication Channel is an agent that supports the transport of interaction messages between all the agents of the PMS architecture. 3 Connection of a Workflow Management System to the PMS Architecture This section first presents the Workflow Management Coalition (WfMC) reference architecture and then explains why this architecture is insufficient to support the connection to the PMS architecture. This section finally explains how we revisit the reference architecture with agents in order to support the connection to the PMS architecture. 3.1 Insufficiency of the Reference Architecture The reference architecture proposed by the WfMC [13] is defined by giving the role of its software components and by specifying how they interact. The main component of this architecture is the Workflow Enactment Service (WES) that manages the execution of workflow processes,and that interacts, on one hand, with workflow definition, execution and monitoring components, and, on the other hand, with external WES. The five interfaces supporting the communication between the different components are called Workflow API (WAPI). These interfaces are: - Interface 1 with Process Definition Tools, - Interface 2 with Workflow Client Applications, - Interface 3 with Invoked Applications, - Interface 4 with others WESs, - Interface 5 with Administration and Monitoring Tools. It is relevant to be compliant with the reference architecture in order to ensure the adaptability of our solution. Unfortunately, despite its advantages, this architecture is insufficient in our context for two main reasons. First, because in loose IOW, not only the WES must execute the execution of workflow processes instances but it also should drive different concurrent activities such as, of course, process instances execution, but also finding partners, negotiation between partners, signature of contracts and cooperation with other WESs, as workflow client or server. Second, interfaces 3 and 4 are not appropriate to interact with the PMS. 3.2 Revisiting the Reference Architecture Consequently, we revisit this WfMC reference architecture and more precisely the WES of this architecture using the agent technology and introducing a new interface and a specific component to support the connection with the PMS. However, our proposition remains compliant with the WfMC architecture since the existing WAPI (1 to 5) are not modified. Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 27

33 Vol. IV, No. II, pp We have chosen the agent technology for several reasons. First, and as mentioned in the introduction, this technology is convenient to the loose IOW context since it provides natural abstractions to deal with distribution, heterogeneity and autonomy which are inherent to this context. Therefore, each organization involved in a loose IOW may be seen as an autonomous agent having the mission to coordinate with other workflow agents, and acting on behalf of its organization. Second, as the agent technology is also at the basis of the PMS architecture, it will be easier and more homogeneous, using agent coordination techniques, to structure and rule the interaction between the agents of both PMS and WfMS's architectures. Third, as defended in [14], [15], [16], the use of this technology ensures a bigger flexibility to the modeled workflow processes: the agents implementing them can easily adapt to their specific requirements. Figure 2 presents the agent-based architecture we propose for the WES. This architecture includes: (i) as many agents, called workflow agents, as the number of workflow process instances being currently in progress, (ii) an agent manager in charge of these agents, (iii) a connection server and a new interface, interface 6, that help workflow agents to solicit the PMS for coordination purposes and finally (iv) an agent communication channel to support the interaction between these agents. Regarding the Workflow Agents, the idea is to implement each workflow process (stored in the Workflow Process Database) instance as a software process, and to encapsulate this process within an agent. Such a Workflow Agent includes a workflow engine that, as and when the workflow process instance progresses, reads the workflow definition and triggers the action(s) to be done according to its current state. This Workflow Agent supports interface 3 with the applications that are to be used to perform pieces of work associated to process tasks. Workflow Enactment Service Agent Communication Channel 6 Connexion Server Knowledge Database Workflow Agent Workflow Agent Workflow Agent Agent Manager 1, 2, 5 3, 4 3, 4 3, 4 Workflow Process Database Figure 2: Workflow Enactment Service Revisited The Agent Manager controls and monitors the running of Workflow Agents: - Upon a request for a new instance of a workflow process, the Agent Manager creates a new instance of the corresponding Workflow Agent type, initializes its parameters according to the context, and launches the running of its workflow engine. - It ensures the persistency of Workflow Agents that execute long-term business processes extending for a long time in which task performances are interleaved with periods of inactivity. - It coordinates Workflow Agents in their use of the local shared resources. - It assumes interfaces 1, 2 and 5 of the WfMS. In the loose IOW context, workflow agents need to find external workflow agents running in other organizations and able to contribute to the achievement of their goal. Connecting them requires finding, negotiation and contracting capacities but also Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 28

34 Vol. IV, No. II, pp maintaining knowledge about resources of the environment. The role of the Connection Server is to manage this knowledge (stored in the Knowledge database) and to help agents to connect to the partners they need. To do this, the connection server interacts with the PMS and other WESs using a new interface, Interface 6. For instance, this interface supports the communication between a connection server of a WES and a moderator agent of the PMS (via the Message Dispatcher agent), but also between two connection servers of two different WESs. The agent manager and the connection server relieve workflow agents of technical tasks concerning relations with their internal and external environments. Each agent being in charge of a single workflow process instance can be more easily adapted to specific requirements of this process. Indeed, each instance of a business process is a specific case featuring distinctive characteristics with regard to its objectives, constraints and environment. Beyond the common generic facilities for supporting flexibility, a workflow agent is provided with two additional capabilities. First, it includes its own definition of the process what is to be performed is specific to it and second, it includes its own engine for interpreting this definition how to perform is also specific. Moreover, tailoring an agent to the distinctive features of its workflow process takes place at its instantiation, but also occurs dynamically. 4 Organizational View on the PMS's interactions To specify and describe the functioning of the PMS i.e. how the agents interact among themselves and among WfMSs' agents, we adopt an organizational view providing macro-level coordination rules. The organizational model structures the communication between the IOW' agents and thus highlights the coordination of the different organizations involved in a loose IOW while finding partners, negotiation between partners For that purpose, we use the Agent Group Role (AGR) meta model [17] which is a possible framework to define organizational dimension of a multi-agent system (MAS) and which is particularly appropriate for loose IOW [10], [15]. The remainder of this section first presents AGR. It then describes how, using this meta-model, we structure and rule the interactions between the agents of the PMS' and WfMS' architectures. Finally, it gives an AUML Sequence Diagram that illustrates exchange of messages between these agents. 4.1 The AGR Meta Model According to AGR, the organization of a system is defined as a set of related groups, agents and roles. A group is a set of agents that also determines an interaction space: an agent may communicate with another agent only if they belong to the same group. The cohesion of the whole system is maintained by the fact that agents may belong to any number of groups, so that the communication between two groups may be done by agents that belong to both. Each group also defines a set of roles, and the group manager is the specific role fulfilled by the agent that initiates the group. The membership of an agent to a group requires that the group manager authorizes this agent to play some role, and each role determines how the agents playing that role may interact with other agents. So the whole behavior of the system is framed by the structure of the groups that may be created and by the behaviors allowed to agents by the roles. AGR has three interesting advantages useful in our context [17]: it eases the security of the application being developed since the interactions are organized within a group which is a private space only open to agents having the capacities and the authorization to enter in it. AGR also provides Modularity by organizing the work and Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 29

35 Vol. IV, No. II, pp the interactions space in small and manageable units thanks to the notion of role and group. Openness is also facilitated, since AGR imposes no constraint on the internal structure of agents. 4.2 The Organizational Model This model, as shown in figure 3, is organized around the following components: - Seven types of groups represented by ellipses that are: Participation, Request- Creation-Coordination-Protocol, Coordination-Protocol-Selection, Coordination- Protocol-Execution, Creation-Coordination-Protocol, Request-Participation- Coordination-Protocol and Coordination-Protocol. In this figure, we have two Participation groups (Requester-Participation and Provider-Participation). - Eight types of agents represented by candles that are: Requester-Workflow-Agent, Connection-Server, Provider-Agent-Manager, Message-Dispatcher, Proto-col- Selection-Agent, Protocol-Launcher-Agent, Mode-rator and Conversation-Server. In this figure, we have two Connection-Server agents (Requester-Connection- Server and Provider-Connection-Server). - Nineteen roles since each agent plays a specific role within each group. The communication between agents belonging to the different groups correspond to either internal communications supported by the agent communication channel (thin arrows) or external communications supported by interface 6 (large arrows). Requester- Workflow-Agent Requester- Connection-Server Message- Dispatcher Protocol- Selection-Agent Protocol- Launcher-Agent 1 1 Coordination-Protocol-Selection 1 1 Requester-Participation Coordination-Protocol-Execution 1 1 Request-Creation- Coordination-Protocol 6 * 1 Conversation- Server 1 1 Moderator 1 Creation-Coordination-Protocol 1 1 * Coordination-Protocol Provider-Connection-Server Provider-Agent-Manager Request-Participation- Coordination-Protocol * * 1 1 Provider-Participation Figure 3: The Organizational Model Let us detail now how each group operates. First, the Requester-Participation group enables a requester workflow agent to solicit its connection server in order to contact the PMS to deal with a coordination problem (finding partners, negotiation between partners ). The Request-Creation-Coordination-Protocol group enables the connection server to forward this request to the message dispatcher agent. This latter then contacts, via the Coordination-Protocol-Selection group, the protocol selection agent that helps the requester workflow agent to select a convenient coordination protocol. The Coordination-Protocol-Execution group then enables the message dispatcher to enter in connection with the Protocol Launcher Agent (PLA) and ask it for the creation of a new conversation which is created by the PLA. More precisely, Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 30

36 Vol. IV, No. II, pp the Creation-Coordination-Protocol group enables the PLA (i) to create a moderator implementing the underlying new conversation coordination protocol, and (ii) to inform the conversation server of a new conversation creation. It is now possible for both either a requester or a provider workflow agent to participate to a coordination protocol or to get information about it. Indeed the requester and provider workflow agents' connection servers belonging to the Request- Participation-Coordination-Protocol group solicit the message dispatcher to forward, via the Coordination-Protocol group, their request to either the moderator (for instance a submission of a new communication act) or the conversation server (for instance an information request about a conversation). 4.3 Message Exchange Between Agents The standard FIPA-ACL communication language [18] is used to support the interaction, through message exchange, between the different agents involved in the organizational model. FIPA-ACL offers a convenient set of performatives to deal with the different coordination problems introduced before (for instance, agree, cancel, refuse, request, inform, confirm for finding partners, or propose, accept-proposal for negotiation between partners). Moreover, FIPA-ACL supports message exchange between heterogeneous agents since (i) the language used to specify the message is free and (ii) a message can refer to an ontology. This latter point is very interesting since it is possible, through FIPA-ACL messages, to refer a domain ontology, which can be used to solve semantic interoperability problems [10]. Figure 4 below illustrates, giving an AUML Sequence Diagram, this message exchange during the creation of a conversation. This sequence diagram only shows the FIPA-ACL interactions between agents belonging to the Requester-Participation, Request-Creation-Coordination-Protocol, Coordination-Protocol-Selection, Coordination-Protocol-Execution and Creation-Coordination-Protocol groups. /Requester Workflow Agent: Agent 1 inform/in-replyto(protocols) 1 1 inform/in-replyto(protocols) 1 /Requester Connection Server: Agent 1 1 request(conversationcreation) 1 1 request(conversationcreation) /Message Dispatcher: Agent request(conversation-creation) 1 inform/in-reply-to(protocols) /Protocol Launcher Agent: Agent /Selection Protocol Agent: Agent /Moderator: Agent request(protocol-creation) inform/in-reply-to (conversation-creation) inform/in-reply-to (conversation-creation) 1 1 request(protocolcreation) request(protocolcreation) 1 1 inform/in-reply-to (conversationcreation) request(protocol-creation) 1 confirm(protocol-creation) 1 1 /Conversation Server: Agent inform (conversation-creation) 1 1 Figure 4: AUML Sequence Diagram illustrating Message Exchange 5 Models For Engineering Protocols The design, selection and enactment of protocols by the PMS require a precise and non-ambiguous definition of what we call a protocol. In this section, we try to give Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 31

37 Vol. IV, No. II, pp such a definition distinguishing three different abstraction levels for protocol description. We also provide a meta-model for protocols and a protocol classification model taking into account only interaction protocols devoted to the loose IOW context. These models have been specified with OWL [19] using Protege-2000 software. In addition to these OWL models, we also give in this paper their equivalent UML models for readability and popularization reasons. Finally, this section shows how the protocol classification can be exploited by the PMS. 5.1 The Three Levels of a Protocol Definition We distinguish three abstraction levels for protocol description: - The first level is a concrete or execution level. At this level, we find conversations (occurrence or instance of a protocol) between participating workflow agents, each one playing a role in the conversation. For instance, a requester workflow agent (A) plays the role of manager and evaluates workflow processes offered by several provider workflow agents (B,C,D ), chooses one of them (D) and delegates the workflow process to be done to the winner (D). - The second level describes protocol specifications (protocol for short) defining the rules to which a class of conversations must obey. Each conversation is an instance of a single protocol specification, but several conversations referring to the same protocol may be running simultaneously. As an example of protocol specification, we can consider the specification of the Contract Net Protocol (CNP) [20] stating that: i) CNP involves a single manager and any number of contractors and the manager cannot be a contractor, ii) at the beginning the Manager, who has a task/workflow to subcontract, submits the specification of the task to contractors agents, wait for their bids and then awards the contractor having the best bid, and iii) at the end the task is subcontracted to the winner. - The third and more abstract level corresponds to the meta model of a protocol i.e. the invariant structure shared by all the protocols. This paper will focus on this last level, which is independent of any protocol and any target technology. This level is described in the next section. 5.2 Protocol Meta Model The protocol meta-model is given by figure 5. Figure 5a just gives an extract of this meta-model expressed in OWL, which is the language we have used for its implementation. For readability reasons, Figure 5b gives a complete UML representation of this meta-model. In the following, we only comment this UML representation. This meta-model is built around three core and inter-related concepts: Interventions Types, Roles and Protocols. We describe them in detail in the following. Intervention Types abstract elementary conversation acts. An Intervention Type is defined by its name, an action with its input and output parameters, and it also includes intervention behavioral constraints such as its level of priority (with regard to other intervention types) or how many times it may be performed during one conversation. The PreCondition and PostCondition associations represent the requirements that must be held before and be fulfilled after an occurrence of the Intervention Type. They contribute to the statement of the behavioral constraints of the protocol since they define an order relation between interventions, and thus the control Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 32

38 Vol. IV, No. II, pp <owl:class rdf:id="protocol"> <rdfs:subclassof> <owl:restriction> <owl:maxcardinality rdf:datatype= " 1 </owl:maxcardinality> <owl:onproperty> <owl:objectproperty rdf:id="hasterminalstate"/> </owl:onproperty> </owl:restriction> </rdfs:subclassof> <rdfs:subclassof rdf:resource=" <owl:datatypeproperty rdf:id="name"> <rdfs:domain rdf:resource="#protocol"/> <rdfs:range rdf:resource=" </owl:datatypeproperty> <owl:datatypeproperty rdf:id="description"> <rdfs:domain rdf:resource="#protocol"/> <rdfs:range rdf:resource=" </owl:datatypeproperty>... </owl:class> Figure 5a: The Protocol Meta Model: OWL representation 0..* PermissionToPerform 1..* PreCondition InterventionType Name Action Input Output Behavioural const. Role Name Skill Casting const. 0..* TerminalState PostCondition 1..1 State Condition 1..1 Creator * InitialState 1..1 Member 1..1 Protocol Name Description Casting const. Behavioural const. Parameters * Business Domain 1..* Ontology Figure 5b: The Protocol Meta Model: UML representation Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 33

39 Vol. IV, No. II, pp structure of the protocol, that is the sequences of interventions that may occur in the course of a conversation. An Intervention Type belongs to the role linked to it, that only agents playing that role may perform. A Protocol includes a set of member Roles, one of them is allowed to initiate a new conversation following the Protocol, and a set of Intervention Types linked to these Roles. The Business Domain link gives access to the all-possible Ontologies to which a Protocol may refer. The InitialState link describes the configuration required to start a conversation, while the TerminalState link establishes a configuration to reach in order to consider a conversation as being completed. A Protocol may include a Description in natural language for documentation purpose, or information about the Protocol at the agents' disposal. The protocol casting constraints attribute records constraints that involve several Roles and cannot be defined regarding individual Roles such as the total number of agents that are allowed to take part in a conversation. Similarly, the protocol behavioral constraints attribute records constraints that cannot be defined regarding individual Intervention Types such as the total number of interventions that may occur in the course of a conversation. Some of these casting or behavioral constraints can involve Parameters of the Protocol, properties whose value is not fixed by the Protocol but is proper to each conversation and set by the conversations initiator. 5.3 Protocol Classification In addition to the previous meta-model, we also need additive information to better handle protocols. We propose a classification of protocols to distinguish them and to easily select them according to the objective to be reached: finding partners, negotiation between partners or contract establishment support. This classification takes only into account the protocols useful in the IOW context. The appropriateness of these protocols to loose IOW was discussed in a previous paper [15]. Protocol 1..* 1..* Business Domain Ontology FindingPartner Negotiation Contract Matchmaker P2PExecution ComparisonMode QualityRate NumberOfProviders Broker Heuristic Argumentation Auction ContractNet ContractTemplate Figure 6: The Protocol Classification In the UML schema of figure 6, the protocol class of figure 5.b is refined. Protocols are specialized, according to their objective, into three abstract classes: FindingPartner, Negotiation and Contract. Those abstract classes are in their turn recursively specialized until obtaining leaves corresponding to models of protocols like for instance Matchmaker, Broker, Argumentation, Heuristic, ContractTemplate Each of these protocols may be used in one of the coordination steps of IOW. However, at this last level, the different classes feature new attributes specific to each one and possibly related to the quality of service. If we consider for example the Matchmaker protocol (the only one developed in figure 6), we can make the following observations. First, it differs from the broker by the fact that it implements a Peer-to-Peer execution mode with the provider: the Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 34

40 Vol. IV, No. II, pp identity of the provider is known and a direct link is established between the requester and the provider at run time. Then, one can be interested in its comparison mode (plug in, exact, and/or subsume) [21], a quality rating to compare it to other matchmakers, the minimum number of providers it is able to manage 5.4 Engineering protocols In this paper, we focus on three activities: the design, the selection and the execution of protocols. Let us give some hints on how to drive these three activities thanks to the models previously given. Protocol Design. The design process consists of producing protocol specifications compliant with the meta-model presented in section 5.2 and refined in section 5.3. Protocol Selection. Given a query specifying the objective of a Protocol and the value of some additional attributes (P2PExecution, NumberOf Providers ), the Protocol Selection Agent follows a three step process to give an answer. First, it navigates in the hierarchy of protocols and selects possible models of protocols. For instance, if the objective of a requester is to find a partner, the requester will obtain a set of FindingPartner protocols. Second, the value of the other attributes can be used to select a specific protocol model. For instance, if a query specifies a P2P execution, it will obtain the Matchmaker protocol model otherwise the Broker will be suggested to it. Third, the Protocol Selection Agent must check that the behavioral and casting constraints of the selected protocol model are compatible with the requester requirements and capabilities. This process may be iterative and interactive to guide the requester in its choice, in case there are still several possible solutions. In our implementation, queries are expressed in nrql [22] which is a language allowing the querying of OWL or RDF ontologies. We use the following nrql query syntax (retrieve (?<x>) (?<x> <class>) [(<condition>)], where <x> is the variable corresponding to the query result, <class> is the name of the queried class, and <condition>, which may be omitted, is the condition that <x> must satisfy. As an example, considering our protocol classification (see figure 6), the following query (retrieve (?x) (?x PartnerFinding ) (= P2Pexecution True)) returns the Matchmaker protocol. Protocol Execution. Once a protocol model has been selected, the Moderator Launcher Agent creates and launches a moderator agent to play that protocol. 6 Implementation This work gives rise to an implementation project, called ProMediaFlow (Protocols for Mediating Workflow), aiming at developing the whole PMS architecture. The first step of this project is to evaluate its feasibility. For this purpose, we have developed a simulator, limited for the moment to a subset of components and considering a single protocol. In fact, regarding the PMS architecture, we have implemented only the Protocol Execution block and considered the Matchmaker protocol. Regarding the WfMS architecture, we have implemented both requester and provider Workflow Agents and their corresponding Agent Managers and Connection Servers. This work has been implemented with the MadKit platform [23], which permits the development of distributed applications using multi-agent principles. Madkit also supports the organizational AGR meta-model and then permits a straightforward implementation of our organizational model (presented in section 4). In the current version of our implementation, the system is only able to produce moderators playing the Matchmaker protocol. More precisely, the Matchmaker is able to compare offers and requests of workflow services (i.e. a service implementing a workflow process) by establishing flexible comparisons (exact, plug in, subsume) [21] Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 35

41 Vol. IV, No. II, pp based on a domain ontology. For that purpose, we also have included facilities to describe workflow services into the simulator. So, as presented in [24], these offers and requests are described using both the Petri Net with Object (PNO) formalism [25] and the OWL-S language [26]: the PNO formalism is used to design, analyze, simulate, check and validate workflow services which are then automatically derived into OWL-S specifications to be published through the Matchmaker. Figure 7 below shows some screenshots of our implementation. More precisely, this figure shows the four following agents: a Requester Workflow Agent, a Provider Workflow Agent, the Conversation Server, and a Moderator, which is a Matchmaker. While windows 1 and 2 represent agents belonging to WfMS (respectively a provider workflow agent and a requester workflow agent), windows 3 and 4 represent agents belonging to the PMS (respectively the conversation server and a moderator agent playing the Matchmaker protocol) Figure 7: Overview of the Implementation The top left window (number 1) represents a requester workflow agent belonging to a WfMS involved in a loose IOW and implementing a workflow process instance. As shown by window 1, this agent can: (i) specify a requested workflow service (Specification menu), (ii) advertise this specification through the Matchmaker (Submission menu), (iii) visualize the providers offering services corresponding to the specification (Visualization menu), (iv) establish peer-to-peer connections with one of these providers (Contact menu), and, (v) launch the execution of this requested service (WorkSpace menu). In a symmetric way, the bottom left window (number 2) represents an agent playing the role of a workflow service provider, and a set of menus enables it to manage its offered services. As shown by window 2, the Specification menu includes three commands to support PNO and OWL-S specifications. The first command permits the specification of a workflow service using the PNO formalism, the second one permits the analysis and validation of the specified PNO and the third one derives the corresponding OWL-S specifications. The top right window (number 3) represents the Conversation Server agent belonging to the PMS architecture. As shown by window 3, this agent can: (i) display Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 36

42 Vol. IV, No. II, pp all the conversations (Conversations menu and List of conversations option), (ii) select a given conversation (Conversations menu and Select a conversation option), and, (iii) obtain all the information related to this selected conversation -its moderator, its initiator, its participants - (Conversations menu and Detail of a conversation option). Finally, the bottom right window (number 4) represents a Moderator agent playing the Matchmaker protocol. This agent can: (i) display all the conversation acts of the supervised conversation (Communication act menu and List of acts option), (ii) select a given conversation act (Communication act menu and Select an act option), and, (iii) obtain all the information related to this selected conversation act -its sender, its content - (Communication act menu and Detail of an act option), and, (iv) know the underlying domain ontology (Ontology domain menu). Now let us give some indications about the efficiency of our implementation and more precisely of our Matchmaker protocol. Let us first remark, that in the IOW context, workflows are most often long-term processes which may last for several days. Consequently, we do not need an instantaneous matchmaking process. However, in order to prove the feasibility of our proposition, we have measured the matchmaker processing time according to some parameters (notably the number of offers and the comparison modes) intervening in the complexity formulae of the matchmaking process [27]. The measures have been realized in the context of the conference review system case study, where a PC chair subcontracts to the matchmaker the research of reviewers able to evaluate papers. The reviewers are supposed to have submitted to the matchmaker their capabilities in term of topics. Papers are classified according to topics belonging to an OWL ontology. Figure 8 visualizes the matchmaker average processing time for a number of offers (services) varying from 100 to 1000 and considering the plug in, exact and subsume comparison modes. As illustrated in figure 8, and in accordance with the complexity study of [27], the exact mode is the most efficient in term of time processing. To better analyze the Matchmaker behavior, we also plan to measure its recall and precision rates, well known in Information Retrieval [28]. Average Processing Time of the Matchaker in milliseconds Average Processing Time Number of offers (services) Exact Plug In Subsume Figure 8: Quantitative Evaluation of the Matchmaker 7 Discussion and Conclusion This paper has proposed a multi-agent protocol management system architecture by considering protocols as first-class entities. This architecture has three advantages: Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 37

43 Vol. IV, No. II, pp Easy design and development of loose IOW systems. The principle of separation of concerns improves understandability of the system, and therefore eases and speeds up its design, its development and its maintenance. Following this practice, protocols may be thought as autonomous and reusable components that may be specified and verified independently from any specific workflow system behavior. Doing so, we can focus on specific coordination issues and build a library of easyto-use and reliable protocols. The same holds for workflow systems, since it becomes possible to focus on their specific capabilities, however they interact with others. - Workflow agent reusability. As a consequence of introducing moderators, workflow agents and agent managers do not interact with each other directly anymore. Instead, they just have to agree on the type of conversation they want to adopt. Consequently, agents impose fewer requirements to their partners, and they are loosely coupled. Hence, heterogeneous workflow agents can be easily composed. - Visibility of conversations. Some conversations are private and concern only the protagonists. But, most of the time, meeting the goal of the conversation depends to a certain extent on its transparency i.e. on the possibility given to the agents to be informed on the conversation progress. With a good knowledge about the state of the conversation and the rules of the protocol, each agent can operate with relevance and in accordance with its objectives. In the absence of a Moderator, the information concerning a conversation is distributed among the participating agents. Thus, it is difficult to know which agent has the searched information, supposing that this agent has been designed to support the supply of this information. By contrast, moderators support the transparency of conversations. Related works may be considered according to two complementary points of view: the loose IOW coordination and the protocol engineering points of view. Regarding loose IOW coordination, it can be noted that most of the works ([10], [14], [15], [16], [29], [30], [31]) adopt one of the following multi-agent coordination protocol: organizational structuring, matchmaking, negotiation, contracting... However, these works neither address all the protocols at the same time nor follow the principle of separation of concerns. In consequence, they do not address protocol engineering issues and notably protocol design, selection and execution. Regarding protocol engineering, the most significant works are [6], [32]. [6] has inspired our software engineering approach, but it differs from our work since it does not address workflow applications and does not address the classification and selection of protocol issues. [32] is a complementary work to ours. It deals with protocol engineering issues focusing particularly on the notion of protocol compatibility, equivalence and replaceability. In fact, this work aims at defining a protocol algebra which can be very useful to our PMS. At design time, it can be used to establish links between protocols, while, at run-time, these links can be used by the Protocol Selection Agent. Finally, we must also mention work which addresses both loose IOW coordination and protocol engineering issues ([33], [34]). [33] isolates protocols and represents them as ontologies but this work only considers negotiation protocols in e-commerce. [34] considers interaction protocols as design abstractions for business processes, provides an ontology of protocols and investigates the composition of protocol issues. [34] differs from our work since it only focuses on the description and the composition aspects. Finally, none of these works ([33], [34]) proposes means for classifying, retrieving and executing protocols, nor addresses architectural issues as we did through the PMS definition. Regarding future works, we plan to complete the implementation of the PMS. The current implantation is limited to the Matchmaker protocol, and so, we intend to design and implement other coordination protocols (broker, argumentation, heuristic). We also believe that an adequate combination of this work and the comparison Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 38

44 Vol. IV, No. II, pp mechanisms of protocols presented in [32] could improve the classification of protocols in our PMS. Finally, to provide a uniform access to our PMS, moderator agents playing protocols could be encapsulated inside Web services. Doing so, we follow the promising Service Oriented Multiagent Architecture recently introduced in [35], that provides on the one hand flexibility to our workflow, inherited from agent technology, and on the other hand interoperability and openness, thanks to the use of a Service Oriented Architecture. References [1] W. van der Aalst: Inter-Organizational Workflows: An Approach Based on Message Sequence Charts and Petri Nets. Systems Analysis, Modeling and Simulation, 34(3), 1999, pp [2] M. Divitini, C. Hanachi, C. Sibertin-Blanc: Inter Organizational Workflows for Enterprise Coordination. In: A. Omicini, F. Zambonelli, M. Klusch, and R. Tolksdorf (eds): Coordination of Internet Agents, Springer-Verlarg, 2001, pp [3] F. Casati, A. Discenza: Supporting Workflow Cooperation within and across Organizations. 15th Int. Symposium on Applied Computing, Como, Italy, March 2000, pp [4] P. Grefen, K. Aberer, Y. Hoffner, H. Ludwig: CrossFlow: Cross-Organizational Workflow Management in Dynamic Virtual Enterprises. Computer Systems Science and Engineering, 15( 5), 2000, pp [5] O. Perrin, F. Wynen, J. Bitcheva, C. Godart: A Model to Support Collaborative Work in Virtual Enterprises. 1st Int. Conference on Business Process Management, Eindhoven, The Netherlands, June 2003, pp [6] C. Hanachi, C. Sibertin-Blanc: Protocol Moderators as active Middle-Agents in Multi-Agent Systems. Autonomous Agents and Multi-Agent Systems, 8(3), March 2004, pp [7] C. Ghezzi, M. Jazayeri, D. Mandrioli, Fundamentals of Software Engineering. Prentice-Hall International, [8] W. van der Aalst: The Application of Petri-Nets to Workflow Management. Circuit, Systems and Computers, 8( 1), February 1998, pp [9] M. Genesereth, S. Ketchpel: Software Agents. Communication of the ACM, 37(7), July 1994, pp [10] E. Andonoff, L. Bouzguenda, C. Hanachi, C. Sibertin-Blanc: Finding Partners in the Coordination of Loose Inter-Organizational Workflow. 6th Int. Conference on the Design of Cooperative Systems, Hyeres (France), May 2004, pp [11] N. Jenning, P. Faratin, A. Lomuscio, S. Parsons, C. Sierra, M. Wooldridge: Automated Negotiation: Prospects, Methods and Challenges. Group Decision and Negotiation, 10 (2), 2001, pp [12] J. Ferber, O. Gutknecht: A Meta-Model for the Analysis and Design of Organizations in Multi-Agent Systems, 3rd Int. Conference on Multi-Agents Systems, Paris, France, July 1998, pp [13] The Workflow Management Coalition, The Workflow Reference Model. Technical Report WfMC-TC-1003, November [14] L. Zeng, A. Ngu, B. Benatallah, M. O Dell: An Agent-Based Approach for Supporting Cross-Enterprise Workflows. 12th Australian Database Conference, Bond, Australia, February 2001, pp [15] E. Andonoff, L. Bouzguenda L: Agent-Based Negotiation between Partners in Loose Inter-Organizational Workflow. 5th Int. Conference on Intelligent Agent Technology, Compiègne, France, September 2005, pp [16] P. Buhler, J. Vidal: Towards Adaptive Workflow Enactment Using Multi Agent Systems. Information Technology and Management, 6(1), 2005, pp Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 39

45 Vol. IV, No. II, pp [17] J. Ferber, O. Gutknetcht, M. Fabien: From Agents to Organizations: an Organizational View of Multi-Agent Systems. 4th Int. Workshop on Agent- Oriented Software Engineering, Melbourne, Australia, July 2003, pp [18] Foundation for Intelligent Physical Agents, FIPA ACL Message Structure Specification. December 2002, Available at [19] World Wide Web Consortium, OWL Web Ontology Language. Available at [20] Foundation for Intelligent Physical Agents, FIPA Contract Net Interaction Protocol Specification. December 2002, Available at [21] A. Ankolekar, M. Burstein, J. Hobbs, O. Lassila, D. Martin, D. McDermott, S. McIlraith, S. Narayanan, M. Paolucci, T. Payne, K. Sycara: DAML-S: Web Service Description for the Semantic Web. 1st Int. Semantic Web Conference, Sardinia, Italy, June 2002, pp [22] V. Haarslev, R. Moeller, M. Wessel M: Querying the Semantic Web with Racer and nrql. 3rd Int. Workshop on Applications of Description Logic, Ulm, Germany, September [23] J. Ferber, O. Gutknecht: TheMadKit Project: a Multi-Agent Development Kit. Available at [24] E. Andonoff, L. Bouzguenda, C. Hanachi: Specifying Web Workflow Services for Finding Partners in the context of Loose Inter-Organizational Workflow. 3rd Int. Conference on Business Process Management, Nancy, France, September 2005, pp [25] C. Sibertin-Blanc: High Level Petri Nets with Data Structure. 6th Int. Workshop on Petri Nets and Applications, Espoo, Finland, June [26] World Wide Web Consortium, OWL-S: Semantic Markup for Web Services. Available at [27] L. Bouzguenda: Agent-based Coordination for loose Inter-Organizational Workflow, PHD dissertation (in French), May 2006, University of Toulouse 1. [28] M. Klusch, B. Fries, K. Sycara: Automated Semantic Web Service Discovery with OWLS-MX. 5th Int. Joint Conference on Autonomous Agents and Multiagent Systems, Hokodate, Japan, May 2006, pp [29] C. Aberg, C. Lambrix, N. Shahmehri: An Agent-Based Framework for Integrating Workflows and Web Services, 14th Int. Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises, Linköping, Sweden, June 2005, pp [30] A. Negri, A. Poggi, M. Tamaiuolo, P. Turci: Agents for e-business Applications. 5th Int. Conference on Autonomous Agents and Multi-Agent Systems, Hokodate, Japan, May 2004, pp [31] M. Sensoy, P. Yolum: A Context-Aware Approach for Service Selection using Ontologies. 5th Int. Conference on Autonomous Agents and Multi-Agent Systems, Hokodate, Japan, May 2004, pp [32] B. Benatallah, F. Casati, F. Toumani: Representing, Analyzing and Managing Web Service Protocols. Data and Knowledge Engineering, 58(3), September 2006, pp [33] V. Tamma, S. Phelps, I. Dickinson, M. Wooldridge: Ontologies for Supporting Negotiation in E-Commerce. Engineering Applications of Artificial Intelligence, 18(2), March 2005, pp [34] N. Desai, A. Mallya, A. Chopra, M. Singh M: Interaction Protocol as Design Abstractions for Business Processes. IEEE Transactions on Software Engineering, December 2005, pp [35] M. Huhnsv bn, M. Singh, M. Burstein, K. Decker, E. Durfee, T. Finin, L. Gasser, H. Goradia, N. Jennings, K. Lakkaraju, H. Nakashima, H. van Dyke Parunak, J. Rosenschein, A. Ruvinski, G. Sukthankar, S. Swarup, K. Sycara, M. Tambe, T. Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 40

46 Vol. IV, No. II, pp Wagner, L. Zavala: Research Directions for Service-Oriented Multiagent Systems. IEEE Internet Computing, 9 (6), December 2005, pp Eric Andonoff, Wassim Bouaziz, Chihab Hanachi 41

47 Vol. IV, No. II 42

48 Vol. IV, No. II, pp Adaptability of Methods for Processing XML Data using Relational Databases the State of the Art and Open Problems 1 Irena Mlynkova and Jaroslav Pokorny Charles University, Faculty of Mathematics and Physics, Department of Software Engineering, Malostranske nam. 25, Prague 1, Czech Republic {irena.mlynkova,jaroslav.pokorny}@mff.cuni.cz Abstract As XML technologies have become a standard for data representation, it is inevitable to propose and implement efficient techniques for managing XML data. A natural alternative is to exploit tools and functions offered by (object-)relational database systems. Unfortunately, this approach has many objectors, especially due to inefficiency caused by structural differences between XML data and relations. On the other hand, (object-)relational databases have long theoretical and practical history and represent a mature technology, i.e. they can offer properties that no native XML database can offer yet. In this paper we study techniques which enable to improve XML processing based on relational databases, so-called adaptive or flexible mapping methods. We provide an overview of existing approaches, we classify their main features, and sum up the most important findings and characteristics. Finally, we discuss possible improvements and corresponding key problems. Keywords: XML-to-relational mapping, state of the art, adaptability, relational databases 1 Introduction Without any doubt the XML [9] is currently one of the most popular formats for data representation. It is well-defined, easy-to-use, and involves various recommendations such as languages for structural specification, transformation, querying, updating, etc. The popularity invoked an enormous endeavor to propose more efficient methods and tools for managing and processing XML data. The four most popular ones are methods which store XML data in a file system, methods which store and process XML data using an (object-)relational database system, methods which exploit a pure object-oriented approach, and native methods that use special indices, numbering schemes [17], and/or data structures [12] proposed particularly for tree structure of XML data. Naturally, each of the approaches has both keen advocates and objectors. The situation is not good especially for file system-based and object-oriented methods. The former ones suffer from inability of querying without any additional 1 This work was supported in part by Czech Science Foundation (GACR), grant number 201/06/0756. Irena Mlynkova, Jaroslav Pokorny 43

49 Vol. IV, No. II, pp preprocessing of the data, whereas the latter approach fails especially in finding a corresponding efficient and comprehensive tool. The highest-performance techniques are the native ones since they are proposed particularly for XML processing and do not need to artificially adapt existing structures to a new purpose. But the most practically used ones are methods which exploit features of (object-)relational databases. The reason is that they are still regarded as universal data processing tools and their long theoretical and practical history can guarantee a reasonable level of reliability. Contrary to native methods it is not necessary to start from scratch but we can rely on a mature and verified technology, i.e. properties that no native XML database can offer yet. Under a closer investigation the database-based 2 methods can be further classified and analyzed [19]. We usually distinguish generic methods which store XML data regardless the existence of corresponding XML schema (e.g. [10] [16]), schema-driven methods based on structural information from existing schema of XML documents (e.g. [26] [18]), and user-defined methods which leave all the storage decisions in hands of future users (e.g. [2] [1]). Techniques of the first type usually view an XML document as a directed labelled tree with several types of nodes. We can further distinguish generic techniques which purely store components of the tree and their mutual relationship [10] and techniques which store additional structural information, usually using a kind of a numbering schema [16]. Such schema enables to speed up certain types of queries but usually at the cost of inefficient data updates. The fact that they do not exploit possibly existing XML schemes can be regarded as both advantage and disadvantage. On one hand they do not depend on its existence but, on the other hand, they cannot exploit the additional structural information. But together with the finding that a significant portion of real XML documents (52% [5] of randomly crawled or 7.4% [20] of semi-automatically collected 3 ) have no schema at all, they seem to be the most practical choice. By contrast, schema-driven methods have contradictory (dis)advantages. The situation is even worse for methods which are based particularly on XML Schema [28] [7] definitions (XSDs) and focus on their special features [18]. As it is expectable, XSDs are used even less (only for 0.09% [5] of randomly crawled or 38% [20] of semi-automatically collected XML documents) and even if they are used, they often (in 85% of cases [6]) define so-called local tree grammars [22], i.e. languages that can be defined using DTD [9] as well. The most exploited non-dtd features are usually simple types [6] whose lack in DTD is crucial but for XML data processing have only a side optimization effect. Another problem of purely schema-driven methods is that information XML schemes provide is not satisfactory. Analysis of both XML documents and XML schemes together [20] shows that XML schemes are too general. Excessive examples can be recursion or * operator which allow theoretically infinitely deep or wide XML documents. Naturally, XML schemes also cannot provide any information about, e.g., retrieval frequency of an element / attribute or the way they are retrieved. Thus not only XML schemes but also corresponding 2 In the rest of the paper the term database represents an (object-)relational database. 3 Data collected with interference of a human operator who removes damaged, artificial, too simple, or otherwise useless XML data. Irena Mlynkova, Jaroslav Pokorny 44

50 Vol. IV, No. II, pp XML documents and XML queries need to be taken into account to get overall notion of the demanded XML-processing application. The last mentioned type of approach, i.e. the user-defined one, is a bit different. It does not involve methods for automatic database storage but rather tools for specification of the target database schema and required XML-to-relational mapping. It is commonly offered by most known (object-)relational database systems [3] as a feature that enables users to define what suits them most instead of being restricted by disadvantages of a particular technique. Nevertheless, the key problem is evident it assumes that the user is skilled in both database and XML technologies. Apparently, advantages of all three approaches are closely related to the particular situation. Thus it is advisable to propose a method which is able to exploit the current situation or at least to comfort to it. If we analyze databasebased methods more deeply, we can distinguish so-called flexible or adaptive methods (e.g. [13] [25] [29] [31]). They take into account a given sample set of XML data and/or XML queries which specify the future usage and adapt the resulting database schema to them. Such techniques have naturally better performance results than the fixed ones (e.g. [10] [16] [26] [18]), i.e. methods which use pre-defined set of mapping rules and heuristics regardless the intended future usage. Nevertheless, they have also a great disadvantage the fact that the target database schema is adapted only once. Thus if the expected usage changes, the efficiency of such techniques can be even worse than in corresponding fixed case. Consequently the adaptability needs to be dynamic. The idea to adapt a technique to a sample set of data is closely related to analyses of typical features of real XML documents [20]. If we combine the two ideas, we can assume that a method which focuses especially on common features will be more efficient than the general one. A similar observation is already exploited, e.g., in techniques which represent XML documents as a set of points in multidimensional space [14]. Efficiency of such techniques depends strongly on the depth of XML documents or the number of distinct paths. Fortunately XML analyses confirm that real XML documents are surprisingly shallow the average depth does not exceed 10 levels [5] [20]. Considering all the mentioned points the presumption that an adaptive enhancing of XML-processing methods focusing on given or typical situations seem to be a promising type of improvement. In this paper we study adaptive techniques from various points of view. We provide an overview of existing approaches, we classify them and their main features, and we sum up the most important findings and characteristics. Finally, we discuss possible improvements and corresponding key problems. The analysis should serve as a starting point for proposal of an enhancing of existing adaptive methods as well as of an unprecedented approach. Thus we also discuss possible improvements of weak points of existing methods and solutions to the stated open problems. The paper is structured as follows: Section 2 contains a brief introduction to formalism used throughout the paper. Section 3 describes and classifies the existing related works, both practical and theoretical, and Section 4 sums up their main characteristics. Section 5 discusses possible ways of improvement of the recent approaches and finally, the sixth section provides conclusions. Irena Mlynkova, Jaroslav Pokorny 45

51 Vol. IV, No. II, pp Definitions and Formalism Before we begin to describe and classify adaptive methods, we state several basic terms used in the rest of the text. An XML document is usually viewed as a directed labelled tree with several types of nodes whose edges represent relationships among them. Side structures, such as entities, comments, CDATA sections, processing instructions, etc., are without loss of generality omitted. Definition 1 An XML document is a directed labelled tree T = (V, E, Σ E, Σ A, Γ, lab, r), where V is a finite set of nodes, E V V is a set of edges, Σ E is a finite set of element names, Σ A is a finite set of attribute names, Γ is a finite set of text values, lab : V Σ E Σ A Γ is a surjective function which assigns a label to each v V, whereas v is an element if lab(v) Σ E, an attribute if lab(v) Σ A, or a text value if lab(v) Γ, and r is the root node of the tree. A schema of an XML document is usually described using DTD or XML Schema which describe the allowed structure of an element using its content model. An XML document is valid against a schema if each element matches its content model. (We state the definitions for DTDs only for the paper length.) Definition 2 A content model α over a set of element names Σ E is a regular expression defined as α = ɛ pcdata f (α 1, α 2,..., α n ) (α 1 α 2... α n ) β* β+ β?, where ɛ denotes the empty content model, pcdata denotes the text content, f Σ E,, and stand for concatenation and union (of content models α 1, α 2,..., α n ), and *, +, and? stand for zero or more, one or more, and optional occurrence(s) (of content model β) respectively. Definition 3 An XML schema S is a four-tuple (Σ E, Σ A,, s), where Σ E is a finite set of element names, Σ A is a finite set of attribute names, is a finite set of declarations of the form e α or e β, where e Σ E, α is a content model over Σ E, and β Σ A, and s Σ E is a start symbol. To simplify the XML-to-relational mapping process an XML schema is often transformed into a graph representation. Probably the first occurrence of this representation, so-called DTD graph, can be found in [26]. There are also various other types of graph representation of an XML schema, if necessary, we mention the slight differences later in the text. Definition 4 A schema graph of a schema S = (Σ E, Σ A,, s) is a directed, labelled graph G = (V, E, lab ), where V is a finite set of nodes, E V V is a set of edges, lab : V Σ E Σ A {, *, +,?,, } {pcdata} is a surjective function which assigns a label to v V, and s is the root node of the graph. The core idea of XML-to-relational mapping methods is to decompose a given schema graph into fragments, which are mapped to corresponding relations. Definition 5 A fragment f of a schema graph G is each its connected subgraph. Definition 6 A decomposition of a schema graph G is a set of its fragments {f 1,..., f n }, where v V is a member of at least one fragment. Irena Mlynkova, Jaroslav Pokorny 46

52 Vol. IV, No. II, pp Existing Approaches Up to now only a few papers have focused on a proposal of an adaptive databasebased XML-processing method. We distinguish two main directions costdriven and user-driven. Techniques of the former group can choose the most efficient XML-to-relational storage strategy automatically. They usually evaluate a subset of possible mappings and choose the best one according to the given sample of XML data, query workload, etc. The main advantage is expressed by the adverb automatically, i.e. without necessary or undesirable user interference. By contrast, techniques of the latter group also support several storage strategies but the final decision is left in hands of users. We distinguish these techniques from the user-defined ones, since their approach is slightly different: By default they offer a fixed mapping, but users can influence the mapping process by annotating fragments of the input XML schema with demanded storage strategies. Similarly to the user-defined techniques this approach also assumes a skilled user, but most of the work is done by the system itself. The user is expected to help the mapping process, not to perform it. 3.1 Cost-Driven Techniques As mentioned above, cost-driven techniques can choose the best storage strategy for a particular application automatically, without any interference of a user. Thus the user can influence the mapping process only through the provided XML schema, set of sample XML documents or data statistics, set of XML queries and eventually their weights, etc. Each of the techniques can be characterized by the following five features: 1. an initial XML schema S init, 2. a set of XML schema transformations T = {t 1, t 2,..., t n }, where i : t i transforms a given schema S into a schema S i, 3. a fixed XML-to-relational mapping function f map which transforms a given XML schema S into a relational schema R, 4. a set of sample data D sample characterizing the future application, which usually consists of a set of XML documents {d 1, d 2,.., d k } valid against S init, and a set of XML queries {q 1, q 2,.., q l } over S init, eventually with corresponding weights {w 1, w 2,.., w l }, i : w i 0, 1, and 5. a cost function f cost which evaluates the cost of the given relational schema R with regard to the set D sample. The required result is an optimal relational schema R opt, i.e. a schema, where f cost (R opt, D sample ) is minimal. A naive but illustrative cost-driven storage strategy that is based on the idea of using a brute force is depicted by Algorithm 1. It first generates a set of possible XML schemes S using transformations from set T and starting from initial schema S init (lines 1 4). Then it searches for schema s S with minimal cost f cost (f map (s), D sample ) (lines 5 12) and returns the corresponding optimal relational schema R opt = f map (s). Irena Mlynkova, Jaroslav Pokorny 47

53 Vol. IV, No. II, pp Algorithm 1 Naive Search Algorithm Input: S init, T, f map, D sample, f cost Output: R opt 1: S {S init } 2: while t T, s S : t(s) S do 3: S S {t(s)} 4: end while 5: cost opt 6: for all s S do 7: R tmp f map (s) 8: cost tmp f cost (R tmp, D sample ) 9: if cost tmp < cost opt then 10: R opt R tmp ; cost opt cost tmp 11: end if 12: end for 13: return R opt Obviously the complexity of such algorithm depends strongly on the set T. It can be proven that even a simple set of transformations causes the problem of finding the optimal schema to be NP-hard [29] [31] [15]. Thus the existing techniques in fact search for a suboptimal solution using various heuristics, greedy strategies, approximation algorithms, terminal conditions, etc. We can also observe that fixed methods can be considered as a special type of cost-driven methods, where T =, D sample =, and f cost (R, ) = const for R Hybrid Object-Relational Mapping One of the first attempts of a cost-driven adaptive approach is a method called Hybrid object-relational mapping [13]. It is based on the fact that if XML documents are mostly semi-structured, a classical decomposition of less structured XML parts into relations leads to inefficient query processing caused by plenty of join operations. The algorithm exploits the idea of storing well structured parts into relations and semi-structured using so-called XML data type, which supports path queries and XML-aware full-text operations. The fixed mapping for structured parts is similar to the classical Hybrid algorithm [26], whereas, in addition, it exploits NF 2 -relations using constructs such as set-of, tuple-of, and list-of. The main concern of the method is to identify the structured and semi-structured parts. It consists of the following steps: 1. A schema graph G 1 = (V 1, E 1, lab 1) is built for a given DTD. 2. For v V 1 a measure of significance ω v (see below) is determined. 3. Each v V 1 which satisfies the following conditions is identified: (a) v is not a leaf node. (b) For v and its descendant v i;1 i k : ω v < ω LOD and ω vi < ω LOD, where ω LOD is a required level of detail of the resulting schema. Irena Mlynkova, Jaroslav Pokorny 48

54 Vol. IV, No. II, pp (c) v does not have a parent node which would satisfy the conditions. 4. Each fragment f G 1 which consists of a previously identified node v and its descendants is replaced with an attribute node having the XML data type, resulting in a schema graph G G 2 is mapped to a relational schema using a fixed mapping strategy. The measure of significance ω v of a node v is defined as: ω v = 1 2 ω S v ω D v ω Q v = 1 2 ω S v card(d v) card(d) card(q v) card(q) (1) where ω Sv is derived from the DTD structure as a combination of weights expressing position of v in the graph and complexity of its content model (see [13]), D D sample is a set of all given documents, D v D is a set of documents containing v, Q D sample is a set of all given queries, and Q v Q is a set of queries containing v. As we can see, the algorithm optimizes the naive approach mainly by the facts that the schema graph is preprocessed, i.e. ω v is determined for v V 1, that the set of transformations T is a singleton, and that the transformation is performed if the current node satisfies the above mentioned conditions (a) (c). As it is obvious, the preprocessing ensures that the complexity of the search algorithm is given by K 1 card(v 1 ) + K 2 card(e 1 ), where K 1, K 2 N. On the other hand, the optimization is too restrictive in terms of the amount of possible XML-to-relational mappings FlexMap Mapping Another example of adaptive cost-driven methods was implemented as so-called FlexMap framework [25]. The algorithm optimizes the naive approach using a simple greedy strategy as depicted in Algorithm 2. The main differences in comparison with the naive approach are the choice of the least expensive transformation at each iteration (lines 3 9) and the termination of searching if there exists no transformation t T that can reduce the current (sub)optimum (lines 10 14). The set T of XML-to-XML transformations involves the following operations: Inlining and outlining inverse operations which enable to store columns of a subelement / attribute either in a parent table or in a separate table Splitting and merging elements inverse operations which enable to store a shared element 4 either in a common table or in separate tables Associativity and commutativity Union distribution and factorization inverse operations which enable to separate out components of a union using equation (a, (b c)) = ((a, b) (a, c)) Splitting and merging repetitions exploitation of equation (a+) = (a, a ) Simplifying unions exploitation of equation (a b) (a?, b?) 4 An element with multiple parent elements in the schema see [26]. Irena Mlynkova, Jaroslav Pokorny 49

55 Vol. IV, No. II, pp Algorithm 2 Greedy Search Algorithm Input: S init, T, f map, D sample, f cost Output: R opt 1: S opt S init ; R opt f map (S opt ) ; cost opt f cost (R opt, D sample ) 2: loop 3: cost min 4: for all t T do 5: cost t f cost (f map (t(s opt )), D sample ) 6: if cost t < cost min then 7: t min t ; cost min cost t 8: end if 9: end for 10: if cost min < cost opt then 11: S opt t min (S opt ) ; R opt f map (S opt ) ; cost opt f cost (R opt, D sample ) 12: else 13: break; 14: end if 15: end loop 16: return R opt Note that except for commutativity and simplifying unions the transformations generate equivalent schema in terms of equivalence of sets of document instances. Commutativity does not retain the order of the schema, whereas simplifying unions generates a more general schema, i.e. a schema with larger set of document instances. (However, only inlining and outlining were implemented and experimentally tested by the FlexMap system.) The fixed mapping again uses a strategy similar to the Hybrid algorithm but it is applied locally on each fragment of the schema specified by the transformation rules stated by the search algorithm. For example elements determined to be outlined are not inlined though a traditional Hybrid algorithm would do so. The process of evaluating f cost is significantly optimized. A naive approach would require construction of a particular relational schema, loading sample XML data into the relations, and cost analysis of the resulting relational structures. The FlexMap evaluation exploits an XML Schema-aware statistics framework StatiX [11] which analyzes the structure of a given XSD and XML documents and computes their statistical summary, which is then mapped to relational statistics regarding the fixed XML-to-relational mapping. Together with sample query workload they are used as an input for a classical relational optimizer which estimates the resulting cost. Thus no relational schema has to be constructed and as the statistics are respectively updated at each XML-to- XML transformation, the XML documents need to be processed only once An Adjustable and Adaptable Method (AAM) The following method, which is also based on the idea of searching a space of possible mappings, is presented in [29] as an Adjustable and adaptable method Irena Mlynkova, Jaroslav Pokorny 50

56 Vol. IV, No. II, pp (AAM). In this case the authors adapt the given problem to features of genetic algorithms. It is also the first paper that mentions that the problem of finding a relational schema R for a given set of XML documents and queries D sample, s.t. f cost (R, D sample ) is minimal, is NP-hard in the size of the data. The set T of XML-to-XML transformations consists of inlining and outlining of subelements. For the purpose of the genetic algorithm each transformed schema is represented using a bit string, where each bit corresponds to an edge of the schema graph and it is set to 1 if the element the edge points to is stored into a separate table or 0 if the element the edge points to is stored into parent table. The bits set to 1 represent borders among fragments, whereas each fragment is stored into one table corresponding to so-called Universal table [10]. The extreme instances correspond to one table for the whole schema (in case of bit string) resulting in many null values and one table per each element (in case of bit string) resulting in many join operations. Similarly to the previous strategy the algorithm chooses only the best possible continuation at each iteration. The algorithm consists of the following steps: 1. The initial population P 0 (i.e. the set of bit strings) is generated randomly. 2. The following steps are repeated until terminating conditions are met: (a) Each member of the current population P i is evaluated and only the best representatives are selected for further production. (b) The next generation P i+1 is produced by genetic operators crossover, mutation, and propagate. The algorithm terminates either after certain number of transformations or if a good-enough schema is achieved. The cost function f cost is expressed as: f cost (R, D sample ) = f M (R, D sample ) + f Q (R, D sample ) = = q C l R l + ( m S i P Si + n J k P Jk ) l=1 where f M is a space-cost function, where C l is number of columns and R l is number of rows in table T l created for l-th element in the schema, q is the number of all elements in the schema, f Q is a query-cost function, where S i is cost and P Si is probability of i-th select query and J k is cost and P Jk is probability of k-th join query, m is the number of select queries in D sample, and n is the number of join queries in D sample. In other words f M represents the total memory cost of the mapping instance, whereas f Q represents the total query cost. The probabilities P Si and P Jk enable to specify which elements will (not) be often retrieved and which sets of elements will (not) be often combined to search. Also note that this algorithm represents another way of finding a reasonable suboptimal solution in the theoretically infinite set of possibilities using (in this case two) terminal conditions. i=1 k=1 (2) Irena Mlynkova, Jaroslav Pokorny 51

57 Vol. IV, No. II, pp A Hill Climbing Algorithm The last but not least cost-driven adaptable representative can be found in paper [31]. The approach is again based on a greedy type of algorithm, in this case a Hill climbing strategy that is depicted by Algorithm 3. Algorithm 3 Hill Climbing Algorithm Input: S init, T, f map, D sample, f cost Output: R opt 1: S opt S init ; R opt f map (S opt ) ; cost opt f cost (R opt, D sample ) 2: T tmp T 3: while T tmp do 4: t any member of T tmp 5: T tmp T tmp \{t} 6: S tmp t(s opt ) 7: cost tmp f cost (f map (S tmp ), D sample ) 8: if cost tmp < cost opt then 9: S opt S tmp ; R opt f map (S tmp ) ; cost opt cost tmp 10: T tmp T 11: end if 12: end while 13: return R opt As we can see, the hill climbing strategy differs from the simple greedy strategy depicted in Algorithm 2 in the way it chooses the appropriate transformation t T. In the previous case the least expensive transformation that can reduce the current (sub)optimum is chosen, in this case it is the first such transformation found. The schema transformations are based on the idea of vertical (V) or horizontal (H) cutting and merging the given XML schema fragment(s). The set T consists of the following four types of (pairwise inverse) operations: V-Cut(f, (u,v)) cuts fragment f into fragments f 1 and f 2, s.t. f 1 f 2 = f, where (u, v) is an edge from f 1 to f 2, i.e. u f 1 and v f 2 V-Merge(f 1, f 2 ) merges fragments f 1 and f 2 into fragment f = f 1 f 2 H-Cut(f, (u,v)) splits fragment f into twin fragments f 1 and f 2 horizontally from edge (u, v), where u f and v f, s.t. ext(f 1 ) ext(f 2 ) = ext(f) and ext(f 1 ) ext(f 2 ) = 5 6 H-Merge(f 1, f 2 ) merges two twin fragments f 1 and f 2 into one fragment f s.t. ext(f 1 ) ext(f 2 ) = ext(f) As we can observe, V-Cut and V-Merge operations are similar to outlining and inlining of the fragment f 2 out of or into the fragment f 1. Conversely, H- Cut operation corresponds to splitting of elements used in FlexMap mapping, i.e. duplication of the shared part, and the H-Merge operation corresponds to inverse merging of elements. 5 ext(f i ) is the set of all instance fragments conforming to the schema fragment f i. 6 Fragments f 1 and f 2 are called twins if ext(f 1 ) ext(f 2 ) = and for each node u f 1, there is a node v f 2 with the same label and vice versa. Irena Mlynkova, Jaroslav Pokorny 52

58 Vol. IV, No. II, pp The fixed XML-to-relational mapping maps each fragment f i which consists of nodes {v 1, v 2,..., v n } to relation R i = (id(r i ) : int, id(r i.parent) : int, lab(v 1 ) : type(v 1 ),..., lab(v n ) : type(v n )) where r i is the root element of f i. Note that such mapping is again similar to locally applied Universal table. The cost function f cost is expressed as: f cost (R, D sample ) = n w i cost(q i, R) (3) where D sample consists of a sample set of XML documents and a given query workload {(Q i, w i ) i=1,2,...,n }, where Q i is an XML query and w i is its weight. The cost function cost(q i, R) for a query Q i which accesses fragment set {f i1,..., f im } is expressed as: i=1 cost(q i, R) = { fi1 m = 1 j,k ( f ij Sel ij + δ ( E ij + E ik )/2) m > 1 (4) where f ij and f ik, j k are two join fragments, E ij is the number of elements in ext(f ij ), and Sel ij is the selectivity of the path from the root to f ij estimated using Markov table. In other words, the formula simulates the cost for joining relations corresponding to fragments f ij and f ik. The authors further analyze the influence of the choice of initial schema S init on efficiency of the search algorithm. They use three types of initial schema decompositions leading to Binary [10], Shared, or Hybrid [26] mapping. The paper concludes with the finding that a good choice of an initial schema is crucial and can lead to faster searches of the suboptimal mapping. 3.2 User-Driven Techniques As mentioned above, the most flexible approach is the user-defined mapping, i.e. the idea to leave the whole process in hands of a user who defines both the target database schema and the required mapping. Due to simple implementation it is supported in most commercial database systems [3]. At first sight the idea is correct users can decide what suits them most and are not restricted by disadvantages of a particular technique. The problem is that such approach assumes users skilled in two complex technologies and for more complex applications the design of an optimal relational schema is generally an uneasy task. On this account new techniques in this paper called user-driven mapping strategies were proposed. The main difference is that the user can influence a default fixed mapping strategy using annotations which specify the required mapping for particular schema fragments. The set of allowed mappings is naturally limited but still enough powerful to define various mapping strategies. Each of the techniques is characterized by the following four features: 1. an initial XML schema S init, 2. a set of allowed fixed XML-to-relational mappings {f i map} i=1,...,n, Irena Mlynkova, Jaroslav Pokorny 53

59 Vol. IV, No. II, pp a set of annotations A, each of which is specified by name, target, allowed values, and function, and 4. a default mapping strategy f def for not annotated fragments MDF Probably the first approach which faces the mentioned issues is proposed in paper [8] as a Mapping definition framework (MDF). It allows users to specify the required mapping, checks its correctness and completeness and completes possible incompleteness. The mapping specifications are made by annotating the input XSD with a predefined set of attributes A listed in Table 1. Attribute Target Value Function outline attribute or element tablename columnname sqltype attribute, element, or group attribute, element, or simple type attribute, element, or simple type true, false string string string structurescheme root element KFO, Interval, Dewey edgemapping element true, false maptoclob attribute or element true, false Table 1: Annotation attributes for MDF If the value is true, a separate table is created for the attribute / element. Otherwise, it is inlined. The string is used as the table name. The string is used as the column name. The string defines the SQL type of a column. Defines the way of capturing the structure of the whole schema. If the value is true, the element and all its subelements are mapped using Edge mapping. If the value is true, the element / attribute is mapped to a CLOB column. As we can see, the set of allowed XML-to-relational mappings {f i map} i=1,...,n involves inlining and outlining of an element / attribute, Edge mapping [10] strategy, and mapping an element or an attribute to a CLOB column. Furthermore, it enables to specify the required capturing of the structure of the whole schema using one of the following three approaches: Key, Foreign Key, and Ordinal Strategy (KFO) each node is assigned a unique integer ID and a foreign key pointing to parent ID, the sibling order is captured using an ordinal value Interval Encoding a unique {start,end} interval is assigned to each node corresponding to preorder and postorder traversal entering time Dewey Decimal Classification each node is assigned a path to the root node described using concatenation of node IDs along the path As side effects can be considered attributes for specifying names of tables or columns and data types of columns. Not annotated parts are stored using user-predefined rules, whereas such mapping is always a fixed one. Irena Mlynkova, Jaroslav Pokorny 54

60 Vol. IV, No. II, pp XCacheDB System Paper [4] also proposes a user-driven mapping strategy which is implemented and experimentally tested as an XCacheDB system which considers only unordered and acyclic XML schemes and omits mixed-content elements. The set of annotating attributes A that can be assigned to any node v S init is listed in Table 2. Attribute Value Function INLINE If placed on a node v, the fragment rooted at v is inlined into parent table. TABLE If placed on a node v, a new table is created for the fragment rooted at v. STORE BLOB If placed on a node v, the fragment rooted at v is stored also into a BLOB column. BLOB ONLY If placed on a node v, the fragment rooted at v is stored into a BLOB column. RENAME string The value specifies the name of corresponding table or column created for node v. DATATYPE string The value specifies the data type of corresponding column created for node v. Table 2: Annotation attributes for XCacheDB It enables inlining and outlining of a node, storing a fragment into a BLOB column, specifying table names or column names, and specifying column data types. The main difference is in the data redundancy allowed by attribute STORE BLOB which enables to shred the data into table(s) and at the same time to store pre-parsed XML fragments into a BLOB column. The fixed mapping uses a slightly different strategy: Each element or attribute node is assigned a unique ID. Each fragment f is mapped to a table T f which has an attribute a vid of ID data type for each element or attribute node v f. If v is an atomic node 7, T f has also an attribute a v of the same data type as v. For each distinct path that leads to f from a repeatable ancestor v, T f has a parent reference column of ID type which points to ID of v. 3.3 Theoretic Issues Besides proposals of cost-driven and user-driven techniques there are also papers which discuss the corresponding open issues on theoretic level Data Redundancy As mentioned above, the XCacheDB system allows a certain degree of redundancy, in particular duplication in BLOB columns and the violation of BCNF or 3NF condition. The paper [4] discusses the strategy also on theoretic level and defines four classes of XML schema decompositions. Before we state the definitions we have to note that the approach is based on a slightly different 7 An attribute node or an element node having no subelements. Irena Mlynkova, Jaroslav Pokorny 55

61 Vol. IV, No. II, pp graph representation than Definition 4. The nodes of the graph correspond to elements, attributes, or pcdata, whereas edges are labelled with corresponding operators. Definition 7 A schema decomposition is minimal if all edges connecting nodes of different fragments are labelled with * or +. Definition 8 A schema decomposition is 4NF if all fragments are 4NF fragments. A fragment is 4NF if no two nodes of the fragment are connected by a * or + labelled edge. Definition 9 A schema decomposition is non-mvd if all fragments are non- MVD fragments. A fragment is non-mvd if all * or + labelled edges appear in a single path. Definition 10 A schema decomposition is inlined if it is non-mvd but it is not a 4NF decomposition. A fragment is inlined if it is non-mvd but it is not a 4NF fragment. According to these definitions fixed mapping strategies (e.g. [26] [18]) naturally consider only 4NF decompositions which are least space-consuming and seem to be the best choice if we do not consider any other information. Paper [4] shows that having further information (in this particular case given by a user), the choice of other type of decomposition can lead to more efficient query processing though it requires a certain level of redundancy Grouping problem Paper [15] is dealing with the idea that searching a (sub)optimal relational decomposition is not only related to given XML schema, query workload, and XML data, but it is also highly influenced by the chosen query translation algorithm 8 and the cost model. For the theoretic purpose a subset of the problem so-called grouping problem is considered. It deals with possible storage strategies for shared subelements, i.e. either into one common table (so-called fully grouped strategy) or into separate tables (so-called fully partitioned strategy). For analysis of its complexity the authors define two simple cost metrics: RelCount the cost of a relational query is the number of relation instances in the query expression RelSize the cost of a relational query is the sum of the number of tuples in relation instances in the query expression and three query translation algorithms: Naive Translation performs a join between the relations corresponding to all the elements appearing in the query, a wild-card query 9 is converted into union of several queries, one for each satisfying wild-card substitution 8 An algorithm for translating XML queries into SQL queries 9 A query containing // or /* operators. Irena Mlynkova, Jaroslav Pokorny 56

62 Vol. IV, No. II, pp Single Scan a separate relational query is issued for each leaf element and joins all relations on the path until the least common ancestor of all the leaf elements is reached Multiple Scan on each relation containing a part of the result is applied Single Scan algorithm and the resulting query consists of union of the partial queries On a simple example the authors show that for a wild-card query Q which retrieves a shared fragment f with algorithm Naive Translation the fully partitioned strategy performs better, whereas with algorithm Multiple Scan the fully grouped strategy performs better. Furthermore, they illustrate that reliability of the chosen cost model is also closely related to query translation strategy. If a query contains not very selective predicate than the optimizer may choose a plan that scans corresponding relations and thus RelSize is a good corresponding metric. On the other hand, in case of highly selective predicate the optimizer may choose an index lookup plan and thus RelCount is a good metric. 4 Summary We can sum up the state of the art of adaptability of database-based XMLprocessing methods into the following natural but important findings: 1. As the storage strategy has a crucial impact on query-processing performance, a fixed mapping based on predefined rules and heuristics is not universally efficient. 2. It is not an easy task to choose an optimal mapping strategy for a particular application and thus it is not advisable to rely only on user s experience and intuition. 3. As the space of possible XML-to-relational mappings is very large (usually theoretically infinite) and most of the subproblems are even NP-hard, the exhaustive search is often impossible. It is necessary to define search heuristics, approximation algorithms, and/or reliable terminal conditions. 4. The choice of an initial schema can strongly influence the efficiency of the search algorithm. It is reasonable to start with at least locally good schema. 5. A strategy of finding a (sub)optimal XML schema should take into account not only the given schema, query workload, and XML data statistics, but also possible query translations, cost metrics, and their consequences. 6. Cost evaluation of a particular XML-to-relational mapping should not involve time-consuming construction of the relational schema, loading XML data and analyzing the resulting relational structures. It can be optimized using cost estimation of XML queries, XML data statistics, etc. 7. Despite the previous claim, the user should be allowed to influence the mapping strategy. On the other hand, the approach should not demand a full schema specification but it should complete the user-given hints. 8. Even thought a storage strategy is able to adapt to a given sample of schemes, data, queries, etc., its efficiency is still endangered by later changes of the expected usage. Irena Mlynkova, Jaroslav Pokorny 57

63 Vol. IV, No. II, pp Open Issues Although each of the existing approaches brings certain interesting ideas and optimizations, there is still a space of possible future improvements of the adaptable methods. We describe and discuss them in this section starting from (in our opinion) the least complex ones. Missing Input Data As we already know, for cost-driven techniques there are three types of input data an XML schema S init, a set of XML documents {d 1, d 2,.., d k }, and a set of XML queries {q 1, q 2,.., q l }. The problem of missing schema S init was already outlined in the introduction in connection with (dis)advantages of generic and schema-driven methods. As we suppose that the adaptability is the ability to adapt to the given situation, a method which does not depend on existence of an XML schema but can exploit the information if being given is probably a natural first improvement. This idea is also strongly related to the mentioned problem of choice of a locally good initial schema S init. The corresponding questions are: Can be the user-given schema considered as a good candidate for S init? How can we find an eventual better candidate? Can we find such candidate for schema-less XML documents? A possible solution can be found in exploitation of methods for automatic construction of XML schema for the given set of XML documents (e.g. [21] [23]). Assuming that documents are more precise sources of structural information, we can expect that a schema generated on their bases will have better characteristics too. On the other hand, the problem of missing input XML documents can be at least partly solved using reasonable default settings based on general analysis of real XML data (e.g. [5] [20]). Furthermore, the surveys show that real XML data are surprisingly simple and thus the default mapping strategy does not have to be complex too. It should rather focus on efficient processing of frequently used XML patterns. Finally, the presence of sample query workload is crucial since (to our knowledge) there are no analyses on real XML queries, i.e. no source of information for default settings. The reason is that collecting such real representatives is not as straightforward as in case of XML documents. Currently the best sources of XML queries are XML benchmarking projects (e.g. [24] [30]) but as the data and especially queries are supposed to be used for rating the performance of a system in various situations, they cannot be considered as an example of a real workload. Naturally, the query statistics can be gathered by the system itself and the schema can be adapted continuously, as discussed later in the text. Efficient Solution of Subproblems A surprising fact we have encountered are numerous simplifications of the chosen solutions. As it was mentioned, some of the techniques omit, e.g., ordering of elements, mixed contents, or recursion. This is a bit confusing finding regarding the fact that there are proposals of efficient processing of these XML constructs (e.g. [27]) and that adaptive methods should cope with various situations. A similar observation can be done for user-driven methods. Though the proposed systems are able to store schema fragments in various ways, the default Irena Mlynkova, Jaroslav Pokorny 58

64 Vol. IV, No. II, pp strategy for not annotated parts of the schema is again a fixed one. It can be an interesting optimization to join the ideas and search the (sub)optimal mapping for not annotated parts using a cost-driven method. Deeper Exploitation of Information Another open issue is possible deeper exploitation of the information given by the user. We can identify two main questions: How can be the user-given information better exploited? Are there any other information a user can provide to increase the efficiency? A possible answer can be found in the idea of pattern matching, i.e. using the user-given schema annotations as hints how to store particular XML patterns. We can naturally predict that structurally similar fragments should be stored similarly and thus to focus on finding these fragments in the rest of the schema. The main problem is how to identify the structurally similar fragments. If we consider the variety of XML-to-XML transformations, two structurally same fragments can be expressed using at first glance different regular expressions. Thus it is necessary to propose particular levels of equivalence of XML schema fragments and algorithms how to determine them. Last but not least, such system should focus on scalability of the similarity metric and particularly its reasonable default setting (based on existing analyses of real-world data). Theoretical Analysis of the Problem As the overview shows, there are various types of XML-to-XML transformations, whereas the mentioned ones certainly do not cover the whole set of possibilities. Unfortunately there seems to be no theoretic study of these transformations, their key characteristics, and possible classifications. The study can, among others, focus on equivalent and generalizing transformations and as such serve as a good basis for the pattern matching strategy. Especially interesting will be the question of NP-hardness in connection with the set of allowed transformations and its complexity (similarly to paper [15] which analyzes theoretical complexity of combinations of cost metrics and query translation algorithms). Such survey will provide useful information especially for optimizations of the search algorithm. Dynamic Adaptability The last but not least issue is connected with the most striking disadvantage of adaptive methods the problem of possible changes of XML queries or XML data that can lead to crucial worsening of the efficiency. As mentioned above, it is also related to the problem of missing input XML queries and ways how to gather them. The question of changes of XML data opens also another wide research area of updatability of the stored data a feature that is often omitted in current approaches although its importance is crucial. The solution to these issues i.e. a system that is able to adapt dynamically is obvious and challenging but it is not an easy task. It should especially avoid total reconstructions of the whole relational schema and corresponding necessary reinserting of all the stored data, or such operation should be done only in very special cases. On the other hand, this brute-force approach can serve as an inspiration. Supposing that changes especially in case of XML queries will not be radical, the modifications of the relational schema will be mostly local and Irena Mlynkova, Jaroslav Pokorny 59

65 Vol. IV, No. II, pp we can apply the expensive reconstruction just locally. Furthermore, we can again exploit the idea of pattern matching and find the XML pattern defined by the modified schema fragment in the rest of the schema. Another question is how often should be the relational schema reconstructed. The natural idea is of course not too often. But, on the other hand, a research can be done on the idea of performing gradual minor changes. It is probable that such approach will lead to less expensive (in terms of reconstruction) and at the same time more efficient (in terms of query processing) system. The former hypothesis should be verified, the latter one can be almost certainly expected. The key issue is how to find a reasonable compromise. 6 Conclusion The main goal of this paper was to describe and discuss the current state of the art and open issues of adaptability in database-based XML-processing methods. Firstly, we have stated the reasons why this topic should be ever studied. Then we have provided an overview and classification of the existing approaches and summed up the key findings. Finally, we have discussed the corresponding open issues and their possible solutions. Our aim was to show that the idea of processing XML data using relational databases is still up to date and should be further developed. From the overview we can see that even though there are interesting and inspiring approaches, there is still a variety of open problems which can further improve the database-based XML processing. Our future work will naturally follow the open issues stated at the end of this paper and especially survey into the solutions we have mentioned. Firstly, we will focus on the idea of improving the user-driven techniques using adaptive algorithm for not annotated parts of the schema together with deeper exploitation of the user-given hints using pattern-matching methods i.e. a hybrid userdriven cost-based system. Secondly, we will deal with the problem of missing theoretic study of schema transformations, their classification, and particularly influence on the complexity of the search algorithm. And finally, on the basis of the theoretical study and the hybrid system we will study and experimentally analyze the dynamic enhancing of the system. References [1] DB2 XML Extender. IBM. [2] Oracle XML DB. Oracle Corporation. [3] S. Amer-Yahia. Storage Techniques and Mapping Schemas for XML. Technical Report TD-5P4L7B, AT&T Labs-Research, [4] A. Balmin and Y. Papakonstantinou. Storing and Querying XML Data Using Denormalized Relational Databases. The VLDB Journal, 14(1):30 49, [5] D. Barbosa, L. Mignet, and P. Veltri. Studying the XML Web: Gathering Statistics from an XML Sample. World Wide Web, 8(4): , Irena Mlynkova, Jaroslav Pokorny 60

66 Vol. IV, No. II, pp [6] G. J. Bex, F. Neven, and J. Van den Bussche. DTDs versus XML Schema: a Practical Study. In WebDB 04: Proc. of the 7th Int. Workshop on the Web and Databases, pages 79 84, New York, NY, USA, ACM Press. [7] P. V. Biron and A. Malhotra. XML Schema Part 2: Datatypes (Second Edition). W3C, October [8] F. Du, S. Amer-Yahia, and J. Freire. ShreX: Managing XML Documents in Relational Databases. In VLDB 04: Proc. of 30th Int. Conf. on Very Large Data Bases, pages , Toronto, ON, Canada, Morgan Kaufmann Publishers Inc. [9] T. Bray et al. Extensible Markup Language (XML) 1.0 (Fourth Edition). W3C, September [10] D. Florescu and D. Kossmann. Storing and Querying XML Data using an RDMBS. IEEE Data Eng. Bull., 22(3):27 34, [11] J. Freire, J. R. Haritsa, M. Ramanath, P. Roy, and J. Simeon. StatiX: Making XML Count. In ACM SIGMOD 02: Proc. of the 21st Int. Conf. on Management of Data, pages , Madison, Wisconsin, USA, ACM Press. [12] T. Grust. Accelerating XPath Location Steps. In SIGMOD 02: Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages , New York, NY, USA, ACM Press. [13] M. Klettke and H. Meyer. XML and Object-Relational Database Systems Enhancing Structural Mappings Based on Statistics. In Lecture Notes in Computer Science, volume 1997, pages , [14] M. Kratky, J. Pokorny, and V. Snasel. Implementation of XPath Axes in the Multi-Dimensional Approach to Indexing XML Data. In Proc. of Current Trends in Database Technology EDBT 04 Workshops, pages 46 60, Heraklion, Crete, Greece, Springer. [15] R. Krishnamurthy, V. Chakaravarthy, and J. Naughton. On the Difficulty of Finding Optimal Relational Decompositions for XML Workloads: A Complexity Theoretic Perspective. In ICDT 03: Proc. of the 9th Int. Conf. on Database Theory, pages , Siena, Italy, Springer. [16] A. Kuckelberg and R. Krieger. Efficient Structure Oriented Storage of XML Documents Using ORDBMS. In Proc. of the VLDB 02 Workshop EEXTT and CAiSE 02 Workshop DTWeb, pages , London, UK, Springer-Verlag. [17] Q. Li and B. Moon. Indexing and Querying XML Data for Regular Path Expressions. In VLDB 01: Proc. of the 27th Int. Conf. on Very Large Data Bases, pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [18] I. Mlynkova and J. Pokorny. From XML Schema to Object-Relational Database an XML Schema-Driven Mapping Algorithm. In ICWI 04: Proc. of IADIS Int. Conf. WWW/Internet, pages , Madrid, Spain, IADIS. [19] I. Mlynkova and J. Pokorny. XML in the World of (Object-)Relational Database Systems. In ISD 04: Proc. of the 13th Int. Conf. on Informa- Irena Mlynkova, Jaroslav Pokorny 61

67 Vol. IV, No. II, pp tion Systems Development, pages 63 76, Vilnius, Lithuania, Springer Science+Business Media, Inc. [20] I. Mlynkova, K. Toman, and J. Pokorny. Statistical Analysis of Real XML Data Collections. In COMAD 06: Proc. of the 13th Int. Conf. on Management of Data, pages 20 31, New Delhi, India, Tata McGraw-Hill Publishing Company Limited. [21] C.-H. Moh, E.-P. Lim, and W. K. Ng. DTD-Miner: A Tool for Mining DTD from XML Documents. In WECWIS 00: Proc. of the 2nd Int. Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems, pages , Milpitas, CA, USA, IEEE. [22] M. Murata, D. Lee, and M. Mani. Taxonomy of XML Schema Languages Using Formal Language Theory. In Proc. of the Extreme Markup Languages Conf., Montreal, Quebec, Canada, [23] S. Nestorov, S. Abiteboul, and R. Motwani. Extracting Schema from Semistructured Data. In SIGMOD 98: Proc. of the ACM Int. Conf. On Management of Data, pages , Seattle, Washington, DC, USA, ACM Press. [24] E. Rahm and T. Bohme. XMach-1: A Benchmark for XML Data Management. Database Group Leipzig, [25] M. Ramanath, J. Freire, J. Haritsa, and P. Roy. Searching for Efficient XML-to-Relational Mappings. In XSym 03: Proc. of the 1st Int. XML Database Symposium, volume 2824, pages 19 36, Berlin, Germany, Springer. [26] J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton. Relational Databases for Querying XML Documents: Limitations and Opportunities. In VLDB 99: Proc. of 25th Int. Conf. on Very Large Data Bases, pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [27] I. Tatarinov, S. D. Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita, and C. Zhang. Storing and Querying Ordered XML Using a Relational Database System. In SIGMOD 02: Proc. of 21st Int. Conf. on Management of Data, pages , Madison, Wisconsin, USA, ACM Press. [28] H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures (Second Edition). W3C, October [29] W. Xiao-ling, L. Jin-feng, and D. Yi-sheng. An Adaptable and Adjustable Mapping from XML Data to Tables in RDB. In Proc. of the VLDB 02 Workshop EEXTT and CAiSE 02 Workshop DTWeb, pages , London, UK, Springer-Verlag. [30] B. B. Yao and M. T. Ozsu. XBench A Family of Benchmarks for XML DBMSs. University of Waterloo, School of Computer Science, Database Research Group, [31] S. Zheng, J. Wen, and H. Lu. Cost-Driven Storage Schema Selection for XML. In DASFAA 03: Proc. of the 8th Int. Conf. on Database Systems for Advanced Applications, pages , Kyoto, Japan, IEEE Computer Society. Irena Mlynkova, Jaroslav Pokorny 62

68 Vol. IV, No. II, pp XML View Based Access to Relational Data in Workflow Management Systems Christian Dreier Department of Informatics-Systems University of Klagenfurt, Austria Johann Eder #1, Marek Lehmann #2, Juergen Mangler #3 Department of Knowledge and Business Engineering University of Vienna, Austria 1 johann.eder@univie.ac.at 2 marek.lehmann@univie.ac.at 3 juergen.mangler@univie.ac.at Abstract XML became the most significant standard for data exchange and publication over the internet but most business data remain stored in relational databases. Access to business data in workflows or Web sevices is frequently realized by accessing XML views published over relational databases. In these loosely coupled environments where activities can execute on XML views generated from relational data without a possibility to lock the base data, it is necessary to provide view freshness control and invalidation mechanisms. In this paper we present an invalidation method for XML views published over relational data developed for our prototype workflow management system. Keywords: Workflow management, workflow data, XML, XML views, view invalidation 1 Introduction XML became the most significant standard for data exchange and publication over the internet. Nevertheless, most business data remain stored in relational databases. XML views over relational data are seen as a general way to publish relational data as XML documents. There are many proposals to overcome the mismatch between flat relational and hierarchical XML models (e.g. [1, 2]). Also commercial relational database management systems offer a possibility to publish relational data as XML (e.g. [3, 4]). The importance of XML technology is increasing tremendously in process management. Web services [5], workflow management systems [6] and B2B standards [7, 8] use XML as a data format. Complex XML documents published and exchanged by business processes are usually defined with XML Schema types. Process activities expect and produce XML documents as parameters. XML documents encapsulated in messages (e.g. Christian Dreier, Johann Eder Marek Lehman, Juergen Mangler 63

69 Vol. IV, No. II, pp WSDL) can trigger new process instances. Frequently, hierarchical XML data used by processes and activities have to be translated into flat relational data model used by external databases. These systems are very often loosely coupled and it is impossible or very difficult to provide view maintenance. On the other hand, the original data can be accessed and modified by other systems or application programs. Therefore, a method of controlling the freshness of a view and invalidating views becomes vital. We developed a view invalidation method for our prototype workflow management system. A special data access module, so called generic data access plug-in (GDAP), enables the definition of XML views over relational data. GDAP offers a possibility to check the view freshness and can invalidate a stale view. In case of view update operations the GDAP automatically checks whether the view is not stale before propagating update to the original database. The remainder of this paper is organized as follows: Section 2 presents an overall architecture of our prototype workflow management systems and introduces the idea of data access plug-ins used to provide uniform access to external data in workflows. Section 3 discusses invalidation mechanisms for XML views defined over relational data and Section 4 describes their actual implementation in our system. We give some overview of related work in Section 5 and finally draw conclusions in Section 6. 2 Uniform and Flexible Data Access in Workflow Management Systems Workflow management systems are not intended to provide general data management systems capabilities, although they have to be able to work with large amounts of data coming from different sources. Business data, describing persistent business information necessary to run an enterprise, may be controlled either by a workflow management system or be managed in external systems (e.g. corporate database). The workflow management system needs a direct data access to make control flow decisions based upon data values. An important drawback is that workflow management system external data can only be used indirectly for this purpose, e.g. be queried for control decisions. Therefore most of the activity programming is related to accessing external databases [9]. We propose to provide the workflow management system with a uniform and transparent access method to all business data stored in any data source. The workflow management system should be able to use data coming from external and independent systems to determine a state transition or to pass it between activities as parameters. This is achieved by an abstraction layer called data access plug-ins. A general architecture of our workflow management prototype is presented in Fig. 1. The workflow engine provides operational functions to support the execution of processes. The workflow repository stores both workflow definition and instance data. The program interaction manager calls programs implementing automated activities. The worklist manager is responsible for worklists of the human actors and for the interaction with the worklist handlers. The data access plug-in manager is responsible for registering and managing data access plug-ins. Apart from the generic data access plug-in there may be specialized plug-ins for specific data sources (e.g. legacy systems). Our implementation included generic data access plug-in for relational databases and another one for XML files stored in a file system. Christian Dreier, Johann Eder Marek Lehman, Juergen Mangler 64

70 Vol. IV, No. II, pp Data Access PlugIn Manager WfMS External Data Sources Data Access Plug-ins Workflow Engine Workflow Repository Program Interaction Manager Worklist Manager External Systems Worklist handler Figure 1: Workflow management system architecture with data access plug-ins 2.1 Data Access Plug-ins Data access plug-ins are reusable and interchangeable wrappers around external data sources which present to the workflow management system the content of underlying data sources and manage the access to it. The functionality of external data sources is abstracted in these plug-ins. Each data access plug-in provides documents in one or several predefined XML Schema types. Both a data access plug-in and XML Schema types served by this plug-in are registered to the workflow management system. Once registered, a data access plug-in can be reused in many workflow definitions to access external data as XML documents of a given type. Consider the following frequent scenario: an enterprise has a large database with the customer data stored in several relations and used in many processes. In our approach the company defines a complex XML Schema type describing customer data and implements a data access plug-in which wraps this database and retrieves and stores customer data in XML format. This has several advantages: Business data from external systems are accessible by the workflow management system. Thus, these data can be passed to activities and used to make control flow decisions. Activities can be parameterized with XML documents of predefined types. The logic for accessing external data sources is hidden in a data access plug-in fetching documents passed to activities at runtime. This allows activities to be truly reusable and independent of physical data location. Making external data access explicit with the data access plug-ins rather than hiding it in the activities improves the understandability, maintainability and auditability of process definitions. Both data access plug-ins and XML Schema type are reusable. This solution is easily evolvable. If the customer data have to be moved to a different database, it is sufficient to use another data access plug plug-in. The process definition and activities remain basically unchanged. Christian Dreier, Johann Eder Marek Lehman, Juergen Mangler 65

71 Vol. IV, No. II, pp The task of a data access plug-in is to translate the operations on XML documents to the underlying data sources. A data access plug-in exposes to the workflow management system a simple interface which allows XML documents to be read, written or created in a collection of many documents of the same XML Schema type. Each document in the collection is identified by a unique identifier. The plug-in must be able to identify the document in the collection given only this identifier. Each data access plug-in allows an XPath expression to be evaluated on a selected XML document. The XML documents used within a workflow can be used by the workflow engine to control the flow of workflow processing. This is done in conditional split nodes by evaluating the XPath conditions on documents. If a given document is stored in an external data source and accessed by a data access plug-in, then the XPath condition has to be evaluated by this plug-in. XPath is also used to access data values in XML documents. 2.2 Generic Data Access Plug-in for Relational Data Sources Most business data remain stored in relational databases. Therefore, a generic and expandable solution for relational data sources was needed. A generic data access plug-in (GDAP) offers basic operations and can be extended by users to their specific data sources. GDAP is responsible for mapping of the hierarchical XML documents used by workflows and activities into flat relational data model used by external databases. Thus, documents produced by GDAP can be seen as XML views of relational data. The workflows and activities managed by the workflow management system can run for a long time. In a loosely coupled workflow scenario it is neither reasonable nor possible to lock data in the original database for a processing time a workflow. At the same time these data can be modified by other systems or workflows. In order to provide optimistic concurrency control, some form of view invalidation is required [4]. Therefore, GDAP provides a view freshness control and view invalidation method. In case of view update operations GDAP automatically checks whether the view is not stale before propagating update to the original database. 3 Change Detection and View Invalidation In our GDAP (generic data access plug-in) we analyzed and implemented a mechanism for invalidation of XML views of relational data by detecting relevant base data changes. Change detection in base relational data can be done in two ways: passive or active. In our approach, passive change detection means that a GDAP is informed about changes in relational data used in views managed by this plug-in. Therefore, it is necessary that a (passive) plug-in is able to subscribe certain views to this notification process. Additionally, a view that is not used any longer needs to be unsubscribed. We identified three passive mechanisms for change detection: 1. Passive change detection by use of concepts of active databases. That means that triggers are defined on base relations containing data published in the views, informing the GDAP about changes in the database. 2. Passive change detection by change methods: This mechanism is based on object oriented and object relational databases, providing the possibility to implement Christian Dreier, Johann Eder Marek Lehman, Juergen Mangler 66

72 Vol. IV, No. II, pp change methods. Change methods that implement change operations on the underlaying base data can be extended by a functionality to inform the plug-ins in case of potential view-invalidations. 3. Passive change detection by event mechanisms: This is the most general approach because here an additional publish-subscribe mechanism is assumed. GDAPs subscribe views for different events on the database (e.g. change operations on base data). If an event occurs, a notification is sent to the GDAP initiating the view invalidation process. On the other hand, using active change detection techniques, it is the GDAP s own responsibility to check whether underlaying base data has changed periodically or at defined points in time. Due to the fact that no concepts of active or object oriented databases and no publish-subscribe mechanisms are required, these techniques are universal. We distinguish three active change detection methods: 1. A naive approach is to backup the relevant relational base data at view creation time and compare it to the relational base data at the time when the view validity has to be checked. Differences in these two data sets may lead to view invalidation. 2. To avoid storing huge amounts of backup data, a hash function can be used to compute its hash value and back it up. At the point in time when a view validity check becomes necessary again a hashvalue is computed on the new database state and compared to the backup-value to determine changes. Notice that in this case it can come to a collision, i.e. hash values could be same for different data and in result lead to over-invalidation. 3. Usage of change logs: Change logs log all changes within the database caused by internal or external actors. Because no database state needs to be backed up at view-creation time, less space is used. After changes in base data have been detected the second GDAP task is to determine whether corresponding views became stale, i.e. check if they need to be invalidated or not. Not all changes in the base relational data lead to invalid views. We developed an algorithm that considers both the type of change operation and the view definition to check the view invalidation. Figure 2 gives an overview of this algorithm. It shows the different effects caused by different change operations: The change of a tuple that is displayed in the view (i.e. that satisfies the selection-condition), always leads to view invalidation. In certain cases change operations on tuples that are not displayed in the view lead to invalidation too: (1) Changes of tuples selected by a where-clause make views stale. (2) Changes invalidate all views with view-definitions that do not contain a where-clause, and the changed tuples emerge in relations occurring in the from-clause of the view-definition. These cases also apply to deletes and inserts. A tuple inserted by an insert operation invalidates the view if it is displayed in the view, it is selected in the where-clause or the view-definition does not contain a where-clause and the tuple is inserted into a relation occurring in the from-clause of the view-definition. The same applies to the case of deletes: A view is invalidated if the deleted tuple is displayed in the view, it is selected by the where-clause or it is deleted from a relation occurring in the view-definition s from-clause. Christian Dreier, Johann Eder Marek Lehman, Juergen Mangler 67

73 Vol. IV, No. II, pp yes selection-clause satisfied (tuple is shown in view) no yes where-clause exists no yes where-clause selects tuple no tuple occurs in relation occuring in fromclause yes no invalid invalid valid invalid valid Figure 2: View invalidation algorithm If we assume that for update-propagation reasons the identifying primary keys of the base tuples are contained in the view, every tuple of the view can be associated with the base tuple and vice-versa. Thus, every single change within base data can be associated with the view. 4 Implementation To validate our approach we implemented a generic data access plug-in which was integrated into our prototype workflow management system described in Section 2. The current implementation of the GDAP for relational databases (Fig. 3) takes advantage of XML-DBMS middleware for transferring data between XML documents and relational databases [10]. XML-DBMS maps the XML document to the database according to an object-relational mapping in which element types are generally viewed as classes and attributes and XML text data as properties of those classes. An XML-based mapping language allows the user to define an XML view of relational data by specifying these mappings. The XML-DBMS supports also insert, update and delete operations. We follow in our implementation an assumption made by the XML-DBMS that the view updateability problem has already been resolved. The XML-DBMS checks only basic properties for the view updateability, e.g. presence in an updateable view of primary keys and obligatory attributes. Other issues, like content and structural duplication are not addressed. The GDAP controls the freshness of generated XML views using the predefined triggers and so called view-tuple lists (VTLs). A VLT contains primary keys of tuples which were selected into the view. VLTs are managed and stored internally by the GDAP. A sample view-tuple list which contains primary keys of displayed tuples in each involved relation is shown in Table 1. Our GDAP also uses triggers defined on tables which were used to create a view. These triggers are created together with the view definition and store all primary keys of all tuples inserted, deleted or modified within these relations in a special log. This log is inspected by the GDAP later, and information gathered during this inspection process is used for our invalidation process of the view. Thus, our implementation uses a variant of active change detection with change logs as described in Section 3. Two different VTLs are used in the invalidation process: VTL old is computed when the view is generated, VTL new is computed at the time of the invalidation process. The primary keys of modified records in the original tables logged by a trigger are used to Christian Dreier, Johann Eder Marek Lehman, Juergen Mangler 68

74 Vol. IV, No. II, pp Generic Data Access Plugin implements Workflow.Plugins. DataAccessPluginInterface XMLView XForms Workflow Engine view updatability checking method call ViewDef.xml XML-DBMS-middleware configuration LogDef.xml VTLold, VTLnew vtl.bin X-Diff view-type-id-list viewlist.bin view invalidation checking passive change detection active change detection DB changes SQL DB changes trigger definition/ deletion log relational DB external users/ applications... immutable schema Figure 3: Generic data access plug-in architecture Table 1: VTL example view-id relation primarykey key values view 001 order ordernr customer customernr view 002 position ordernr,positionnr 5,1 5,2 5,4 view 003 article articlenr detect possible changes in a view. The algorithm is as follows: For each tuple T in a change log check one of the following (ID T denotes the identifying primary key of the tuple T): Case 1: ID T is contained both in VTL old and in VTL new. This denotes an update operation. Case 2: ID T is contained only in VTL new. This denotes an insert operation. Case 3: ID T is contained only in VTL old. This denotes a delete operation. The check procedure stops as soon as one of Cases 1-3 is true. This means that one of the modified tuples logged in the change log would be selected into the view and the view must be invalidated. But the view should be also invalidated if the selection-condition of the view definition is not satisfied. This described by the next two cases which also must Christian Dreier, Johann Eder Marek Lehman, Juergen Mangler 69

75 Vol. IV, No. II, pp Table 2: View example tupleid name salary maxsalary 1 Joe Bill yes case 1, case 2 or case 3 satisfied no yes case 4 or case 5 satisfied no invalid invalid valid Figure 4: View invalidation method implemented in GDAP be checked. View old resp. View new denotes the original view resp. the view generated during the validation checking process: Case 4: If VTL old is not equal to VTL new, the view is invalid because the set of tuples has changed. Case 5: If VTL old is equal to VTL new and View old is not equal to View new, the view has to be invalidated. That means that the tuples remained the same, but values within these tuples have changed. To clarify that it is necessary to check case 5, see the following view-definition and the corresponding view listed in Table 2: SELECT tupleid, name, salary, (SELECT max(salary) AS maxsalary FROM employees WHERE department= IT ) FROM employees WHERE department= IT AND tupleid<3 If the salary of the employee with the maximum salary is changed (notice that this employee is not selected by the selection-condition), still the same tuples are selected, but within the view maxsalary changes. The invalidation checking in Cases 1-4 does not require view recomputation. But Case 5 only needs to be checked if the Cases 1-4 are not satisfied. Notice that while the Cases 4 and 5 just have to be checked once, Cases 1-3 have to be checked for every tuple occurring in the change log. This invalidation algorithm used in our GDAPs is summarized in Fig. 4. Case 5 is checked by comparing two views, as shown above. The disadvantages of this comparison are that a recomputation of the view View new is time and resource consuming, as well as the comparison itself may be very inefficient. A more efficient way is to replace Case 5 by Case 5a: Christian Dreier, Johann Eder Marek Lehman, Juergen Mangler 70

76 Vol. IV, No. II, pp Table 3: View example tupleid name salary comparisonsalary 1 Joe Bill Case 5a: If VTL old is equal to VTL new and there is an additional sub-select-clause within the select-clause and any relation in this from-clause has been changed, then the view is invalidated. This way the view shown in Table 2 can be invalidated after the attribute maxsalary is changed. It is also possible that the sub-select-clause does not contain group-functions, like the following view-definition: SELECT tupleid, name, salary, (SELECT salary AS comparisonsalary FROM employees WHERE tupleid= 10 ) FROM employees WHERE department= IT AND tupleid= 3 The resulting view is listed in Table 3. If there is a change operation on the salary of the employee with tupleid=10, the view has to be invalidated. This is checked by case 5a. Additionally, all changes on relations occurring in the from-clause of the sub-select lead to an invalidation of the view. In Case 5a even the changes, that do not affect the view itself, can make it stale. Thus, over-invalidation may occur in a high degree. Still, this mechanism of checking view freshness and view-invalidation seems to be a more efficient alternative to Case 5. 5 Related Work In most existing workflow management systems, data used to control the flow of the workflow instances (i.e. workflow relevant data) are controlled by the workflow management system itself and stored in the workflow repository. If these data originate in external data sources, then external data are usually copied into the workflow repository. There is no universal standard for accessing external data in workflow management systems. Basically each product uses different solutions [11]. There has been recent interest in publishing relational data as XML documents often called XML views over relational data. Most of these proposals focus on converting queries over XML documents into SQL queries over the relational tables and on efficient methods of tagging and structuring of relational data into XML (e.g. [1, 2, 12]. The view updateability problem is well known in relational and object- relational databases [13]. The mismatch between flat relational and hierarchical XML models is an additional challenge. This problem is addressed in [14]. However, most proposals of updateable XML views [15] and commercial RDBMS (e.g. [3]) assume that XML view updateability problem is already solved. The term view freshness is not used in a uniform way, dealing with the currency and timeliness [16] of a (materialized) view. Additional Christian Dreier, Johann Eder Marek Lehman, Juergen Mangler 71

77 Vol. IV, No. II, pp dimensions of view freshness regarding the frequency of changes are discussed in [17]. We do not distinguish all these dimensions in this paper. Here view freshness means the validity of a view. Different types of invalidation, including over- and under-invalidation are discussed in [18]. A view is not fresh (stale) if data used to generate the view were modified. It is important to detect relevant modifications. In [19] authors proposed to store both original XML documents and their materialized XML views in special relational tables and to use update log to detect relevant updates. A new view generated from the modified base data may be different as a previously generated view. Several methods for change detection of XML documents were proposed (e.g. [20, 21]). The authors of [22] proposed first to store XML in special relational tables and then to use SQL queries to detect content changes of such documents. In [4] before and after images of an updateable XML view are compared in order to find the differences which are later used to generate corresponding SQL statements responsible for updating relational data. The before and after images are also used to provide optimistic concurrency control. 6 Conclusions The data aspect of workflows requires more attention. Since workflows typically access data bases for performing activities or making flow decisions, the correct synchronization between the base data and the copies of these data in workflow systems is of great importance for the correctness of the workflow execution. We described a way for recognizing the invalidation of materialized views of relational data used in workflow execution. To check the freshness of generated views our algorithm does not require any special data structures in the RDBMS except a log table and triggers. Additionally, view-tuple-lists are managed to store primary keys of tuples selected into a view. Thus, only a very small amount of overhead data are stored and can be used to invalidate stale views. The implemented general data access plug-in enables flexible publication of relational data as XML documents used in loosely coupled workflows. This brings obvious advantages for intra- and interorganizational exchange of data. In particular, it makes the definition of workflows easier and the coupling between workflow system and databases more transparent, since it is no longer needed to perform all the necessary checks in the individual activities of a workflow. Acknowledgements This work is partly supported by the Commission of the European Union within the project WS-Diamond in FP6.STREP References [1] M. Fernández, Y. Kadiyska, D. Suciu, A. Morishima, and W.-C. Tan, Silkroute: A framework for publishing relational data in xml, ACM Trans. Database Syst., vol. 27, no. 4, pp , Christian Dreier, Johann Eder Marek Lehman, Juergen Mangler 72

78 Vol. IV, No. II, pp [2] J. Shanmugasundaram, J. Kiernan, E. J. Shekita, C. Fan, and J. Funderburk, Querying xml views of relational data, in VLDB 01: Proceedings of the 27th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., 2001, pp [3] Oracle, XML Database Developer s Guide - Oracle XML DB. Release 2 (9.2), Oracle Corporation, October [4] M. Rys, Bringing the internet to your database: Using sqlserver 2000 and xml to build loosely-coupled systems, in Proceedings of the 17th International Conference on Data Engineering ICDE, April 2-6, 2001, Heidelberg, Germany. IEEE Computer Society, 2001, pp [5] T. Andrews, F. Curbera, H. Dholakia, Y. Goland, J. Klein, F. Leymann, K. Liu, D. Roller, D. Smith, S. Thatte, I. Trickovic, and S. Weerawarana, Business process execution language for web services (bpel4ws), BEA, IBM, Microsoft, SAP, Siebel Systems, Tech. Rep. 1.1, 5 May [6] WfMC, Process definition interface - xml process definition language (xpdl 2.0), Workflow Management Coalition, Tech. Rep. WFMC-TC-1025, [7] ebxml, ebxml Technical Architecture Specification v1.0.4, ebxml Technical Architecture Project Team, [8] M. Sayal, F. Casati, U. Dayal, and M.-C. Shan, Integrating workflow management systems with business-to-business interaction standards, in Proceedings of the 18th International Conference on Data Engineering (ICDE 02). IEEE Computer Society, 2002, p [9] M. Ader, Workflow and business process management comparative study. volume 2, Workflow & Groupware Stratégies, Tech. Rep., June [10] R. Bourret, Xml-dbms middleware, Viewed: May 2005, [11] N. Russell, A. H. M. t. Hofstede, D. Edmond, and W. v. d. Aalst, Workflow data patterns, Queensland University of Technology, Brisbane, Australia, Tech. Rep. FIT-TR , April [12] J. Shanmugasundaram, E. J. Shekita, R. Barr, M. J. Carey, B. G. Lindsay, H. Pirahesh, and B. Reinwald, Efficiently publishing relational data as xml documents, VLDB J., vol. 10, no. 2-3, pp , [13] C. Date, An Introduction to Database Systems, Eighth Edition. Addison Wesley, [14] L. Wang and E. A. Rundensteiner, On the updatability of xml views published over relational data, in Conceptual Modeling - ER 2004, 23rd International Conference on Conceptual Modeling, Shanghai, China, November 2004, Proceedings, ser. Lecture Notes in Computer Science, P. Atzeni, W. W. Chu, H. Lu, S. Zhou, and T. W. Ling, Eds., vol Springer, 2004, pp Christian Dreier, Johann Eder Marek Lehman, Juergen Mangler 73

79 Vol. IV, No. II, pp [15] I. Tatarinov, Z. G. Ives, A. Y. Halevy, and D. S. Weld, Updating xml, in SIGMOD Conference, [16] M. Bouzeghoub and V. Peralta, A framework for analysis of data freshness, in IQIS 2004, International Workshop on Information Quality in Information Systems, 18 June 2004, Paris, France (SIGMOD 2004 Workshop), F. Naumann and M. Scannapieco, Eds. ACM, 2004, pp [17] M. Bouzeghoub and V. Peralta, On the evaluation of data freshness in data integration systems, in 20 èmes Journées de Bases de Données Avancées (BDA 2004), [18] K. S. Candan, D. Agrawal, W.-S. Li, O. Po, and W.-P. Hsiung, View invalidation for dynamic content caching in multitiered architectures, in VLDB, 2002, pp [19] H. Kang, H. Sung, and C. Moon, Deferred incremental refresh of xml materialized views : Algorithms and performance evaluation, in Database Technologies 2003, Proceedings of the 14th Australasian Database Conference, ADC 2003, Adelaide, South Australia, February 2003, ser. CRPIT, vol. 17. Australian Computer Society, 2003, pp [20] G. Cobena, S. Abiteboul, and A. Marian, Detecting changes in xml documents, in Proceedings of the 18th International Conference on Data Engineering ICDE, 26 February - 1 March 2002, San Jose, CA. IEEE Computer Society, 2002, pp [21] Y. Wang, D. J. DeWitt, and J. yi Cai, X-diff: An effective change detection algorithm for xml documents, in Proceedings of the 19th International Conference on Data Engineering ICDE, March 5-8, 2003, Bangalore, India, U. Dayal, K. Ramamritham, and T. M. Vijayaraman, Eds. IEEE Computer Society, 2003, pp [22] E. Leonardi, S. S. Bhowmick, T. S. Dharma, and S. K. Madria, Detecting content changes on ordered xml documents using relational databases, in Database and Expert Systems Applications, 15th International Conference, DEXA 2004 Zaragoza, Spain, August 30-September 3, 2004, Proceedings, ser. Lecture Notes in Computer Science, F. Galindo, M. Takizawa, and R. Traunmüller, Eds., vol Springer, 2004, pp Christian Dreier, Johann Eder Marek Lehman, Juergen Mangler 74

80 Vol. IV, No. II, pp Incremental Trade-Off Management for Preference Based Queries 1 BALKE Wolf-Tilo LOFI Christoph L3S Research Center Leibniz University Hannover Appelstr. 9a Hannover Germany {balke, lofi}@l3s.de GÜNTZER Ulrich Institute of Computer Science University of Tübingen Sand Tübingen, Germany ulrich.guentzer@informatik.unituebingen.de Abstract Preference-based queries often referred to as skyline queries play an important role in cooperative query processing. However, their prohibitive result sizes pose a severe challenge to the para. In this paper we discuss the incremental re-computation of skylines based on additional information elicited from the user. Extending the traditional case of totally ordered domains, we consider preferences in their most general form as strict partial orders of attribute values. After getting an initial skyline set our ap. This additional knowledge then is incorporated into the preference information and constantly reduces skyline sizes. In particular, our approach also allows users to specify trade-offs between different query attributes, thus effectively decreasing the query dimensionality. We provide the required theoretical foundations for modeling preferences and equivalences, show how to compute incremented skylines, and proof the correctness of the algorithm. Moreover, we show that incremented skyline computation can take advantage of locality and database indices and thus the performance of the algorithm can be additionally increased. Keywords: Personalized Queries, Skylines, Trade-Off Management, Preference Elicitation 1 Introduction Preference-based queries, usually called skyline queries in database research [9], [4], [19], have become a prime paradigm for cooperative information systems. Their major appeal is the intuitiveness of use in contrast to other query paradigms like e.g. rigid setbased SQL queries, which only too often return an empty result set, or efficient, but hard to use top-k queries, where the success of a query depends on choosing the right scoring or utility functions. 1 Part of this work was supported by a grant of the German Research Foundation (DFG) within the Emmy Noether Program of Excellence. Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 75

81 Vol. IV, No. II, pp Skyline queries offer user-centered querying as the user just has to specify the basic attributes to be queried and in return retrieves the Pareto-optimal result set. In this set all possible s to being optimal with respect to any monotonic optimization function) are returned. Hence, a user cannot miss any important answer. However, the intuitiveness of querying comes at a price. Skyline sets are known to grow exponentially in size [8], [14] with the number of query attributes and may reach unreasonable result sets (of about half of the original database size, cf. [3], [7]) already for as little as six independent query predicates. The problem even becomes worse, if instead of totally ordered domains user preferences on arbitrary predicates over attribute-based domains are considered. In database retrieval, preferences are usually understood as partial orders [12], [15], [20] of domain values that allow for incomparability between attributes. This incomparability is reflected in the respective skyline sizes that are generally significantly bigger than in the totally ordered case. On the other hand such attribute-based domains like colors, book titles, or document formats play an important role in practical applications, e.g., digital libraries or e-commerce applications. As a general rule of thumb it can be stated that the more preference information (including its transitive implications) is given by the user with respect to each attribute, the smaller the average skyline set can be expected to be. In addition to prohibitive result set sizes, skyline queries are expensive to compute. Evaluation times in the range of several minutes or even hours over large databases are not unheard of. One possible solution is based on the idea of refining skyline queries incrementally by taking advantage of user interaction. This approach is promising since it benefits skyline sizes as well as evaluation times. Recently, several approaches have been proposed for user-centered refinement: using an interactive, exploratory process steering the progressive computation of skyline objects [17] exploiting feedback on a representative sample of the original skyline result [8], [16] projecting the complete skyline on subsets of predicates using pre-computed skycubes [20], [23]. The benefit of offering intuitive querying and a cooperative system behavior to the user in all three approaches can be obtained with a minimum of user interaction to guide the further refinement of the skyline. However, when dealing with a massive amount of result tuples, the first approach needs a certain user expertise for steering the progressive computation effectively. The second approach faces the problem of deriving representative samples efficiently, i.e. avoiding a complete skyline computation for each sample. In the third approach the necessary pre-computations are expensive in the face of updates of the database instance. Moreover, basic theoretical properties of incremented preferences in respect to possible preference collisions and induced query modification and query evaluation have been outlined in [13]. In this paper we will provide the theoretical foundations of modeling partial-ordered preferences and equivalences on attribute domains provide algorithms for incrementally and interactively computing skyline sets and prove the soundness and consistency of the algorithms (and thus giving a comprehensive view of [1], [2], [6]). Seeing preferences in their most general form as partial orders between domain values, this implicitly includes Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 76

82 Vol. IV, No. II, pp the case of totally ordered domains. After getting an (usually too big) initial skyline set our approach aims at interac wishes. The additional knowledge then is incorporated into the preference information and helps to reduce skyline sets. Our contribution thus is: Users are enabled to specify additional preference information (in the sense of domination), as well as equivalences (in the sense of indifference) between attributes leading to an incremental reduction of the skyline. Here our system will efficiently support the user by automatically taking care that newly specified preferences and equivalences will never violate the consistency of the previously stated preferences. Our skyline evaluation algorithm will allow specifying such additional information within a certain attribute domain. That means that more preference information about an attribute is elicited from the user. Thus the respective preference will be more complete and skylines will usually become smaller. This can reduce skylines to the (on average considerably smaller) sizes of total order skyline sizes by canceling out incomparability between attribute values. In addition, our evaluation algorithm will also allow specifying additional relationships between preferences on different attributes. This feature allows defining the qualitative importance or equivalence of attributes in different domains and thus forms a good tool to compare the respective utility or desirability of certain attribute values. The user can thus express trade-offs or compromises he/she is willing to take and also can adjust imbalances between fine-grained and coarse preference specifications. We show that the efficiency of incremented skyline computation can be considerably increased by employing preference diagrams. We derive an algorithm which takes advantage of the locality of incremented skyline set changes depending on the changes made by the user to the preference diagram. By that, the algorithm can operate on a considerable smaller dataset with an increased efficiency. Spanning preferences across attributes (by specifying trade-offs) is the only way short of dropping entire query predicates to reduce the dimensionality of the skyline computation and thus severely reduce skyline sizes. Nevertheless the user stays in full control of the information specified and all information is only added in a qualitative way, and not by unintuitive weightings. 2 A Skyline Query Use-Case and Modeling Before discussing the basic concepts of skyline processing, let us first take a closer look at a motivating scenario which illustrates our modeling approach with a practical example: Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 77

83 Vol. IV, No. II, pp beach area commercial district arts district outer suburbs university district preference P 1 location loft maisonette 2-bedroom studio preference P 2 type [<25 000] [ ] [> ] preference P 3 price range beach area commercial district arts district outer suburbs university district preference P 1 location Figure 1. Three typical user preferences (left) and an enhanced preference (right) (beach area, loft, A) (beach area, maisonette, B) A C equivalence across B C preference P 1 and P 2 (beach area, 2-bedroom, C) B D A D if the price is equal, C D C D beach area (beach area, studio, D) (arts district, loft, D) studio is equivalent C E D E to arts district loft D E (arts district, 2-bedroom, E) original relationship new relationship Figure 2. Original and induced preference relationships for trade offs Example: Anna is currently looking for a new apartment. Naturally, she has some preferences how and where she wants to live. Figure 1 shows preference diagrams of base preferences modeled as a strict partial order on domain values of three attributes (cf. [15], [12]): location, apartment type and price. These preferences might either be stated explicitly by Anna together with the query or might be derived from or activity history [11]. Some of these preferences may even be common domain knowledge (c.f. [5]) like for instance that in case of two equally desirable objects, the less costly alternative is generally preferred. Based on such preferences, Anna may now retrieve the skyline over a real-estate database. The result is the Paretooptimal set of available apartments consisting of all apartments which are not dominated by others, e.g. a cheap beach area loft immediately dominates all more expensive 2- bedrooms or studios, but can, for instance, not dominate any maisonette. After the first retrieval Anna has to manually review a probably large skyline. In the first few retrieval steps skylines usually contain a large portion of all database objects due to the incomparability between many objects. But the size of the Paretooptimal set may be reduced incrementally by providing suitable additional information on top of the stated preferences, which will then result in new domination relationships on the level of the database objects and thus remove less preferred objects from the sky- Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 78

84 Vol. IV, No. II, pp line. Naturally, existing preferences might be extended by adding some new preference relationships. But also explicit equivalences may be stated between certain attributes expressing actual indifference and thus resulting in new domination relationships, too. Example (cont): that the skyline still contains too many apartments. Thus, Anna interactively refines her originally stated preferences. For example, she might state that she actually prefers the arts district over the university district and the latter over the commercial district which would turn the preference P 1 into a totally ordered relation. This would for instance allow apartments located in the arts and university district to dominate those located in the commercial district with respect to P 1, resulting in a decrease of the size of the Pareto-optimal set of skyline objects. Alternatively, Anna might state that she actually does not care whether her flat is located in the university district or the commercial district that these two attributes are equally desirable for her. This is illustrated in the right hand side part of Figure 1 as preference P 1. In this case, it is reasonable to deduce that all arts district apartments will dominate commercial district apartments with respect to the location preference. Preference relationships over attribute domains lead to domination relationships on database objects, when the skyline operator is applied to a given database. These resulting domination relationships are illustrated by the solid arrows in figure 2. However, users might also weigh some predicates as more important than others and hence might want to model trade-offs they are willing to consider. Our preference modeling approach introduced in [1] allows expressing such trade-offs by providing new preference relations or equivalence relations between different attributes some attributes in the skyline query and subsequently reducing the dimensionality of the query. Example (cont): While refining her preferences, Anna realizes that she actually would consider the area in which her new apartment is located as more important than the actual apartment type in other words: for her, a relaxation in the apartment type is less severe than a relaxation in the area attribute. Thus she states that she would consider a beach area studio (the least desired apartment type in the best area) still as equally desirable to a loft in the arts district (the best apartment type in a less preferred area by doing that, she stated a preference on an amalgamation of the attribute apartment type and location). This statement induces new domination relations on database objects (illustrated as the dotted arrows in Figure 2), allowing for example any cheaper beach area 2- bedroom to dominate all equally priced or more expensive arts district lofts (by using the ceteris paribus [18] assumption). In this way, the result set size of the skyline query can be decreased. Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 79

85 Vol. IV, No. II, pp Theoretical Foundation and Formalization In this section we formalize the semantics of adding incremental preference or equivalence information on top of already existing base preferences or base equivalences. First, we provide basic definitions required to model base and amalgamated preferences and equivalence relationships. Then, we provide basic theorems which allow for consistent incremented skyline computation (cf. [1]). Moreover, we show that it suffices to calculate incremental changes on transitively reduced preference diagrams. We show that local changes in the preference graph only result in locally restricted recomputations for the incremented skyline and thus leads to superior performance (cf. [2]). 3.1 Base Preferences and the Induced Pareto Aggregation In this section we will provide the basic definitions which are prerequisites for section 3.1 and 3.1. We will introduce the notion for base preferences, base equivalences, their amalgamated counterparts, a generalized Pareto composition and a generalized skyline. The basic construct are so-called base preferences defining strict partial orders on attribute domains of database objects (based on [12], [15]): Definition 1: (Base Preference) Let D 1, D 2, D m be a non-empty set of m domains (i.e. sets of attribute values) on the attributes Attr 1, Attr 2 Attr m so that D i is the domain of Attr i. Furthermore let O D 1 D 2 D m be a set of database objects and let attr i : O D i be a function mapping each object in O to a value of the domain D i. Then a Base Preference P i D i 2 is a strict partial order on the domain D i. The intended interpretation of (x, y) P i with x, y D i (or alternatively written x < Pi y) is attribute value y (for the domain D i ) better than attribute value x (of the same domain) This implies that for o 1, o 2 O (attr i (o 1 ), attr i (o 2 )) P i bject o 2 better than object o 1 with respect to its i-th attribute value In addition to specifying preferences on a domain D i we also allow to define equivalences as given in Definition 2. Definition 2: (Base Equivalence and Compatibility) Let O a set of database objects and P i a base preference on D i as given in Definition 1. Then we define a Base Equivalence Q i D i 2 as an equivalence relation (i.e. Q i is reflexive, symmetric and transitive) which is compatible with P i and is defined as: a) Q i P i = (meaning no equivalence in Q i contradicts any strict preference in P i ) b) P i i = Q i i = P i (the domination relationships expressed transitively using P i and Q i must always be contained in P i ) In particular, as Q i is an equivalence relation, Q i trivially contains the pairs (x, x) for all x D i. The interpretation of base equivalences is similarly intuitive as for base preferences: (x, y) Q i with x, y D i (or alternatively written x ~ Qi y am indifferent between attribute values x and y of the domain D i Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 80

86 Vol. IV, No. II, pp As mentioned in Definition 2, a given base preference P i and base equivalence Q i have to be compatible to each other - this means that on one hand attribute value x can never be considered (transitively) equivalent and being (transitively) preferred to some attribute value y at the same time. On the other hand preference relationships between attribute values should always extend to all equivalent attribute values, too. Please note that generally there still may remain values x, y D i where neither x < Pi y nor y < Pi x, nor x ~ Qi y holds. We call these values incomparable. The base preferences P 1, m together with the base equivalences Q 1 m induce a partial order on the set of database objects according to the notion of Pareto optimality. This partial order is created by the Generalized Pareto Aggregation (cf. [6]) which is given in Definition 3. Definition 3: (Generalized Pareto Aggregation for Base Preferences and Equivalences) Let O be a set of database objects, P 1 P m be a set of m base preferences as given in Definition 1 and Q 1 m be a set of m compatible base equivalence relations as defined in Definition 2. Then we define the Generalized Pareto Aggregation for base preferences and equivalences as: Pareto(O, P 1 P m, Q 1 Q m ) := {(o 1, o 2 ) O 2 i m: (attr i (o 1 ), attr i (o 2 )) (P i Q i ) j m: (attr i (o 1 ), attr i (o 2 )) Q j } As stated before, the generalized Pareto aggregation induces an order on the actual database objects. This order can be extended to an Object Preference as given by Definition 4. Definition 4: (Object Preference P and Object Equivalence Q) An Object Preference P O 2 is defined as a strict partial order on the set of database objects O containing the generalized Pareto aggregation of the underlying set of base preferences and base equivalences: P Pareto(O, P 1 P m, Q 1 Q m ). Furthermore, we define the Object Equivalence Q O 2 as an equivalence relation on O (i.e. Q is reflexive, symmetric and transitive) which is compatible with P (cf. Definition 2) and respects Q 1 Q m in the following way: i m (x i, y i ) Q i ((x 1 x m ), (y 1 y m )) Q In particular, Q contains at least the identity tuples (o, o) for each o O. An object level preference P, as given by Definition 4, contains at least the order induced by the Pareto aggregation function on the base preferences and equivalences. Additionally, it can be enhanced and extended by other user-provided relationships (and thus leaving the strict Pareto domain). Often, users are willing to perform a trade-off, i.e. relax their preferences in one attribute in favor a better performance in another attribute. For modeling trade-offs, we therefore introduce the notion of amalgamated preferences and amalgamated equivalences. Building on the example given in Section 2, an amalgamated equivalence could be a statement am indifferent between an Arts District Studio and an University District Loft (as illustrated in Figure 3). Thus, this equivalence statement is modeled on an amalgamation of domains (location and type) and results in new equivalence and preference relationships on the database instance. Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 81

87 Vol. IV, No. II, pp beach area loft maisonette commercial district arts district outer suburbs university district preference P 1 location 2-bedroom studio preference P 2 type Figure 3. Modeling a trade-off using an Amalgamated Preference Definition 5: (Amalgamated Preferences Functions) Let m} be a set with cardinality k. Using as the projection in the sense of relational algebra we define the function AmalPref(x, y ) : ( i D i ) 2 O 2 (x, y ) {(o 1, o 2 ) O 2 i : ( Attri (o 1 ) = Attri (x ) Attri (o 2 ) = Attri (y )) i m}\. : ( Attri (o 1 ) = Attri (o 2 ) )) } This means: Given two tuples x, y from the same amalgamated domains described by, the function AmalPref(x, y ) returns a set of relationships between database objects of the form (o 1, o 2 ) where the attributes of o 1 projected on the amalgamated domains equal those of x, the attributes of o 2 projected on the amalgamated domains equal those of y and furthermore all other attributes which are not within the amalgamated attributes are identical for o 1 and o 2. The last requirement denotes the well-known ceteris paribus [18] condition. The relationships created by that function may be incorporated into P violate P consistency. The conditions and detailed mechanics allowing this incorporation are the topic of section 3.1. Please note the typical cross-shape introduced by trade-offs: A relaxation in one attribute is compared to a relaxation in the second attribute. Though in the Pareto sense the two respective objects are not comparable (in Figure 3 the arts district studio has a better value with respect to location, whereas the university district loft has a better value with respect to type), amalgamation adds respective preference and equivalence relationships between those objects. Definition 6: (Amalgamated Equivalence Functions) Let m} be a set with cardinality k. Using as the projection in the sense of relational algebra we define the function AmalEq(x, y ) : ( i D i ) 2 O 2 (x, y ) {(o 1, o 2 ) O 2 i : [ ( Attri (o 1 ) = Attri (x ) Attri (o 2 ) = Attri (y )) ( Attri (o 2 ) = Attri (x ) Attri (o 1 ) = Attri (y )) ] i m}\. : ( Attri (o 1 ) = Attri (o 2 ) )) } Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 82

88 Vol. IV, No. II, pp The function differs from amalgamated preferences in that as it returns symmetric relationships, i.e. if (o 1, o 2 ) Q, also (o 2, o 1 ) has to be in Q. Furthermore, these relationships have to be incorporated into Q instead of P stency. But due to the compatibility characteristic also P can be affected by new relationships in Q. Based on an object preference P we can finally derive the Skyline set (given by Definition 7) which is returned to the user. This set contains all possible best objects with respect to the underlying user preferences. This means that the set contains only those objects that are not dominated by any other object in the object level preference P. Note that we call this set generalized Skyline as it is derived from P which is initially the Pareto order but may also be extended with additional relationships (e.g. trade-offs). Note that the generalized skyline set is solely derived using P, still it respects Q by introducing new relationships into P based on recent additions to Q (cf. Definition 8). Definition 7: (Generalized Skyline) The Generalized Skyline S O is the set containing all optimal database objects in respect to a given object preference P and is defined as S := { o O o O : (o, o ) P } 3.1 Incremental Preference and Equivalence Sets The last section provided the basic definitions required for dealing with preference and equivalence sets. In the following sections we provide a method for incremental specification and enhancements of object preference sets P and equivalence sets Q. Also, we show under which conditions the addition of new object preferences / equivalences is safe and how compatibility and soundness can be ensured. The basic approach for dealing with incremental preference and equivalent sets is illustrated in Figure 4. First, the base preferences P 1 to P m (Definition 1) and their according base equivalences Q 1 to Q m (Definition 2) are elicited. Based on these, the initial object preference P (Definition 4) is created by using the generalized Pareto aggregation (Definition 3). The initial object equivalence Q starts as a minimal relation as defined in Definition 4. The generalized Pareto skyline (Definition 7) of P is then displayed to the user and the iterative phase of the process starts. Users now have the opportunity to specify additional base preferences or equivalences or amalgamated relationships (Definition 5, Definition 6) as described in the previous section. The set of new object relationships resulting from the Ceteris Paribus functions of the newly stated user information then is checked for compatibility using the constrains given in this section. If the new object relationships are compatible with P and Q they are inserted and thus incremented sets P* and Q* are formed. If the relationships were not consistent with the previously stated information, then the last addition is discarded and the user is notified. The user thus can state more and more information until the generalized skyline is lean enough for manual inspection. P 1,,P m,q 1,,Q m Pareto P, Q New Base Relationships Amalgamated Pref. / Eq. (Trade-Offs) P*, Q* Iteration Figure 4. General Approach for Iterated Preference / Equivalence Sets Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 83

89 Vol. IV, No. II, pp Definition 8: (Incremented Preference and Equivalence Set) Let O be a set of database objects, P O 2 be a strict preference relation, P conv O 2 be the set of converse preferences with respect to P, and Q O 2 be an equivalence relation that is compatible with P. Let further S O 2 be a set of object pairs (called incremental preferences) such that x, y O: (x, y) S (y,x) S and S (P P conv Q) = and let E O 2 be a set of object pairs (called incremental equivalences) such that (x, y) O: (x, y) E (y, x) E and E ( P P conv Q S) =. Then we will define T as the transitive closure T := (P Q S E) + and the incremented preference relation P* and the incremented equivalence relation Q* as P* := { (x, y) T (y, x) T} and Q* := { (x, y) T (y,x) T} The basic intuition is that S and E contain the new preference and equivalence relationships that have been elicited from the user additionally to those given in P and Q. For example, S and E can result from the user specifying a trade-off and, in this case, are induced using the ceteris paribus semantics (cf. Definition 5 and Definition 6). The only conditions on S and E are that they can neither directly contradict each other, nor are they allowed to contradict already known information. The sets P* and Q* then are the new preference/equivalence sets that incorporate all the information from S and E and that will be used to calculate the new generalized and probably smaller skyline set. Definition 8 indeed results in the desired incremental skyline set as we will prove in Theorem 1: Theorem 1: (Correct Incremental Skyline Evaluation with P* and Q*) Let P* and Q* be defined like in Definition 8. Then the following statements hold: 1) P* defines a strict partial order (specifically: P* does not contain cycles) 2) Q* is a compatible equivalence relation with preference relation P* 3) Q E Q* 4) The following statements are equivalent a) P S P* b) P* (P S) conv = and Q* (P S) conv = c) No cycle in (P Q S E) contains an element from (P S) and from either one of these statements follows: Q* = (Q E) + Proof: Let us first show two short lemmas: Lemma 1: T P* P* Proof: Due to T T P* T T T holds. If there would exist objects x, y, z O with (x, y) T, (y, z) P*, but (x, z) P*, then follows (x, z) Q* because T is transitive and the disjoint union of P* and Q*. Due to Q* symmetry we also get (z, x) Q* and thus (z, y) = (z, x) (x, y) T T T. Hence we have (y, z), (z, y) T (y, z) Q* in contradiction to (y, z) P*. Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 84

90 Vol. IV, No. II, pp Lemma 2: P* T P* Proof: analogous to Lemma 1 ad 1) From Lemma 1 directly follows P* P* P* and thus P* is transitive. Since by Definition 8 P* is also anti-symmetric and irreflexive, P* defines a strict partial order. ad 2) We have to show the three conditions for compatibility: a) Q* is an equivalence relation. This can be shown as follows: Q* is symmetric by definition, is transitive because T is transitive, and is reflexive because Q T and trivially all pairs (q, q) Q. b) Q* P* = is true by Definition 8 c) From Lemma 1 we get Q* P* P* and due to Q* being reflexive also P* Q* P*. Thus P* = Q* P*. Analogously we get P* Q* = P* from Lemma 2. Since a), b) and c) hold, equivalence relation Q* is compatible to P*. ad 3) Since Q T and Q is symmetric, Q Q*. Analogously E T and E is symmetric, E Q*. Thus, Q E Q*. ad 4) We have to show three implications for the equivalence of a), b) and c): a) c): Assume there would exist a cycle (x 0, x 1 x n-1, x n ) with x 0 = x n and edges from (P Q S E) where at least one edge is from P S, further assume without loss of generality (x 0, x 1 ) P S. We know (x 2, x n ) T and (x 1, x 0 ) T, therefore (x 0, x 1 ) Q* and (x 0, x 1 ) P*. Thus, the statement P S P* cannot hold in contradiction to a). c) b): We have to show T (P S) conv =. Assume there would exist (x 0, x 1 x n-1, x n ) (P S) conv with (x i-1, x i ) (P Q S E) for 1 i n. Because of (x 0, x n ) (P S) conv follows (x n, x 0 ) P S and thus (x 0, x 1 x n-1, x n ) would have been a cycle in (P Q S E) with at least one edge from P or S, which is a contradiction to c). b) a): If the statement P S P* would not hold, there would be x and y with (x, y) P S, but (x, y) P*. Since (x, y) T, it would follow (x, y) Q*. But then also (y, x) Q* (P S) conv would hold, which is a contradiction to b). This completes the equivalence of the three conditions now we have to show that from any of we can deduce Q* = (Q E) +. Let us assume condition c) holds. First we show Q* (Q E) +. Let (x, y) Q*, then also (y, x) Q*. Thus we have two representations (x, y) = (x 0, x 1 x n-1, x n ) and (y, x) = (y 0, y 1 y m-1, y m ), where all edge are in (P Q S E) and x n = y = y 0 and x 0 = x = y m. If both representations are concatenated, a cycle is formed with edges from (P Q S E). Using condition c) we know that none of these edges can be in P S. Thus, (x, y) (Q E) +. The inclusion Q* (Q E) + holds trivially due to (Q E) + T and (Q E) + is symmetric, since both Q and E are symmetric. Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 85

91 Vol. IV, No. II, pp The evaluation of skylines thus comes down to calculating P* and Q* as given by Definition 8 after we have checked their consistency as described in Theorem 1, i.e. verified that no inconsistent information has been added. It is a nice advantage of our system that at any point we can incrementally check the applicability and then accept or reject a statement elicited from the user or a different source like e.g. profile information. Therefore, skyline computation and preference elicitation are interleaved in a transparent process. 3.1 Efficient Incremental Skyline Computation In the last sections, we provided the basic theoretical foundations for incremented preference and equivalence sets. In addition, we showed how to use the generalized Pareto aggregation for incremented Skyline computation based on the previous skyline objects. l nature of incrementally added preference / equivalence information. While in the last section we facilitated skyline computation by modeling the full object preference P and object equivalence E, we will now enhance the algorithm to be based on transitively reduced (and thus considerably smaller) Preference Diagrams. Preferences diagrams are based on the concept of Hasse diagrams but, in contrast, do not require a full intransitive reduction (e.g. some transitive information may remain in the diagram). Basically, a preference diagram is a simple graph representation of attribute values and preference edges as following: Definition 9: (Preference Diagrams) Let P be a preference in the form of a finite strict partial order. A preference diagram PD(P) for preference P denotes a (not necessarily minimal) graph such that the transitive closure PD(P) + = P. Please note that there may be several preference diagrams provided (or incrementally completed) by the user to express the same preference information (which is given by the transitive closure of the graph). Thus the preference diagram may contain redundant transitive information if it was explicitly stated by the user during the elicitation process. This is particularly useful when the diagram is used for user interface purposes [2]. In the remainder of this section, we want to avoid the handling of the bulky and complex to manage incremented preference P* and rather only incorporate increments of new preference information as well as new equivalence information into the preference diagram instead. The following two theorems show how to do this. Theorem 2: (Calculation of P*) Let O be a set of database objects and P, P conv, and Q as in Definition 8 and E := {(x, y), (y, x)} new equivalence information such that (x, y), (y, x) ( P P conv Q). Then P* can be calculated as P* = (P (P E P) (Q E P) (P E Q)). Proof: Assume (a, b) T as defined in Definition 8. The edge can be represented by a chain (a 0, a 1 ) (a n-1, a n ), where each edge (a i-1, a i ) (P Q E) and a 0 := a, a n := b. This chain can even be transcribed into a representation with edges from (P Q E), where at most one single edge is from E. This is because, if there would be two (or more) edges from E namely (a i-1, a i ) and (a j-1, a j ) (with i < j) then there are four possibilities: Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 86

92 Vol. IV, No. II, pp a) both edges are (x, y) or both edges are (y, x), in both of which cases the sequence (a i, a i+1 ) (a j-1, a j ) forms a cycle and can be omitted b) the first edge is (x, y) and the second edge is (y, x), or vice versa, in both of which cases (a i-1, a i ) (a j-1. a j ) forms a cycle and can be omitted, leaving no edge from E at all. Since we have defined Q as compatible with P in Definition 8, we know that (P Q) + = (P Q) and since elements of T can be represented with at most one edge from E, we get T = P Q ((P Q) E (P Q)). In this case both edges in E are consistent with the already known information, because there are no cyclic paths in T containing edges of P ( c.f. condition 1.4.c) in [1]): This is because if there would be a cycle with edges in (P Q E) and at least one egde from P (i.e. the new equivalence information would create an inconsistency in P*), the cycle could be represented as (a 0, a 1 ) P and (a 1, a 2 ) (a n-1, a 0 ) would at most contain one edge from E and thus the cycle is either of the form P (P Q), or of the form P (P Q) E (P Q). In the first case there can be no cycle, because otherwise P and Q would already have been inconsistent, and if there would be cycle in the second case, there would exist objects a, b O such that (a, x) P, (x, y) E and (y, b) (P Q) and (y, x) = (y, a) (a, x) (P Q) P P contradicting (x, y) P conv. Because of T = (P Q (P E P) (Q E P) (P E Q) (Q E Q)) and P* = T \ Q* and since (P (P E P) (Q E P) (P E Q)) Q* = (if the intersection would not be empty then due to Q* being symmetric there would be a cycle in P* with edges from (P Q E) and at least one edge from P contradicting the condition 1.4 above), we finally get P* = (P (P E P) (Q E P) (P E Q)). We have now found a way to derive P* in the case of a new incremental equivalence relationship, but still P* is a large relation containing all transitive information. We will now show that we can also get P* by just manipulating a respective preference diagram in a very local fashion. Locality here results to only having to deal with edges that are directly adjacent in the preference diagram to the additional edges in E. Let us define an abbreviated form of writing such edges: Definition 10: (Set Shorthand Notations) Let R be a binary relation over a set of database objects O and let x O. We write: (_ R x) := { y O (y, x) R} and (x R _) := { y O (x, y) R)} If R is an equivalence relation we write the objects in the equivalence class of x in R as: R[x] := { y O (x, y), (y, x) R} With these abbreviations we will show what objects sets have to be considered for actually calculating P* via a given preference diagram: Theorem 3: (Calculation of PD(P)*) Let O be a set of database objects and P, P conv, and Q as in Definition 8 and E := {(x, y), (y, x)} new equivalence information such that (x, y), (y, x) ( P P conv Q). If PD(P) P is some preference diagram of P, and with Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 87

93 Vol. IV, No. II, pp PD(P)* := (PD(P) (PD(P) E Q) (Q E PD(P))), holds: (PD(P)*) + = P* i.e. PD(P)* is a preference diagram for P*, which can be calculated as: PD(P)* = PD(P) ((_ PD(P) x) Q[y]) ((_ PD(P) y) Q[x]) (Q[x] (y PD(P)_)) (Q[y] (x PD(P)_) ). Proof: We know from Theorem 2 that P* = (P (P E P) (Q E P) (P E Q)) and for preference diagrams PD(P) of P holds: a) P = PD(P) + (PD(P)*) + b) (P E P) = (P E) P (P E Q) P = (PD(P) + E Q) PD(P) + (PD(P)*) +, because (PD(P) + E Q) (PD(P)*) + and PD(P) + (PD(P)*) +. c) Furthermore (P E Q) = PD(P) + E Q (PD(P) E Q) + (PD(P)*) + d) And similarly (Q E P) = Q E PD(P) + (Q E PD(P)) + (PD(P)*) + Using a) d) we get P* (PD(P)*) + and since PD(P)* P*, we get (PD(P)*) + (P*) + = P* and thus (PD(P)*) + = P*. To calculate PD(P)* we have to consider the terms in PD(P) (PD(P) E Q) (Q E PD(P))): The first term is just the old preference diagram. Since the second and third terms both contain a single edge from E (i.e. either (x, y) or (y, x)), the terms can be written as (PD(P) E Q) = ((_ PD(P) x) Q[y]) ((_ PD(P) y) Q[x]) and (Q E PD(P)) = (Q[x] (y PD (P)_)) (Q[y] (x PD(P)_)) In general these sets will be rather small because first they are only derived from the preference diagram which is usually considerably smaller than preference P and second in these small sets there usually will be only few edges originating or ending in x or y. Furthermore, these sets can be computed easily using an index on the first and second entry of the binary relation PD(P) and Q. Getting a set like e.g., _PD(P) x then is just an PD(P) where second entry is x Therefore we can calculate the incremented preference P* by simple manipulations on PD(P) and the computation of a transitive closure like shown in the commutating diagram in Figure 5. P P* = (PD(P)*) + PD(P) trans. closure extend by E extend by E PD(P)* trans. closure Figure 5. Diagram for deriving incremented skylines using preference diagrams Having completed incremental changes introduced by new equivalence information, we will now consider incremental changes by new preference information. Theorem 4: (Incremental calculation of P*) Let O be a set of database objects and P, P conv, and Q as in Definition 8 and S := {(x, y)} new preference information such that (x, y) ( P P conv Q). Then P* can be calculated as P* = (P (P S P) (P S Q) (Q S P) (Q S Q)). Proof: The proof is similar to the proof of Theorem 2. Assume (a, b) T as defined in Definition 8. The edge can be represented by a chain (a 0, a 1 ) (a n-1, a n ), where Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 88

94 Vol. IV, No. II, pp each edge (a i-1, a i ) (P Q S) and a 0 := a, a n := b. This chain can even be transcribed into a representation with edges from (P Q S), where edge (x, y) occurs at most once. This is because, if (x, y) occurs twice the two edges would enclose a cycle that can be removed. Since we have assumed Q to be compatible with P in Definition 8, we know T = P Q ((P Q) S (P Q)) and (like in Theorem 2) edge (x, y) is consistent with the already known information, because there are no cyclic paths in T containing edges of P ( c.f. 4.c in Theorem 1): This is because if there would be a cycle with edges in (P Q S) and at least one egde from P (i.e. the new preference information would create an inconsistency in P*), the cycle could be represented as (a 0, a 1 ) P and (a 1, a 2 ) (a n-1, a 0 ) would at most contain one edge from S and thus the cycle is either of the form P, or of the form P (P Q) S (P Q). In the first case there can be no cycle, because otherwise P would already have been inconsistent, and if there would be cycle in the second case, there would exist objects a, b O such that (a, x) P and (y, b) (P Q) and (y, x) = (y, a) (a, x) (P Q) P P contradicting (x, y) P conv. Similarly, there is no cycle with edges in (Q S) and at least one egde from S, either: if there would be such a cycle, it could be transformed into the form S Q, i.e. (x, y) (a, b) would be a cycle with (a, b) Q, forcing (a, b)= (y, x) Q and thus due to Q mmetry a contradiction to (x, y) Q. Because of T = (P Q (P S P) (Q S P) (P S Q) (Q S Q)) and P* = T \ Q* and since (P (P S P) (Q S P) (P S Q) (Q S Q)) Q* = (if the intersection would not be empty then due to Q* being symmetric there would be a cycle in P* with edges from (P Q S) and at least one edge from P contradicting the condition 1.4 above), we finally get P* = (P (P S P) (Q S P) (P S Q) (Q S Q)). Analogously to Theorem 2 and Theorem 3 in the case of a new incremental preference relationship, we can also derive P* very efficiently by just working on the small preference diagram instead on the large preference relation P. Theorem 5: (Incremental calculation of PD(P*)) Let O be a set of database objects and P, P conv, and Q as in Definition 8 and S := {(x, y)} new preference information such that (x, y) ( P P conv Q). If PD(P) P is some preference diagram of P, and with PD(P)* := (PD(P) (Q S Q)), holds: (PD(P)*) + = P* i.e. PD(P)* is a preference diagram for P*, which can be calculated as: PD(P)* = PD(P) (Q[x] Q[y]) with PD(P) (Q[x] Q[y]) =. Proof: We know from Theorem 3 that P* = (P (P S P) (P S Q) (Q S P) (Q S Q)) and for preference diagrams PD(P) of P holds: a) P = PD(P) + (PD(P)*) + b) since S Q S Q PD(P)*, it follows (P S P) (PD(P)*) + PD(P)* (PD(P)*) + (PD(P)*) + c) since Q S (Q S) Q PD(P)*, it follows ((Q S) P) PD(P)* (PD(P)*) + (PD(P)*) + d) analogously (P (S Q)) (PD(P)*) + PD(P)* (PD(P)*) + e) finally by definition (Q S Q) PD(P)* (PD(P)*) + Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 89

95 Vol. IV, No. II, pp Using a) e) we get P* (PD(P)*) + and since PD(P)* P*, we get (PD(P)*) + (P*) + = P* and thus (PD(P)*) + = P*. To calculate PD(P)* analogously to Theorem 3 we have to consider the terms in PD(P) (Q S Q): The first term again is just the old preference diagram. Since the second term contains (x, y), it can be written as (Q S Q) = (Q[x] Q[y]). Moreover, if there would exist (a, b) PD(P) (Q[x] Q[y]) then (a, b) PD(P) and there would also exist (a, x), (y, b) Q. But then (x, y) = (x, a) (a, b) (b, y) P, because Q is compatible with P, which is a contradiction. Thus, we can also calculate the incremented preference P* by simple manipulations on PD(P) in the case of incremental preference information. Again the necessary set can efficiently be indexed for fast retrieval. In summary, we have shown that the incremental refinement of skylines is possible efficiently by manipulating only the preference diagrams. 4 Conclusion In this paper we laid the foundation to efficiently compute incremented skylines driven by user interaction. Building on and extending the often used notion of Pareto optimality, our approach allows users to interactively model their preferences and explore the resulting generalized skyline sets. New domination relationships can be specified by incrementally providing additional information like new preferences, equivalence relations, or acceptable trade-offs. Moreover, we investigated the efficient evaluation of incremented generalized skylines by considering only those relations that are directly af- ference information. The actual computation takes advantage of the local nature of incremental changes in preference information leading to far superior performance over the baseline algorithms. Although this work is an advance for the application of the skyline paradigm in real world applications, still several challenges remain largely unresolved. For instance, the time necessary for computing initial skylines is still too high hampering the applicability in large scale scenarios. Here, introducing suitable index structures, heuristics, and statistics might prove beneficial. Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 90

96 Vol. IV, No. II, pp References [1] W.-T. Balke, U. Güntzer, C. Lofi. Eliciting Matters Controlling Skyline Sizes by Incremental Integration of User Preferences. Int. Conf. on Database Systems for Advanced Applications (DASFAA), Bangkok, Thailand, 2007 [2] W.-T. Balke, U. Güntzer, C. Lofi. User Interaction Support for Incremental Refinement of Preference- Based Queries. 1st IEEE International Conference on Research Challenges in Information Science (RCIS), Ouarzazate, Morocco, [3] W.-T. Balke, U. Güntzer, W. Siberski. Getting Prime Cuts from Skylines over Partially Ordered Domains. Datenbanksysteme in Business, Technologie und Web (BTW 2007), Aachen, Germany, 2007 [4] W.-T. Balke, U. Güntzer. Multi-objective Query Processing for Database Systems. Int. Conf. on Very Large Data Bases (VLDB), Toronto, Canada, [5] W.-T. Balke, M. Wagner. Through Different Eyes - Assessing Multiple Conceptual Views for Querying Web Services. Int. World Wide Web Conference (WWW), New York, USA, [6] W.-T. Balke, U. Güntzer, W. Siberski. Exploiting Indifference for Customization of Partial Order Skylines. Int. Database Engineering and Applications Symp. (IDEAS), Delhi, India, [7] W.-T. Balke, J. Zheng, U. Güntzer. Efficient Distributed Skylining for Web Information Systems. Int. Conf. on Extending Database Technology (EDBT), Heraklion, Greece, [8] W.-T. Balke, J. Zheng, U. Güntzer. Approaching the Efficient Frontier: Cooperative Database Retrieval Using High-Dimensional Skylines. Int. Conf. on Database Systems for Advanced Applications (DASFAA), Beijing, China, [8] J. Bentley, H. Kung, M. Schkolnick, C. Thompson. On the Average Number of Maxima in a Set of Vectors and Applications. Journal of the ACM (JACM), vol. 25(4) ACM, [9] S. Börzsönyi, D. Kossmann, K. Stocker. The Skyline Operator. Int. Conf. on Data Engineering (ICDE), Heidelberg, Germany, [10] C. Boutilier, R. Brafman, C. Geib, D. Poole. A Constraint-Based Approach to Preference Elicitation and Decision Making. AAAI Spring Symposium on Qualitative Decision Theory, Stanford, USA, [11] L. Chen, P. Pu. Survey of Preference Elicitation Methods. EPFL Technical Report IC/2004/67, Lausanne, Swiss, [12] J. Chomicki. Preference Formulas in Relational Queries. ACM Transactions on Database Systems (TODS), Vol. 28(4), [13] J. Chomicki. Iterative Modification and Incremental Evaluation of Preference Queries. Int. Symp. on Found. of Inf. and Knowledge Systems (FoIKS), Budapest, Hungary, [14] P. Godfrey. Skyline Cardinality for Relational Processing. Int Symp. on Foundations of Information and Knowledge Systems (FoIKS), Wilhelminenburg Castle, Austria, [15] W. Kießling. Foundations of Preferences in Database Systems. Int. Conf. on Very Large Databases (VLDB), Hong Kong, China, [16] V. Koltun, C. Papadimitriou. Approximately Dominating Representatives. Int. Conf. on Database Theory (ICDT), Edinburgh, UK, [17] D. Kossmann, F. Ramsak, S. Rost. Shooting Stars in the Sky: An Online Algorithm for Skyline Queries. Int. Conf. on Very Large Data Bases (VLDB), Hong Kong, China, [18] M. McGeachie, J. Doyle. Efficient Utility Functions for Ceteris Paribus Preferences. In Proc. of Conf. on Artificial In-telligence and Conf. on Innovative Applications of Artificial Intell Edmonton, Canada, [19] D. Papadias, Y. Tao, G. Fu, B. Seeger. An Optimal and Progressive Algorithm for Skyline Queries. Int. Conf. on Management of Data (SIGMOD), San Diego, USA, [20] J. Pei, W. Jin, M. Ester, Y. Tao. Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces. Int. Conf. on Very Large Databases (VLDB), Trondheim, Norway, [21] T. Satty. A Scaling Method for Priorities in Hierarchical Structures. Journal of Mathematical Psychology, 1977 [22] T. Xia, D. Zhang. Refreshing the sky: the compressed skycube with efficient support for frequent updates. Int. Conf. on Management of Data (SIGMOD), Chicago, USA, [23] Y. Yuan, X. Lin, Q. Liu, W. Wang, J. Yu, Q. Zhang. Efficient Computation of the Skyline Cube. Int. Conf. on Very Large Databases (VLDB), Trondheim, Norway, Wolf-Tilo Balke, Ulrich Güntzer Christof Lofi 91

97 Vol. IV, No. II 92

98 Vol. IV, No. II, pp What Enterprise Architecture and Enterprise Systems Usage Can and Can not Tell about Each Other Maya Daneva, Pascal van Eck Dept. of Computer Science, University of Twente, The Netherlands Abstract There is an increased awareness of the roles that enterprise architecture (EA) and enterprise systems (ES) play in today s organizations. EA and ES usage maturity models are used to assess how well companies are capable of deploying these two concepts while striving to achieve strategic corporate goals. The existence of various architecture and ES usage models raises questions about how they both refer to each other, e.g. if a higher level of architecture maturity implies a higher ES usage level. This paper compares these two types of models by using literature survey results and case-study experiences. We conclude that (i) EA and ES usage maturity models agree on a number of critical success factors and (ii) in a company with a mature architecture function, one is likely to observe, at the early stages of ES initiatives, certain practices associated with a higher level of ES usage maturity. Keywords: maturity models, enterprise resource planning, enterprise architecture. 1 Introduction In the past decade, companies and public sector organizations developed an increased understanding that true connectedness and participation in the networked economy or in virtual value webs would not happen merely through applications of technology, like Enterprise Resource Planning (ERP), Enterprise Application Integration middleware, or web services. The key lesson they learnt was that it would happen only if organizations changed the way they run their operations and integrated them well into cross-organizational business processes [1]. This takes at least 2-3 years and implies the need to (i) align changes in the business processes to technology changes and (ii) be able to anticipate and support complex decisions impacting each of the partner organizations in a network and their enterprise systems (ES). In this context, Enterprise Architecture (EA) increasingly becomes critical, for Maya Daneva, Pascal van Eck 93

99 Vol. IV, No. II, pp it provides to both business and IT managers a clear and synthetic vision of an organization s business processes and of the IT resources they rely on. For the purpose of this research, we use the term enterprise architecture to refer to the constituents of an enterprise at both the social level (roles, organizational units, processes) as well as the technical level (information technology and related technology), and the synergetic relations between these constituents. Enterprise architecture explains how the constituents of an enterprise are related and how these relations jointly create added value. EA also implies a model that drives the process of aligning programs and initiatives with solution architectures integrating both ES and legacy applications. Observations from EA and ES literature [2,3,9,13,14,15,20] indicate that, in practice, the many facets of EA and ES are commonly used as complementing each other. For example, EA and ES represent two of the five major decision areas encompassed in IT governance at high performing organizations [21]. The experiences of these companies suggest that EA is the common enforcer of standards from which a high-level strategic and management-oriented view of potential solutions can be driven to the implementation level. Moreover, EA processes are critical in implementing coordinated sets of governance mechanisms for ERP programs that simultaneously change technology support, ways of doing business, and people s job content. However, due to a lack of adequate principles, theories, and tools to support consistent application of the concepts of EA and ES usage, the interplay between them is still rarely studied. ES usage and evolution processes and EA processes are analyzed in isolation, by using different research methods. Clearly, there is a need for approaches including definitions, assessment aspects and models that allow architects and IT decision makers to reason about these two aspects of IT governance. Examples include reasoning about the choices that guide an organization s approach to ES investments or about situations when changing business requirements can be addressed within the architecture and when changes justify an exception to enterprise standards. The present paper responds to this need. Its objective is to add to our understanding of how the concepts of EA and ES usage are linked, how the processes of EA and ES usage are linked, and how those processes can be organized differently to create improved organizational results. The paper seeks to make the linkages between EA and ES usage explicit so that requirement engineers working on corporate-wide or networked ES implementation projects can use this knowledge and leverage EA assets to achieve feasible RE processes. To get insights into these two concepts, we apply a maturity-based view of the ES adopting organizations. This perspective provides grounds for the practical application of commonly available maturity models that could be used with minimal disruption to the areas being examined in a case study. The paper is structured as follows: In Section 2 we motivate our research approach. In Section 3, we give a background of how we use existing architecture maturity models to build a framework and provide a rationale for using the DoC s AMM [4] to mould our case study assessment process. Section 4 discussed the concept of ES usage maturity along with three specific models. Maya Daneva, Pascal van Eck 94

100 Vol. IV, No. II, pp Section 5 reports on how both classes of models agree and disagree. Section 6 reports on and discusses findings from our case study. In Section 7, we check the consistency between the findings from the literature survey and the ones from the case study. We summarize conclusions and research plans in Section 8. 2 Background and Research Approach The goal of our study is to collect information that would help us assess the interplay of architecture and ES usage in an ES-adopting organization. Since research studies in architecture maturity and in ERP usage maturity have been focused either on organization-specific architecture aspects or on ES factors, there is a distinct challenge to develop a research model that adopts the most appropriate constructs from prior research and integrate them with constructs that are most suitable to our context. Given the lack of research on the phenomenon we are interested in and the fact that the boundaries between phenomenon and context are not clearly evident, it seems appropriate to use a qualitative approach to our research goal. Specifically, we chose to apply an approach based on the positivist case study research method [5,22] because of the following: (i) evidence suggests its particular suitability to IS research situations in which both an in-depth investigation is needed and the phenomenon in question can not be studied outside the context where it occurs, (ii) it offers a great deal of flexibility in terms of research perspectives to be adopted and qualitative data collection methods, and (iii) case studies open up opportunities to get the subtle data we need to increase our understanding of complex IS phenomena like ERP adoption and enterprise architecture. In this research, we take the view that the linkages between EA and ES usage can be interrogated via artefacts subjected to maturity assessments such as (a) visible practices deployed by an organization, (b) underlying assumptions behind these practices, (c) architecture and ES project deliverables, (d) architecture and ES project roles, and (e) shared codes of meaning that undergird what an organization thinks a good practice is and what it is not [19]. According to this view, we see maturity assessment frameworks as vehicles that help organizations and external observers integrate their experiences into coherent systems of meaning. Our view is consistent with the understanding of assessment models as (i) normative behaviour models, based on organization s values and believes as well as (ii) process theories that help explain why organizations do not always succeed in EA and ES initiatives [10,16,23]. We selected architecture maturity models [4,8,11,12,18,23] and ES usage models [7,13,16] as the lens through which we examine the linkages between EA and ES usage. The reason for choosing these models is threefold: (i) the models support decision making in context of organizational change and this is certainly relevant to understanding IT governance, (ii) the models suggest how organizations can proceed from less controlled to more controlled fashion of organizing architecture and ES processes and through this we can analyze how to leverage architecture and ES assets to achieve a better business results, and Maya Daneva, Pascal van Eck 95

101 Vol. IV, No. II, pp (iii) both classes of models provide a perspective allowing us to see the evolution of EA and ES usage as moving through stages characterized by key role players, typical activities and challenges, appropriate performance metrics, and a range of possible outcomes. Our view of maturity models as normative systems of meaning brought us to the idea of using the methods of semiotic analysis [6,19] for uncovering the facets of the relationship between EA maturity and ES usage maturity. From the semiotics standpoint, organizational settings are treated as a system of signs, where a sign is defined as the relationship between a symbol and the content that this symbol conveys. This relationship is determined by the conventions of the stakeholders involved (e.g., business users, architects and ES implementation project team members). In semiotic analysis, these conventions are termed codes. A code is defined by a set of symbols, a set of contents and rules that map symbols to contents [19]. Codes specify meanings of a set of symbols within organizational settings. On the manifest level, certain practices, roles, and symbols are carriers of architecture and ES usage maturity. On the core level, stakeholders share beliefs, values, and understandings that guide their actions. Thus, in order to fully understand the maturity of EA or ES usage in organization s settings, we should uncover the relevant symbols, the contents conveyed by these symbols, and the relationships that bind them. If we could do this, we should be able to get a clear picture about the extent to which the EA and ES usage maturity models agree and disagree in terms of pertinent symbols, contents, and codes. Our analytical approach has three specific objectives, namely: (i) to identify how existing architecture frameworks and ES usage models stand to each other, (ii) to assess the possible mappings between their assessment criteria, and (iii) to examine if the mappings between architecture maturity assessment criteria and the ES usage maturity criteria can be used to judge the ES usage maturity in an ES adopting organization, provided architecture maturity of this organization is known. Our research approach is multi-analytical in nature. It draws on the idea of merging a literature survey and a case study. It involved five stages: 1. Literature survey and mapping assessment criteria of existing architecture maturity models. 2. Literature survey of existing ES usage maturity models. 3. Identification of assessment criteria for architecture and ES usage maturity that seem (i) to overlap, (ii) to correlate, and (iii) to explain each other. 4. Selection and application of one architecture maturity model and one ES usage model to organizational settings in a case study. 5. Post-application analysis to understand the relationships between the two maturity models. We discuss each of these stages in more detail in the sections that follow. Maya Daneva, Pascal van Eck 96

102 Vol. IV, No. II, pp Mapping Architecture Maturity Criteria At least six methods for assessing the ability of EA to deliver to promise were introduced in the past five years: (i) the IT ACMM of the Department of Commerce (DoC) of the USA [4], (ii) the Federal Enterprise Architecture Maturity Framework [8], (iii) the Information Technology Balanced Score Card model [12], (iv) the models for extended-enterprise-architects [23], (v) the Gartner Enterprise Architecture Maturity Model [11] and (vi) the META Enterprise Architecture Program Maturity Model [18]. We analyzed these models by studying the following aspects: what assessment criteria they deem important to judge maturity, what practices, roles and artifacts are surveyed for compliance to these criteria, how the artefacts surveyed are mapped to these criteria. Our findings indicate that these six models all define the concept of maturity differently, but all implicitly aim at adopting or adapting some good practices within an improvement initiative targeting repeatable outcomes. The models assume that organizations reach a plateau in which at least one architecture process is transformed from a lower level to a new level of capability. We found that they all share the following common properties: a number of dimensions or process areas at several discrete levels of maturity (typically five or six), a qualifier for each level (such as initial, repeatable, defined, managed, optimized), a hierarchical model of assessment criteria for each process area, a description of each assessment criterion which codifies what the authors regard as good and not so good practice and which could be observed at each maturity level, an assessment procedure that provides qualitative or quantitative ratings for each of the process areas. To get more insights into how the assessment criteria of each model refer to the ones from the other models (e.g. if assessment criteria overlap, or if they complement each other), we did a comparison on a definition-by-definition basis. We tried to understand if there exists a semantic equivalence between the assessment criteria of the six models. We termed two assessment criteria semantically equivalent if their definitions suggest an identical set of symbols, an identical set of contents, and an identical set of mappings from symbols to contents. This definition ensures that two assessment criteria are equivalent when they have the same meaning and they use the same artifact to judge identical maturity factors. In our definition, the term artifacts means one of the following [10]: a process (e.g. EA process, activity or practice), a product (e.g. an architecture deliverable, a business requirements document), or a resource (e.g. architects, architecture modeling tools). For example, the Operation-Unit- Participation criterion from the DoC ACMM is semantically equivalent to the Business-Unit-Involvement criterion from the models for extended-enterprise- Maya Daneva, Pascal van Eck 97

103 Vol. IV, No. II, pp architects (E2ACMM). These two criteria both mean to assess the extent to which business stakeholders are actively kept involved in the architecture processes. When compared on a symbol-by-symbol, contents-by-contents and code-by-code basis, the definitions of these two criteria indicate that they both mean to measure a common aspect, namely how frequently and how actively business representatives participate in the architecture process and what the level of business representatives awareness of architecture is. An extraction of our analysis findings is presented in Table 1. It reports on a set of assessment criteria that we found to be semantically equivalent in two models, namely the E2ACMM [23], and the DoC ACMM [4]. E2ACMM DoC ACMM Extended Enterprise Involvement Business units involvement Operating Unit Participation Enterprise Program Management Business & Technology Strategy Alignment Business Linkage Executive Management Involvement Senior Management Involvement Strategic Governance Governance Enterprise budget & Procurement strategy IT investment & Acquisition Strategy Holistic Extended Enterprise Architecture Extended Enterprise Architecture Programme Office Architecture Process Extended Enterprise Architecture Development Architecture Development Enterprise Program Management Architecture Communication IT security Enterprise budget & Procurement strategy IT investment & Acquisition Strategy Extended Enterprise Architecture Results Extended Enterprise Architecture Development Architecture Development Table 1: Two ACMMs compared and contrasted Next, we analyzed the distribution of the assessment criteria according to maturity levels in order to understand what the relative contribution of each criterion is to a certain maturity level. Our general observation was that the ACMMs may use correlating criteria but these may be linked to different maturity levels. For example, the DoC ACMM defines the formal alignment of business strategy and IT strategies to be a Level 4 criterion, while the E2ACMM checks it at Level 3. 4 Mapping UMS Maturity Criteria The ES literature, to the best of our knowledge, indicates that there are three relatively popular ES Usage maturity models: (i) the ES experience model by Markus et al [16], (ii) the ERP Maturity Model by Ernst & Young, India [7], and (iii) the staged ES Usage Maturity Model by Holland et al [13]. All the three models take different views of the way companies make decisions on their organization structure, process and data definitions, configuration, security and Maya Daneva, Pascal van Eck 98

104 Vol. IV, No. II, pp training. What these models have in common is that they all are meant as theoretical frameworks for analysing, both retrospectively and prospectively, the business value of ES. It is important to note that organizations repeatedly go through various maturity stages when they undertake major upgrades or replacement of ES. As system evolution adds the concept of time to these frameworks, they tend to structure ES experiences in terms of stages, starting conditions, goals, plans and quality of execution. First, the model by Markus et al [16] allocates elements of ES success to three different points in time during the system life cycle in an organization: (i) the project phase in which the system is configured and rolled out, (ii) the shakedown phase in which the organization goes live and integrates the system in their daily routine, and (iii) the onward and upward phase, in which the organization gets used to the system and is going to implement additions. Success in the shakedown phase and in the onward and upward phase is influenced by ES usage maturity. For example, observations like (i) a high level of successful improvement initiatives, (ii) a high level of employees willingness to work with the system, and (iii) frequent adaptations in new releases, are directly related to a high level of ES usage maturity. Second, the ERP Maturity Model by Ernst & Young, India [7] places the experiences in context of creating an adaptable ERP solution that meets changing processes, organization structures and demand patterns. This model structures ERP adopter s experiences into three stages: (i) chaos, in which the adopter may loose the alignment of processes and ERP definition, reverts to old habits and routines, and complements the ERP system usage with workarounds, (ii) stagnancy in which organizations are reasonably satisfied with the implemented solution but they had hoped for a higher return-on-investment rates and, therefore, they refine and improve the ES usage to get a better business performance, and (iii) growth in which the adopter seeks strategic support from the ES and moves its focus over to profit, working capital management and people growth. Third, the staged maturity model by Holland et al [13] suggests three stages as shown in the Table 2. It is based on five assessment criteria that reflect how ERP-adopters progress to a more mature level based on increased ES usage. Our comparative analysis of the definitions of the assessment criteria pointed out that the number of common factors that make up the criteria of these three models is less than 30%. The common factors are: (1) shared vision of how the ES contributes to the organization s bottom-line, (2) use of ES for strategic purposes, (3) tight integration of processes and ES, and (4) executive sponsorship. In the next section, we refer to these common criteria when we compare the models for assessing ES usage maturity to the ones for assessing architecture maturity. Maya Daneva, Pascal van Eck 99

105 Vol. IV, No. II, pp Constructs Stage 1 Stage 2 Stage 3 Strategic Use of IT Organizational Sophistication Penetration of the ERP System Drivers & Lessons Vision Retention of responsible people no CIO (anymore) IS does not support strategic decision-making no process orientation very little thought about information flows no culture change the system is used by less than 50% of the organization cost-based issues prohibit the number of users little training staff retention issues Key drivers: priority with management information costs Lessons: mistakes are hard to correct high learning curve no clear vision simple transaction processing ES is on a low level used for strategic decision-making IT strategy is regularly reviewed High ES importance significant organizational change improved transactional efficiency most business groups / departments are supported high usage by employees Key drivers: reduction in costs replacement of legacy systems integrating all business processes improved access of management information performance oriented culture internal and external benchmarking Table 2: ES Usage MM (based on [13]) Strong vision Organizationwide IT strategy CIO on the senior management team process oriented organization top level support and strong understanding of ERPimplications truly integrated organization users find the system easy to use Key drivers: single supply chain replacement of legacy systems higher level uses are identified other IT systems can be connected Maya Daneva, Pascal van Eck 100

106 Vol. IV, No. II, pp Mapping Architecture Maturity Criteria to ES Usage Maturity Criteria:Insight from the Survey Study The underlying hypothesis is this paper is that the criteria of the ACMMs and the ones of the ES UMMs differ, correlate but do not explain one another. Table 3 summarizes the similarities and the differences of the two types of models. The rightmost column indicates that the models agree on seven factors in terms of what they contain. Table 3 also identifies significant differences between the two model types. For example, the ES usage models do not explicitly address a number of areas critical to ACMM compliance, e.g.: use of a framework, existence of communication loops and acquisition processes, security, and governance. Having analyzed the linkages between these two model types, our findings from the literature survey study suggest the following two implications for ES adopting organizations: (1) If an organization scores high in terms of ES usage maturity, they still have to do something to comply with a higher level of ACMM, and (2) If the architecture team of ES adopting organization complies with a higher level (than 3) of ACMM, the ES usage model still has value to offer. This is due to the focus of the ES usage model on the management of ESsupported change and evolveability of the ES. ACMMs contain ES MMMs contain Both types of models contain Definition of standards (incl. frameworks) Implementation of architecture methods Scoping in depth & breath of architecture definitions Planning Feedback loops based revision Implementation of metrics program Responsibility for acquisition Responsibility for corporate security Governance Managed ESsupported change Speed of adaptation to changing demand patterns Responsibility for maintaining stability of information & process environments Periodic reviews Vision Strategic decisionmaking, transformation & support Coherence between big picture view & local project views Process definition Alignment of people, processes & applications with goals Business involvement & buy-in Drivers Table 3: Similarities and differences between ACMMs and ES Usage Models This alone offers significant complementary guidance in the areas of refocusing resources to better balance the choices between ES rigidity and business flexibility as well as the choice between short-term versus the long term benefits. ES usage models also explicitly account for the role of factors beyond the direct control of the organization. For example, they address the need to stay aware of changes in market demands and take responsibility to maintain a stable environment in the face of rapid change. Maya Daneva, Pascal van Eck 101

107 Vol. IV, No. II, pp Linkages between Architecture Maturity and ES Usage Maturity: Insights from a Case Study The case company in this study is a Canadian wireless communications services provider who serves both corporate and consumers markets with different subscriber segments in different geographic areas. To maintain the big-picture view of the key business processes and supporting applications while adapting to changing markets, the organization relied on an established architecture team. To support their fast growth, the company also started an ES initiative that included 13 ERP projects within five years. For the purpose of our research, the unit of analysis [5] is the ES-adopting organization. We investigate two aspect of the adopter: (i) the maturity of their architecture function and (ii) the maturity of the ES usage. 6.1 Architecture Maturity In 2000, after a series of corporate mergers, the company initiated a strategic planning exercise as part of a major business processes and systems alignment program. A key component of the strategic planning effort was the assessment of architecture maturity and the capability of the organization s architecture process. The DoC ACMM was used among other standards as a foundation and an assessment process was devised based on a series of reviews of (i) the architecture deliverables created for small, mid-sized and large projects, (ii) architecture usage scenarios, (iii) architecture roles, (iv) architecture standards, and (v) architecture process documentation. There are nine unique maturity assessment criteria in the DoC ACMM (as can be seen in the second column in Table 1). These were mapped into the types of architecture deliverables produced and used at the company. The highlights of the assessment are listed below: Operating unit participation: Since 1996, a business process analyst and a data analyst have been involved in a consistent way in any business (re)- engineering initiative. Process and data modeling were established as functions, they were visible for the business, the business knew about the value the architecture services provided and sought architecture support for their projects. Each core process and each data subject area had a process owner and a data owner. Their sign-off was important for the process of maintaining the repositories of process and data models current. Business linkage: The architecture deliverables have been completed on behalf of the business, but it was the business who took ownership of these deliverables. The architecture team was the custodian of the resulting architecture deliverables, however, these were maintained and changed based on requests by the business. Senior management involvement / Governance: All midsized and large projects were strategically important, as the telecommunication industry implies a constant change and a dynamic business environment. The projects were seen Maya Daneva, Pascal van Eck 102

108 Vol. IV, No. II, pp as business initiatives rather than IT projects and had strong commitment from top management. IT investment and acquisition strategy: IT was critical to the company s success and market share. Investments in applications were done as a result of a strategic planning process. Architecture process: The architecture process was institutionalized as a part of the corporate Project Office. It was documented in terms of key activities and key deliverables. It was supported by means of standards and tools. Architecture development: All major areas of business, e.g. all core business processes, major portion of the support processes, and all data subject areas were architected according to Martin s methodology [17]. The architecture team had a quite good understanding of which architecture elements were rigid and which were flexible. Architecture communication: Architecture was communicated by the Project Office Department and by the process owners. The IT team has not been consistently successful in marketing the architecture services. There were ups and downs as poor stakeholder involvement impacted the effectiveness of the architecture team s interventions. IT security: IT Security was considered as one of the highest corporate priorities. The manager of this function was part of the business, and not of the IT function. He reported directly to Vice-President Business Development. 6.2 ES Usage Maturity To assess the ES usage maturity in this case, the ES UMM from Table 2 is used. Assessments were done at two points in time: (i) after the completion of the multi-phase roll-out of the ERP package and (ii) after a major business process and systems alignment initiative run by three merging telecommunication businesses. The first assessment rated the ERP-adopter at Maturity Stage 1, while the second assessment indicated Stage 2. Details on the five assessment criteria are discussed as follows: Strategic use of IT: The organization started with a strong IT vision, the senior managers were highly committed to the projects. The CFO was responsible for the choice for an enterprise system, and therefore, moving to a new ERP platform was a business decision. The company also had their CIO in the management team. Assessments of strategically important implementation options were done consistently by the executives themselves. For example, ERPsupported processes were not adopted in all areas because this would have reduced the organization s competitive advantage. Instead, the executive team approved the option to complement the ERP modules with a telecom-businessspecific package that supports the competitively important domain of wireless service delivery (including client activations, client care, and rate plan management). This decision was in line with the key priorities of the company, namely, quality of service provisioning and client intimacy. Organizational Sophistication: Business users wanted to keep processes diverse, however the system pushed them towards process standardization and Maya Daneva, Pascal van Eck 103

109 Vol. IV, No. II, pp this led to cultural conflicts. Another problem was the unwillingness to change the organization. People were afraid that the new ways of working were not as easy as before and, therefore, they undermined the process. Penetration of the ERP system: The amount of involvement of process owners in the implementation led immediately to the same amount of results. The process owners were committed to reuse their old processes, which led to significant customization efforts. The penetration of the ERP can be assessed according to two indicators: the number of people who use the system or the number of processes covered. The latter gives a clearer picture of the use, than the first because many employees can be in functions in which they have nothing to do with the ES. Examples of such functions were field technicians in cell site building and call center representatives. In our case study organization, 30-40% of the business processes are covered with SAP and they are still extending. Vision: The company wanted to achieve a competitive advantage by implementing ES. Because this was a pricy initiative, they made consistent efforts to maximize the value of ES investments and extend it to non-core activities and back office. Drivers & Lessons: The company s drivers were: (i) integration of sites and locations, (ii) reducing transaction costs, and (iii) replacement of legacy applications. There was a very high learning curve through the process. Some requirements engineering activities, like requirements prioritization and negotiation went wrong in the first place, but solutions were found during the process. More about the lessons learned in the requirements process can be found in [2]. 6.3 Mapping of the Case Study Findings This section provides a list of the most important findings from our architecture and ES usage assessment results. Later, in Section 7, this list is compared to the results of our literature survey study (Section 5). The list reports on the following: 1. There appears to be a relationship between the DoC AMM criterion of Business Linkage and the ES UMM criterion of Strategic Use of IT. Strong business/architecture linkage strengthened the stakeholders involvement in the ERP initiative: for example, we observed that those business process owners who had collected positive experiences of using architecture deliverables in earlier process automation projects, maintained, in a consistent way, positive attitude towards the architecture-driven ERP implementation projects. 2. There appears to be a relationship between the DoC AMM criterion of Senior Management Involvement and the ES UMM criterion of Vision. The executive sponsorship for architecture made it easy for the ES adopter to develop the capability to consistently maintain a shared vision throughout all ES projects. Despite that the ES adopter was rated as a Stage 1 organization on the majority of the ES UMM criteria, they managed to maintain at all times a sense of shared vision and identity of who they were and this rated them, regarding the Vision criterion, at Stage 3. This may also be an example of how a mature Maya Daneva, Pascal van Eck 104

110 Vol. IV, No. II, pp architecture team can positively influence a Stage 1 organization and help to earlier practice what other ES adopters experience when arriving at Stage Our observations found no correlation between the DoC ACMM criterion of Architecture Communication and the ES UMM criterion of Organizational Sophistication. At the very first glance, it appeared that the organization was rated low on the Organizational-Sophistication criterion of ES UMM due to the low level scored on the Architecture-Communication criterion of ACMM. However, a deeper look indicated the Organizational-Sophistication criterion got influenced by a number of events over which the architecture team s willingness and efforts to communicate architecture had neither a direct nor an indirect control. 4. There appears to be no relationship between the DoC AMM criterion of Operating Units Participation and the ES UMM criterion of Penetration of the ES. The ES adopter had designated process and data owners on board in both the architecture process and the ES implementation process. Despite the intuitive belief that a high Operating Units Participation positively influences the Penetration-of-the-ES rate, we found the contrary be part of the case study reality. Owners of ERP-supported processes could not tie the depth and the breath of ERP usage to architecture. One of the most difficult questions in ERP implementation was how many jobs and job-specific roles would change and how many people would be supposed to work in these roles. This key question is captured in the Penetration-of-the-ES criteria of the ES UMM but its resolution was not found based on architecture. Also, both architects and ERP teams saw little correlation between these two aspects. 5. We observed no clear connection between a highly mature Architecture Process and the ES UMM criterion of Drivers and Lessons. A mature architecture process implies clarity on what the business drivers for ES initiatives are. In our experience, however, the organization defined business drivers for each project but found later that some of them were in conflict. This led to unnecessary complex ERP customization and needless installation of multiple system versions [2]. However, the ES team did it better in the next series of roll-outs and their improvement was attributed to the role of architecture. Architecture frameworks, architecture-level metrics, and reusable model repositories were made parts of the requirements definition process and were consistently used in the prioritization and negotiation of ERPcustomization-requirements in most of the projects that followed. This suggests that an architecture process alone does not determine the project s success but can assist ES adopters in correcting and doing things better the next time. 6. We found no correlation between a highly-mature Architecture Development and the ES UMM criterion of Drivers and Lessons. In the early projects, the organization failed to see the ES initiative as a learning process. Process owners shared readiness to change their ways of working, but found themselves unprepared to spend time for learning the newly-designed integrated end-to-end processes, the new system, the way it is configured, and the future options being kept open. Inconsistent definitions of business drivers and inconsistent learning from trials and failures favoured a low rating on the Maya Daneva, Pascal van Eck 105

111 Vol. IV, No. II, pp Drivers-and-Lessons criterion. 7. We found no correlation between a highly-mature Architecture Development and the ES UMM criterion of Organizational Sophistication. Stakeholders saw process architecture deliverables as tools to communicate their workflow models to other process owners. All agreed that process models made process knowledge explicit. But business users also raised a shared concern about the short life-span of the architecture-compliant ERP process models. Due to the market dynamics in the telecommunication sector, process models had the tendency to get outdated in average each 6 weeks. Modelling turned out to be an expensive exercise and took in average at least 3 days of full-time architect s work and one day of process owner s time. Keeping the models intact was found resource-consuming and business users saw little value in doing this. To sum up, high architecture maturity does not necessarily imply coordination in determining ES priorities and drivers; neither can it turn an ES initiative into a systematic learning process. While the architecture maturity in the beginning of the project was very high, the organization could not set up a smooth implementation process for the first six ERP projects. So, at the time of the first assessment, the ES usage maturity was low (stage 1) although the company had clarity on the strategic use of IT and treated the ES implementation projects as business initiatives and not as IT projects. 7 Comparison with the Survey Study This section addresses the question whether the factors identified from our survey study are consistent with the ones identified in our case study. We did this to see if our multi-analyses approach can help uncover subtle information about both the interplay of EA and ES and the research method itself. The factors resulting from the survey and the ones from the case study are compared in Table 4. It indicates a number of overlapping factors in the two case studies: both studies identified four factors that are linked to a mature ES usage and EA. Factor Survey Study Case Study Vision yes yes Strategic decision-making, transformation & support yes yes Coherence between big picture view & local project views yes no Process definition yes yes Alignment of people, processes & applications with goals yes no Business involvement & buy-in yes yes Drivers yes no Making knowledge explicit no yes Table 4: Consistency check in the findings of the survey and the case study Next, our findings suggest that three factors were identified in the survey but not Maya Daneva, Pascal van Eck 106

112 Vol. IV, No. II, pp in the case study. One factor was found in the case study but not in the survey. 8 Conclusions In the past decade, awareness of IT governance in organizations increased and many have also increased their spending in EA and ES with the expectation that these investments will bring improved business results. However, some organizations appear to be more mature than others in how they use EA and ES for their advantage and do get better value out of their spending. This has opened the need to understand what it takes for an organization to be more mature in EA and ES usage and how an organization measures up by using one of the numerous maturity models available in the market. Our study is one attempt to answer this question. We outlined a comparative strategy for researching the multiple facets of a correlation relationship existing between these two types of maturity models, namely for EA and ES. We used a survey study and a case study of one company s ERP experiences in order to get a deeper understanding of how these assessment criteria refer to each other. We found that the two types of maturity models rest on a number of overlapping assessment criteria, however, the interpretation of these criteria in each maturity model can be different. Furthermore, our findings suggest that a well-established architecture function in a company does not imply that there is support for an ES-implementation. This leads to the conclusion that high architecture maturity does not automatically guarantee high ES usage maturity. In terms of research methods, our experiences in merging a case study and a literature survey study suggest that a multi-analyses approach is necessary for a deeper understanding of the correlations between architecture and ES usage. The present study shows that a multi-analyses method helps revise our view of maturity to better accommodate the cases of ES and EA from an IT governance perspective and provides rationale for doing so. By applying a multi-analyses approach to this research problem, our study departs from past framework comparison studies. Moreover, this study extends previous research by providing a conceptual basis to explicitly link the assessment criteria of two types of models in terms of symbols, contents and codified good practices. In our case study, we have chosen to use qualitative assessments of EA and ES maturity, instead of determining quantitative maturity measurement according to the models. The nature of the semiotic analysis, however, makes specific descriptions of linkages between EA and ES usage difficult. Many open and far-reaching questions result from this first exploration. Our initial but not exhaustive list includes the following lines for future research: 1. Apply content analysis methods [24] to selected architecture and ES usage models to check the repeatability of the findings of this research. 2. Analyze how EA is used in managing strategic change. This will be done by carrying out case studies at companies sites. 3. Refining ES UMM concepts. The ES UMM was developed at the time of the year 2000 ERP boom and certainly needs revisions to reflect the most recent Maya Daneva, Pascal van Eck 107

113 Vol. IV, No. II, pp ERP market developments [13]. 4. Investigate how capability assessments and maturity advancement are used to achieve IT-business alignment. Our present results suggest this research is certainly warranted. References [1] J. Champy, X-Engineering the Corporation, Warner Books, New York, [2] M. Daneva, ERP requirements engineering practice: lessons learned, IEEE Software, March/April [3] T. Davenport, The future of enterprise system-enabled organizations, Information Systems Frontiers 2(2), 2000, pp [4] Department of Commerce (DoC), USA Government: Introduction IT Architecture Capability Maturity Model, 2003, acmm_rev1_1_ pdf [5] L. Dube, G. Pare, Rigor in information systems positivist research: current practices, trends, and recommendations, MIS Quarterly, 27(4), 2003, pp N., [6] U. Eco, A Theory of Semiotics, Bloomington, Indiana, University of Indiana Press, [7] Ernst & Young LLT: Are you getting the most from your ERP: an ERP maturity model, Bombay, India, April [8] Federal Enterprise Architecture Program Management Office, NASCIO, US Government, April, [9] Fonstad, D. Robertson, Transforming a Company, Project by Project: The IT Engagement Model, Sloan School of Management, CISR report, WP No363, Sept [10]N. Fenton, S.L. Pfleeger, Software Metrics: a Rigorous & Practical Approach, International Thompson Publ., London, [11]Gartner Architecture Maturity Assessment, Gartner Group Stamford, Connecticut, Nov [12]W. van Grembergen, R. Saull, Aligning business and information technology through the balanced scorecard at a major Canadian financial group: its status measured with an IT BSC maturity model. Proc.of the 34th Hawaii Int l Conf. on System Sciences, [13]C. Holland, B. Light, A stage maturity model for enterprise resource planning systems use, The DATABASE for Advances in Information Systems, 32 (2), 2001, pp [14]J. Lee, K. Siau, S. Hong, Enterprise integration with ERP and EAI, Communications of the ACM, 46(2), [15]M. Markus, Paradigm shifts E-business and business / systems integration, Communications of the AIS, 4(10), [16]M. Markus, S. Axline, D. Petrie, C. Tanis, Learning from adopters experiences with ERP: problems encountered and success achieved, Journal of Information Technology, 15, 2000, pp [17]J. Martin, Strategic Data-planning Methodologies. Prentice Hall, [18]META Architecture Program Maturity Assessment: Findings and Trends, META Group, Stamford, Connecticut, Sept [19]A. Rafaeli, M. Worline, Organizational symbols and organizational culture, in Ashkenasy & C.P.M. Wilderom (Eds.) International Handbook of Organizational Climate and Culture, 2001, [20]J. Ross, P.Weill, D. Robertson, Enterprise Architecture as Strategy: Building a Foundation for Business Execution, Harvard Business School Press, July [21]P. Weill P, J.W. Ross, How effective is your IT governance? MIT Sloan Research Briefing, March [22] R.K. Yin, Case Study Research, Design and Methods, 3rd ed. Newbury Park, Sage Publications, [23] J. Schekkerman, Enterprise Architecture Score Card, Institute for Enterprise Architecture Developments, Amersfoort, The Netherlands, [24] S. Stempler, An overview of content analysis, Journal of Practical Assessment, Research & Evaluation, 7(17), Maya Daneva, Pascal van Eck 108

114 Vol. IV, No. II, pp Acknowledgement: This research has been carried out with the financial support of the Nederlands Organization for Scientific Research (NWO) under the Collaborative Alignment of Cross-organizational Enterprise Resource Planning Systems (CARES) project. The authors thank Claudia Steghuis for providing us with Table 1. Maya Daneva, Pascal van Eck 109

115 Vol. IV, No. II 110

116 Vol. IV, No. II, pp UNISC-Phone A case study SCHREIBER Jacques FELDENS Gunter LAWISCH Eduardo ALVES Luciano Informatics Departament UNISC Universidade de Santa Cruz do Sul - Av. Independência, 2293 CEP Santa Cruz do Sul Brasil jacques@unisc.br gunterfeldens@gmail.com eduardo.lawisch@gmail.com alvesluciano@gmail.com Abstract The Internet has had an exponential growth over the past decade and its impact on society has been rising steadily. Nonetheless, two situations persist: first, the available interfaces are not appropriate for a huge number of citizens (visual deficient people), second, it is still difficult for people to access the internet in our Country. On the other hand, the proliferation of cell phones as tools for accessing the Internet, equally paved the way for new applications and business models. The interaction with Web pages through the voice is an ever-increasing technological challenge and added solution in terms of ubiquitous interfaces. This paper presents a solution to access Web contents through a bidirectional voice interface. It also features a practical application of this technology, giving users an access to an academic enrolment system using the telephone to surf the Web pages of the present system. Keywords: disabled people, voice recognition 1 Introduction Humanity has long been feeling the need to arrange mechanisms that make sluggish, routine and complex tasks easier. Calculations, data storing are some of the examples and lots of tools have been created to solve these problems. The personal computer comes as a working instrument that carries out many of them. The interconnection of these devices allows for information release and gives users the chance to share contents thus improving task execution efficiency. The Internet is inevitably a consequence of the use and proliferation of computers. Its constant evolution could be framed into several features, such as infrastructure, development tools (increasingly easy to use) and interfaces with the user. The objective of this paper is to explore the new man-computer interface and its implications upon society. The Web interfaces are based on languages of their own, which are used to write the documents and, by means of appropriate software (navigators, interpreters) can be read, visualized and executed. The first programs to present the contents of these documents used simple interfaces, which only contained the text. Nowadays, they support texts in various formats, charts, images, videos, sounds, among others. As information visualization software progresses, along with the need to make this Jacques Schreiber, Gunter Feldens Eduardo Lawisch, Luciano Alves 111

117 Vol. IV, No. II, pp information available to more users, more complex interfaces arise, using hardware systems and software innovators. Then come technologies that utilize voice and dialogue recognition systems to provide and collect data. The motivation for developing this work comes from the ever-expanding fixed and mobile telephone services in Brazil, particularly over the past decade, a development that will allow a broader range of users to access the contents of the Web. There are also motivations of social character, for example, access to the Internet for special needs citizens, like visual deficient people. 1.1 Voice interfaces What solutions? Nowadays, mobile phones are fairly common devices everywhere and increasingly perform activities that go beyond the range of simple conversation. On the other hand, the Internet is currently a giant database not easily accessible by voice through a cell phone. An interface that recognizes the voice and is capable of producing a coherent dialogue from a text, could be the solution to provide these contents. These devices would turn out to be very useful for visual deficient people. As an initial approach toward the solution of the problem we could consider the WAP (Wireless Application Protocol), a specially designed protocol for mobile phones to access the contents of the Web. This type of access includes the microbrowser, a navigator specifically designed to function through a screen and reduced resources devices. The protocol was projected to make it possible to execute multimedia data on mobile phones, as well as on other wireless devices. The development was essentially based on the existing principles and patterns for the Internet.(TCP/IP-like) [1]. However, due to the technological deficiencies of these devices, new solutions were adopted (protocol battery) to optimize its utilization within this context. The architecture is similar to the one of the Internet, as shown in Figure 1. Figure 1: WAP architecture Extracted from the paper [6] In a concise manner, we could say that the WAP architecture functions as an interface which allows for communication, more precisely for data exchange, between handhelds and the World Wide Web. The sequence of actions that occur while accessing the WWW through a WAP protocol is as follows: Microbrowser on the mobile device starts; The device seeks signal; Jacques Schreiber, Gunter Feldens Eduardo Lawisch, Luciano Alves 112

118 Vol. IV, No. II, pp A connection is established with the server of the telephone company; A Web Site is selected to be displayed; An order is sent to proxy using WAP; The proxy accesses the desired information on the site by means of the HTTP protocol; The proxy codifies the HTTP data to WML; The WML codified data are sent to the device; The microbrowser displays the wireless version of the Web Site. In spite of it all, this protocol shows some limitations, among them, the inability to establish a channel that allows for transporting audio data in real time. This is a considerable limitation and makes it impossible to use the WAP as a solution to supply Web contents through a voice interface. 2 Speech Recognizing Systems The automatic speech recognition systems (Automated Speech Recognition - ASR) were greatly developed over the past years with the creation of improved algorithms and acoustic models and with the higher processing capacity of the computers. With a relatively accessible ASR system installed in the personal computer and with a good quality microphone quality recognition can be achieved if the system is trained for the user s voice. Through a telephone, and a system that has not been trained, the recognition system needs a set of speech grammars to be able to recognize the answers. This is one possible manner to increase the possibilities to recognize, for example, a name among a rather big number of hypotheses, without the need for many interactions. Speech recognition through mobile phones, which are sometimes used in noisy environments, requires more complex algorithms and simple, well built grammars. Nowadays, there are many ASR commercial applications in an array of languages and action areas, for example, voice, finance, bank, telecommunications and trade portals ( There are also evolutions in speech synthesis and in the transformation of texts into speech (TTS, Text-To-Speech). Many of the present TTS systems still have problems in terms of easily perceived speeches. However, a new form of speech synthesis under development is based on the concatenation of wave forms. Within this technique, speech is not entirely generated from the text, but recognition also relies on a series of pre-recorded sounds [2]. There are other recognizing manners, namely, mobile phones execute calls and other operations to voice commands of their owners habitual user. The recognition system functions through the use of previous recordings of the user s voice, which are then used to compare with the entry data so as to make order recognition possible. This is a rather simple recognition manner which works well and is widely used. The recognition techniques are also used by systems that rely on a traditional operational system in order to allow the user to manipulate programs and access contents by means of a voice interface. An example of this type of application is the IN CUBE Voice Command [3]. This application uses a technology specifically developed to facilitate repetitive tasks. It is an application of the Windows(9X, NT, 2000) system but does not modify its configurations, and there are also versions available for other operational systems. Alone, these systems do not assume themselves as possible solutions to the problem of creating a voice interface for the Web, because, as mentioned above, there is a need for high processing capacity systems to make full recognition possible, although, as shown later in this paper, this is an important technology and might even become part of the global architecture of a system that presents a Jacques Schreiber, Gunter Feldens Eduardo Lawisch, Luciano Alves 113

119 Vol. IV, No. II, pp solution to the problem. The applications that execute commands of an operational system through voice recognition also poise some considerable disadvantages. If the systems are easy to use and are reliable, they do not have mechanisms that would allow to return in speech form, for example, any information that might be obtained from the Internet (from a XML document). 2.1 The advent of the VoiceXML The working group Speech Interface Framework of the W3C (World Wide Web Consortium) created patterns that regulate access to Web contents using a voice interface. Access to information is normally done through visualization programs (browsers), which interpret a markup language (HTML, XML). The specification of the Speech Synthesis Markup Language is a relevant component of this new set of rules for voice navigators and was projected to provide a richer markup language, based on XML (Extended Markup Language), so as to make it possible to create applications for this new type of Web interface. The essential rule for working out the markup language is to establish mechanisms which allow the authors of the synthesizable contents to control such features of speech as pronunciation, volume, emphasis on certain words or sentences in the different platforms. This is how VoiceXml was created, a language based on XML which allows for the development of documents containing dialogues to facilitate access to Web contents [4]. This language, like the HTML, is used for Man-Computer dialogues. However, whilst the HTML assumes the existence of a graphic navigator (browser), a monitor, keyboard and mouse, the VoiceXML assumes the existence of a voice navigator with an audio outlet synthesized by the computer or a pre-recorded audio with an audio inlet via voice and/or keyboard tones. This technology frees the Internet for the development of voice applications, thus simplifying drastically the previously difficult tasks, creating new business opportunities. The VoiceXML is a specification of the VoiceXml Forum, an industrial consortium comprising more than 300 companies. This forum is now engaged in certifying, testing and spreading this language, while the control of its development is in the hands of the World Wide Web Consortium (W3C). As the VoiceXML is a specification, the applications that function in accordance with this specification can be used in different platforms. Telephones had a paramount importance in its development, but it is not restricted to the telephone market and reaches others, for example, personal computers. The technology s architecture (Figure 2) is analogous to the one of the Internet, differing only in that it included a gateway whose functions will be explained later on (5). Figure 2: The Architecture of a Client-Server system based on VoiceXml, extracted from [6] Jacques Schreiber, Gunter Feldens Eduardo Lawisch, Luciano Alves 114

120 Vol. IV, No. II, pp In a succinct manner, let us look at how information transference is processed in a system that implements this technology. First, the user calls a certain number over the mobile phone, or even over a conventional telephone. The answer to the call is made by a computerized system, the VoiceXml gateway, which then sends an order to the document server (the server could be in any place of the Internet) which will return the document referenced by the call. A constituent component of the gateway, called VoiceXml interpreter, executes the commands in order to make the contents speakable through the system. It listens to the answers and passes them on to the speech recognizing motor, which is also a part of the gateway. The normalization process followed the Speech Synthesis Markup Requirements for Voice Markup Languages privileging some aspects, of which the following are of note [4]: Interoperationality: compatibility with other W3C specifications, including the Dialog Markup Language and the Audio Cascading Style Sheets. Generality: supports speech outlets to an array of applications and several contents. Internationalization: provides speech outlet in a big number of languages, with the possibility to use these languages simultaneously in a chosen document. Reading generation and capacity: capable of automatically generating easy-to-read documents. Consistency: will allow for a predictable control of outgoing data regardless of the implementation platform and the characteristics of the speech s synthesis device. Implementable: the specification should be accomplishable with the existing technology and there should be a minimum number of operational functionalities. Upon analyzing the architecture in detail (Figure 3) it becomes clear that the server (for example, a Web Server) processes the client s application orders, the VoiceXml Interpreter, through the VoiceXml Interpreter context. The server produces VoiceXml documents in reply, which are then processed and interpreted by the VoiceXml Interpreter. The VoiceXml Interpreter Context can monitor the data furnished by the user in parallel with the VoiceXml Interprete. For example, a VoiceXml Interpreter Context may be listening to a request of the user to access the aid system, and the other might be soliciting profile alteration orders. The implementation platform is controlled by the VoiceXml Interpreter context and by the VoiceXml interpreter. For example, in a voice interactive application, the VoiceXml Interpreter context may be responsible for detecting a call, read the initial VoiceXml document and answer the call, while the VoiceXml Interpreter conducts the dialogue after the answer. The implementation platform generates events in reply to user actions (for example: speech, entry digits, request for call termination) and system events (for example, end of temporizer counting). Some of these events are manipulated by the VoiceXml Interpreter itself, as specified by the VoiceXml document, while others are manipulated by the VoiceXml Interpreter context. Jacques Schreiber, Gunter Feldens Eduardo Lawisch, Luciano Alves 115

121 Vol. IV, No. II, pp Server Request VoiceXML Interpreter Context VoiceXml Interpreter Document Implementation Platform Figure 3: Architecture details Now analyzing the tasks executed by the gateway VoiceXML in a more detailed manner (Figure 4) it becomes clear that the interpretation of the scripts and the interaction with the user are actions controlled by the latter in order to execute them, the gateway consists of a set of hardware and software elements which form the heart of the VoiceXml (VoiceXml Interpreter technology and the above described VoiceXml Interpreter Context are also components of the gateway). Essentially, they furnish the user interaction mechanisms analogically to the browsers in a conventional HITP service. The calls are answered by the telephone services and by the signal processing component. Gateway VoiceXML Services VoiceXML Interpreter Telephone and signal processing services Speech recognizing device Audio Reproduction TTS services HTTP clients Figure 4: The Gateway VoiceXML components[5] The gateways are fitted into the web in a manner very similar to the IVR (Interactive Voice Response) systems and may be placed before or after the small- Jacques Schreiber, Gunter Feldens Eduardo Lawisch, Luciano Alves 116

122 Vol. IV, No. II, pp scale telephone centers utilized by many institutions. The architecture allows the users to request the transference of their call to an operator and also allows the technology to implement the order easily. When a call is received, the VoiceXml Interpreter starts to check and execute the instructions contained in the scripts VoiceXml. As mentioned before, when the script, which is executing, requests an answer from the user, the interpreter directs the control to the recognition system, which listens to and interprets the user s reply. The recognition system is totally independent from other gateway components. The interpreter may use a compatible client/server recognition system or may change the system during the execution with the purpose to improve the performance. Another manner of collecting data is the recognition of keys which results into DTMF controls which are interpreted to allow the user to furnish information to the system, like access passwords. 3 The Unisc-Phone prototype The prototype originated from the difficulties faced by the students of Unisc University of Santa Cruz do Sul at the enrolment period. Although efficient in itself, the academic enrolment system implemented on the Web requires access to the Internet and knowledge of micro-informatics, a distant reality for many students. This gave origin to the idea of creating a system that provides academic information and makes enrolment possible in the respective disciplines of the course attended by the student, at any place, any moment and automatically. The idea resulted into the Unisc-Phone project, and uses the telephone, broader in scope and available everywhere, to provide appropriate information to any student who wishes to enroll at the university. 3.1 The functionality of the Unisc-Phone While figuring out the project, the main concern of the team was the creation of a grammar flexible enough to provide for a dialogue as natural as possible. It would be up to the student to conduct the dialogue with the system, giving the student the chance to start the dialogue or leave it to the system to make the questions. In case the system assumes control, a set of questions in a pre-defined order would be made, if not, the system should be able to interact with the student, perceiving their needs, clearing the doubts as they arise. Upon assuming the dialogue, the student could interact with the system in several manners, as follows: <Student> Hi! I m a student of the Engineering course, my enrolment number is 54667, I want to know which disciplines I can enroll in? <System> Hello, Mr <XXXX>, please type in your password on the telephone keyboard. Figure 5: A first example of possible dialogue Jacques Schreiber, Gunter Feldens Eduardo Lawisch, Luciano Alves 117

123 Vol. IV, No. II, pp After the validation by the user, the dialogue proceeds... <System> Mr <XXXX>, the available disciplines are as follows: (Discipline A), (Discipline B), (Discipline C), (Discipline D) <Student> I want to enroll in disciplines A and C <System> You ve requested to enroll in disciplines A and C, please confirm! <Student> That s Right, ok Figure 6: A continue example of possible dialogue Another dialogue could occur, for example: <student> Hi, I want to enroll in disckiplines A and C, my name is <XXXXX> <System> Good morning, Mr <XXXXX>, please inform your number of enrolment and type in the password on the keyboard of the phone <Student> Oh..yes, my enrolment number is (typing in password) <System> You ve requested to enroll in disciplines A and C, please confirm! <Student> That s Right, ok Figure 7: A second example of possible dialogue These two dialogues are just examples of the many dialogues that could take place. Obviously, the VoiceXML technology does not yet provide for an intelligent system able to conduct any dialogue, however, with mixed-initiative forms, the <initial> tag of the VoiceXML, a system capable of interacting with a considerable array of dialogues, was implemented, imparting on the user-student the feeling of interacting with a system sufficiently intelligent, conveying confidence and security to this user The system s architecture A database, now one of the main components of the system, was built during the implementation phase. It contains detailed information on the disciplines (whether or not attended by the student), the pre-requisites and available schedules. These data will later be used to provide a solution to the student s problem. The global architecture was implemented as shown in Figure 2, however, such technologies as the VoiceXML, JSP MySQL, were utilized, and the concept of architecture was applied in three layers, see Figure 8: Presentation Logic (VoiceXML) Business Logic (JSP) Data Access Logic (MySQL) Figure 8: UNISC-Phone s Three-Layer Architecture Jacques Schreiber, Gunter Feldens Eduardo Lawisch, Luciano Alves 118

124 Vol. IV, No. II, pp For the development of this system, the JSP program free server was used - and a free VoiceXML gateway, site The following phase of the project consisted in the development of the VoiceXML documents that make up the interface. The objective was to create a voice interface in Portuguese, but unfortunately free VoiceXml gateways, capable of supporting this language, do not exist yet. Therefore, we developed a version that does not utilize the graphic signs typical to the Portuguese language, like accentuation and cedillas; even so, the dialogues are understood by Brazilian users. An accurate analysis of the implementation showed that the alteration of the system to make it support the Portuguese language is relatively simple and intuitive, once the only thing to do is to alter the text to be pronounced by the Portuguese text platform and force the system to utilize this language (there is a VoiceXML element to turn it into «speak xml:lang="pt-br"»). A. System with dialogues in Portuguese The interface and the dialogues it is supposed to provide, as well as the information to be produced by the system, were studied and projected by the team, under the guidance of the professor of Special Topics, at Unisc s (University of Santa Cruz do Sul) Computer Science College. After agreeing on the idea that the functionalities already offered by the system on the Web should be identical to the service existing on the Web, it was necessary to start implementing the dialogues in VoiceXml and also create dynamic pages so that the information contained on the database could be offered to the user-students. Dialogue organization and the corresponding database were established to make the resulting interface comply with the specifications (Figure 9). DisciplineRN.java Function.Java Start.jsp SaveEnrollment.jsp EnrollmentRN.java Figure 9: Organizing the system s files In short, the functionality of each program is as follows: DisciplineRN.java: implements a class that lends support to the recovery of disciplines suitable to each student, in other words, the methods of this class only seek those disciplines possible to be attended, in compliance with the pre-requisites, vacancy availability and viable schedules. Functions.java: implements the grammar of the system, in other words, all the plausible dialogues are the result of hundreds of possible combinations of the grammar elements existing in this class. Jacques Schreiber, Gunter Feldens Eduardo Lawisch, Luciano Alves 119

125 Vol. IV, No. II, pp Enrollment.java: this class implements persistence, in other words, after the enrolment process, the disciplines chosen by the student are entered in the institution s database. An analysis of Figure 9 shows that the user starts interacting with the system through the program start.jsp which contains an initial greeting, asks for student s name and enrollment number: <%try { ConnectionBean conbean = new ConnectionBean(); Connection con = conbean.getconnection(); DisciplineRN disc = new DisciplineRN(con); GeneralED edresearch = new GeralED(); String no.enrolment = request.getparameter("nromatricula"); //int nrodia = 2; //Integer.valueOf(request.getParameter("nroDia")).intValue (); edresearch.put("codestudent", numberenrolment); String name = disc.liststudent(edresearch); DevCollection lstdisciplines; String dia[] = {"","","Monday","Tuesday","Wednesday","Thursday","Friday", "end"}; %> <vxml version="2.0" xmlns=" <field name= enrolment > <prompt> Welcome to Unisc-Phone, now you can do your enrolment at home, by phone. Please inform your enrolment number. </prompt> </field>... <block> <%=name%>, now we will start with your reenrolment. </block> <form id="enrolment"> <% for (int no.day=2;no.day<=6;no.day++) { %> <field name="re-enrol_<%=day[no. of Day]%>"> Figure 10: The first part of source code Jacques Schreiber, Gunter Feldens Eduardo Lawisch, Luciano Alves 120

126 Vol. IV, No. II, pp <prompt> do you want to re-enroll on <%=day[no.day]%>-day of week? </prompt> <grammar> [yes no] </grammar> <filled> <if cond="re-enrol_<%=day[no.day]%> == 'yes'"> Let s get started... <goto nextitem="discipline_<%=day[no.day]%>"/> <else/> you won t have lessons on <%=day[no.day]%>-day of week, <% if (no.day==6){ %> The re-enrolment process has been concluded, sending data to server... <submit method="post" namelist="discipline_<%=daya[2]%> discipline_<%=day[3]%> discipline_<%=day[4]%> discipline_<%=day[5]%> disciplina_<%=dia[6]%>" next=" rolment.jsp"/>... </vxml> Figure 11: The second part of source code Although built in JSP, the program mixes fragments of the VoiceXML code which provides interaction with the student. It should be noted that the Portuguese spelling omits all accentuation and other graphic signs (apostrophe, cedilla). On a daily basis, the program lists the available disciplines, later to be chosen by the student. A code analysis shows that the Function class implements a method known as GetGrammar: public static String getgrammar(devcollection listdiscipline, String nameday) throws Exception { String return = ""; retorno += "<![CDATA[\n"+ " (\n"+ "?[ eu ]\n"+ "?[ I want (I wish to) I would like to]\n"+ "?[ (re-enrol) (do my re-enrolment) ]\n"+ "?[ (in discipline) (subject) for ]\n"+ "?[ de ]\n"+ " [\n"; Figure 12: The GetGrammar Method This grammar provides the student with several manners to ask for enrollment. What follows is an example of the dialogue: Jacques Schreiber, Gunter Feldens Eduardo Lawisch, Luciano Alves 121

127 Vol. IV, No. II, pp System: Welcome to Unisc-Phone, now you can do your enrolment at home, by phone. Please inform your enrolment number. User-Student: System: John, now we will start your re-enrolment process. System: do you want to re-enroll on Monday? User-Student: Yes System: The disciplines available for you on that day are: algorithms, Logics and discreet mathematics User-Student: I would like to re-enroll in Logics System: Do you want to enroll on Tuesday?... Figure 13: The GetGrammar Method Other manners for a user-student to ask for his/her enrolment could include requests like: I want to re-enroll for the discipline of Algorithm on Monday, or something like this It is my wish to enroll in Logics. This flexibility is possible thanks to the grammar definitions and the utilization of mixed-initiative forms. Once the user-student has made his/her choice, the interpreter creates a matrix containing all the intended disciplines and document interpretation calls for the SaveEnrolment.jsp program. This program asks for the confirmation and enters the re-enrolment data into the system s database. After confirmation, the system cordially says farewell and ends the enrolment procedure. 4 Conclusions The rising business opportunities offered by the Internet as a result of neverending technology improvement, both at infrastructure and interface level, and the growing number of users translate into a diversification of needs at interface level, triggering the appearance of new technologies. It was this context that led to the creation of the voice interface. The VoiceXML comes as a response to these needs due to its characteristics, once it allows for Computer-Human dialogues as a means of providing the users with information. Although entirely specified, this technology is still at its study and development stage. The companies comprised by the VoiceXML Forum are doing their best in spreading the system rapidly. Nevertheless, there are still some shortfalls like, for example, the lack of speech recognition motors and TTS transformation into languages like Portuguese, compatible with the existing gateways. The system here in above described represents a step forward for the Brazilian scientific community, which lacks practical applications that materialize their research works and signal the right course toward the new means of Human- Computer interfaces within the Brazilian context. Jacques Schreiber, Gunter Feldens Eduardo Lawisch, Luciano Alves 122

128 Vol. IV, No. II, pp References [1] "W AP Forum"- ( [2] Peter A. Heeman, "Modeling Speech Repairs and International Phrasing to Improve Speech Recognition", Computer Science and Engineering Oregon Graduate Institute of Science and Technology. [3] Speech Recognition Technologies Are NOT Ali Alike, ( [4] Speech Synthesis Markup Language Specification for the Speech Interface Framework ( O 1 03). [5] Steve Ihnen VP, "Developing With VoiceXML:Overview and System Architecture ", Applications DevelopmentSpeechHost, Inc. [6] WAP White Paper1 Jacques Schreiber, Gunter Feldens Eduardo Lawisch, Luciano Alves 123

129 Vol. IV, No. II 124

130 Vol. IV, No. II, pp Fuzzy Ontologies and Scale-free Networks Analysis CALEGARI Silvia Universitá degli Studi di Milano Bicocca - DISCo Via Bicocca degli Arcimboldi 8, Milano (Italy) calegari@disco.unimib.it FARINA Fabio INFN, Sezione di Milano-Bicocca Piazza della Scienza 3, Milano (Italy) fabio.farina@cern.ch Abstract In the recent years ontologies have played a major role in knowledge representation, both in the theoretic aspects and in many application domains (e.g., Semantic Web, Semantic Web Services, Information Retrieval Systems). The structure provided by an ontology lets us to semantically reason with the concepts. In this paper, we present a novel kind of concept network based on the evolution of a dynamical fuzzy ontology. A dynamical fuzzy ontology lets us to manage vague and imprecise information. Fuzzy ontologies have been defined by integrating Fuzzy Set Theory into ontology domain, so that a truth value is assigned to each concept and relation. In particular, we have examined the case where the truth values change in time according to the queries executed on the represented knowledge domain. Empirically we show how the concepts and relations evolve towards a power-law statistical distribution. This distribution is the same that characterizes complex network systems. The fuzzy concept network evolution is analyzed as a new case of a scale-free system. Two efficiency measures are evaluated on such a network at different evolution stages. A novel information retrieval algorithm using fuzzy concept networks is also proposed. Keywords: Fuzzy Ontology, Scale-free Networks, Information Retrieval. 1 Introduction In the last years, the ontology plays a key role in the Semantic Web [1] area of research. This term has been used in various areas in Artificial Intelligence [2] (i.e. knowledge representation, database design, information retrieval, knowledge management, and so on), so that to find an unique its meaning becomes a subtle topic. In a philosophical sense, the term ontology refers to a system of categories in order to achieve a common sense of the world [3]. From the FRISCO Report [4] point of Silvia Calegari, Fabio Farina 125

131 Vol. IV, No. II, pp view, this agreement has to be made not only by the relationships between humans and objects, but also from the interactions established by humans-to-humans. In the Semantic Web an ontology is a formal conceptualization of a domain of interest, shared among heterogeneous applications. It consists of entities, attributes, relationships and axioms to provide a common understanding of the real world [5, 3, 6, 4]. With the support of ontologies, users and systems can communicate with each other through an easy information integration [7]. Ontologies help people and machines to communicate concisely by supporting information exchange based on semantics rather than just syntax. Nowadays, there are ontology applications where information is often vague and imprecise, for instance, the semantic-based applications of the Semantic Web, such as e-commerce, knowledge management, web portals, etc. Thus, one of the key issues in the development of the Semantic Web is to enable machines to exchange meaningful knowledge across heterogeneous applications to reach the users goals. Ontology provides a semantic structure for sharing concepts across different applications in an unambiguous way. The conceptual formalism supported by a typical ontology may not be sufficient to represent uncertain information that is commonly found in many application domains. For example, keywords extracted by many queries in the same domain may not be considered with the same relevance, since some keywords may be more significant than others. Therefore, the need to give a different interpretation according to the context emerges. Furthermore, humans use linguistic adverbs and adjectives to specify their interests and needs (i.e., users can be interested in finding a very fast car, a wine with a very strong taste, a fairly cold drink, and so on). The necessity to handle the richness of natural languages used by humans emerges. A possible solution to treat uncertain data and, hence, to tackle these problems, is to incorporate fuzzy logic into ontologies. The aim of fuzzy set theory [8] introduced by L. A. Zadeh [9] is to describe vague concepts through a generalized notion of set, according to which an object may belong to a set with a certain degree (typically a real number in the interval [0,1]). For instance, the semantic content of a statement like Cabernet is a deep red acidic wine might have the degree, or truth-value, of 0.6. Up to now, fuzzy sets and ontologies are jointly used to resolve uncertain information problems in various areas, for example, in text retrieval [10, 11, 12] or to generate a scholarly ontology from a database in ESKIMO [13] and FOGA [14] frameworks. The FOGA framework has been recently applied in the Semantic Web context [15]. However, there is not a complete fusion of Fuzzy Set Theory with ontologies in any of these examples. In literature we can find some attempts to directly integrate fuzzy logic in ontology, for instance in the context of medical document retrieval [16] and in ontology-based queries [17]. In particular, in [16] the integration is obtained by adding a degree of membership to all terms in the ontology to overcome the overloading problem; while in [17] a query enrichment is performed. This is done with the insertion of a weight that introduces a similarity measure among the taxonomic relations of the ontology. Another proposal is an extension of the ontology domain with fuzzy concepts and relations [18]. However it is applied only to Chinese news summarization. Two well-formed definitions of fuzzy ontology can be found in [19, 20]. Silvia Calegari, Fabio Farina 126

132 Vol. IV, No. II, pp In this paper we will refer to the formal definition stated in [19]. A fuzzy ontology is an ontology extended with fuzzy values assigned to entities and relations of the ontology. Furthermore, in [19] it has been showed how to insert fuzzy logic in ontology domain extending the KAON 1 editor [21] to directly handle uncertainty information during the ontology definition, in order to enrich the knowledge domain. In a recent work, fuzzy ontologies have been used to model knowledge in creative environments [22]. The goal was to build a digitally enhanced environment supporting creative learning process in architecture and interaction design education. The numerical values are assigned during the fuzzy ontology definition by the domain expert and by the user queries. There is a continuous evolution of new relations among concepts and of new concepts inserted in the fuzzy ontology. This evolutive process lets the concepts to arrange in a characteristic topological structure describing a weighted complex network. Such a network is neither a periodic lattice nor a random graph [23]. This network has been introduced in an information retrieval algorithm, in particular this has been adopted in a computer-aided creative environment. Many dynamical systems can be modelled as a network, where vertices are the elements of the system and hedges identify the interaction between them [24]. Some examples are biological and chemical systems, neural networks, social interacting species, computer networks, the WWW, and so on [25]. Thus, it is very important to understand the emerging behaviour of a complex network and to study its fundamental properties. In this paper, we present how fuzzy ontology relations evolve in time, producing a typical structure of complex network systems. Two efficiency measures are used to study how information is suitably exchanged over the network, and how the concepts are closely tied [26]. The rest of paper is organized as follows: Section 2 presents the importance of the use of the fuzzy ontology and the definition of a new concept network based on its evolution in time. Section 3 introduces scale-free network notation, smallworld phenomena and efficiency measures on weighted network topologies, while in Section 4 a new information retrieval algorithm exploiting the concept network is discussed. In Section 5 we present some experimental results that confirms the scale-free nature of the fuzzy concept network. In Section 6 some other relevant works we found in literature are presented for introducing our scope. Finally, in Section 7 some conclusions are reported. 2 Fuzzy Ontology In this section an in depth study about how the Fuzzy Set Theory [9] has been integrated into the ontology definition will be discussed. Some fuzzy ontology preliminary applications are presented. We will show how to construct a novel concept network model also relying on this definition. 1 The KAON project is a meta-project carried out at the Institute AIFB, University of Karlsruhe and at the Research Center for Information Technologies(FZI). Silvia Calegari, Fabio Farina 127

133 Vol. IV, No. II, pp Toward a Fuzzy Ontology Definition Nowadays, the knowledge is mainly represented with ontologies. For example, applications of the Semantic Web (i.e., e-commerce, knowledge management, web portals, etc.) are based on ontologies. With the support of ontologies, users and systems can communicate each other through an easy information exchange and integration [7]. Unfortunately, the only ontology structure is not sufficient to handle all the nuances of natural languages. Humans use linguistic adverbs and adjectives to describe what they want. For example, a user can be interested in finding information using web portals about the topic A fun holiday. But what is a fun holiday? How to handle this type of request? Fuzzy Set Theory introduced by Zadeh [9] lets us tackle this problem denoting and reasoning with non-crisp concepts. A degree of truth (typically a real number from the interval [0,1]) is assigned to a sentence. The previous statement a fun holiday might have truth-value of 0.8. At first, let us remember the definition of a fuzzy set. Definition 1 Let U be the universe of discourse, U = {u 1, u 2,..., u n }, where u i U is an object of U and let A be a fuzzy set in U, then the fuzzy set A can be represented as: A = {(u 1, f A (u 1 )), (u 2, f A (u 2 )),..., (u n, f A (u n ))}, (1) where f A, f A : U [0, 1], is the membership function of the fuzzy set A; f A (u i ) indicates the degree of membership of u i in A. Finally, we can give the definition of fuzzy ontology presented in [19]. Definition 2 A fuzzy ontology is an ontology extended with fuzzy values assigned through the two functions g :(Concepts Instances) (P roperties P roperty value) [0, 1] (2) h :Concepts Instances [0, 1]. (3) where g is defined on the relations and h is defined on the concepts of the ontology. 2.2 Some Applications of the Fuzzy Ontology From the practical point of view, using the given fuzzy ontology definition we can denote not only non-crisp concepts, but we can also directly include the property value according to the definition given in [27]. In particular, the knowledge domain has been extended with the quality concept becoming an application of the property values. This solution can be used in the tourist context to better define the meaning of a sentence like this is a hot day. Furthermore, it is an usual practice to extend the set of concepts already present in the query with other ones which can be derived from an ontology. Generally, given a concept, the query is extended with its parents and children to enrich the set of displayed documents. With a fuzzy ontology it is possible to establish a threshold value (defined by the domain expert) in order to extend queries with instances of concepts which satisfies the chosen value [19]. This approach can be compared with [17] where the Silvia Calegari, Fabio Farina 128

134 Vol. IV, No. II, pp queries evaluation is determined through the similarity among the concepts and the hyponymy relations of the ontology. In literature the problem of the efficient queries refinement has been faced with a large number of different approaches during the last years. PASS is a method developed in order to construct automatically a fuzzy ontology (the associations among the concepts are found analysing the documents keywords [28]) that can be used to refine a user s query [29]. Another possible use of the fuzzy value associated to concepts has been adopted in the context of medical document retrieval to limit the problems due to overloading of a concept in an ontology [16]. This also permits the reduction of the number of documents found hiding those that do not fulfil the request of the user. The relevant goal achieved using fuzzy ontology has been the direct handling of concept modifiers into the knowledge domain. A concept modifier [30] has the effect of altering the fuzzy value of a property. Given a set of linguistic hedges such as very, more or less, slightly, a concept modifier is a chain of one or more hedges, such as very slightly or very very slightly, and so on. So, a user can write a statement like Cabernet has a very dry taste. It is necessary to associate a membership modifier to any (linguistic) concept modifier. A membership modifier has a value β > 0 which is used as an exponent to modify the value of the associated concepts [19, 31]. According to their effect on a fuzzy value, a hedge can be classified in two groups: concentration type and dilation type. The effect of a concentration modifier is to reduce the grade of a membership value. Thus, in this case, it must be β > 1. For instance, to the hedge very, it is usually assigned β = 2. So, if Cabernet has a dry taste with value 0.8, then Cabernet has a very dry taste with value = On the contrary, a dilation hedge has the effect of raising a membership value, that is β (0, 1). The example is analogous to the previous one. This allows not only the enrichment of the semantic that usually the ontologies offer, but it also gives the possibility to make a request without mandatory constraints to the user. 2.3 Fuzzy Concept Network Every time that a query is performed there is an updating of the fuzzy values given to the concepts or to the relations set by the expert during the ontology definition. In [22] two formulae both to update and inizialize the fuzzy values are given. Such expressions take into account the use of concept modifiers. The dynamical behaviour of a fuzzy ontology is also given by the introduction of new concepts when a query is performed. In [22] a system has been presented that allows the fuzzy ontology to adapt to the context in which it is used, in order to propose an exhaustive approach to directly handle the knowledge-based fuzzy information. This consists in the determination of a semantic correlation [22] among the entities (i.e. concepts and instances) that are searched together in a query. Definition 3 A correlation is a binary and symmetric relation between entities. It is characterized by a fuzzy value: corr : O O [0, 1], where the set O = {o 1, o 2,..., o n } is the set of the entities contained in the ontology. This defines the degree of relevance for the entities. The closer the corr value is Silvia Calegari, Fabio Farina 129

135 Vol. IV, No. II, pp to 1, the more the two considered, for instance, concepts are correlated. Obviously an updating formula for each existent correlation is also given. A similar technique is known in literature as co-occurrence metric [32, 33]. To integrate the correlation values into the fuzzy ontology is a crucial topic. Indeed, the knowledge of a domain is given, also considering the use of the objects inside the context. An important topic is to handle the trade off between the correct definition of an object (given by the ontology represented definition of the domain) and the actual means assigned to the artifact by humans (i.e. the experience-based context assumed by every person according to his specific knowledge). Figure 1: Fuzzy concept network. In this way, the fuzzy ontology reflects all the aspects of the knowledge-base and allows to dynamically adapt to the context in which it is introduced. When a query is executed (e.g., to insert a new document or to search for other documents) new correlations can be created or updated altering their weights. A fuzzy weight to the concepts and to correlations is also assigned during the definition of the ontological domain by the expert. In this paper, we propose a new concept network for the dynamical fuzzy ontologies. A concept network consists of n nodes and a set of directed links [34, 35]. Each node represents a concept or a document and each link is labelled with a real number [0, 1]. Finally, we can give the definition of fuzzy concept network. Definition 4 A Fuzzy Concept Network (FCN) is a complete weighted graph N f = {O, F, m}, where O denotes the set of the ontology entities. The edges among the nodes are described by the function F : O O [0, 1], if F (o i, o j ) = 0 then the entities are considered uncorrelated. In particular F := corr. Each node o i is characterised by a membership value defined by the function m : O [0, 1], which determines the importance of the entity by its own in the ontology. By definition F (o i, o i ) = m(o i ). In Fig.1 a graphical representation of a small fuzzy concept network is shown: in this chart the edges with F (o i, o j ) = 0 are omitted in order to increase the readability. The membership values m(o i ) are reported beside the respective instances. Silvia Calegari, Fabio Farina 130

136 Vol. IV, No. II, pp Small-world behaviour and efficiency measures of a scale-free network From the dynamical nature of the FCN some important topological properties can be determined. In particular, the correlations time evolution plays a dominant role in this kind of insight. In this section some formal tools and some efficiency measures are presented. These have been adopted to numerically analyse the FNC evolution and the underlying fuzzy ontology. The study of the structural properties of complex systems underlying networks can be very important. For instance, the efficiency of communication and navigation over the Net is strongly related to the topological properties of the Internet and of the World Wide Web. The connectivity structure of a population (the set of social contacts) acts on the way ideas are diffused. Only very recently the increasing accessibility of databases of real networks on one side, and the availability of powerful computers on the other side, have made possible a series of empirical studies on the social networks properties. In their seminal work [23], Watts and Strogatz have shown that the connection topology of some real networks is neither completely regular nor completely random. These networks, named small-world networks [36], exhibit a high clustering coefficient (a measure of the connectedness of a network), like regular lattices, and small average distance between two generic points (small characteristic path length), like random graphs. Small average distance and high clustering are not all the common features of complex networks. Albert, Barabasi et al. [37] have studied P(k), the degree distribution of a network, and found that many large networks are scale-free, i.e., have a power-law degree distribution P (k) k γ. Watts and Strogatz have named these networks, that are somehow in between regular and random networks, small-worlds, in analogy with the small-world phenomenon, empirically observed in social systems more than 30 years ago [36]. The mathematical characterization of the small-world behaviour is based on the evaluation of two quantities, the characteristic path length L, measuring the typical separation between two generic nodes in the network and the clustering coefficient C, measuring the average cliquishness of a node. Small-world networks are highly clustered, like regular lattices, having small characteristic path lengths, like random graphs. Generic network is usually represented by a weighted graph G = (N, K), where N is a finite set of vertices and K (N N) are the edges connecting the nodes. The information related to G is described by both an adjacency matrix A M( N, {0, 1}) and by a weight matrix W M( N, R + ). Both the matrix A and W are symmetric. The entry a ij in A are 1 if there is an edge joining vertex i to vertex j, and 0 otherwise. The matrix W contains a weight w ij related to any edge a ij. If a ij = 0 then w ij =. If the condition w ij = 1 for any a ij = 1 is assumed the graph G corresponds to an unweighted relational network. In network analysis a very important quantity is the degree of a vertex, i.e., the number of incident with i N. The degree k(i) (N) of generic vertex i is defined as: k(i) = {(i, j) : (i, j) K} (4) Silvia Calegari, Fabio Farina 131

137 Vol. IV, No. II, pp The average value of k(i) is < k(i) >= i k(i) 2 N The coefficient 2 in Equation (5) appears at the denominator because each link in A is counted twice. The shortest path length d ij : (N N) {R + } between two vertices has to be calculated to define L. In unweighted social networks d ij corresponds to the geodesic distance between nodes and it is measured as the minimum number of edges traversed to get from a vertex i to another vertex j. The distances d ij can be calculated using any all-to-all path algorithm (e.g., Floyd-Warshall algorithm) either for a weighted or a relational network. Let us remember that according to Definition 4, the weights in the network correspond to the fuzzy correlation joining two concepts in the fuzzy ontology induced concept network. The characteristic path length L of graph G is defined as the average of the shortest path lengths between two generic vertices: L(G) = i j N (5) 1 N ( N 1) d ij (6) The definition given in Equation (6) is valid for a totally connected G, where at least one finite path connecting any couple of vertices exists. Otherwise, when from a node i we cannot reach a node j then the distance d ij =, thus the sum in L(G) diverges. The clustering coefficient C(G) is a measure depending on the connectivity of the subgraph G i induced by a generic node i and its neighbours. Formally a subgraph G i = (N i, K i ) of a node i N can be defined as the pair: N i = {j N : (i, j) K} (7) K i = {(j, k) K : j N i k N i } (8) An upper bound on the cardinality of K i can be stated according to the following observation: if the degree of a given node is k(i), following Equation (4), then G i has k(i) (k(i) 1) K i (9) 2 Let us stress that the subgraph G i does not contain the node i. G i results to be useful in studying the connectivity of the neighbours of a node i after the eliminations of the node itself. The upper bound on the number of the edges in a subgraph G i introduces the ratio of the actual number of edges in G i with respect to the right hand side of equation (9). Formally this ratio is defined as: C sub (i) = 2 K i k(i) (k(i) 1) The quantities C sub (i) are used to calculate the clustering coefficient C(G) as their mean value: C(G) = 1 C sub (i) (11) N i N (10) Silvia Calegari, Fabio Farina 132

138 Vol. IV, No. II, pp A network exhibits the small-world phenomenon if it is characterized by small values for L and high values for the C clustering coefficient. Scale-free networks are usually identified, in addition to the exponential probability density function P (k) of the edges, by small values of both L and C. Studying real system networks, such as the collaboration graph of actors or the links among WWW documents, the probability to incur in non-connected graphs is very high. The L(G) and C(G) formalism is trivially not suited to treat these situations. In such cases the alternative formalism proposed by [38] is much more effective, even in the case of disconnected networks. This approach defines two measures of efficiency which give well-posed characterizations for the path mean length and for the node mean cliquishness respectively. To introduce the efficiency coefficients, it is necessary to consider the efficiency ɛ ij of a generic node (i, j) K. This quantity measures the speed of information propagation between a node i and a node j, in particular ɛ ij = 1 d ij. With this definition, when there is no path in the graph between i and j; d ij = and consistently ɛ ij = 0. The global efficiency of the graph G results to be: E glob (G) = i j G ɛ ij N ( N 1) = 1 N ( N 1) i j G 1 d ij (12) and the local efficiency, in analogy with C, can be defined as the average efficiency of local subgraphs: E(G i ) = 1 1 k(i)(k(i) 1) d l m G i lm (13) E loc (G) = 1 E(G i ) (14) N where G i, as previously defined, is the subgraph of the neighbours of i, which is composed of k(i) nodes. The two definitions originally given in [24] have the important property that both the global and local efficiency are normalized quantities, that is: E glob (G) 1 and E loc (G) 1. The conditions E glob (G) = 1 and E loc (G) = 1 hold in the case of a completely connected graph where the weight of the edges is a node independent positive constant. In the efficiency-based formalism, small-world phenomenon emerges for systems with high E glob (corresponding to low L) and high E loc (corresponding to high clustering C). Scale-free networks without small-world behaviour show high E glob and low E loc. 4 Information Retrieval Algorithm using Fuzzy Concept Network In the last decade, there has been a rapid and wide development of Internet which has brought online an increasingly great amount of documents and online textual i G Silvia Calegari, Fabio Farina 133

139 Vol. IV, No. II, pp information. The necessity of a better definition of the Information Retrieval System (IRS) emerged in order to retrieve the information considered pertinent to a user query. Information Retrieval is a discipline that involves the organization, storage, retrieval and display of information. IRSs are designed with the objective of providing references to documents which contain the information requested by the user [32]. In IRS, problems arise when there is the need to handle uncertainty and vagueness that appear in many different parts of the retrieval process. On one hand, an IRS is required to understand queries expressed in natural languages. On the other hand, it has the need to handle the uncertain representation of a document. In literature, there are many models of IRS that are classified into the following categories: boolean logic, vector space, probabilistic and fuzzy logic [39, 40]. However, both the efficiency and the effectiveness of these methods are not satisfactory [34]. Thus, other approaches have been proposed to directly handle the knowledge-based fuzzy information. In preliminary attempts the knowledge was represented by a concept matrix, where the elements identify relevant values among concepts [34]. Other more relevant approaches have been made adding fuzzy types to object-oriented databases systems [41]. As stated in Section 2, a crucial topic for the semantic information handling is the face the trade off between the proper definition of an object and its common sense counterpart. The FCN characteristic weights are initially set by an expert of the domain. In particular, he sets the initial correlation values on the fuzzy ontology and the fuzzy concept network construction procedure takes these as initial values for the links among the objects in O (see Definition 4). From now on the correlation values F (o i, o j ) will be updated according to the queries (both selections and insertions) performed on the documents. Most of all, the FCN usage gives the possibility to directly incorporate the semantics expressed by the natural languages in graph spanning. This feature let us intrinsically obtain fuzzy information retrieval algorithms without introducing fuzzyfication and defuzzyfication operators. Let us stress that this process is possible because the fuzzy logic is directly inserted into the knowledge expressed by the fuzzy ontology. An example for this kind of approach is presented in the following. The original crisp information retrieval algorithm taken into account has been successfully applied to support the creative processes of architects and interaction designers. More in detail, a new formalization of the algorithm adopted in the ATELIER project (see [22] and Section 5.1) is presented including a step-by-step brief description. The FCN has been involved in steps (1) and (4) in order to semantically enrich the results obtained. The algorithm input is the vector of the keywords in the query. The first step of the algorithm uses these keywords to locate the documents (e.g., stored in a relational database) containing them. The keyword vector is extended with all the new keywords related to each selected document. In the step (1) the queries are extended by navigating the FCN recursively. For each keyword specified in the query, a depth-first visit is performed arresting the spanning at a fixed level. In [22] this threshold was set to 3. The edges whose F (o i, o j ) is 0 are excluded and the neighbour keywords are collected without Silvia Calegari, Fabio Farina 134

140 Vol. IV, No. II, pp FCN-IR Search ( kv : keyword vector ) 1: FCN-based kv extension 2: kv pruning 3: kv-based documents extraction 4: FCN-based relevance calculation return ranking of the documents Figure 2: Information Retrieval Algorithm repetitions. Usually the techniques of Information Retrieval simply extend queries with parents and children of the concepts. In step (2) the final list of neighbouring entities is pruned by navigating the fuzzy ontology, namely the set of candidates is reduced excluding the keywords that are not connected by a direct path in the taxonomy, i.e. the parents and children of the terms contained in the query. In the third phase the documents containing the resulting keywords are extracted from the knowledge base. In the last step, the FCN is used to calculate the relevance of the documents thus arranging them in the desired order. In particular, thanks to the FCN characterising functions F and m, the weights for the keywords o i in each selected document are determined according to the following equation: w(o i ) = m(o i ) β o i [F (o i, o j )] βoi,oj (15) o j K,o j o i Where K is the set of the keywords obtained from the step (3), β oi,o j R is a modifier value used to express concept modifiers effects (see [19] and Section 2.2 for details). The final score of a document is evaluated through a cosine distance among the weights of each keyword. This is done for normalisation purpose. Such a value is finally sorted in order to obtain a ranking among the documents. 5 Test validation This section is divided as follows: in the first part we introduce the environment used to experiment the FCN, whereas in the second part the analytic study of the scale-free properties of these networks is given. 5.1 Description of the experiment A creative learning environment is the context chosen to study the fuzzy concept networks behaviour. In particular, the ATELIER (Architecture and Technologies for Inspirational Learning Environments) project has been examined. ATELIER is an EU-funded project that is part of the Disappearing Computer initiative 2. The aim of this project is to build a digitally enhanced environment, supporting a creative learning process in architecture and interaction design education. The work of the students is supported by many kinds of devices (e.g., large displays, 2 Silvia Calegari, Fabio Farina 135

141 Vol. IV, No. II, pp RFID technology, barcodes,... ) and a hyper-media database (HMDB) is used to store all digital materials produced. An approach has been studied to help and to support the creative practices of the students. To achieve this goal, an ontologydriven selection tool has been developed. In a recent contribution [22] it has been shown how dynamical fuzzy ontologies are suitable for this context. Every day the students create a very large amount of documents and artifacts and they collect a lot of material (e.g., digital pictures, notes, videos, and so on). Thus, new concepts are produced in the life cycle of a project. Indeed, the ontology evolves in time and the necessity emerges to make a dynamical fuzzy ontology suited for the context taken into account. As presented in Section 2, the fuzzy ontology is an exhaustive approach to handle the knowledge-based fuzzy information. Furthermore, it emerges that the evolution of the fuzzy concept network is mainly given by the keywords of the documents inserted in a HMDB and from the concepts written during the definition of a query by the users. The algorithm presented in Section 4 used these properties deeply. The executed experiments consider all these aspects. We have examined the trend of the fuzzy concept network in three different scenarios: the contribution of the keywords in the HMDB, the contribution of the concepts from the queries and their combined evaluation effects. In the first case, 485 documents have been examined. Four keywords is the average for each document and the resulting final correlations are 431. In the second case, 500 queries were performed by users with an age of the people from 20 to 60. For each query a user had the opportunity to include up to 5 different concepts and each user had the possibility to semantically enrich their requests using the following list of concept modifiers (little, enough, moderately, quite, very, totally). In this experiment we have obtained 232 correlations and 32 new concepts which were introduced in the fuzzy ontology domain. In the last test we examined two types of queries ( ) jointly: the keywords of the documents and requests of the users. The number of inducted correlations is 615, while the new concepts are Analytic Results of the experiments During the construction of the fuzzy concept network some snapshots have been periodically dumped to file (one snapshot each 50 queries) to be analyzed. To have a graphical topological representation a network analysis tool called AGNA 3 has been used. AGNA (Applied Graph and Network Analysis) is a platformindependent application designed for scientists and researchers who employ specific mathematical methods, such as social network analysis, sociometry and sequential analysis. Specifically, AGNA can assist in the study of communication relations in groups, organizational analysis and team building, kinship relations or animal behavior laws of organization. The most recent version is AGNA and it has been used to produce the following pictures of the fuzzy concept networks starting from the weighted adjacency matrix defined in Section 3. The link color intensity is proportional to the function F introduced in Definition 2 so, the more marked 3 Silvia Calegari, Fabio Farina 136

142 Vol. IV, No. II, pp lines mean F values near to 1. Because of the large number of concepts and correlations, the pictures in Figure 4 have the purpose of showing qualitatively the link distributions and the hub locations. Little semantic information about the ontology can be effectively extracted from these pictures. Both the global and the local efficiency, as defined in Equation (12) and in Equation (14), are calculated on each of these snapshots. The evolution for the efficiency measures are reported in Figure 3(a), 3(b) and 3(c). The solid line corresponds to the global efficiency while the dashed one is the local efficiency. In Figure 3(a) the efficiency evolutions of the HMDB are reported. The global efficiency becomes a dominant effect after an initial transitory where the local efficiency results dominant. So, we can deduce the emergence of a hub connected fuzzy concept network. This consideration is graphically confirmed by the network reported in Figure 4(a). To increase the readability we located the hubs on the borders of the figure. It can be seen clearly that the hub concepts are people, man, woman, hat, face and portrait. These central concepts have been isolated using the betweenness [42] sociometric measure. The high betweenness values of the hubs with respect to the one for the other concepts confirm the measure obtained by the E glo. Indeed, the mean distance among the concepts is kept low thanks to these points appearing very frequently in the paths from and to all the other nodes. We want to stress that the global efficiency quantifies the presence of hubs in a given network, while the betweenness of the nodes gives a way to identify which of the concepts are actually hub points. On the other hand, Figure 3(b) shows how the local efficiency in a fuzzy concept network builded using user queries is higher than its global counterpart. This suggests that the network topology lacks hubs. Furthermore many nodes present quite the same number of neighbours. A confirmation for this analysis is given by the fuzzy concept network reported in Figure 4(b). In this case the betweenness index for the concepts shows that no particular point has significantly higher frequency in the other node paths. Finally, in Figure 3(c) the effect of the network obtained by both the documents and the queries is reported. In this composed kind of tests a total of about 1000 queries is taken into account. It is interesting how both the data from HMDB and the user queries act in a non linear way on the quantification of the efficiency measures. The resulting fuzzy concept network shows a dominant E glo with respect to its E loc and some hubs emerge. In particular Figure 5.2 highlights the fact that the hubs collect a large number of links coming from the other concepts. The betweenness index for this fuzzy concept network identify one main hub, the concept people, with an extremely high value (5 times the mean value for the other hubs). The principal other hubs are woman, man, landscape, sea, portrait, red and fruit. In this case the Freeman General Coefficient evaluated on the indexes is slightly lower than in the HMDB case, this is due to the higher clustering among concepts (higher E loc ), see Table 1. The Freeman Coefficient is a function that allows the consolidation of node-level measures in a single value related to the properties of the whole network [42]. The strength of these connections is much more marked than in the similar situation treated in the case of HMDB induced fuzzy concept network. This means that the queries contribution reinforces the semantic correlations among the hub Silvia Calegari, Fabio Farina 137

143 Vol. IV, No. II, pp Efficiency (a) Snap Efficiency Snap (b) Efficiency Snap (c) Figure 3: (a) Efficiency measures for the concept network induced by the knowledge base. (b) Efficiency for the user queries. (c) Efficiency for the joined queries and knowledge base documents. concepts. To confirm the scale-free nature of the hubs in the fuzzy concept networks we analyzed the statistical distributions of k(i) (see Equation (4)) reported in Figure 5. For Figures 5(a) and 5(c) the frequencies decrease according to a power law. This confirms what is stated by the theoretical expectations very well. The user queries distributions, in Figure 5(b), behave differently. Their high values for E loc imply, as already stated, a highly clustered structure. Let us consider how E loc is related to other classical social network parameters such as the density and the weighted density [42]. The density reflects the connectedness of a given network with respect to its complete graph. It is easy to note that this criterion is quite similar to what stated in Equation (14): we can consider the E loc as a mean value for the densities evaluated locally in each node of the fuzzy concept network. Unexpectedly the numerical results in Table 1 show that there is an inverse proportional relation between the E loc and the density. More investigations are required. The weighted density can be interpreted as a measure of the mean link weight value namely, the mean semantic correlation (see Section 2) among the concepts in the fuzzy concept network. In Table 1 it can be seen that the weighted density values are higher for the systems exhibiting hubs. This is graphically confirmed by the Figures 4(a) and 5.2, where the links among the concepts are more marked (i.e. more colored lines correspond to stronger correlations). Silvia Calegari, Fabio Farina 138

144 Vol. IV, No. II, pp (a) (b) (c) Figure 4: Topological views of different fuzzy concept network obtained by AGNA 2.1.1: (a) HMDB network. (b) User queries network. (c) Complete knowledge-base fuzzy concept network. Silvia Calegari, Fabio Farina 139

S A. Technomathematics Research Foundation, India. International Journal of Computer Science & Applications. Published by. Grid and Parallel Computing

S A. Technomathematics Research Foundation, India. International Journal of Computer Science & Applications. Published by. Grid and Parallel Computing ISSN 0972-9038 I International Journal of Computer Science & VOLUME 6, ISSUE 1 Applications JANUARY 2009 Editor-in-Chief Rajendra Akerkar J Guest Editor Rajeev Wankar C Special Issue on Grid and Parallel

More information

International Journal of Computer Science & Applications. Guest Editors Natarajan Meghanathan and Dhinaharan Nagamalai

International Journal of Computer Science & Applications. Guest Editors Natarajan Meghanathan and Dhinaharan Nagamalai ISSN 0972-9038 I International Journal of Computer Science & Applications VOLUME 6, ISSUE 3 JUNE 2009 Editor-in-Chief Rajendra Akerkar J Guest Editors Natarajan Meghanathan and Dhinaharan Nagamalai C Special

More information

International Journal of Computer Science & Applications

International Journal of Computer Science & Applications ISSN 0972-9038 International Journal of Computer Science & Applications Volume 5 Issue 3a September 2008 Special Issue on New Trends in Information and Communication Technologies Applications Editor-in-Chief

More information

A Survey of Belief Revision on Reputation Management. Yao Yanjun a73482

A Survey of Belief Revision on Reputation Management. Yao Yanjun a73482 A Survey of Belief Revision on Reputation Management Yao Yanjun a73482 Agenda Introduction The AGM Belief Revision Framework expansion revision contraction The Probability Algorithms A Model in a Multi-agent

More information

Fault Analysis in Software with the Data Interaction of Classes

Fault Analysis in Software with the Data Interaction of Classes , pp.189-196 http://dx.doi.org/10.14257/ijsia.2015.9.9.17 Fault Analysis in Software with the Data Interaction of Classes Yan Xiaobo 1 and Wang Yichen 2 1 Science & Technology on Reliability & Environmental

More information

Eco-Efficient Packaging Material Selection for Fresh Produce: Industrial Session

Eco-Efficient Packaging Material Selection for Fresh Produce: Industrial Session Eco-Efficient Packaging Material Selection for Fresh Produce: Industrial Session Nouredine Tamani 1, Patricio Mosse 2, Madalina Croitoru 1, Patrice Buche 2, Valérie Guillard 2, Carole Guillaume 2, Nathalie

More information

UPDATES OF LOGIC PROGRAMS

UPDATES OF LOGIC PROGRAMS Computing and Informatics, Vol. 20, 2001,????, V 2006-Nov-6 UPDATES OF LOGIC PROGRAMS Ján Šefránek Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University,

More information

Data Integration using Agent based Mediator-Wrapper Architecture. Tutorial Report For Agent Based Software Engineering (SENG 609.

Data Integration using Agent based Mediator-Wrapper Architecture. Tutorial Report For Agent Based Software Engineering (SENG 609. Data Integration using Agent based Mediator-Wrapper Architecture Tutorial Report For Agent Based Software Engineering (SENG 609.22) Presented by: George Shi Course Instructor: Dr. Behrouz H. Far December

More information

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials ehealth Beyond the Horizon Get IT There S.K. Andersen et al. (Eds.) IOS Press, 2008 2008 Organizing Committee of MIE 2008. All rights reserved. 3 An Ontology Based Method to Solve Query Identifier Heterogeneity

More information

3. Mathematical Induction

3. Mathematical Induction 3. MATHEMATICAL INDUCTION 83 3. Mathematical Induction 3.1. First Principle of Mathematical Induction. Let P (n) be a predicate with domain of discourse (over) the natural numbers N = {0, 1,,...}. If (1)

More information

Distributed Database for Environmental Data Integration

Distributed Database for Environmental Data Integration Distributed Database for Environmental Data Integration A. Amato', V. Di Lecce2, and V. Piuri 3 II Engineering Faculty of Politecnico di Bari - Italy 2 DIASS, Politecnico di Bari, Italy 3Dept Information

More information

8 Divisibility and prime numbers

8 Divisibility and prime numbers 8 Divisibility and prime numbers 8.1 Divisibility In this short section we extend the concept of a multiple from the natural numbers to the integers. We also summarize several other terms that express

More information

Regular Languages and Finite Automata

Regular Languages and Finite Automata Regular Languages and Finite Automata 1 Introduction Hing Leung Department of Computer Science New Mexico State University Sep 16, 2010 In 1943, McCulloch and Pitts [4] published a pioneering work on a

More information

A model of indirect reputation assessment for adaptive buying agents in electronic markets

A model of indirect reputation assessment for adaptive buying agents in electronic markets A model of indirect reputation assessment for adaptive buying agents in electronic markets Kevin Regan and Robin Cohen School of Computer Science University of Waterloo Abstract. In this paper, we present

More information

Semantic Analysis of Business Process Executions

Semantic Analysis of Business Process Executions Semantic Analysis of Business Process Executions Fabio Casati, Ming-Chien Shan Software Technology Laboratory HP Laboratories Palo Alto HPL-2001-328 December 17 th, 2001* E-mail: [casati, shan] @hpl.hp.com

More information

Fall 2012 Q530. Programming for Cognitive Science

Fall 2012 Q530. Programming for Cognitive Science Fall 2012 Q530 Programming for Cognitive Science Aimed at little or no programming experience. Improve your confidence and skills at: Writing code. Reading code. Understand the abilities and limitations

More information

CS 3719 (Theory of Computation and Algorithms) Lecture 4

CS 3719 (Theory of Computation and Algorithms) Lecture 4 CS 3719 (Theory of Computation and Algorithms) Lecture 4 Antonina Kolokolova January 18, 2012 1 Undecidable languages 1.1 Church-Turing thesis Let s recap how it all started. In 1990, Hilbert stated a

More information

Reputation Model with Forgiveness Factor for Semi-Competitive E-Business Agent Societies

Reputation Model with Forgiveness Factor for Semi-Competitive E-Business Agent Societies Reputation Model with Forgiveness Factor for Semi-Competitive E-Business Agent Societies Radu Burete, Amelia Bădică, and Costin Bădică University of Craiova Bvd.Decebal 107, Craiova, 200440, Romania radu

More information

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2 CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2 Proofs Intuitively, the concept of proof should already be familiar We all like to assert things, and few of us

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

Personalized e-learning a Goal Oriented Approach

Personalized e-learning a Goal Oriented Approach Proceedings of the 7th WSEAS International Conference on Distance Learning and Web Engineering, Beijing, China, September 15-17, 2007 304 Personalized e-learning a Goal Oriented Approach ZHIQI SHEN 1,

More information

A Semantic Marketplace of Peers Hosting Negotiating Intelligent Agents

A Semantic Marketplace of Peers Hosting Negotiating Intelligent Agents A Semantic Marketplace of Peers Hosting Negotiating Intelligent Agents Theodore Patkos and Dimitris Plexousakis Institute of Computer Science, FO.R.T.H. Vassilika Vouton, P.O. Box 1385, GR 71110 Heraklion,

More information

FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY AUTUMN 2016 BACHELOR COURSES

FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY AUTUMN 2016 BACHELOR COURSES FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY Please note! This is a preliminary list of courses for the study year 2016/2017. Changes may occur! AUTUMN 2016 BACHELOR COURSES DIP217 Applied Software

More information

GOAL-BASED INTELLIGENT AGENTS

GOAL-BASED INTELLIGENT AGENTS International Journal of Information Technology, Vol. 9 No. 1 GOAL-BASED INTELLIGENT AGENTS Zhiqi Shen, Robert Gay and Xuehong Tao ICIS, School of EEE, Nanyang Technological University, Singapore 639798

More information

Argumentación en Agentes Inteligentes: Teoría y Aplicaciones Prof. Carlos Iván Chesñevar

Argumentación en Agentes Inteligentes: Teoría y Aplicaciones Prof. Carlos Iván Chesñevar Argumentation in Intelligent Agents: Theory and Applications Carlos Iván Chesñevar cic@cs.uns.edu.ar http://cs.uns.edu.ar/~cic cic Part 5 - Outline Main research projects in argumentation Main conferences

More information

Integrating Benders decomposition within Constraint Programming

Integrating Benders decomposition within Constraint Programming Integrating Benders decomposition within Constraint Programming Hadrien Cambazard, Narendra Jussien email: {hcambaza,jussien}@emn.fr École des Mines de Nantes, LINA CNRS FRE 2729 4 rue Alfred Kastler BP

More information

Intelligent Log Analyzer. André Restivo <andre.restivo@portugalmail.pt>

Intelligent Log Analyzer. André Restivo <andre.restivo@portugalmail.pt> Intelligent Log Analyzer André Restivo 9th January 2003 Abstract Server Administrators often have to analyze server logs to find if something is wrong with their machines.

More information

Weighted Graph Approach for Trust Reputation Management

Weighted Graph Approach for Trust Reputation Management Weighted Graph Approach for Reputation Management K.Thiagarajan, A.Raghunathan, Ponnammal Natarajan, G.Poonkuzhali and Prashant Ranjan Abstract In this paper, a two way approach of developing trust between

More information

MEng, BSc Computer Science with Artificial Intelligence

MEng, BSc Computer Science with Artificial Intelligence School of Computing FACULTY OF ENGINEERING MEng, BSc Computer Science with Artificial Intelligence Year 1 COMP1212 Computer Processor Effective programming depends on understanding not only how to give

More information

Properties of Stabilizing Computations

Properties of Stabilizing Computations Theory and Applications of Mathematics & Computer Science 5 (1) (2015) 71 93 Properties of Stabilizing Computations Mark Burgin a a University of California, Los Angeles 405 Hilgard Ave. Los Angeles, CA

More information

Automatic Timeline Construction For Computer Forensics Purposes

Automatic Timeline Construction For Computer Forensics Purposes Automatic Timeline Construction For Computer Forensics Purposes Yoan Chabot, Aurélie Bertaux, Christophe Nicolle and Tahar Kechadi CheckSem Team, Laboratoire Le2i, UMR CNRS 6306 Faculté des sciences Mirande,

More information

MEng, BSc Applied Computer Science

MEng, BSc Applied Computer Science School of Computing FACULTY OF ENGINEERING MEng, BSc Applied Computer Science Year 1 COMP1212 Computer Processor Effective programming depends on understanding not only how to give a machine instructions

More information

Process Modelling from Insurance Event Log

Process Modelling from Insurance Event Log Process Modelling from Insurance Event Log P.V. Kumaraguru Research scholar, Dr.M.G.R Educational and Research Institute University Chennai- 600 095 India Dr. S.P. Rajagopalan Professor Emeritus, Dr. M.G.R

More information

Decision Making under Uncertainty

Decision Making under Uncertainty 6.825 Techniques in Artificial Intelligence Decision Making under Uncertainty How to make one decision in the face of uncertainty Lecture 19 1 In the next two lectures, we ll look at the question of how

More information

Georg Cantor and Set Theory

Georg Cantor and Set Theory Georg Cantor and Set Theory. Life Father, Georg Waldemar Cantor, born in Denmark, successful merchant, and stock broker in St Petersburg. Mother, Maria Anna Böhm, was Russian. In 856, because of father

More information

Solving simultaneous equations using the inverse matrix

Solving simultaneous equations using the inverse matrix Solving simultaneous equations using the inverse matrix 8.2 Introduction The power of matrix algebra is seen in the representation of a system of simultaneous linear equations as a matrix equation. Matrix

More information

MULTI AGENT-BASED DISTRIBUTED DATA MINING

MULTI AGENT-BASED DISTRIBUTED DATA MINING MULTI AGENT-BASED DISTRIBUTED DATA MINING REECHA B. PRAJAPATI 1, SUMITRA MENARIA 2 Department of Computer Science and Engineering, Parul Institute of Technology, Gujarat Technology University Abstract:

More information

CHAPTER 5. Number Theory. 1. Integers and Division. Discussion

CHAPTER 5. Number Theory. 1. Integers and Division. Discussion CHAPTER 5 Number Theory 1. Integers and Division 1.1. Divisibility. Definition 1.1.1. Given two integers a and b we say a divides b if there is an integer c such that b = ac. If a divides b, we write a

More information

SIP Service Providers and The Spam Problem

SIP Service Providers and The Spam Problem SIP Service Providers and The Spam Problem Y. Rebahi, D. Sisalem Fraunhofer Institut Fokus Kaiserin-Augusta-Allee 1 10589 Berlin, Germany {rebahi, sisalem}@fokus.fraunhofer.de Abstract The Session Initiation

More information

Detecting Deception in Reputation Management

Detecting Deception in Reputation Management Detecting Deception in Reputation Management Bin Yu Department of Computer Science North Carolina State University Raleigh, NC 27695-7535, USA byu@eos.ncsu.edu Munindar P. Singh Department of Computer

More information

Implementation of hybrid software architecture for Artificial Intelligence System

Implementation of hybrid software architecture for Artificial Intelligence System IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.1, January 2007 35 Implementation of hybrid software architecture for Artificial Intelligence System B.Vinayagasundaram and

More information

Automated Haggling: Building Artificial Negotiators. Nick Jennings

Automated Haggling: Building Artificial Negotiators. Nick Jennings Automated Haggling: Building Artificial Negotiators Nick Jennings Intelligence, Agents, Multimedia (IAM) Group, Dept. of Electronics and Computer Science, University of Southampton, UK. nrj@ecs.soton.ac.uk

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Intelligent Agents. Based on An Introduction to MultiAgent Systems and slides by Michael Wooldridge

Intelligent Agents. Based on An Introduction to MultiAgent Systems and slides by Michael Wooldridge Intelligent Agents Based on An Introduction to MultiAgent Systems and slides by Michael Wooldridge Denition of an Agent An agent is a computer system capable of autonomous action in some environment, in

More information

Master of Science Service Oriented Architecture for Enterprise. Courses description

Master of Science Service Oriented Architecture for Enterprise. Courses description Master of Science Service Oriented Architecture for Enterprise Courses description SCADA and PLC networks The course aims to consolidate and transfer of extensive knowledge regarding the architecture,

More information

Lies My Calculator and Computer Told Me

Lies My Calculator and Computer Told Me Lies My Calculator and Computer Told Me 2 LIES MY CALCULATOR AND COMPUTER TOLD ME Lies My Calculator and Computer Told Me See Section.4 for a discussion of graphing calculators and computers with graphing

More information

Follow links for Class Use and other Permissions. For more information send email to: permissions@pupress.princeton.edu

Follow links for Class Use and other Permissions. For more information send email to: permissions@pupress.princeton.edu COPYRIGHT NOTICE: Ariel Rubinstein: Lecture Notes in Microeconomic Theory is published by Princeton University Press and copyrighted, c 2006, by Princeton University Press. All rights reserved. No part

More information

A Framework for the Semantics of Behavioral Contracts

A Framework for the Semantics of Behavioral Contracts A Framework for the Semantics of Behavioral Contracts Ashley McNeile Metamaxim Ltd, 48 Brunswick Gardens, London W8 4AN, UK ashley.mcneile@metamaxim.com Abstract. Contracts have proved a powerful concept

More information

Detecting Deception in Reputation Management

Detecting Deception in Reputation Management Detecting Deception in Reputation Management Bin Yu Department of Computer Science North Carolina State University Raleigh, NC 27695-7535, USA byu@eos.ncsu.edu Munindar P. Singh Department of Computer

More information

INTRUSION PREVENTION AND EXPERT SYSTEMS

INTRUSION PREVENTION AND EXPERT SYSTEMS INTRUSION PREVENTION AND EXPERT SYSTEMS By Avi Chesla avic@v-secure.com Introduction Over the past few years, the market has developed new expectations from the security industry, especially from the intrusion

More information

Current California Math Standards Balanced Equations

Current California Math Standards Balanced Equations Balanced Equations Current California Math Standards Balanced Equations Grade Three Number Sense 1.0 Students understand the place value of whole numbers: 1.1 Count, read, and write whole numbers to 10,000.

More information

A Decision Support System for the Assessment of Higher Education Degrees in Portugal

A Decision Support System for the Assessment of Higher Education Degrees in Portugal A Decision Support System for the Assessment of Higher Education Degrees in Portugal José Paulo Santos, José Fernando Oliveira, Maria Antónia Carravilla, Carlos Costa Faculty of Engineering of the University

More information

A new cost model for comparison of Point to Point and Enterprise Service Bus integration styles

A new cost model for comparison of Point to Point and Enterprise Service Bus integration styles A new cost model for comparison of Point to Point and Enterprise Service Bus integration styles MICHAL KÖKÖRČENÝ Department of Information Technologies Unicorn College V kapslovně 2767/2, Prague, 130 00

More information

The Ontology and Architecture for an Academic Social Network

The Ontology and Architecture for an Academic Social Network www.ijcsi.org 22 The Ontology and Architecture for an Academic Social Network Moharram Challenger Computer Engineering Department, Islamic Azad University Shabestar Branch, Shabestar, East Azerbaijan,

More information

2 AIMS: an Agent-based Intelligent Tool for Informational Support

2 AIMS: an Agent-based Intelligent Tool for Informational Support Aroyo, L. & Dicheva, D. (2000). Domain and user knowledge in a web-based courseware engineering course, knowledge-based software engineering. In T. Hruska, M. Hashimoto (Eds.) Joint Conference knowledge-based

More information

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data. MATHEMATICS: THE LEVEL DESCRIPTIONS In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data. Attainment target

More information

SOFTWARE REQUIREMENTS

SOFTWARE REQUIREMENTS SOFTWARE REQUIREMENTS http://www.tutorialspoint.com/software_engineering/software_requirements.htm Copyright tutorialspoint.com The software requirements are description of features and functionalities

More information

Information Retrieval Systems in XML Based Database A review

Information Retrieval Systems in XML Based Database A review Information Retrieval Systems in XML Based Database A review Preeti Pandey 1, L.S.Maurya 2 Research Scholar, IT Department, SRMSCET, Bareilly, India 1 Associate Professor, IT Department, SRMSCET, Bareilly,

More information

8 Primes and Modular Arithmetic

8 Primes and Modular Arithmetic 8 Primes and Modular Arithmetic 8.1 Primes and Factors Over two millennia ago already, people all over the world were considering the properties of numbers. One of the simplest concepts is prime numbers.

More information

Creating, Solving, and Graphing Systems of Linear Equations and Linear Inequalities

Creating, Solving, and Graphing Systems of Linear Equations and Linear Inequalities Algebra 1, Quarter 2, Unit 2.1 Creating, Solving, and Graphing Systems of Linear Equations and Linear Inequalities Overview Number of instructional days: 15 (1 day = 45 60 minutes) Content to be learned

More information

Relational Databases

Relational Databases Relational Databases Jan Chomicki University at Buffalo Jan Chomicki () Relational databases 1 / 18 Relational data model Domain domain: predefined set of atomic values: integers, strings,... every attribute

More information

Sudoku puzzles and how to solve them

Sudoku puzzles and how to solve them Sudoku puzzles and how to solve them Andries E. Brouwer 2006-05-31 1 Sudoku Figure 1: Two puzzles the second one is difficult A Sudoku puzzle (of classical type ) consists of a 9-by-9 matrix partitioned

More information

Acquisition Lesson Planning Form Key Standards addressed in this Lesson: MM2A3d,e Time allotted for this Lesson: 4 Hours

Acquisition Lesson Planning Form Key Standards addressed in this Lesson: MM2A3d,e Time allotted for this Lesson: 4 Hours Acquisition Lesson Planning Form Key Standards addressed in this Lesson: MM2A3d,e Time allotted for this Lesson: 4 Hours Essential Question: LESSON 4 FINITE ARITHMETIC SERIES AND RELATIONSHIP TO QUADRATIC

More information

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem Time on my hands: Coin tosses. Problem Formulation: Suppose that I have

More information

Lecture 16 : Relations and Functions DRAFT

Lecture 16 : Relations and Functions DRAFT CS/Math 240: Introduction to Discrete Mathematics 3/29/2011 Lecture 16 : Relations and Functions Instructor: Dieter van Melkebeek Scribe: Dalibor Zelený DRAFT In Lecture 3, we described a correspondence

More information

Reusable Knowledge-based Components for Building Software. Applications: A Knowledge Modelling Approach

Reusable Knowledge-based Components for Building Software. Applications: A Knowledge Modelling Approach Reusable Knowledge-based Components for Building Software Applications: A Knowledge Modelling Approach Martin Molina, Jose L. Sierra, Jose Cuena Department of Artificial Intelligence, Technical University

More information

Lecture 17 : Equivalence and Order Relations DRAFT

Lecture 17 : Equivalence and Order Relations DRAFT CS/Math 240: Introduction to Discrete Mathematics 3/31/2011 Lecture 17 : Equivalence and Order Relations Instructor: Dieter van Melkebeek Scribe: Dalibor Zelený DRAFT Last lecture we introduced the notion

More information

Cryptography and Network Security Department of Computer Science and Engineering Indian Institute of Technology Kharagpur

Cryptography and Network Security Department of Computer Science and Engineering Indian Institute of Technology Kharagpur Cryptography and Network Security Department of Computer Science and Engineering Indian Institute of Technology Kharagpur Module No. # 01 Lecture No. # 05 Classic Cryptosystems (Refer Slide Time: 00:42)

More information

Course Syllabus For Operations Management. Management Information Systems

Course Syllabus For Operations Management. Management Information Systems For Operations Management and Management Information Systems Department School Year First Year First Year First Year Second year Second year Second year Third year Third year Third year Third year Third

More information

An Investigation of Agent Oriented Software Engineering Methodologies to Provide an Extended Methodology

An Investigation of Agent Oriented Software Engineering Methodologies to Provide an Extended Methodology An Investigation of Agent Oriented Software Engineering Methodologies to Provide an Extended Methodology A.Fatemi 1, N.NematBakhsh 2,B. Tork Ladani 3 Department of Computer Science, Isfahan University,

More information

Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products

Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products Chapter 3 Cartesian Products and Relations The material in this chapter is the first real encounter with abstraction. Relations are very general thing they are a special type of subset. After introducing

More information

Agenda. Interface Agents. Interface Agents

Agenda. Interface Agents. Interface Agents Agenda Marcelo G. Armentano Problem Overview Interface Agents Probabilistic approach Monitoring user actions Model of the application Model of user intentions Example Summary ISISTAN Research Institute

More information

USING COMPLEX EVENT PROCESSING TO MANAGE PATTERNS IN DISTRIBUTION NETWORKS

USING COMPLEX EVENT PROCESSING TO MANAGE PATTERNS IN DISTRIBUTION NETWORKS USING COMPLEX EVENT PROCESSING TO MANAGE PATTERNS IN DISTRIBUTION NETWORKS Foued BAROUNI Eaton Canada FouedBarouni@eaton.com Bernard MOULIN Laval University Canada Bernard.Moulin@ift.ulaval.ca ABSTRACT

More information

Reputation Rating Mode and Aggregating Method of Online Reputation Management System *

Reputation Rating Mode and Aggregating Method of Online Reputation Management System * Reputation Rating Mode and Aggregating Method of Online Reputation Management System * Yang Deli ** Song Guangxing *** (School of Management, Dalian University of Technology, P.R.China 116024) Abstract

More information

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical

More information

Distributed Knowledge Management based on Software Agents and Ontology

Distributed Knowledge Management based on Software Agents and Ontology Distributed Knowledge Management based on Software Agents and Ontology Michal Laclavik 1, Zoltan Balogh 1, Ladislav Hluchy 1, Renata Slota 2, Krzysztof Krawczyk 3 and Mariusz Dziewierz 3 1 Institute of

More information

Slow Intelligence System Framework to Network Management Problems for Attaining Feasible Solution

Slow Intelligence System Framework to Network Management Problems for Attaining Feasible Solution Slow Intelligence System Framework to Network Management Problems for Attaining Feasible Solution T.S.Baskaran 1 and R.Sivakumar 2 1 Department of Computer Science, A.V.V.M. Sri Pushpam College, Poondi,

More information

Appendix A: Science Practices for AP Physics 1 and 2

Appendix A: Science Practices for AP Physics 1 and 2 Appendix A: Science Practices for AP Physics 1 and 2 Science Practice 1: The student can use representations and models to communicate scientific phenomena and solve scientific problems. The real world

More information

Bayesian Tutorial (Sheet Updated 20 March)

Bayesian Tutorial (Sheet Updated 20 March) Bayesian Tutorial (Sheet Updated 20 March) Practice Questions (for discussing in Class) Week starting 21 March 2016 1. What is the probability that the total of two dice will be greater than 8, given that

More information

Selbo 2 an Environment for Creating Electronic Content in Software Engineering

Selbo 2 an Environment for Creating Electronic Content in Software Engineering BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 9, No 3 Sofia 2009 Selbo 2 an Environment for Creating Electronic Content in Software Engineering Damyan Mitev 1, Stanimir

More information

So let us begin our quest to find the holy grail of real analysis.

So let us begin our quest to find the holy grail of real analysis. 1 Section 5.2 The Complete Ordered Field: Purpose of Section We present an axiomatic description of the real numbers as a complete ordered field. The axioms which describe the arithmetic of the real numbers

More information

Theoretical Perspective

Theoretical Perspective Preface Motivation Manufacturer of digital products become a driver of the world s economy. This claim is confirmed by the data of the European and the American stock markets. Digital products are distributed

More information

Sample Size Issues for Conjoint Analysis

Sample Size Issues for Conjoint Analysis Chapter 7 Sample Size Issues for Conjoint Analysis I m about to conduct a conjoint analysis study. How large a sample size do I need? What will be the margin of error of my estimates if I use a sample

More information

Fault Localization in a Software Project using Back- Tracking Principles of Matrix Dependency

Fault Localization in a Software Project using Back- Tracking Principles of Matrix Dependency Fault Localization in a Software Project using Back- Tracking Principles of Matrix Dependency ABSTRACT Fault identification and testing has always been the most specific concern in the field of software

More information

5.1 Identifying the Target Parameter

5.1 Identifying the Target Parameter University of California, Davis Department of Statistics Summer Session II Statistics 13 August 20, 2012 Date of latest update: August 20 Lecture 5: Estimation with Confidence intervals 5.1 Identifying

More information

Reliability Guarantees in Automata Based Scheduling for Embedded Control Software

Reliability Guarantees in Automata Based Scheduling for Embedded Control Software 1 Reliability Guarantees in Automata Based Scheduling for Embedded Control Software Santhosh Prabhu, Aritra Hazra, Pallab Dasgupta Department of CSE, IIT Kharagpur West Bengal, India - 721302. Email: {santhosh.prabhu,

More information

Artificial Intelligence in Knowledge-based Technologies and Systems

Artificial Intelligence in Knowledge-based Technologies and Systems Computer Science and Information Technology 4(1): 27-32, 2016 DOI: 10.13189/csit.2016.040105 http://www.hrpub.org Artificial Intelligence in Knowledge-based Technologies and Systems Viktor Krasnoproshin

More information

A Multi-Agent-based Approach to Improve Intrusion Detection Systems False Alarm Ratio by Using Honeypot

A Multi-Agent-based Approach to Improve Intrusion Detection Systems False Alarm Ratio by Using Honeypot A Multi-Agent-based Approach to Improve Intrusion Detection Systems False Alarm Ratio by Using Honeypot Babak Khosravifar Dept. Comp. Eng. Concordia University, Montreal, Canada b khosr@encs.concordia.ca

More information

The Role of Computers in Synchronous Collaborative Design

The Role of Computers in Synchronous Collaborative Design The Role of Computers in Synchronous Collaborative Design Wassim M. Jabi, The University of Michigan Theodore W. Hall, Chinese University of Hong Kong Abstract In this paper we discuss the role of computers

More information

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1. MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

More information

6.254 : Game Theory with Engineering Applications Lecture 2: Strategic Form Games

6.254 : Game Theory with Engineering Applications Lecture 2: Strategic Form Games 6.254 : Game Theory with Engineering Applications Lecture 2: Strategic Form Games Asu Ozdaglar MIT February 4, 2009 1 Introduction Outline Decisions, utility maximization Strategic form games Best responses

More information

A Secure Online Reputation Defense System from Unfair Ratings using Anomaly Detections

A Secure Online Reputation Defense System from Unfair Ratings using Anomaly Detections A Secure Online Reputation Defense System from Unfair Ratings using Anomaly Detections Asha baby PG Scholar,Department of CSE A. Kumaresan Professor, Department of CSE K. Vijayakumar Professor, Department

More information

An Intelligent Software Agent Machine Condition Monitoring System Using GPRS and Data Mining

An Intelligent Software Agent Machine Condition Monitoring System Using GPRS and Data Mining An Intelligent Software Agent Machine Condition Monitoring System Using GPRS and Data Mining R.Anandan Assistant Professor 1 1 Department of Computer Science & Engineering KarpagaVinayagaCollege of Engineering

More information

The Theory of Concept Analysis and Customer Relationship Mining

The Theory of Concept Analysis and Customer Relationship Mining The Application of Association Rule Mining in CRM Based on Formal Concept Analysis HongSheng Xu * and Lan Wang College of Information Technology, Luoyang Normal University, Luoyang, 471022, China xhs_ls@sina.com

More information

4.2 Euclid s Classification of Pythagorean Triples

4.2 Euclid s Classification of Pythagorean Triples 178 4. Number Theory: Fermat s Last Theorem Exercise 4.7: A primitive Pythagorean triple is one in which any two of the three numbers are relatively prime. Show that every multiple of a Pythagorean triple

More information

Testing LTL Formula Translation into Büchi Automata

Testing LTL Formula Translation into Büchi Automata Testing LTL Formula Translation into Büchi Automata Heikki Tauriainen and Keijo Heljanko Helsinki University of Technology, Laboratory for Theoretical Computer Science, P. O. Box 5400, FIN-02015 HUT, Finland

More information

Semantic Errors in SQL Queries: A Quite Complete List

Semantic Errors in SQL Queries: A Quite Complete List Semantic Errors in SQL Queries: A Quite Complete List Christian Goldberg, Stefan Brass Martin-Luther-Universität Halle-Wittenberg {goldberg,brass}@informatik.uni-halle.de Abstract We investigate classes

More information

Technology and Writing Assessment: Lessons Learned from the US National Assessment of Educational Progress 1

Technology and Writing Assessment: Lessons Learned from the US National Assessment of Educational Progress 1 Technology and Writing Assessment: Lessons Learned from the US National Assessment of Educational Progress 1 Randy Elliot Bennett Educational Testing Service Princeton, NJ 08541 rbennett@ets.org 1 This

More information

How To Understand The Theory Of Computer Science

How To Understand The Theory Of Computer Science Theory of Computation Lecture Notes Abhijat Vichare August 2005 Contents 1 Introduction 2 What is Computation? 3 The λ Calculus 3.1 Conversions: 3.2 The calculus in use 3.3 Few Important Theorems 3.4 Worked

More information

Super-Agent Based Reputation Management with a Practical Reward Mechanism in Decentralized Systems

Super-Agent Based Reputation Management with a Practical Reward Mechanism in Decentralized Systems Super-Agent Based Reputation Management with a Practical Reward Mechanism in Decentralized Systems Yao Wang, Jie Zhang, and Julita Vassileva Department of Computer Science, University of Saskatchewan,

More information