Uncertainty in Data Integration and Dataspace Support Platforms

Transcription

1 Uncertainty in Data Integration and Dataspace Support Platforms Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy Abstract Data integration has been an important area of research for several years. However, such systems suffer from one of the main drawbacks of database systems: the need to invest significant modeling effort upfront. Dataspace Support Platforms (DSSP) envision a system that offers useful services on its data without any setup effort, and improve with time in a pay-as-you-go fashion. We argue that in order to support DSSPs, the system needs to model uncertainty at its core. We describe the concepts of probabilistic mediated schemas and probabilistic mappings as enabling concepts for DSSPs. 1 Introduction Data integration and exchange systems offer a uniform interface to a multitude of data sources and the ability to share data across multiple systems. These systems have recently enjoyed significant research and commercial success (Halevy et al, 2005, 2006b). Current data integration systems are essentially a natural extension of traditional database systems in that queries are specified in a structured form and data are modeled in one of the traditional data models (relational, XML). In addition, the data integration system has exact knowledge of how the data in the sources map to the schema used by the data integration system. In this chapter we argue that as the scope of data integration applications broadens, such systems need to be able to model uncertainty at their core. Uncertainty can arise for multiple reasons in data integration. First, the semantic mappings between Anish Das Sarma Yahoo! Research anish@yahoo-inc.com Xin Luna Dong AT&T Labs-Research lunadong@research.att.com Alon Y. Halevy Google halevy@google.com 1

2 2 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy the data sources and the mediated schema may be approximate. For example, in an application like Google Base (GoogleBase, 2005) that enables anyone to upload structured data, or when mapping millions of sources on the deep web (Madhavan et al, 2007), we cannot imagine specifying exact mappings. In some domains (e.g., bioinformatics), we do not necessarily know what the exact mapping is. Second, data are often extracted from unstructured sources using information extraction techniques. Since these techniques are approximate, the data obtained from the sources may be uncertain. Third, if the intended users of the application are not necessarily familiar with schemata, or if the domain of the system is too broad to offer form-based query interfaces (such as web forms), we need to support keyword queries. Hence, another source of uncertainty is the transformation between keyword queries and a set of candidate structured queries. Finally, if the scope of the domain is very broad, there can even be uncertainty about the concepts in the mediated schema. Another reason for data integration systems to model uncertainty is to support pay-as-you-go integration. Dataspace Support Platforms (Halevy et al, 2006a) envision data integration systems where sources are added with no effort and the system is constantly evolving in a pay-as-you-go fashion to improve the quality of semantic mappings and query answering. This means that as the system evolves, there will be uncertainty about the semantic mappings to its sources, its mediated schema and even the semantics of the queries posed to it. This chapter describes some of the formal foundations for data integration with uncertainty. We define probabilistic schema mappings and probabilistic mediated schemas, and show how to answer queries in their presence. With these foundations, we show that it is possible to completely automatically bootstrap a pay-as-you-go integration system. This chapter is largely based on previous papers (Dong et al, 2007; Sarma et al, 2008). The proofs of the theorems we state and the experimental results validating some of our claims can be found in there. We also place several other works on uncertainty in data integration in the context of the system we envision. In the next section, we begin by describing an architecture for data integration system that incorporates uncertainty. 2 Overview of the System This section describes the requirements from a data integration system that supports uncertainty and the overall architecture of the system. 2.1 Uncertainty in data integration A data integration system needs to handle uncertainty at four levels.

3 Uncertainty in Data Integration and Dataspace Support Platforms 3 Uncertain mediated schema: The mediated schema is the set of schema terms in which queries are posed. They do not necessarily cover all the attributes appearing in any of the sources, but rather the aspects of the domain that the application builder wishes to expose to the users. Uncertainty in the mediated schema can arise for several reasons. First, as we describe in Section 4, if the mediated schema is automatically inferred from the data sources in a pay-as-you-go integration system, there will be some uncertainty about the results. Second, when domains get broad, there will be some uncertainty about how to model the domain. For example, if we model all the topics in Computer Science there will be some uncertainty about the degree of overlap between different topics. Uncertain schema mappings: Data integration systems rely on schema mappings for specifying the semantic relationships between the data in the sources and the terms used in the mediated schema. However, schema mappings can be inaccurate. In many applications it is impossible to create and maintain precise mappings between data sources. This can be because the users are not skilled enough to provide precise mappings, such as in personal information management (Dong and Halevy, 2005), because people do not understand the domain well and thus do not even know what correct mappings are, such as in bioinformatics, or because the scale of the data prevents generating and maintaining precise mappings, such as in integrating data of the web scale (Madhavan et al, 2007). Hence, in practice, schema mappings are often generated by semi-automatic tools and not necessarily verified by domain experts. Uncertain data: By nature, data integration systems need to handle uncertain data. One reason for uncertainty is that data are often extracted from unstructured or semistructured sources by automatic methods (e.g., HTML pages, s, blogs). A second reason is that data may come from sources that are unreliable or not up to date. For example, in enterprise settings, it is common for informational data such as gender, racial, and income level to be dirty or missing, even when the transactional data is precise. Uncertain queries: In some data integration applications, especially on the web, queries will be posed as keywords rather than as structured queries against a well defined schema. The system needs to translate these queries into some structured form so they can be reformulated with respect to the data sources. At this step, the system may generate multiple candidate structured queries and have some uncertainty about which is the real intent of the user. 2.2 System architecture Given the previously discussed requirements, we describe the architecture of a data integration system we envision that manages uncertainty at its core. We describe the system by contrasting it to a traditional data integration system.

4 4 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy Q Keyword Reformulation Mediated Schema Q1,...Qm Query Reformulation Query Pocessor Q11,...Q1n,,Qk1,...Qkn Q11,...Q1n... Qk1,...Qkn D1 Dk D2 D4 D3 Fig. 1 Architecture of a data-integration system that handles uncertainty. The first and most fundamental characteristic of this system is that it is based on a probabilistic data model. This means that we attach probabilities to: tuples that we process in the system, schema mappings, mediated schemas, and possible interpretations of keyword queries posed to the system. In contrast, a traditional data integration system includes a single mediated schema and a single (and supposed to be correct) schema mapping between the mediated schema and each source. The data in the sources is also assumed to be correct. Traditional data integration systems assume that the query is posed in a structured fashion (i.e., can be translated to some subset of SQL). Here, we assume that queries can be posed as keywords (to accommodate a much broader class of users and applications). Hence, whereas traditional data integration systems begin by reformulating a query onto the schemas of the data sources, a data integration system with uncertainty needs to first reformulate a keyword query into a set of candidate structured queries. We refer to this step as keyword reformulation. Note that keyword reformulation is different from techniques for keyword search on structured data (e.g., (Hristidis and Papakonstantinou, 2002; Agrawal et al, 2002)) in that (a) it does not assume access to all the data in the sources or that the sources support keyword search, and (b) it tries to distinguish different structural elements in the query in order to pose more precise queries to the sources (e.g., realizing that in the keyword query Chicago weather, weather is an attribute label and Chicago is an instance name). That being said, keyword reformulation should benefit from techniques that support answering keyword search on structured data. The query answering model is different. Instead of necessarily finding all answers to a given query, our goal is typically to find the top-k answers, and rank these answers most effectively. The final difference from traditional data integration systems is that our query processing will need to be more adaptive than usual. Instead of generating a query

5 Uncertainty in Data Integration and Dataspace Support Platforms 5 answering plan and executing it, the steps we take in query processing will depend on results of previous steps. We note that adaptive query processing has been discussed quite a bit in data integration (, Ed.), where the need for adaptivity arises from the fact that data sources did not answer as quickly as expected or that we did not have accurate statistics about their contents to properly order our operations. In our work, however, the goal for adaptivity is to get the answers with high probabilities faster. The architecture of the system is shown in Figure 1. The system contains a number of data sources and a mediated schema (we omit probabilistic mediated schemas from this figure). When the user poses a query Q, which can be either a structured query on the mediated schema or a keyword query, the system returns a set of answer tuples, each with a probability. If Q is a keyword query, the system first performs keyword reformulation to translate it into a set of candidate structured queries on the mediated schema. Otherwise, the candidate query is Q itself. 2.3 Source of probabilities A critical issue in any system that manages uncertainty is whether we have a reliable source of probabilities. Whereas obtaining reliable probabilities for such a system is one of the most interesting areas for future research, there is quite a bit to build on. For keyword reformulation, it is possible to train and test reformulators on large numbers of queries such that each reformulation result is given a probability based on its performance statistics. For information extraction, current techniques are often based on statistical machine learning methods and can be extended to compute probabilities of each extraction result. Finally, in the case of schema matching, it is standard practice for schema matchers to also associate numbers with the candidates they propose (e.g., (Rahm and Bernstein, 2001; Berlin and Motro, 2002; Dhamankar et al, 2004; Do and Rahm, 2002; Doan et al, 2002; He and Chang, 2003; Kang and Naughton, 2003; Wang et al, 2004)). The issue here is that the numbers are meant only as a ranking mechanism rather than true probabilities. However, as schema matching techniques start looking at a larger number of schemas, one can imagine ascribing probabilities (or estimations thereof) to their measures. 2.4 Outline of the chapter We begin by discussing probabilistic schema mappings in Section 3. We also discuss how to answer queries in their presence and how to answer top-k queries. In Section 4 we discuss probabilistic mediated schemas. We begin by motivating them and showing that in some cases they add expressive power to the resulting system. Then we describe an algorithm for generating probabilistic mediated schemas from a collection of data sources.

6 6 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy 3 Uncertainty in Mappings The key to resolving heterogeneity at the schema level is to specify schema mappings between data sources. These mappings describe the relationship between the contents of the different sources and are used to reformulate a query posed over one source (or a mediated schema) into queries over the sources that are deemed relevant. However, in many applications we are not able to provide all the schema mappings upfront. In this section we describe how we use probabilistic schema mappings (p-mappings, defined in Definition 3 in Chapter 3) to capture uncertainty on mappings between schemas. We start by presenting a running example for this section that also motivates p- mappings (Section 3.1). Then, Section 3.2 formally defines its semantics in query answering. After that, Section 3.3 describes algorithms for query answering with respect to probabilistic mappings and discusses the complexity. Next, Section 3.4 shows how to leverage previous work on schema matching to automatically create probabilistic mappings. In the end, Section 3.5 briefly describes various extensions to the basic definition and Section 3.6 describes other types of approximate schema mappings that have been proposed in the literature. 3.1 Motivating probabilistic mappings Possible Mapping Prob m {(pname, name), ( -addr, ), 1 = 0.5 (current-addr, mailing-addr), (permanent-addr, home-addr)} {(pname, name), ( -addr, ), m 2 = (permanent-addr, mailing-addr), (current-addr, home-addr)} 0.4 {(pname, name), ( -addr, mailing-addr), m 3 = (current-addr, home-addr)} 0.1 (a) pname -addr current-addr permanent-addr Alice alice@ Mountain View Sunnyvale Bob bob@ Sunnyvale Sunnyvale (b) Tuple (mailing-addr) Prob ( Sunnyvale ) 0.9 ( Mountain View ) 0.5 ( alice@ ) 0.1 ( bob@ ) 0.1 (c) Fig. 2 The running example: (a) a probabilistic schema mapping between S and T; (b) a source instance D S ; (c) the answers of Q over D S with respect to the probabilistic mapping. Example 1. Consider a data source S, which describes a person by her address, current address, and permanent address, and the mediated schema T, which

7 Uncertainty in Data Integration and Dataspace Support Platforms 7 describes a person by her name, , mailing address, home address and office address: S=(pname, -addr, current-addr, permanent-addr) T=(name, , mailing-addr, home-addr, office-addr) A semi-automatic schema-mapping tool may generate three possible mappings between S and T, assigning each a probability. Whereas the three mappings all map pname to name, they map other attributes in the source and the target differently. Figure 2(a) describes the three mappings using sets of attribute correspondences. For example, mapping m 1 maps pname to name, -addr to , current-addr to mailing-addr, and permanent-addr to home-addr. Because of the uncertainty about which mapping is correct, we consider all of these mappings in query answering. Suppose the system receives a query Q composed on the mediated schema and asking for people s mailing addresses: Q: SELECT mailing-addr FROM T Using the possible mappings, we can reformulate Q into different queries: Q1: SELECT current-addr FROM S Q2: SELECT permanent-addr FROM S Q3: SELECT -addr FROM S If the user requires all possible answers, the system generates a single aggregation query based on Q 1,Q 2 and Q 3 to compute the probability of each returned tuple, and sends the query to the data source. Suppose the data source contains a table D S as shown in Figure 2(b), the system will retrieve four answer tuples, each with a probability, as shown in Figure 2(c). If the user requires only the top-1 answer (i.e., the answer tuple with the highest probability), the system decides at runtime which reformulated queries to execute. For example, after executing Q 1 and Q 2 at the source, the system can already conclude that ( Sunnyvale ) is the top-1 answer and can skip query Q Definition and Semantics Schema mappings and p-mappings We begin by reviewing non-probabilistic schema mappings. The goal of a schema mapping is to specify the semantic relationships between a source schema and a target schema. We refer to the source schema as S, and a relation in S as S = s 1,...,s m. Similarly, we refer to the target schema as T, and a relation in T as T = t 1,...,t n. We consider a limited form of schema mappings that are also referred to as schema matching in the literature. Specifically, a schema matching contains a set of

8 8 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy attribute correspondences. An attribute correspondence is of the form c i j = (s i,t j ), where s i is a source attribute in the schema S and t j is a target attribute in the schema T. Intuitively, c i j specifies that there is a relationship between s i and t j. In practice, a correspondence also involves a function that transforms the value of s i to the value of t j. For example, the correspondence (c-degree, temperature) can be specified as temperature=c-degree , describing a transformation from Celsius to Fahrenheit. These functions are irrelevant to our discussion, and therefore we omit them. This class of mappings are quite common in practice and already expose many of the novel issues involved in probabilistic mappings. In Section 3.5 we will briefly discuss extensions to a broader class of mappings. Formally, relation mappings and schema mappings are defined as follows. Definition 1 (Schema Mapping). Let S and T be relational schemas. A relation mapping M is a triple (S,T,m), where S is a relation in S, T is a relation in T, and m is a set of attribute correspondences between S and T. When each source and target attribute occurs in at most one correspondence in m, we call M a one-to-one relation mapping. A schema mapping M is a set of one-to-one relation mappings between relations in S and in T, where every relation in either S or T appears at most once. A pair of instances D S and D T satisfies a relation mapping m if for every source tuple t s D S, there exists a target tuple t t D t, such that for every attribute correspondence (s,t) m, the value of attribute s in t s is the same as the value of attribute t in t t. Example 2. Consider the mappings in Example 1. The source database in Figure 2(b) (repeated in Figure 3(a)) and the target database in Figure 3(b) satisfy m 1. Intuitively, a probabilistic schema mapping describes a probability distribution of a set of possible schema mappings between a source schema and a target schema. For completeness, we repeat its definition as follows (also see Definition 3 in Chapter 3). Definition 2 (Probabilistic Mapping). Let S and T be relational schemas. A probabilistic mapping (p-mapping), pm, is a triple (S,T,m), where S S, T T, and m is a set {(m 1,Pr(m 1 )),...,(m l,pr(m l ))}, such that for i [1,l], m i is a one-to-one mapping between S and T, and for every i, j [1,l], i j m i m j. Pr(m i ) [0,1] and l i=1 Pr(m i) = 1. A schema p-mapping, pm, is a set of p-mappings between relations in S and in T, where every relation in either S or T appears in at most one p-mapping. We refer to a non-probabilistic mapping as an ordinary mapping. A schema p- mapping may contain both p-mappings and ordinary mappings. Example 1 shows a p-mapping (see Figure 2(a)) that contains three possible mappings.

9 Uncertainty in Data Integration and Dataspace Support Platforms 9 pname -addr current-addr permanent-addr Alice alice@ Mountain View Sunnyvale Bob bob@ Sunnyvale Sunnyvale (a) name mailing-addr home-addr office-addr Alice alice@ Mountain View Sunnyvale office Bob bob@ Sunnyvale Sunnyvale office (b) name mailing-addr home-addr office-addr Alice alice@ Sunnyvale Mountain View office Bob bob@ Sunnyvale office (c) Tuple (mailing-addr) Prob ( Sunnyvale ) 0.9 ( Mountain View ) 0.5 ( alice@ ) 0.1 ( bob@ ) 0.1 (d) Tuple (mailing-addr) Prob ( Sunnyvale ) 0.94 ( Mountain View ) 0.5 ( alice@ ) 0.1 ( bob@ ) 0.1 (e) Fig. 3 Example 3: (a) a source instance D S ; (b) a target instance that is by-table consistent with D S and m 1 ; (c) a target instance that is by-tuple consistent with D S and < m 2,m 3 >; (d) Q table (D S ); (e) Q tuple (D S ) Semantics of probabilistic mappings Intuitively, a probabilistic schema mapping models the uncertainty about which of the mappings in pm is the correct one. When a schema matching system produces a set of candidate matches, there are two ways to interpret the uncertainty: (1) a single mapping in pm is the correct one and it applies to all the data in S, or (2) several mappings are partially correct and each is suitable for a subset of tuples in S, though it is not known which mapping is the right one for a specific tuple. Figure 3(b) illustrates the first interpretation and applies mapping m 1. For the same example, the second interpretation is equally valid: some people may choose to use their current address as mailing address while others use their permanent address as mailing address; thus, for different tuples we may apply different mappings, so the correct mapping depends on the particular tuple. We define query answering under both interpretations. The first interpretation is referred to as the by-table semantics and the second one is referred to as the by-tuple semantics of probabilistic mappings. Note that one cannot argue for one interpretation over the other; the needs of the application should dictate the appropriate semantics. Furthermore, the complexity results for query answering, which will show advantages to by-table semantics, should not be taken as an argument in the favor of by-table semantics. We next define query answering with respect to p-mappings in detail and the definitions for schema p-mappings are the obvious extensions. Recall that given a query and an ordinary mapping, we can compute certain answers to the query with

10 10 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy respect to the mapping. Query answering with respect to p-mappings is defined as a natural extension of certain answers, which we next review. A mapping defines a relationship between instances of S and instances of T that are consistent with the mapping. Definition 3 (Consistent Target Instance). Let M = (S, T, m) be a relation mapping and D S be an instance of S. An instance D T of T is said to be consistent with D S and M, if for each tuple t s D S, there exists a tuple t t D T, such that for every attribute correspondence (a s,a t ) m, the value of a s in t s is the same as the value of a t in t t. For a relation mapping M and a source instance D S, there can be an infinite number of target instances that are consistent with D S and M. We denote by Tar M (D S ) the set of all such target instances. The set of answers to a query Q is the intersection of the answers on all instances in Tar M (D S ). Definition 4 (Certain Answer). Let M = (S,T,m) be a relation mapping. Let Q be a query over T and let D S be an instance of S. A tuple t is said to be a certain answer of Q with respect to D S and M, if for every instance D T Tar M (D S ), t Q(D T ). By-table semantics: We now generalize these notions to the probabilistic setting, beginning with the by-table semantics. Intuitively, a p-mapping pm describes a set of possible worlds, each with a possible mapping m pm. In by-table semantics, a source table can fall in one of the possible worlds; that is, the possible mapping associated with that possible world applies to the whole source table. Following this intuition, we define target instances that are consistent with the source instance. Definition 5 (By-table Consistent Instance). Let pm = (S, T, m) be a p-mapping and D S be an instance of S. An instance D T of T is said to be by-table consistent with D S and pm, if there exists a mapping m m such that D S and D T satisfy m. Given a source instance D S and a possible mapping m m, there can be an infinite number of target instances that are consistent with D S and m. We denote by Tar m (D S ) the set of all such instances. In the probabilistic context, we assign a probability to every answer. Intuitively, we consider the certain answers with respect to each possible mapping in isolation. The probability of an answer t is the sum of the probabilities of the mappings for which t is deemed to be a certain answer. We define by-table answers as follows: Definition 6 (By-table Answer). Let pm = (S,T,m) be a p-mapping. Let Q be a query over T and let D S be an instance of S. Let t be a tuple. Let m(t) be the subset of m, such that for each m m(t) and for each D T Tar m (D S ), t Q(D T ). Let p = m m(t) Pr(m). If p > 0, then we say (t, p) is a by-table answer of Q with respect to D S and pm.

11 Uncertainty in Data Integration and Dataspace Support Platforms 11 By-tuple semantics: If we follow the possible-world notions, in by-tuple semantics, different tuples in a source table can fall in different possible worlds; that is, different possible mappings associated with those possible worlds can apply to the different source tuples. Formally, the key difference in the definition of by-tuple semantics from that of by-table semantics is that a consistent target instance is defined by a mapping sequence that assigns a (possibly different) mapping in m to each source tuple in D S. (Without losing generality, in order to compare between such sequences, we assign some order to the tuples in the instance). Definition 7 (By-tuple Consistent Instance). Let pm = (S, T, m) be a p-mapping and let D S be an instance of S with d tuples. An instance D T of T is said to be by-tuple consistent with D S and pm, if there is a sequence m 1,...,m d such that d is the number of tuples in D S and for every 1 i d, m i m, and for the i th tuple of D S, t i, there exists a target tuple t i D T such that for each attribute correspondence (a s,a t ) m i, the value of a s in t i is the same as the value of a t in t i. Given a mapping sequence seq = m 1,...,m d, we denote by Tar seq (D S ) the set of all target instances that are consistent with D S and seq. Note that if D T is bytable consistent with D S and m, then D T is also by-tuple consistent with D S and a mapping sequence in which each mapping is m. We can think of every sequence of mappings seq = m 1,...,m d as a separate event whose probability is Pr(seq) = Πi=1 d Pr(mi ). (Section 3.5 relaxes this independence assumption and introduces conditional mappings.) If there are l mappings in pm, then there are l d sequences of length d, and their probabilities add up to 1. We denote by seq d (pm) the set of mapping sequences of length d generated from pm. Definition 8 (By-tuple Answer). Let pm = (S,T,m) be a p-mapping. Let Q be a query over T and D S be an instance of S with d tuples. Let t be a tuple. Let seq(t) be the subset of seq d (pm), such that for each seq seq(t) and for each D T Tar seq (D S ), t Q(D T ). Let p = seq seq(t) Pr(seq). If p > 0, we call (t, p) a by-tuple answer of Q with respect to D S and pm. The set of by-table answers for Q with respect to D S is denoted by Q table (D S ) and the set of by-tuple answers for Q with respect to D S is denoted by Q tuple (D S ). Example 3. Consider the p-mapping pm, the source instance D S, and the query Q in the motivating example. In by-table semantics, Figure 3(b) shows a target instance that is consistent with D S (repeated in Figure 3(a)) and possible mapping m 1. Figure 3(d) shows the by-table answers of Q with respect to D S and pm. As an example, for tuple

12 12 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy t =( Sunnyvale ), we have m(t) = {m 1,m 2 }, so the possible tuple ( Sunnyvale, 0.9) is an answer. In by-tuple semantics, Figure 3(c) shows a target instance that is by-tuple consistent with D S and the mapping sequence < m 2,m 3 >. Figure 3(e) shows the bytuple answers of Q with respect to D S and pm. Note that the probability of tuple t=( Sunnyvale ) in the by-table answers is different from that in the by-tuple answers. We describe how to compute the probabilities in detail in the next section. 3.3 Query Answering This section studies query answering in the presence of probabilistic mappings. We start with describing algorithms for returning all answer tuples with probabilities, and discussing the complexity of query answering in terms of the size of the data (data complexity) and the size of the p-mapping (mapping complexity). We then consider returning the top-k query answers, which are the k answer tuples with the top probabilities, and answering aggregate queries By-table query answering In the case of by-table semantics, answering queries is conceptually simple. Given a p-mapping pm = (S,T,m) and an SPJ query Q, we can compute the certain answers of Q under each of the mappings m m. We attach the probability Pr(m) to every certain answer under m. If a tuple is an answer to Q under multiple mappings in m, then we add up the probabilities of the different mappings. Algorithm BYTABLE takes as input an SPJ query Q that mentions the relations T 1,...,T l in the FROM clause. Assume that we have the p-mapping pm i associated with the table T i. The algorithm proceeds as follows. Step 1: We generate the possible reformulations of Q (a reformulation query computes all certain answers when executed on the source data) by considering every combination of the form (m 1,...,m l ), where m i is one of the possible mappings in pm i. Denote the set of reformulations by Q 1,...,Q k. The probability of a reformulation Q = (m 1,...,m l ) is Πi=1 l Pr(mi ). Step 2: For each reformulation Q, retrieve each of the unique answers from the sources. For each answer obtained by Q 1... Q k, its probability is computed by summing the probabilities of the Q s in which it is returned. Importantly, note that it is possible to express both steps as an SQL query with grouping and aggregation. Therefore, if the underlying sources support SQL, we can leverage their optimizations to compute the answers.

13 Uncertainty in Data Integration and Dataspace Support Platforms 13 With our restricted form of schema mapping, the algorithm takes time polynomial in the size of the data and the mappings. We thus have the following complexity result. Theorem 1. Let pm be a schema p-mapping and let Q be an SPJ query. Answering Q with respect to pm in by-table semantics is in PTIME in the size of the data and the mapping By-tuple query answering Tuple (mailing-addr) Pr ( Sunnyvale ) 0.94 ( Mountain View ) 0.5 ( alice@ ) 0.1 ( bob@ ) 0.1 (a) Fig. 4 Example 4: (a) Q tuple 1 (D) and (b) Q tuple 2 (D). Tuple (mailing-addr) Pr ( Sunnyvale ) 0.8 ( Mountain View ) 0.8 (b) To extend the by-table query-answering strategy to by-tuple semantics, we would need to compute the certain answers for every mapping sequence generated by pm. However, the number of such mapping sequences is exponential in the size of the input data. The following example shows that for certain queries this exponential time complexity is inevitable. Example 4. Suppose that in addition to the tables in Example 1, we also have U(city) in the source and V(hightech) in the target. The p-mapping for V contains two possible mappings: ({(city, hightech)},.8) and (/0,.2). Consider the following query Q, which decides if there are any people living in a high-tech city. Q: SELECT true FROM T, V WHERE T.mailing-addr = V.hightech An incorrect way of answering the query is to first execute the following two sub-queries Q 1 and Q 2, then join the answers of Q 1 and Q 2 and summing up the probabilities. Q1: SELECT mailing-addr FROM T Q2: SELECT hightech FROM V Now consider the source instance D, where D S is shown in Figure 2(a), and D U has two tuples ( Mountain View ) and ( Sunnyvale ). Figure 4(a) and (b) show Q tuple 1 (D) and Q tuple 2 (D). If we join the results of Q 1 and Q 2, we obtain for the true tuple the following probability: = However, this

14 14 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy is incorrect. By enumerating all consistent target tables, we in fact compute as the probability. The reason for this error is that on some target instance that is by-tuple consistent with the source instance, the answers to both Q 1 and Q 2 contain tuple ( Sunnyvale ) and tuple ( Mountain View ). Thus, generating the tuple ( Sunnyvale ) as an answer for both Q 1 and Q 2 and generating the tuple ( Mountain View ) for both queries are not independent events, and so simply adding up their probabilities leads to incorrect results. Indeed, it is not clear if there exists a better algorithm to answer Q than by enumerating all by-tuple consistent target instances and then answering Q on each of them. In fact, it is proved that in general, answering SPJ queries in by-tuple semantics with respect to schema p-mappings is hard. Theorem 2. Let Q be an SPJ query and let pm be a schema p-mapping. The problem of finding the probability for a by-tuple answer to Q with respect to pm is #Pcomplete with respect to data complexity and is in PTIME with respect to mapping complexity. Recall that #P is the complexity class of some hard counting problems (i.e. e.g., counting the number of variable assignments that satisfy a Boolean formula). It is believed that a #P-complete problem cannot be solved in polynomial time, unless P = NP. Although by-tuple query answering in general is hard, there are two restricted but common classes of queries for which by-tuple query answering takes polynomial time. The first class of queries are those that include only a single subgoal being the target of a p-mapping; here, we refer to an occurrence of a table in the FROM clause of a query as a subgoal of the query. Relations in the other subgoals are either involved in ordinary mappings or do not require a mapping. Hence, if we only have uncertainty with respect to one part of the domain, our queries will typically fall in this class. The second class of queries can include multiple subgoals involved in p-mappings, but return the join attributes for such subgoals. We next illustrate these two classes of queries and query answering for them using two examples. Example 5. Consider rewriting Q in the motivating example, repeated as follows: Q: SELECT mailing-addr FROM T To answer the query, we first rewrite Q into query Q by adding the id column: Q : SELECT id, mailing-addr FROM T We then invoke BYTABLE and generate the following SQL query to compute by-table answers for Q : Qa: SELECT id, mailing-addr, SUM(pr) FROM ( SELECT DISTINCT id, current-addr AS mailing-addr, 0.5 AS pr

15 Uncertainty in Data Integration and Dataspace Support Platforms 15 FROM S UNION ALL SELECT DISTINCT id, permanent-addr AS mailing-addr, 0.4 AS pr FROM S UNION ALL SELECT DISTINCT id, -addr AS mailing-addr, 0.1 AS pr FROM S) GROUP BY id, mailing-addr Finally, we generate the results using the following query. Qu: SELECT mailing-addr, NOR(pr) AS pr FROM Qa GROUP BY mailing-addr where for a set of probabilities pr 1,..., pr n, NOR computes 1 Π n i=1 (1 pr i). Example 6. Consider the schema p-mapping in Example 4. If we revise Q slightly be returning the join attribute, shown as follows, we can answer the query in polynomial time. Q : SELECT V.hightech FROM T, V WHERE T.mailing-addr = V.hightech We answer the query by dividing it into two sub-queries, Q 1 and Q 2, as shown in Example 4. We can compute Q 1 with query Q u (shown in Example 5) and compute Q 2 similarly with a query Q u. We compute by-tuple answers of Q as follows: SELECT Qu.hightech, Qu.pr*Qu.pr FROM Qu, Qu WHERE Qu.mailing-addr = Qu.hightect Top-K Query Answering The main challenge in designing the algorithm for returning top-k query answers is to only perform the necessary reformulations at every step and halt when the top-k answers are found. We focus on top-k query answering for by-table semantics and the algorithm can be modified for by-tuple semantics. Recall that in by-table query answering, the probability of an answer is the sum of the probabilities of the reformulated queries that generate the answer. Our goal is to reduce the number of reformulated queries we execute. The algorithm we describe next proceeds in a greedy fashion: it executes queries in descending order of probabilities. For each tuple t, it maintains the upper bound p max (t) and lower

16 16 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy bound p min (t) of its probability. This process halts when it finds k tuples whose p min values are higher than p max of the rest of the tuples. TOPKBYTABLE takes as input an SPJ query Q, a schema p-mapping pm, an instance D S of the source schema, and an integer k, and outputs the top-k answers in Q table (D S ). The algorithm proceeds in three steps. Step 1: Rewrite Q according to pm into a set of queries Q 1,...,Q n, each with a probability assigned in a similar way as stated in Algorithm BYTABLE. Step 2: Execute Q 1,...,Q n in descending order of their probabilities. Maintain the following measures: The highest probability, PMax, for the tuples that have not been generated yet. We initialize PMax to 1; after executing query Q i and updating the list of answers (see third bullet), we decrease PMax by Pr(Q i ); The threshold th determining which answers are potentially in the top-k. We initialize th to 0; after executing Q i and updating the answer list, we set th to the k-th largest p min for tuples in the answer list; A list L of answers whose p max is no less than th, and bounds p min and p max for each answer in L. After executing query Q i, we update the list as follows: (1) for each t L and t Q i (D S ), we increase p min (t) by Pr(Q i ); (2) for each t L but t Q i (D S ), we decrease p max (t) by Pr(Q i ); (3) if PMax th, for each t L but t Q i (D S ), insert t to L, set p min to Pr(Q i ) and p max (t) to PMax. A list T of k tuples with top p min values. Step 3: When th > PMax and for each t T, th > p max (t), halt and return T. Example 7. Consider Example 1 where we seek for top-1 answer. We answer the reformulated queries in order of Q 1,Q 2,Q 3. After answering Q 1, for tuple ( Sunnyvale ) we have p min =.5 and p max = 1, and for tuple ( Mountain View ) we have the same bounds. In addition, PMax =.5 and th =.5. In the second round, we answer Q 2. Then, for tuple ( Sunnyvale ) we have p min =.9 and p max = 1, and for tuple ( Mountain View ) we have p min =.5 and p max =.6. Now PMax =.1 and th =.9. Because th > PMax and th is above the p max for the ( Mountain View ) tuple, we can halt and return ( Sunnyvale ) as the top-1 answer Answering aggregate queries Finally, we discuss queries with aggregate operators: COUNT, SUM, AVG, MAX, and MIN based on results from (Gal et al, 2009). We consider three common extensions to semantics with aggregates and probabilistic information: the range semantics returns the range of the aggregate (i.e., the minimum and the maximum value); the expected-value semantics returns the expected value of the aggregate; and the distribution semantics returns all possible values with their probabilities. Note that the answer under the former two semantics can be derived from that under the last

17 Uncertainty in Data Integration and Dataspace Support Platforms 17 Table 1 Complexity of answering aggregate queries under different semantics. Semantics Operator Range Expected-Value Distribution By-table COUNT, SUM, AVG, MIN, MAX PTIME PTIME PTIME COUNT PTIME PTIME PTIME By-tuple SUM PTIME PTIME? AVG, MIN, MAX PTIME?? semantics; in other words, the distribution semantics is the richest one. We next formally define the three semantics. Definition 9 (Semantics of Aggregate Query). Let pm = (S, T, m) be a p-mapping, Q be an aggregate query over T, and D S be an instance of S. Let V be the set of result values of evaluating Q on D S w.r.t. pm under by-table (resp. by-tuple) semantics and Pr(v) be the probability of value v V. 1. Range semantics: The result is the interval [min( V), max( V)]. 2. Expected-value semantics: The result is v V Pr(v) v. 3. Distribution semantics: The result is a random variable X, s.t. for each distinct value v V, Pr(X = v) = Pr(v). According to the definition, there are six combinations of semantics for aggregate queries w.r.t. a p-mapping. Since results under the range or expected-value semantics can be derived from the results under the distribution semantics in polynomial time, the complexity of query answering w.r.t. to the former two semantics is no higher than that w.r.t. to the distribution semantics; in fact, in some cases query answering w.r.t. to the former two semantics can be computed more efficiently without obtaining the distribution. Table 1 summarizes the complexity results for each aggregate operator and we now explain them briefly. In by-table query answering, we can enumerate all answers and compute their probabilities in polynomial time; thus, query answering is in PTIME for all semantics. In by-tuple query answering, we can enumerate all answers (without computing their probabilities) in polynomial time; thus, query answering under the range semantics (where we do not need to know the probabilities of each answer) is in PTIME. We can prove that under the expected-value semantics, the answer for the SUM operator under by-table and by-tuple semantics is the same; thus, query answering for SUM under the by-tuple and expected-value semantics is in PTIME. For the COUNT operator, even query answering under the by-tuple semantics is PTIME for the distribution semantics, and thus also for other semantics. We next illustrate this using an example. For the rest of the combinations, we conjecture that query answering cannot be finished in polynomial time and the complexity of query answering remains open.

18 18 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy Table 2 Trace of query answering in Example 8. TupleID COUNT = 0 COUNT = 1 COUNT = Example 8. Continue with the running example (Figure 2) and consider the following query. Qc: SELECT COUNT(*) FROM S WHERE mailing-addr = Sunnyvale Table 2 shows how the probability of each answer value changes after we process each source tuple under the by-tuple semantics. After processing the first tuple, the probability of COUNT = 0 is =.6 (m 1,m 3 ) and that of COUNT = 1 is.4 (m 2 ). After processing the second tuple, the probability of COUNT = 0 is the probability that COUNT = 0 after the first tuple times the probability of applying m 3 to the second tuple, so.6.1 =.06. That of COUNT = 1 is.6 (.5+.4)+.4.1 =.58; that is, either COUNT = 0 after we process the first tuple and we apply m 1 or m 2 to the second tuple, or COUNT = 1 after the first tuple and we apply m 3 to the third tuple. Similarly, the probability of COUNT = 2 is.4 (.5 +.4) = Creating P-mappings We now address the problem of generating a p-mapping between a source schema and a target schema. We begin by assuming we have a set of weighted correspondences between the source attributes and the target attributes. These weighted correspondences are created by a set of schema matching modules. However, as we explain shortly, there can be multiple p-mappings that are consistent with a given set of weighted correspondences, and the question is which of them to choose. We describe an approach to creating p-mappings that is based on choosing the mapping that maximizes the entropy of the probability assignment Computing weighted correspondences A weighted correspondence between a pair of attributes specifies the degree of semantic similarity between them. Let S(s 1,...,s m ) be a source schema and T(t 1,...,t n ) be a target schema. We denote by C i, j,i [1,m], j [1,n], the weighted correspondence between s i and t j and by w i, j the weight of C i, j. The first step is to compute a weighted correspondence between every pair of attributes, which can be done by applying existing schema matching techniques. Although weighted correspondences tell us the degree of similarity between pairs of attributes, they do not tell us which target attribute a source attribute should map

19 Uncertainty in Data Integration and Dataspace Support Platforms 19 to. For example, a target attribute mailing-address can be both similar to the source attribute current-addr and to permanent-addr, so it makes sense to map either of them to mailing-address in a schema mapping. In fact, given a set of weighted correspondences, there could be a set of p-mappings that are consistent with it. We can define the one-to-many relationship between sets of weighted correspondences and p-mappings by specifying when a p-mapping is consistent with a set of weighted correspondences. Definition 10 (Consistent p-mapping). A p-mapping pm is consistent with a weighted correspondence C i, j between a pair of source and target attributes if the sum of the probabilities of all mappings m pm containing correspondence (i, j) equals w i, j ; that is, w i, j = m pm,(i, j) m Pr(m). A p-mapping is consistent with a set of weighted correspondences C if it is consistent with each weighted correspondence C C. However, not every set of weighted correspondences admits a consistent p-mapping. The following theorem shows under which conditions a consistent p-mapping exists, and establishes a normalization factor for weighted correspondences that will guarantee the existence of a consistent p-mapping. Theorem 3. Let C be a set of weighted correspondences between a source schema S(s 1,...,s m ) and a target schema T(t 1,...,t n ). There exists a consistent p-mapping with respect to C if and only if (1) for every i [1,m], n j=1 w i, j 1 and (2) for every j [1,n], m i=1 w i, j 1. Let M = max{max i { n j=1 w i, j },max j { m i=1 w i, j }}. Then, for each i [1,m], n w i, j j=1 M 1 and for each j [1,n], m w i, j i=1 M 1. Based on Theorem 3, we normalize the weighted correspondences we generated as described previously by dividing them by M ; that is, w i, j = w i, j M Generating p-mappings To motivate our approach to generate p-mappings, consider the following example. Consider a source schema (A,B) and a target schema (A,B ). Assume we have computed the following weighted correspondences between source and target attributes: w A,A = 0.6 and w B,B = 0.5 (the rest are 0). As we explained above, there are an infinite number of p-mappings that are consistent with this set of weighted correspondences and below we list two: pm 1 :

20 20 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy pm 2 : m1: (A,A ), (B,B ): 0.3 m2: (A,A ): 0.3 m3: (B,B ): 0.2 m4: empty: 0.2 m1: (A,A ), (B,B ): 0.5 m2: (A,A ): 0.1 m3: empty: 0.4 In a sense, pm 1 seems better than pm 2 because it assumes that the similarity between A and A is independent of the similarity between B and B. In the general case, among the many p-mappings that are consistent with a set of weighted correspondences C, we choose the one with the maximum entropy; that is, the p-mappings whose probability distribution obtains the maximum value of l i=1 p i logp i. In the above example, pm 1 obtains the maximum entropy. The intuition behind maximum entropy is that when we need to select among multiple possible distributions on a set of exclusive events, we choose the one that does not favor any of the events over the others. Hence, we choose the distribution that does not introduce new information that we didn t have apriori. The principle of maximum entropy is widely used in other areas such as natural language processing. To create the p-mapping, we proceed in two steps. First, we enumerate all possible one-to-one schema mappings between S and M that contain a subset of correspondences in C. Second, we assign probabilities on each of the mappings in a way that maximizes the entropy of our result p-mapping. Enumerating all possible schema mappings given C is trivial: for each subset of correspondences, if it corresponds to a one-to-one mapping, we consider the mapping as a possible mapping. Given the possible mappings m 1,...,m l, we assign probabilities p 1,..., p l to m 1,...,m l by solving the following constraint optimization problem (OPT): maximize l k=1 p k log p k subject to: 1. k [1,l],0 p k 1, 2. l k=1 p k = 1, and 3. i, j : k [1,l],(i, j) mk p k = w i, j. We can apply existing technology in solving the OPT optimization problem. Although finding maximum-entropy solutions in general is costly, the experiments described in (Sarma et al, 2008) show that the execution time is reasonable for a one-time process. 3.5 Broader Classes of Mappings In this section we describe several practical extensions to the basic mapping language. The query answering techniques and complexity results we have described carry over to these techniques.

21 Uncertainty in Data Integration and Dataspace Support Platforms 21 GLAV mappings: The common formalism for schema mappings, GLAV (a.k.a. tuple-generating dependencies), is based on expressions of the form m : x(ϕ(x) yψ(x,y)). In the expression, ϕ is the body of a conjunctive query over S and ψ is the body of a conjunctive query over T. A pair of instances D S and D T satisfies a GLAV mapping m if for every assignment of x in D S that satisfies ϕ there exists an assignment of y in D T that satisfies ψ. We define general p-mappings to be triples of the form pgm = ( S, T,gm), where gm is a set {(gm i,pr(gm i )) i [1,n]}, such that for each i [1,n], gm i is a general GLAV mapping. The definition of by-table semantics for such mappings is a simple generalization of Definition 6 and query answering can be conducted in PTIME. Extending by-tuple semantics to arbitrary GLAV mappings is much trickier than by-table semantics and would involve considering mapping sequences whose length is the product of the number of tuples in each source table, and the results are much less intuitive. Theorem 4. Let pgm be a general p-mapping between a source schema S and a target schema T. Let D S be an instance of S. Let Q be an SPJ query with only equality conditions over T. The problem of computing Q table (D S ) with respect to pgm is in PTIME in the size of the data and the mapping. Complex mappings: Complex mappings map a set of attributes in the source to a set of attributes in the target. For example, we can map the attribute address to the concatenation of street, city, and state. Formally, a set correspondence between S and T is a relationship between a subset of attributes in S and a subset of attributes in T. Here, the function associated with the relationship specifies a single value for each of the target attributes given a value for each of the source attributes. Again, the actual functions are irrelevant to our discussion. A complex mapping is a triple (S,T,cm), where cm is a set of set correspondences, such that each attribute in S or T is involved in at most one set correspondence. A complex p-mapping is of the form pcm = {(cm i,pr(cm i )) i [1,n]}, where n i=1 Pr(cm i) = 1. Theorem 5. Let pcm be a complex schema p-mapping between schemas S and T. Let D S be an instance of S. Let Q be an SPJ query over T. The data complexity and mapping complexity of computing Q table (D S ) with respect to pcm are PTIME. The data complexity of computing Q tuple (D S ) with respect to pcm is #P-complete. The mapping complexity of computing Q tuple (D S ) with respect to pcm is in PTIME. Union mapping: Union mappings specify relationships such as both attribute home-address and attribute office-address can be mapped to address. Formally, a union mapping is a triple (S,T, m), where m is a set of mappings between S and T. Given a source relation D S and a target relation D T, we say D S and D T are consistent with respect to the union mapping if for each source tuple t and m m, there