Uncertainty in Data Integration and Dataspace Support Platforms

Size: px
Start display at page:

Download "Uncertainty in Data Integration and Dataspace Support Platforms"

Transcription

1 Uncertainty in Data Integration and Dataspace Support Platforms Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy Abstract Data integration has been an important area of research for several years. However, such systems suffer from one of the main drawbacks of database systems: the need to invest significant modeling effort upfront. Dataspace Support Platforms (DSSP) envision a system that offers useful services on its data without any setup effort, and improve with time in a pay-as-you-go fashion. We argue that in order to support DSSPs, the system needs to model uncertainty at its core. We describe the concepts of probabilistic mediated schemas and probabilistic mappings as enabling concepts for DSSPs. 1 Introduction Data integration and exchange systems offer a uniform interface to a multitude of data sources and the ability to share data across multiple systems. These systems have recently enjoyed significant research and commercial success (Halevy et al, 2005, 2006b). Current data integration systems are essentially a natural extension of traditional database systems in that queries are specified in a structured form and data are modeled in one of the traditional data models (relational, XML). In addition, the data integration system has exact knowledge of how the data in the sources map to the schema used by the data integration system. In this chapter we argue that as the scope of data integration applications broadens, such systems need to be able to model uncertainty at their core. Uncertainty can arise for multiple reasons in data integration. First, the semantic mappings between Anish Das Sarma Yahoo! Research anish@yahoo-inc.com Xin Luna Dong AT&T Labs-Research lunadong@research.att.com Alon Y. Halevy Google halevy@google.com 1

2 2 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy the data sources and the mediated schema may be approximate. For example, in an application like Google Base (GoogleBase, 2005) that enables anyone to upload structured data, or when mapping millions of sources on the deep web (Madhavan et al, 2007), we cannot imagine specifying exact mappings. In some domains (e.g., bioinformatics), we do not necessarily know what the exact mapping is. Second, data are often extracted from unstructured sources using information extraction techniques. Since these techniques are approximate, the data obtained from the sources may be uncertain. Third, if the intended users of the application are not necessarily familiar with schemata, or if the domain of the system is too broad to offer form-based query interfaces (such as web forms), we need to support keyword queries. Hence, another source of uncertainty is the transformation between keyword queries and a set of candidate structured queries. Finally, if the scope of the domain is very broad, there can even be uncertainty about the concepts in the mediated schema. Another reason for data integration systems to model uncertainty is to support pay-as-you-go integration. Dataspace Support Platforms (Halevy et al, 2006a) envision data integration systems where sources are added with no effort and the system is constantly evolving in a pay-as-you-go fashion to improve the quality of semantic mappings and query answering. This means that as the system evolves, there will be uncertainty about the semantic mappings to its sources, its mediated schema and even the semantics of the queries posed to it. This chapter describes some of the formal foundations for data integration with uncertainty. We define probabilistic schema mappings and probabilistic mediated schemas, and show how to answer queries in their presence. With these foundations, we show that it is possible to completely automatically bootstrap a pay-as-you-go integration system. This chapter is largely based on previous papers (Dong et al, 2007; Sarma et al, 2008). The proofs of the theorems we state and the experimental results validating some of our claims can be found in there. We also place several other works on uncertainty in data integration in the context of the system we envision. In the next section, we begin by describing an architecture for data integration system that incorporates uncertainty. 2 Overview of the System This section describes the requirements from a data integration system that supports uncertainty and the overall architecture of the system. 2.1 Uncertainty in data integration A data integration system needs to handle uncertainty at four levels.

3 Uncertainty in Data Integration and Dataspace Support Platforms 3 Uncertain mediated schema: The mediated schema is the set of schema terms in which queries are posed. They do not necessarily cover all the attributes appearing in any of the sources, but rather the aspects of the domain that the application builder wishes to expose to the users. Uncertainty in the mediated schema can arise for several reasons. First, as we describe in Section 4, if the mediated schema is automatically inferred from the data sources in a pay-as-you-go integration system, there will be some uncertainty about the results. Second, when domains get broad, there will be some uncertainty about how to model the domain. For example, if we model all the topics in Computer Science there will be some uncertainty about the degree of overlap between different topics. Uncertain schema mappings: Data integration systems rely on schema mappings for specifying the semantic relationships between the data in the sources and the terms used in the mediated schema. However, schema mappings can be inaccurate. In many applications it is impossible to create and maintain precise mappings between data sources. This can be because the users are not skilled enough to provide precise mappings, such as in personal information management (Dong and Halevy, 2005), because people do not understand the domain well and thus do not even know what correct mappings are, such as in bioinformatics, or because the scale of the data prevents generating and maintaining precise mappings, such as in integrating data of the web scale (Madhavan et al, 2007). Hence, in practice, schema mappings are often generated by semi-automatic tools and not necessarily verified by domain experts. Uncertain data: By nature, data integration systems need to handle uncertain data. One reason for uncertainty is that data are often extracted from unstructured or semistructured sources by automatic methods (e.g., HTML pages, s, blogs). A second reason is that data may come from sources that are unreliable or not up to date. For example, in enterprise settings, it is common for informational data such as gender, racial, and income level to be dirty or missing, even when the transactional data is precise. Uncertain queries: In some data integration applications, especially on the web, queries will be posed as keywords rather than as structured queries against a well defined schema. The system needs to translate these queries into some structured form so they can be reformulated with respect to the data sources. At this step, the system may generate multiple candidate structured queries and have some uncertainty about which is the real intent of the user. 2.2 System architecture Given the previously discussed requirements, we describe the architecture of a data integration system we envision that manages uncertainty at its core. We describe the system by contrasting it to a traditional data integration system.

4 4 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy Q Keyword Reformulation Mediated Schema Q1,...Qm Query Reformulation Query Pocessor Q11,...Q1n,,Qk1,...Qkn Q11,...Q1n... Qk1,...Qkn D1 Dk D2 D4 D3 Fig. 1 Architecture of a data-integration system that handles uncertainty. The first and most fundamental characteristic of this system is that it is based on a probabilistic data model. This means that we attach probabilities to: tuples that we process in the system, schema mappings, mediated schemas, and possible interpretations of keyword queries posed to the system. In contrast, a traditional data integration system includes a single mediated schema and a single (and supposed to be correct) schema mapping between the mediated schema and each source. The data in the sources is also assumed to be correct. Traditional data integration systems assume that the query is posed in a structured fashion (i.e., can be translated to some subset of SQL). Here, we assume that queries can be posed as keywords (to accommodate a much broader class of users and applications). Hence, whereas traditional data integration systems begin by reformulating a query onto the schemas of the data sources, a data integration system with uncertainty needs to first reformulate a keyword query into a set of candidate structured queries. We refer to this step as keyword reformulation. Note that keyword reformulation is different from techniques for keyword search on structured data (e.g., (Hristidis and Papakonstantinou, 2002; Agrawal et al, 2002)) in that (a) it does not assume access to all the data in the sources or that the sources support keyword search, and (b) it tries to distinguish different structural elements in the query in order to pose more precise queries to the sources (e.g., realizing that in the keyword query Chicago weather, weather is an attribute label and Chicago is an instance name). That being said, keyword reformulation should benefit from techniques that support answering keyword search on structured data. The query answering model is different. Instead of necessarily finding all answers to a given query, our goal is typically to find the top-k answers, and rank these answers most effectively. The final difference from traditional data integration systems is that our query processing will need to be more adaptive than usual. Instead of generating a query

5 Uncertainty in Data Integration and Dataspace Support Platforms 5 answering plan and executing it, the steps we take in query processing will depend on results of previous steps. We note that adaptive query processing has been discussed quite a bit in data integration (, Ed.), where the need for adaptivity arises from the fact that data sources did not answer as quickly as expected or that we did not have accurate statistics about their contents to properly order our operations. In our work, however, the goal for adaptivity is to get the answers with high probabilities faster. The architecture of the system is shown in Figure 1. The system contains a number of data sources and a mediated schema (we omit probabilistic mediated schemas from this figure). When the user poses a query Q, which can be either a structured query on the mediated schema or a keyword query, the system returns a set of answer tuples, each with a probability. If Q is a keyword query, the system first performs keyword reformulation to translate it into a set of candidate structured queries on the mediated schema. Otherwise, the candidate query is Q itself. 2.3 Source of probabilities A critical issue in any system that manages uncertainty is whether we have a reliable source of probabilities. Whereas obtaining reliable probabilities for such a system is one of the most interesting areas for future research, there is quite a bit to build on. For keyword reformulation, it is possible to train and test reformulators on large numbers of queries such that each reformulation result is given a probability based on its performance statistics. For information extraction, current techniques are often based on statistical machine learning methods and can be extended to compute probabilities of each extraction result. Finally, in the case of schema matching, it is standard practice for schema matchers to also associate numbers with the candidates they propose (e.g., (Rahm and Bernstein, 2001; Berlin and Motro, 2002; Dhamankar et al, 2004; Do and Rahm, 2002; Doan et al, 2002; He and Chang, 2003; Kang and Naughton, 2003; Wang et al, 2004)). The issue here is that the numbers are meant only as a ranking mechanism rather than true probabilities. However, as schema matching techniques start looking at a larger number of schemas, one can imagine ascribing probabilities (or estimations thereof) to their measures. 2.4 Outline of the chapter We begin by discussing probabilistic schema mappings in Section 3. We also discuss how to answer queries in their presence and how to answer top-k queries. In Section 4 we discuss probabilistic mediated schemas. We begin by motivating them and showing that in some cases they add expressive power to the resulting system. Then we describe an algorithm for generating probabilistic mediated schemas from a collection of data sources.

6 6 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy 3 Uncertainty in Mappings The key to resolving heterogeneity at the schema level is to specify schema mappings between data sources. These mappings describe the relationship between the contents of the different sources and are used to reformulate a query posed over one source (or a mediated schema) into queries over the sources that are deemed relevant. However, in many applications we are not able to provide all the schema mappings upfront. In this section we describe how we use probabilistic schema mappings (p-mappings, defined in Definition 3 in Chapter 3) to capture uncertainty on mappings between schemas. We start by presenting a running example for this section that also motivates p- mappings (Section 3.1). Then, Section 3.2 formally defines its semantics in query answering. After that, Section 3.3 describes algorithms for query answering with respect to probabilistic mappings and discusses the complexity. Next, Section 3.4 shows how to leverage previous work on schema matching to automatically create probabilistic mappings. In the end, Section 3.5 briefly describes various extensions to the basic definition and Section 3.6 describes other types of approximate schema mappings that have been proposed in the literature. 3.1 Motivating probabilistic mappings Possible Mapping Prob m {(pname, name), ( -addr, ), 1 = 0.5 (current-addr, mailing-addr), (permanent-addr, home-addr)} {(pname, name), ( -addr, ), m 2 = (permanent-addr, mailing-addr), (current-addr, home-addr)} 0.4 {(pname, name), ( -addr, mailing-addr), m 3 = (current-addr, home-addr)} 0.1 (a) pname -addr current-addr permanent-addr Alice alice@ Mountain View Sunnyvale Bob bob@ Sunnyvale Sunnyvale (b) Tuple (mailing-addr) Prob ( Sunnyvale ) 0.9 ( Mountain View ) 0.5 ( alice@ ) 0.1 ( bob@ ) 0.1 (c) Fig. 2 The running example: (a) a probabilistic schema mapping between S and T; (b) a source instance D S ; (c) the answers of Q over D S with respect to the probabilistic mapping. Example 1. Consider a data source S, which describes a person by her address, current address, and permanent address, and the mediated schema T, which

7 Uncertainty in Data Integration and Dataspace Support Platforms 7 describes a person by her name, , mailing address, home address and office address: S=(pname, -addr, current-addr, permanent-addr) T=(name, , mailing-addr, home-addr, office-addr) A semi-automatic schema-mapping tool may generate three possible mappings between S and T, assigning each a probability. Whereas the three mappings all map pname to name, they map other attributes in the source and the target differently. Figure 2(a) describes the three mappings using sets of attribute correspondences. For example, mapping m 1 maps pname to name, -addr to , current-addr to mailing-addr, and permanent-addr to home-addr. Because of the uncertainty about which mapping is correct, we consider all of these mappings in query answering. Suppose the system receives a query Q composed on the mediated schema and asking for people s mailing addresses: Q: SELECT mailing-addr FROM T Using the possible mappings, we can reformulate Q into different queries: Q1: SELECT current-addr FROM S Q2: SELECT permanent-addr FROM S Q3: SELECT -addr FROM S If the user requires all possible answers, the system generates a single aggregation query based on Q 1,Q 2 and Q 3 to compute the probability of each returned tuple, and sends the query to the data source. Suppose the data source contains a table D S as shown in Figure 2(b), the system will retrieve four answer tuples, each with a probability, as shown in Figure 2(c). If the user requires only the top-1 answer (i.e., the answer tuple with the highest probability), the system decides at runtime which reformulated queries to execute. For example, after executing Q 1 and Q 2 at the source, the system can already conclude that ( Sunnyvale ) is the top-1 answer and can skip query Q Definition and Semantics Schema mappings and p-mappings We begin by reviewing non-probabilistic schema mappings. The goal of a schema mapping is to specify the semantic relationships between a source schema and a target schema. We refer to the source schema as S, and a relation in S as S = s 1,...,s m. Similarly, we refer to the target schema as T, and a relation in T as T = t 1,...,t n. We consider a limited form of schema mappings that are also referred to as schema matching in the literature. Specifically, a schema matching contains a set of

8 8 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy attribute correspondences. An attribute correspondence is of the form c i j = (s i,t j ), where s i is a source attribute in the schema S and t j is a target attribute in the schema T. Intuitively, c i j specifies that there is a relationship between s i and t j. In practice, a correspondence also involves a function that transforms the value of s i to the value of t j. For example, the correspondence (c-degree, temperature) can be specified as temperature=c-degree , describing a transformation from Celsius to Fahrenheit. These functions are irrelevant to our discussion, and therefore we omit them. This class of mappings are quite common in practice and already expose many of the novel issues involved in probabilistic mappings. In Section 3.5 we will briefly discuss extensions to a broader class of mappings. Formally, relation mappings and schema mappings are defined as follows. Definition 1 (Schema Mapping). Let S and T be relational schemas. A relation mapping M is a triple (S,T,m), where S is a relation in S, T is a relation in T, and m is a set of attribute correspondences between S and T. When each source and target attribute occurs in at most one correspondence in m, we call M a one-to-one relation mapping. A schema mapping M is a set of one-to-one relation mappings between relations in S and in T, where every relation in either S or T appears at most once. A pair of instances D S and D T satisfies a relation mapping m if for every source tuple t s D S, there exists a target tuple t t D t, such that for every attribute correspondence (s,t) m, the value of attribute s in t s is the same as the value of attribute t in t t. Example 2. Consider the mappings in Example 1. The source database in Figure 2(b) (repeated in Figure 3(a)) and the target database in Figure 3(b) satisfy m 1. Intuitively, a probabilistic schema mapping describes a probability distribution of a set of possible schema mappings between a source schema and a target schema. For completeness, we repeat its definition as follows (also see Definition 3 in Chapter 3). Definition 2 (Probabilistic Mapping). Let S and T be relational schemas. A probabilistic mapping (p-mapping), pm, is a triple (S,T,m), where S S, T T, and m is a set {(m 1,Pr(m 1 )),...,(m l,pr(m l ))}, such that for i [1,l], m i is a one-to-one mapping between S and T, and for every i, j [1,l], i j m i m j. Pr(m i ) [0,1] and l i=1 Pr(m i) = 1. A schema p-mapping, pm, is a set of p-mappings between relations in S and in T, where every relation in either S or T appears in at most one p-mapping. We refer to a non-probabilistic mapping as an ordinary mapping. A schema p- mapping may contain both p-mappings and ordinary mappings. Example 1 shows a p-mapping (see Figure 2(a)) that contains three possible mappings.

9 Uncertainty in Data Integration and Dataspace Support Platforms 9 pname -addr current-addr permanent-addr Alice alice@ Mountain View Sunnyvale Bob bob@ Sunnyvale Sunnyvale (a) name mailing-addr home-addr office-addr Alice alice@ Mountain View Sunnyvale office Bob bob@ Sunnyvale Sunnyvale office (b) name mailing-addr home-addr office-addr Alice alice@ Sunnyvale Mountain View office Bob bob@ Sunnyvale office (c) Tuple (mailing-addr) Prob ( Sunnyvale ) 0.9 ( Mountain View ) 0.5 ( alice@ ) 0.1 ( bob@ ) 0.1 (d) Tuple (mailing-addr) Prob ( Sunnyvale ) 0.94 ( Mountain View ) 0.5 ( alice@ ) 0.1 ( bob@ ) 0.1 (e) Fig. 3 Example 3: (a) a source instance D S ; (b) a target instance that is by-table consistent with D S and m 1 ; (c) a target instance that is by-tuple consistent with D S and < m 2,m 3 >; (d) Q table (D S ); (e) Q tuple (D S ) Semantics of probabilistic mappings Intuitively, a probabilistic schema mapping models the uncertainty about which of the mappings in pm is the correct one. When a schema matching system produces a set of candidate matches, there are two ways to interpret the uncertainty: (1) a single mapping in pm is the correct one and it applies to all the data in S, or (2) several mappings are partially correct and each is suitable for a subset of tuples in S, though it is not known which mapping is the right one for a specific tuple. Figure 3(b) illustrates the first interpretation and applies mapping m 1. For the same example, the second interpretation is equally valid: some people may choose to use their current address as mailing address while others use their permanent address as mailing address; thus, for different tuples we may apply different mappings, so the correct mapping depends on the particular tuple. We define query answering under both interpretations. The first interpretation is referred to as the by-table semantics and the second one is referred to as the by-tuple semantics of probabilistic mappings. Note that one cannot argue for one interpretation over the other; the needs of the application should dictate the appropriate semantics. Furthermore, the complexity results for query answering, which will show advantages to by-table semantics, should not be taken as an argument in the favor of by-table semantics. We next define query answering with respect to p-mappings in detail and the definitions for schema p-mappings are the obvious extensions. Recall that given a query and an ordinary mapping, we can compute certain answers to the query with

10 10 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy respect to the mapping. Query answering with respect to p-mappings is defined as a natural extension of certain answers, which we next review. A mapping defines a relationship between instances of S and instances of T that are consistent with the mapping. Definition 3 (Consistent Target Instance). Let M = (S, T, m) be a relation mapping and D S be an instance of S. An instance D T of T is said to be consistent with D S and M, if for each tuple t s D S, there exists a tuple t t D T, such that for every attribute correspondence (a s,a t ) m, the value of a s in t s is the same as the value of a t in t t. For a relation mapping M and a source instance D S, there can be an infinite number of target instances that are consistent with D S and M. We denote by Tar M (D S ) the set of all such target instances. The set of answers to a query Q is the intersection of the answers on all instances in Tar M (D S ). Definition 4 (Certain Answer). Let M = (S,T,m) be a relation mapping. Let Q be a query over T and let D S be an instance of S. A tuple t is said to be a certain answer of Q with respect to D S and M, if for every instance D T Tar M (D S ), t Q(D T ). By-table semantics: We now generalize these notions to the probabilistic setting, beginning with the by-table semantics. Intuitively, a p-mapping pm describes a set of possible worlds, each with a possible mapping m pm. In by-table semantics, a source table can fall in one of the possible worlds; that is, the possible mapping associated with that possible world applies to the whole source table. Following this intuition, we define target instances that are consistent with the source instance. Definition 5 (By-table Consistent Instance). Let pm = (S, T, m) be a p-mapping and D S be an instance of S. An instance D T of T is said to be by-table consistent with D S and pm, if there exists a mapping m m such that D S and D T satisfy m. Given a source instance D S and a possible mapping m m, there can be an infinite number of target instances that are consistent with D S and m. We denote by Tar m (D S ) the set of all such instances. In the probabilistic context, we assign a probability to every answer. Intuitively, we consider the certain answers with respect to each possible mapping in isolation. The probability of an answer t is the sum of the probabilities of the mappings for which t is deemed to be a certain answer. We define by-table answers as follows: Definition 6 (By-table Answer). Let pm = (S,T,m) be a p-mapping. Let Q be a query over T and let D S be an instance of S. Let t be a tuple. Let m(t) be the subset of m, such that for each m m(t) and for each D T Tar m (D S ), t Q(D T ). Let p = m m(t) Pr(m). If p > 0, then we say (t, p) is a by-table answer of Q with respect to D S and pm.

11 Uncertainty in Data Integration and Dataspace Support Platforms 11 By-tuple semantics: If we follow the possible-world notions, in by-tuple semantics, different tuples in a source table can fall in different possible worlds; that is, different possible mappings associated with those possible worlds can apply to the different source tuples. Formally, the key difference in the definition of by-tuple semantics from that of by-table semantics is that a consistent target instance is defined by a mapping sequence that assigns a (possibly different) mapping in m to each source tuple in D S. (Without losing generality, in order to compare between such sequences, we assign some order to the tuples in the instance). Definition 7 (By-tuple Consistent Instance). Let pm = (S, T, m) be a p-mapping and let D S be an instance of S with d tuples. An instance D T of T is said to be by-tuple consistent with D S and pm, if there is a sequence m 1,...,m d such that d is the number of tuples in D S and for every 1 i d, m i m, and for the i th tuple of D S, t i, there exists a target tuple t i D T such that for each attribute correspondence (a s,a t ) m i, the value of a s in t i is the same as the value of a t in t i. Given a mapping sequence seq = m 1,...,m d, we denote by Tar seq (D S ) the set of all target instances that are consistent with D S and seq. Note that if D T is bytable consistent with D S and m, then D T is also by-tuple consistent with D S and a mapping sequence in which each mapping is m. We can think of every sequence of mappings seq = m 1,...,m d as a separate event whose probability is Pr(seq) = Πi=1 d Pr(mi ). (Section 3.5 relaxes this independence assumption and introduces conditional mappings.) If there are l mappings in pm, then there are l d sequences of length d, and their probabilities add up to 1. We denote by seq d (pm) the set of mapping sequences of length d generated from pm. Definition 8 (By-tuple Answer). Let pm = (S,T,m) be a p-mapping. Let Q be a query over T and D S be an instance of S with d tuples. Let t be a tuple. Let seq(t) be the subset of seq d (pm), such that for each seq seq(t) and for each D T Tar seq (D S ), t Q(D T ). Let p = seq seq(t) Pr(seq). If p > 0, we call (t, p) a by-tuple answer of Q with respect to D S and pm. The set of by-table answers for Q with respect to D S is denoted by Q table (D S ) and the set of by-tuple answers for Q with respect to D S is denoted by Q tuple (D S ). Example 3. Consider the p-mapping pm, the source instance D S, and the query Q in the motivating example. In by-table semantics, Figure 3(b) shows a target instance that is consistent with D S (repeated in Figure 3(a)) and possible mapping m 1. Figure 3(d) shows the by-table answers of Q with respect to D S and pm. As an example, for tuple

12 12 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy t =( Sunnyvale ), we have m(t) = {m 1,m 2 }, so the possible tuple ( Sunnyvale, 0.9) is an answer. In by-tuple semantics, Figure 3(c) shows a target instance that is by-tuple consistent with D S and the mapping sequence < m 2,m 3 >. Figure 3(e) shows the bytuple answers of Q with respect to D S and pm. Note that the probability of tuple t=( Sunnyvale ) in the by-table answers is different from that in the by-tuple answers. We describe how to compute the probabilities in detail in the next section. 3.3 Query Answering This section studies query answering in the presence of probabilistic mappings. We start with describing algorithms for returning all answer tuples with probabilities, and discussing the complexity of query answering in terms of the size of the data (data complexity) and the size of the p-mapping (mapping complexity). We then consider returning the top-k query answers, which are the k answer tuples with the top probabilities, and answering aggregate queries By-table query answering In the case of by-table semantics, answering queries is conceptually simple. Given a p-mapping pm = (S,T,m) and an SPJ query Q, we can compute the certain answers of Q under each of the mappings m m. We attach the probability Pr(m) to every certain answer under m. If a tuple is an answer to Q under multiple mappings in m, then we add up the probabilities of the different mappings. Algorithm BYTABLE takes as input an SPJ query Q that mentions the relations T 1,...,T l in the FROM clause. Assume that we have the p-mapping pm i associated with the table T i. The algorithm proceeds as follows. Step 1: We generate the possible reformulations of Q (a reformulation query computes all certain answers when executed on the source data) by considering every combination of the form (m 1,...,m l ), where m i is one of the possible mappings in pm i. Denote the set of reformulations by Q 1,...,Q k. The probability of a reformulation Q = (m 1,...,m l ) is Πi=1 l Pr(mi ). Step 2: For each reformulation Q, retrieve each of the unique answers from the sources. For each answer obtained by Q 1... Q k, its probability is computed by summing the probabilities of the Q s in which it is returned. Importantly, note that it is possible to express both steps as an SQL query with grouping and aggregation. Therefore, if the underlying sources support SQL, we can leverage their optimizations to compute the answers.

13 Uncertainty in Data Integration and Dataspace Support Platforms 13 With our restricted form of schema mapping, the algorithm takes time polynomial in the size of the data and the mappings. We thus have the following complexity result. Theorem 1. Let pm be a schema p-mapping and let Q be an SPJ query. Answering Q with respect to pm in by-table semantics is in PTIME in the size of the data and the mapping By-tuple query answering Tuple (mailing-addr) Pr ( Sunnyvale ) 0.94 ( Mountain View ) 0.5 ( alice@ ) 0.1 ( bob@ ) 0.1 (a) Fig. 4 Example 4: (a) Q tuple 1 (D) and (b) Q tuple 2 (D). Tuple (mailing-addr) Pr ( Sunnyvale ) 0.8 ( Mountain View ) 0.8 (b) To extend the by-table query-answering strategy to by-tuple semantics, we would need to compute the certain answers for every mapping sequence generated by pm. However, the number of such mapping sequences is exponential in the size of the input data. The following example shows that for certain queries this exponential time complexity is inevitable. Example 4. Suppose that in addition to the tables in Example 1, we also have U(city) in the source and V(hightech) in the target. The p-mapping for V contains two possible mappings: ({(city, hightech)},.8) and (/0,.2). Consider the following query Q, which decides if there are any people living in a high-tech city. Q: SELECT true FROM T, V WHERE T.mailing-addr = V.hightech An incorrect way of answering the query is to first execute the following two sub-queries Q 1 and Q 2, then join the answers of Q 1 and Q 2 and summing up the probabilities. Q1: SELECT mailing-addr FROM T Q2: SELECT hightech FROM V Now consider the source instance D, where D S is shown in Figure 2(a), and D U has two tuples ( Mountain View ) and ( Sunnyvale ). Figure 4(a) and (b) show Q tuple 1 (D) and Q tuple 2 (D). If we join the results of Q 1 and Q 2, we obtain for the true tuple the following probability: = However, this

14 14 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy is incorrect. By enumerating all consistent target tables, we in fact compute as the probability. The reason for this error is that on some target instance that is by-tuple consistent with the source instance, the answers to both Q 1 and Q 2 contain tuple ( Sunnyvale ) and tuple ( Mountain View ). Thus, generating the tuple ( Sunnyvale ) as an answer for both Q 1 and Q 2 and generating the tuple ( Mountain View ) for both queries are not independent events, and so simply adding up their probabilities leads to incorrect results. Indeed, it is not clear if there exists a better algorithm to answer Q than by enumerating all by-tuple consistent target instances and then answering Q on each of them. In fact, it is proved that in general, answering SPJ queries in by-tuple semantics with respect to schema p-mappings is hard. Theorem 2. Let Q be an SPJ query and let pm be a schema p-mapping. The problem of finding the probability for a by-tuple answer to Q with respect to pm is #Pcomplete with respect to data complexity and is in PTIME with respect to mapping complexity. Recall that #P is the complexity class of some hard counting problems (i.e. e.g., counting the number of variable assignments that satisfy a Boolean formula). It is believed that a #P-complete problem cannot be solved in polynomial time, unless P = NP. Although by-tuple query answering in general is hard, there are two restricted but common classes of queries for which by-tuple query answering takes polynomial time. The first class of queries are those that include only a single subgoal being the target of a p-mapping; here, we refer to an occurrence of a table in the FROM clause of a query as a subgoal of the query. Relations in the other subgoals are either involved in ordinary mappings or do not require a mapping. Hence, if we only have uncertainty with respect to one part of the domain, our queries will typically fall in this class. The second class of queries can include multiple subgoals involved in p-mappings, but return the join attributes for such subgoals. We next illustrate these two classes of queries and query answering for them using two examples. Example 5. Consider rewriting Q in the motivating example, repeated as follows: Q: SELECT mailing-addr FROM T To answer the query, we first rewrite Q into query Q by adding the id column: Q : SELECT id, mailing-addr FROM T We then invoke BYTABLE and generate the following SQL query to compute by-table answers for Q : Qa: SELECT id, mailing-addr, SUM(pr) FROM ( SELECT DISTINCT id, current-addr AS mailing-addr, 0.5 AS pr

15 Uncertainty in Data Integration and Dataspace Support Platforms 15 FROM S UNION ALL SELECT DISTINCT id, permanent-addr AS mailing-addr, 0.4 AS pr FROM S UNION ALL SELECT DISTINCT id, -addr AS mailing-addr, 0.1 AS pr FROM S) GROUP BY id, mailing-addr Finally, we generate the results using the following query. Qu: SELECT mailing-addr, NOR(pr) AS pr FROM Qa GROUP BY mailing-addr where for a set of probabilities pr 1,..., pr n, NOR computes 1 Π n i=1 (1 pr i). Example 6. Consider the schema p-mapping in Example 4. If we revise Q slightly be returning the join attribute, shown as follows, we can answer the query in polynomial time. Q : SELECT V.hightech FROM T, V WHERE T.mailing-addr = V.hightech We answer the query by dividing it into two sub-queries, Q 1 and Q 2, as shown in Example 4. We can compute Q 1 with query Q u (shown in Example 5) and compute Q 2 similarly with a query Q u. We compute by-tuple answers of Q as follows: SELECT Qu.hightech, Qu.pr*Qu.pr FROM Qu, Qu WHERE Qu.mailing-addr = Qu.hightect Top-K Query Answering The main challenge in designing the algorithm for returning top-k query answers is to only perform the necessary reformulations at every step and halt when the top-k answers are found. We focus on top-k query answering for by-table semantics and the algorithm can be modified for by-tuple semantics. Recall that in by-table query answering, the probability of an answer is the sum of the probabilities of the reformulated queries that generate the answer. Our goal is to reduce the number of reformulated queries we execute. The algorithm we describe next proceeds in a greedy fashion: it executes queries in descending order of probabilities. For each tuple t, it maintains the upper bound p max (t) and lower

16 16 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy bound p min (t) of its probability. This process halts when it finds k tuples whose p min values are higher than p max of the rest of the tuples. TOPKBYTABLE takes as input an SPJ query Q, a schema p-mapping pm, an instance D S of the source schema, and an integer k, and outputs the top-k answers in Q table (D S ). The algorithm proceeds in three steps. Step 1: Rewrite Q according to pm into a set of queries Q 1,...,Q n, each with a probability assigned in a similar way as stated in Algorithm BYTABLE. Step 2: Execute Q 1,...,Q n in descending order of their probabilities. Maintain the following measures: The highest probability, PMax, for the tuples that have not been generated yet. We initialize PMax to 1; after executing query Q i and updating the list of answers (see third bullet), we decrease PMax by Pr(Q i ); The threshold th determining which answers are potentially in the top-k. We initialize th to 0; after executing Q i and updating the answer list, we set th to the k-th largest p min for tuples in the answer list; A list L of answers whose p max is no less than th, and bounds p min and p max for each answer in L. After executing query Q i, we update the list as follows: (1) for each t L and t Q i (D S ), we increase p min (t) by Pr(Q i ); (2) for each t L but t Q i (D S ), we decrease p max (t) by Pr(Q i ); (3) if PMax th, for each t L but t Q i (D S ), insert t to L, set p min to Pr(Q i ) and p max (t) to PMax. A list T of k tuples with top p min values. Step 3: When th > PMax and for each t T, th > p max (t), halt and return T. Example 7. Consider Example 1 where we seek for top-1 answer. We answer the reformulated queries in order of Q 1,Q 2,Q 3. After answering Q 1, for tuple ( Sunnyvale ) we have p min =.5 and p max = 1, and for tuple ( Mountain View ) we have the same bounds. In addition, PMax =.5 and th =.5. In the second round, we answer Q 2. Then, for tuple ( Sunnyvale ) we have p min =.9 and p max = 1, and for tuple ( Mountain View ) we have p min =.5 and p max =.6. Now PMax =.1 and th =.9. Because th > PMax and th is above the p max for the ( Mountain View ) tuple, we can halt and return ( Sunnyvale ) as the top-1 answer Answering aggregate queries Finally, we discuss queries with aggregate operators: COUNT, SUM, AVG, MAX, and MIN based on results from (Gal et al, 2009). We consider three common extensions to semantics with aggregates and probabilistic information: the range semantics returns the range of the aggregate (i.e., the minimum and the maximum value); the expected-value semantics returns the expected value of the aggregate; and the distribution semantics returns all possible values with their probabilities. Note that the answer under the former two semantics can be derived from that under the last

17 Uncertainty in Data Integration and Dataspace Support Platforms 17 Table 1 Complexity of answering aggregate queries under different semantics. Semantics Operator Range Expected-Value Distribution By-table COUNT, SUM, AVG, MIN, MAX PTIME PTIME PTIME COUNT PTIME PTIME PTIME By-tuple SUM PTIME PTIME? AVG, MIN, MAX PTIME?? semantics; in other words, the distribution semantics is the richest one. We next formally define the three semantics. Definition 9 (Semantics of Aggregate Query). Let pm = (S, T, m) be a p-mapping, Q be an aggregate query over T, and D S be an instance of S. Let V be the set of result values of evaluating Q on D S w.r.t. pm under by-table (resp. by-tuple) semantics and Pr(v) be the probability of value v V. 1. Range semantics: The result is the interval [min( V), max( V)]. 2. Expected-value semantics: The result is v V Pr(v) v. 3. Distribution semantics: The result is a random variable X, s.t. for each distinct value v V, Pr(X = v) = Pr(v). According to the definition, there are six combinations of semantics for aggregate queries w.r.t. a p-mapping. Since results under the range or expected-value semantics can be derived from the results under the distribution semantics in polynomial time, the complexity of query answering w.r.t. to the former two semantics is no higher than that w.r.t. to the distribution semantics; in fact, in some cases query answering w.r.t. to the former two semantics can be computed more efficiently without obtaining the distribution. Table 1 summarizes the complexity results for each aggregate operator and we now explain them briefly. In by-table query answering, we can enumerate all answers and compute their probabilities in polynomial time; thus, query answering is in PTIME for all semantics. In by-tuple query answering, we can enumerate all answers (without computing their probabilities) in polynomial time; thus, query answering under the range semantics (where we do not need to know the probabilities of each answer) is in PTIME. We can prove that under the expected-value semantics, the answer for the SUM operator under by-table and by-tuple semantics is the same; thus, query answering for SUM under the by-tuple and expected-value semantics is in PTIME. For the COUNT operator, even query answering under the by-tuple semantics is PTIME for the distribution semantics, and thus also for other semantics. We next illustrate this using an example. For the rest of the combinations, we conjecture that query answering cannot be finished in polynomial time and the complexity of query answering remains open.

18 18 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy Table 2 Trace of query answering in Example 8. TupleID COUNT = 0 COUNT = 1 COUNT = Example 8. Continue with the running example (Figure 2) and consider the following query. Qc: SELECT COUNT(*) FROM S WHERE mailing-addr = Sunnyvale Table 2 shows how the probability of each answer value changes after we process each source tuple under the by-tuple semantics. After processing the first tuple, the probability of COUNT = 0 is =.6 (m 1,m 3 ) and that of COUNT = 1 is.4 (m 2 ). After processing the second tuple, the probability of COUNT = 0 is the probability that COUNT = 0 after the first tuple times the probability of applying m 3 to the second tuple, so.6.1 =.06. That of COUNT = 1 is.6 (.5+.4)+.4.1 =.58; that is, either COUNT = 0 after we process the first tuple and we apply m 1 or m 2 to the second tuple, or COUNT = 1 after the first tuple and we apply m 3 to the third tuple. Similarly, the probability of COUNT = 2 is.4 (.5 +.4) = Creating P-mappings We now address the problem of generating a p-mapping between a source schema and a target schema. We begin by assuming we have a set of weighted correspondences between the source attributes and the target attributes. These weighted correspondences are created by a set of schema matching modules. However, as we explain shortly, there can be multiple p-mappings that are consistent with a given set of weighted correspondences, and the question is which of them to choose. We describe an approach to creating p-mappings that is based on choosing the mapping that maximizes the entropy of the probability assignment Computing weighted correspondences A weighted correspondence between a pair of attributes specifies the degree of semantic similarity between them. Let S(s 1,...,s m ) be a source schema and T(t 1,...,t n ) be a target schema. We denote by C i, j,i [1,m], j [1,n], the weighted correspondence between s i and t j and by w i, j the weight of C i, j. The first step is to compute a weighted correspondence between every pair of attributes, which can be done by applying existing schema matching techniques. Although weighted correspondences tell us the degree of similarity between pairs of attributes, they do not tell us which target attribute a source attribute should map

19 Uncertainty in Data Integration and Dataspace Support Platforms 19 to. For example, a target attribute mailing-address can be both similar to the source attribute current-addr and to permanent-addr, so it makes sense to map either of them to mailing-address in a schema mapping. In fact, given a set of weighted correspondences, there could be a set of p-mappings that are consistent with it. We can define the one-to-many relationship between sets of weighted correspondences and p-mappings by specifying when a p-mapping is consistent with a set of weighted correspondences. Definition 10 (Consistent p-mapping). A p-mapping pm is consistent with a weighted correspondence C i, j between a pair of source and target attributes if the sum of the probabilities of all mappings m pm containing correspondence (i, j) equals w i, j ; that is, w i, j = m pm,(i, j) m Pr(m). A p-mapping is consistent with a set of weighted correspondences C if it is consistent with each weighted correspondence C C. However, not every set of weighted correspondences admits a consistent p-mapping. The following theorem shows under which conditions a consistent p-mapping exists, and establishes a normalization factor for weighted correspondences that will guarantee the existence of a consistent p-mapping. Theorem 3. Let C be a set of weighted correspondences between a source schema S(s 1,...,s m ) and a target schema T(t 1,...,t n ). There exists a consistent p-mapping with respect to C if and only if (1) for every i [1,m], n j=1 w i, j 1 and (2) for every j [1,n], m i=1 w i, j 1. Let M = max{max i { n j=1 w i, j },max j { m i=1 w i, j }}. Then, for each i [1,m], n w i, j j=1 M 1 and for each j [1,n], m w i, j i=1 M 1. Based on Theorem 3, we normalize the weighted correspondences we generated as described previously by dividing them by M ; that is, w i, j = w i, j M Generating p-mappings To motivate our approach to generate p-mappings, consider the following example. Consider a source schema (A,B) and a target schema (A,B ). Assume we have computed the following weighted correspondences between source and target attributes: w A,A = 0.6 and w B,B = 0.5 (the rest are 0). As we explained above, there are an infinite number of p-mappings that are consistent with this set of weighted correspondences and below we list two: pm 1 :

20 20 Anish Das Sarma and Xin Luna Dong and Alon Y. Halevy pm 2 : m1: (A,A ), (B,B ): 0.3 m2: (A,A ): 0.3 m3: (B,B ): 0.2 m4: empty: 0.2 m1: (A,A ), (B,B ): 0.5 m2: (A,A ): 0.1 m3: empty: 0.4 In a sense, pm 1 seems better than pm 2 because it assumes that the similarity between A and A is independent of the similarity between B and B. In the general case, among the many p-mappings that are consistent with a set of weighted correspondences C, we choose the one with the maximum entropy; that is, the p-mappings whose probability distribution obtains the maximum value of l i=1 p i logp i. In the above example, pm 1 obtains the maximum entropy. The intuition behind maximum entropy is that when we need to select among multiple possible distributions on a set of exclusive events, we choose the one that does not favor any of the events over the others. Hence, we choose the distribution that does not introduce new information that we didn t have apriori. The principle of maximum entropy is widely used in other areas such as natural language processing. To create the p-mapping, we proceed in two steps. First, we enumerate all possible one-to-one schema mappings between S and M that contain a subset of correspondences in C. Second, we assign probabilities on each of the mappings in a way that maximizes the entropy of our result p-mapping. Enumerating all possible schema mappings given C is trivial: for each subset of correspondences, if it corresponds to a one-to-one mapping, we consider the mapping as a possible mapping. Given the possible mappings m 1,...,m l, we assign probabilities p 1,..., p l to m 1,...,m l by solving the following constraint optimization problem (OPT): maximize l k=1 p k log p k subject to: 1. k [1,l],0 p k 1, 2. l k=1 p k = 1, and 3. i, j : k [1,l],(i, j) mk p k = w i, j. We can apply existing technology in solving the OPT optimization problem. Although finding maximum-entropy solutions in general is costly, the experiments described in (Sarma et al, 2008) show that the execution time is reasonable for a one-time process. 3.5 Broader Classes of Mappings In this section we describe several practical extensions to the basic mapping language. The query answering techniques and complexity results we have described carry over to these techniques.

21 Uncertainty in Data Integration and Dataspace Support Platforms 21 GLAV mappings: The common formalism for schema mappings, GLAV (a.k.a. tuple-generating dependencies), is based on expressions of the form m : x(ϕ(x) yψ(x,y)). In the expression, ϕ is the body of a conjunctive query over S and ψ is the body of a conjunctive query over T. A pair of instances D S and D T satisfies a GLAV mapping m if for every assignment of x in D S that satisfies ϕ there exists an assignment of y in D T that satisfies ψ. We define general p-mappings to be triples of the form pgm = ( S, T,gm), where gm is a set {(gm i,pr(gm i )) i [1,n]}, such that for each i [1,n], gm i is a general GLAV mapping. The definition of by-table semantics for such mappings is a simple generalization of Definition 6 and query answering can be conducted in PTIME. Extending by-tuple semantics to arbitrary GLAV mappings is much trickier than by-table semantics and would involve considering mapping sequences whose length is the product of the number of tuples in each source table, and the results are much less intuitive. Theorem 4. Let pgm be a general p-mapping between a source schema S and a target schema T. Let D S be an instance of S. Let Q be an SPJ query with only equality conditions over T. The problem of computing Q table (D S ) with respect to pgm is in PTIME in the size of the data and the mapping. Complex mappings: Complex mappings map a set of attributes in the source to a set of attributes in the target. For example, we can map the attribute address to the concatenation of street, city, and state. Formally, a set correspondence between S and T is a relationship between a subset of attributes in S and a subset of attributes in T. Here, the function associated with the relationship specifies a single value for each of the target attributes given a value for each of the source attributes. Again, the actual functions are irrelevant to our discussion. A complex mapping is a triple (S,T,cm), where cm is a set of set correspondences, such that each attribute in S or T is involved in at most one set correspondence. A complex p-mapping is of the form pcm = {(cm i,pr(cm i )) i [1,n]}, where n i=1 Pr(cm i) = 1. Theorem 5. Let pcm be a complex schema p-mapping between schemas S and T. Let D S be an instance of S. Let Q be an SPJ query over T. The data complexity and mapping complexity of computing Q table (D S ) with respect to pcm are PTIME. The data complexity of computing Q tuple (D S ) with respect to pcm is #P-complete. The mapping complexity of computing Q tuple (D S ) with respect to pcm is in PTIME. Union mapping: Union mappings specify relationships such as both attribute home-address and attribute office-address can be mapped to address. Formally, a union mapping is a triple (S,T, m), where m is a set of mappings between S and T. Given a source relation D S and a target relation D T, we say D S and D T are consistent with respect to the union mapping if for each source tuple t and m m, there

Data Integration with Uncertainty

Data Integration with Uncertainty Data Integration with Uncertainty Xin Dong University of Washington Seattle, WA 98195 lunadong@cs.washington.edu Alon Y. Halevy Google Inc. Mountain View, CA 94043 halevy@google.com Cong Yu University

More information

Data Modeling in Dataspace Support Platforms

Data Modeling in Dataspace Support Platforms Data Modeling in Dataspace Support Platforms Anish Das Sarma 1, Xin (Luna) Dong 2, and Alon Y. Halevy 3 1 Stanford University, anish@cs.stanford.edu 2 AT&T Labs-Research, New Jersey, USA lunadong@research.att.com

More information

Bootstrapping Pay-As-You-Go Data Integration Systems

Bootstrapping Pay-As-You-Go Data Integration Systems Bootstrapping Pay-As-You-Go Data Integration Systems Anish Das Sarma Stanford University California, USA anish@cs.stanford.edu Xin Dong AT&T Labs Research New Jersey, USA lunadong@research.att.com Alon

More information

Procedia Computer Science 00 (2012) 1 21. Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

Procedia Computer Science 00 (2012) 1 21. Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu. Procedia Computer Science 00 (2012) 1 21 Procedia Computer Science Top-k best probability queries and semantics ranking properties on probabilistic databases Trieu Minh Nhut Le, Jinli Cao, and Zhen He

More information

Discuss the size of the instance for the minimum spanning tree problem.

Discuss the size of the instance for the minimum spanning tree problem. 3.1 Algorithm complexity The algorithms A, B are given. The former has complexity O(n 2 ), the latter O(2 n ), where n is the size of the instance. Let n A 0 be the size of the largest instance that can

More information

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

More information

A Survey on Uncertainty Management in Data Integration

A Survey on Uncertainty Management in Data Integration A Survey on Uncertainty Management in Data Integration MATTEO MAGNANI and DANILO MONTESI University of Bologna In the last few years, uncertainty management has come to be recognized as a fundamental aspect

More information

Functional Dependency Generation and Applications in Pay As You Go Data Integration Systems

Functional Dependency Generation and Applications in Pay As You Go Data Integration Systems Functional Dependency Generation and Applications in Pay As You Go Data Integration Systems Daisy Zhe Wang, Luna Dong, Anish Das Sarma, Michael J. Franklin, and Alon Halevy UC Berkeley, AT&T Research,

More information

arxiv:1112.0829v1 [math.pr] 5 Dec 2011

arxiv:1112.0829v1 [math.pr] 5 Dec 2011 How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly

More information

Data Integration: A Theoretical Perspective

Data Integration: A Theoretical Perspective Data Integration: A Theoretical Perspective Maurizio Lenzerini Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza Via Salaria 113, I 00198 Roma, Italy lenzerini@dis.uniroma1.it ABSTRACT

More information

Designing and Using Views To Improve Performance of Aggregate Queries

Designing and Using Views To Improve Performance of Aggregate Queries Designing and Using Views To Improve Performance of Aggregate Queries Foto Afrati 1, Rada Chirkova 2, Shalu Gupta 2, and Charles Loftis 2 1 Computer Science Division, National Technical University of Athens,

More information

KEYWORD SEARCH IN RELATIONAL DATABASES

KEYWORD SEARCH IN RELATIONAL DATABASES KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to

More information

Web-scale Data Integration: You can only afford to Pay As You Go

Web-scale Data Integration: You can only afford to Pay As You Go Web-scale Data Integration: You can only afford to Pay As You Go Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy Google, Inc. jayant@google.com, jeffery@cs.berkeley.edu,

More information

[Refer Slide Time: 05:10]

[Refer Slide Time: 05:10] Principles of Programming Languages Prof: S. Arun Kumar Department of Computer Science and Engineering Indian Institute of Technology Delhi Lecture no 7 Lecture Title: Syntactic Classes Welcome to lecture

More information

Query Processing in Data Integration Systems

Query Processing in Data Integration Systems Query Processing in Data Integration Systems Diego Calvanese Free University of Bozen-Bolzano BIT PhD Summer School Bressanone July 3 7, 2006 D. Calvanese Data Integration BIT PhD Summer School 1 / 152

More information

Querying Past and Future in Web Applications

Querying Past and Future in Web Applications Querying Past and Future in Web Applications Daniel Deutch Tel-Aviv University Tova Milo Customer HR System Logistics ERP Bank ecomm CRM Supplier Outline Introduction & Motivation Querying Future [VLDB

More information

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009. Notes on Algebra

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009. Notes on Algebra U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009 Notes on Algebra These notes contain as little theory as possible, and most results are stated without proof. Any introductory

More information

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 Outline More Complex SQL Retrieval Queries

More information

Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

More information

2.1 Complexity Classes

2.1 Complexity Classes 15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look

More information

Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza

Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza Data Integration Maurizio Lenzerini Universitá di Roma La Sapienza DASI 06: Phd School on Data and Service Integration Bertinoro, December 11 15, 2006 M. Lenzerini Data Integration DASI 06 1 / 213 Structure

More information

3. Mathematical Induction

3. Mathematical Induction 3. MATHEMATICAL INDUCTION 83 3. Mathematical Induction 3.1. First Principle of Mathematical Induction. Let P (n) be a predicate with domain of discourse (over) the natural numbers N = {0, 1,,...}. If (1)

More information

1 Formulating The Low Degree Testing Problem

1 Formulating The Low Degree Testing Problem 6.895 PCP and Hardness of Approximation MIT, Fall 2010 Lecture 5: Linearity Testing Lecturer: Dana Moshkovitz Scribe: Gregory Minton and Dana Moshkovitz In the last lecture, we proved a weak PCP Theorem,

More information

1 Approximating Set Cover

1 Approximating Set Cover CS 05: Algorithms (Grad) Feb 2-24, 2005 Approximating Set Cover. Definition An Instance (X, F ) of the set-covering problem consists of a finite set X and a family F of subset of X, such that every elemennt

More information

Online and Offline Selling in Limit Order Markets

Online and Offline Selling in Limit Order Markets Online and Offline Selling in Limit Order Markets Kevin L. Chang 1 and Aaron Johnson 2 1 Yahoo Inc. klchang@yahoo-inc.com 2 Yale University ajohnson@cs.yale.edu Abstract. Completely automated electronic

More information

Software Active Online Monitoring Under. Anticipatory Semantics

Software Active Online Monitoring Under. Anticipatory Semantics Software Active Online Monitoring Under Anticipatory Semantics Changzhi Zhao, Wei Dong, Ji Wang, Zhichang Qi National Laboratory for Parallel and Distributed Processing P.R.China 7/21/2009 Overview Software

More information

9.2 Summation Notation

9.2 Summation Notation 9. Summation Notation 66 9. Summation Notation In the previous section, we introduced sequences and now we shall present notation and theorems concerning the sum of terms of a sequence. We begin with a

More information

Efficient Auditing For Complex SQL queries

Efficient Auditing For Complex SQL queries Efficient Auditing For Complex SQL queries Raghav Kaushik Microsoft Research skaushi@microsoft.com Ravi Ramamurthy Microsoft Research ravirama@microsoft.com ABSTRACT We address the problem of data auditing

More information

The Relative Worst Order Ratio for On-Line Algorithms

The Relative Worst Order Ratio for On-Line Algorithms The Relative Worst Order Ratio for On-Line Algorithms Joan Boyar 1 and Lene M. Favrholdt 2 1 Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, joan@imada.sdu.dk

More information

On the Unique Games Conjecture

On the Unique Games Conjecture On the Unique Games Conjecture Antonios Angelakis National Technical University of Athens June 16, 2015 Antonios Angelakis (NTUA) Theory of Computation June 16, 2015 1 / 20 Overview 1 Introduction 2 Preliminary

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy

Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy The Deep Web: Surfacing Hidden Value Michael K. Bergman Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy Presented by Mat Kelly CS895 Web-based Information Retrieval

More information

XML Data Integration

XML Data Integration XML Data Integration Lucja Kot Cornell University 11 November 2010 Lucja Kot (Cornell University) XML Data Integration 11 November 2010 1 / 42 Introduction Data Integration and Query Answering A data integration

More information

Factoring & Primality

Factoring & Primality Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount

More information

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

PartJoin: An Efficient Storage and Query Execution for Data Warehouses PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2

More information

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents

More information

Testing LTL Formula Translation into Büchi Automata

Testing LTL Formula Translation into Büchi Automata Testing LTL Formula Translation into Büchi Automata Heikki Tauriainen and Keijo Heljanko Helsinki University of Technology, Laboratory for Theoretical Computer Science, P. O. Box 5400, FIN-02015 HUT, Finland

More information

1 Construction of CCA-secure encryption

1 Construction of CCA-secure encryption CSCI 5440: Cryptography Lecture 5 The Chinese University of Hong Kong 10 October 2012 1 Construction of -secure encryption We now show how the MAC can be applied to obtain a -secure encryption scheme.

More information

CHAPTER 7 GENERAL PROOF SYSTEMS

CHAPTER 7 GENERAL PROOF SYSTEMS CHAPTER 7 GENERAL PROOF SYSTEMS 1 Introduction Proof systems are built to prove statements. They can be thought as an inference machine with special statements, called provable statements, or sometimes

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

An example of a computable

An example of a computable An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept

More information

DEGREES OF ORDERS ON TORSION-FREE ABELIAN GROUPS

DEGREES OF ORDERS ON TORSION-FREE ABELIAN GROUPS DEGREES OF ORDERS ON TORSION-FREE ABELIAN GROUPS ASHER M. KACH, KAREN LANGE, AND REED SOLOMON Abstract. We construct two computable presentations of computable torsion-free abelian groups, one of isomorphism

More information

Chapter 5. SQL: Queries, Constraints, Triggers

Chapter 5. SQL: Queries, Constraints, Triggers Chapter 5 SQL: Queries, Constraints, Triggers 1 Overview: aspects of SQL DML: Data Management Language. Pose queries (Ch. 5) and insert, delete, modify rows (Ch. 3) DDL: Data Definition Language. Creation,

More information

Database Security. The Need for Database Security

Database Security. The Need for Database Security Database Security Public domain NASA image L-1957-00989 of people working with an IBM type 704 electronic data processing machine. 1 The Need for Database Security Because databases play such an important

More information

Efficiently Identifying Inclusion Dependencies in RDBMS

Efficiently Identifying Inclusion Dependencies in RDBMS Efficiently Identifying Inclusion Dependencies in RDBMS Jana Bauckmann Department for Computer Science, Humboldt-Universität zu Berlin Rudower Chaussee 25, 12489 Berlin, Germany bauckmann@informatik.hu-berlin.de

More information

Relational Databases

Relational Databases Relational Databases Jan Chomicki University at Buffalo Jan Chomicki () Relational databases 1 / 18 Relational data model Domain domain: predefined set of atomic values: integers, strings,... every attribute

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

SQL: Queries, Programming, Triggers

SQL: Queries, Programming, Triggers SQL: Queries, Programming, Triggers Chapter 5 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 R1 Example Instances We will use these instances of the Sailors and Reserves relations in

More information

The Classes P and NP

The Classes P and NP The Classes P and NP We now shift gears slightly and restrict our attention to the examination of two families of problems which are very important to computer scientists. These families constitute the

More information

Example Instances. SQL: Queries, Programming, Triggers. Conceptual Evaluation Strategy. Basic SQL Query. A Note on Range Variables

Example Instances. SQL: Queries, Programming, Triggers. Conceptual Evaluation Strategy. Basic SQL Query. A Note on Range Variables SQL: Queries, Programming, Triggers Chapter 5 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Example Instances We will use these instances of the Sailors and Reserves relations in our

More information

So let us begin our quest to find the holy grail of real analysis.

So let us begin our quest to find the holy grail of real analysis. 1 Section 5.2 The Complete Ordered Field: Purpose of Section We present an axiomatic description of the real numbers as a complete ordered field. The axioms which describe the arithmetic of the real numbers

More information

Cleaning Uncertain Data with Quality Guarantees

Cleaning Uncertain Data with Quality Guarantees Cleaning Uncertain Data with Quality Guarantees Reynold Cheng Jinchuan Chen Xike Xie Department of Computing The Hong Kong Polytechnic University Hung Hom, Kowloon, Hong Kong {csckcheng,csjcchen,csxxie}@comp.polyu.edu.hk

More information

SQL: Queries, Programming, Triggers

SQL: Queries, Programming, Triggers SQL: Queries, Programming, Triggers CSC343 Introduction to Databases - A. Vaisman 1 R1 Example Instances We will use these instances of the Sailors and Reserves relations in our examples. If the key for

More information

IT2305 Database Systems I (Compulsory)

IT2305 Database Systems I (Compulsory) Database Systems I (Compulsory) INTRODUCTION This is one of the 4 modules designed for Semester 2 of Bachelor of Information Technology Degree program. CREDITS: 04 LEARNING OUTCOMES On completion of this

More information

A Comparison of Database Query Languages: SQL, SPARQL, CQL, DMX

A Comparison of Database Query Languages: SQL, SPARQL, CQL, DMX ISSN: 2393-8528 Contents lists available at www.ijicse.in International Journal of Innovative Computer Science & Engineering Volume 3 Issue 2; March-April-2016; Page No. 09-13 A Comparison of Database

More information

Lecture 7: NP-Complete Problems

Lecture 7: NP-Complete Problems IAS/PCMI Summer Session 2000 Clay Mathematics Undergraduate Program Basic Course on Computational Complexity Lecture 7: NP-Complete Problems David Mix Barrington and Alexis Maciel July 25, 2000 1. Circuit

More information

On-Line/Off-Line Digital Signatures

On-Line/Off-Line Digital Signatures J. Cryptology (996) 9: 35 67 996 International Association for Cryptologic Research On-Line/Off-Line Digital Signatures Shimon Even Computer Science Department, Technion Israel Institute of Technology,

More information

Improving Data Quality: Consistency and Accuracy

Improving Data Quality: Consistency and Accuracy Improving Data Quality: Consistency and Accuracy Gao Cong 1 Wenfei Fan 2,3 Floris Geerts 2,4,5 Xibei Jia 2 Shuai Ma 2 1 Microsoft Research Asia 2 University of Edinburgh 4 Hasselt University 3 Bell Laboratories

More information

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Samarjit Chakraborty Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology (ETH) Zürich March

More information

Answering Structured Queries on Unstructured Data

Answering Structured Queries on Unstructured Data Answering Structured Queries on Unstructured Data Jing Liu University of Washington Seattle, WA 9895 liujing@cs.washington.edu Xin Dong University of Washington Seattle, WA 9895 lunadong @cs.washington.edu

More information

Information Systems SQL. Nikolaj Popov

Information Systems SQL. Nikolaj Popov Information Systems SQL Nikolaj Popov Research Institute for Symbolic Computation Johannes Kepler University of Linz, Austria popov@risc.uni-linz.ac.at Outline SQL Table Creation Populating and Modifying

More information

Simulation-Based Security with Inexhaustible Interactive Turing Machines

Simulation-Based Security with Inexhaustible Interactive Turing Machines Simulation-Based Security with Inexhaustible Interactive Turing Machines Ralf Küsters Institut für Informatik Christian-Albrechts-Universität zu Kiel 24098 Kiel, Germany kuesters@ti.informatik.uni-kiel.de

More information

A single minimal complement for the c.e. degrees

A single minimal complement for the c.e. degrees A single minimal complement for the c.e. degrees Andrew Lewis Leeds University, April 2002 Abstract We show that there exists a single minimal (Turing) degree b < 0 s.t. for all c.e. degrees 0 < a < 0,

More information

Introduction. Digital Signature

Introduction. Digital Signature Introduction Electronic transactions and activities taken place over Internet need to be protected against all kinds of interference, accidental or malicious. The general task of the information technology

More information

1 Message Authentication

1 Message Authentication Theoretical Foundations of Cryptography Lecture Georgia Tech, Spring 200 Message Authentication Message Authentication Instructor: Chris Peikert Scribe: Daniel Dadush We start with some simple questions

More information

Personalization of Web Search With Protected Privacy

Personalization of Web Search With Protected Privacy Personalization of Web Search With Protected Privacy S.S DIVYA, R.RUBINI,P.EZHIL Final year, Information Technology,KarpagaVinayaga College Engineering and Technology, Kanchipuram [D.t] Final year, Information

More information

The composition of Mappings in a Nautural Interface

The composition of Mappings in a Nautural Interface Composing Schema Mappings: Second-Order Dependencies to the Rescue Ronald Fagin IBM Almaden Research Center fagin@almaden.ibm.com Phokion G. Kolaitis UC Santa Cruz kolaitis@cs.ucsc.edu Wang-Chiew Tan UC

More information

Relational Database: Additional Operations on Relations; SQL

Relational Database: Additional Operations on Relations; SQL Relational Database: Additional Operations on Relations; SQL Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin Overview The course packet

More information

Reading 13 : Finite State Automata and Regular Expressions

Reading 13 : Finite State Automata and Regular Expressions CS/Math 24: Introduction to Discrete Mathematics Fall 25 Reading 3 : Finite State Automata and Regular Expressions Instructors: Beck Hasti, Gautam Prakriya In this reading we study a mathematical model

More information

WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT?

WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT? WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT? introduction Many students seem to have trouble with the notion of a mathematical proof. People that come to a course like Math 216, who certainly

More information

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

Automatic Annotation Wrapper Generation and Mining Web Database Search Result Automatic Annotation Wrapper Generation and Mining Web Database Search Result V.Yogam 1, K.Umamaheswari 2 1 PG student, ME Software Engineering, Anna University (BIT campus), Trichy, Tamil nadu, India

More information

Gambling Systems and Multiplication-Invariant Measures

Gambling Systems and Multiplication-Invariant Measures Gambling Systems and Multiplication-Invariant Measures by Jeffrey S. Rosenthal* and Peter O. Schwartz** (May 28, 997.. Introduction. This short paper describes a surprising connection between two previously

More information

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation: CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

More information

Virtual Data Integration

Virtual Data Integration Virtual Data Integration Helena Galhardas Paulo Carreira DEI IST (based on the slides of the course: CIS 550 Database & Information Systems, Univ. Pennsylvania, Zachary Ives) Agenda Terminology Conjunctive

More information

A Model For Revelation Of Data Leakage In Data Distribution

A Model For Revelation Of Data Leakage In Data Distribution A Model For Revelation Of Data Leakage In Data Distribution Saranya.R Assistant Professor, Department Of Computer Science and Engineering Lord Jegannath college of Engineering and Technology Nagercoil,

More information

How To Find An Optimal Search Protocol For An Oblivious Cell

How To Find An Optimal Search Protocol For An Oblivious Cell The Conference Call Search Problem in Wireless Networks Leah Epstein 1, and Asaf Levin 2 1 Department of Mathematics, University of Haifa, 31905 Haifa, Israel. lea@math.haifa.ac.il 2 Department of Statistics,

More information

A Probabilistic Database Extension

A Probabilistic Database Extension A Probabilistic Database Extension Ander de Keijzer Maurice van Keulen Faculty of EEMCS, University of Twente POBox 217, 7500 AE Enschede, The Netherlands {a.dekeijzer,m.vankeulen}@utwente.nl Abstract

More information

Functional Dependency Generation and Applications in pay-as-you-go data integration systems

Functional Dependency Generation and Applications in pay-as-you-go data integration systems Functional Dependency Generation and Applications in pay-as-you-go data integration systems Daisy Zhe Wang, Luna Dong, Anish Das Sarma, Michael J. Franklin, and Alon Halevy UC Berkeley and AT&T Research

More information

Theorem (informal statement): There are no extendible methods in David Chalmers s sense unless P = NP.

Theorem (informal statement): There are no extendible methods in David Chalmers s sense unless P = NP. Theorem (informal statement): There are no extendible methods in David Chalmers s sense unless P = NP. Explication: In his paper, The Singularity: A philosophical analysis, David Chalmers defines an extendible

More information

Bounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances

Bounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances Bounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances Jialin Zhang Tsinghua University zhanggl02@mails.tsinghua.edu.cn Wei Chen Microsoft Research Asia weic@microsoft.com Abstract

More information

Arithmetic Coding: Introduction

Arithmetic Coding: Introduction Data Compression Arithmetic coding Arithmetic Coding: Introduction Allows using fractional parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation

More information

THE TURING DEGREES AND THEIR LACK OF LINEAR ORDER

THE TURING DEGREES AND THEIR LACK OF LINEAR ORDER THE TURING DEGREES AND THEIR LACK OF LINEAR ORDER JASPER DEANTONIO Abstract. This paper is a study of the Turing Degrees, which are levels of incomputability naturally arising from sets of natural numbers.

More information

From Databases to Natural Language: The Unusual Direction

From Databases to Natural Language: The Unusual Direction From Databases to Natural Language: The Unusual Direction Yannis Ioannidis Dept. of Informatics & Telecommunications, MaDgIK Lab University of Athens, Hellas (Greece) yannis@di.uoa.gr http://www.di.uoa.gr/

More information

Exponential time algorithms for graph coloring

Exponential time algorithms for graph coloring Exponential time algorithms for graph coloring Uriel Feige Lecture notes, March 14, 2011 1 Introduction Let [n] denote the set {1,..., k}. A k-labeling of vertices of a graph G(V, E) is a function V [k].

More information

Approximation Algorithms

Approximation Algorithms Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

More information

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted

More information

Principle of Data Reduction

Principle of Data Reduction Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

More information

14.1 Rent-or-buy problem

14.1 Rent-or-buy problem CS787: Advanced Algorithms Lecture 14: Online algorithms We now shift focus to a different kind of algorithmic problem where we need to perform some optimization without knowing the input in advance. Algorithms

More information

MINIMAL BOOKS OF RATIONALES

MINIMAL BOOKS OF RATIONALES MINIMAL BOOKS OF RATIONALES José Apesteguía Miguel A. Ballester D.T.2005/01 MINIMAL BOOKS OF RATIONALES JOSE APESTEGUIA AND MIGUEL A. BALLESTER Abstract. Kalai, Rubinstein, and Spiegler (2002) propose

More information

1 if 1 x 0 1 if 0 x 1

1 if 1 x 0 1 if 0 x 1 Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or

More information

DYNAMIC QUERY FORMS WITH NoSQL

DYNAMIC QUERY FORMS WITH NoSQL IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 7, Jul 2014, 157-162 Impact Journals DYNAMIC QUERY FORMS WITH

More information

Oracle Database: SQL and PL/SQL Fundamentals

Oracle Database: SQL and PL/SQL Fundamentals Oracle University Contact Us: 1.800.529.0165 Oracle Database: SQL and PL/SQL Fundamentals Duration: 5 Days What you will learn This course is designed to deliver the fundamentals of SQL and PL/SQL along

More information

Notes from Week 1: Algorithms for sequential prediction

Notes from Week 1: Algorithms for sequential prediction CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

More information

Linear Codes. Chapter 3. 3.1 Basics

Linear Codes. Chapter 3. 3.1 Basics Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

More information

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma Please Note: The references at the end are given for extra reading if you are interested in exploring these ideas further. You are

More information

Databases 2011 The Relational Model and SQL

Databases 2011 The Relational Model and SQL Databases 2011 Christian S. Jensen Computer Science, Aarhus University What is a Database? Main Entry: da ta base Pronunciation: \ˈdā-tə-ˌbās, ˈda- also ˈdä-\ Function: noun Date: circa 1962 : a usually

More information

A Framework for Personalized Healthcare Service Recommendation

A Framework for Personalized Healthcare Service Recommendation A Framework for Personalized Healthcare Service Recommendation Choon-oh Lee, Minkyu Lee, Dongsoo Han School of Engineering Information and Communications University (ICU) Daejeon, Korea {lcol, niklaus,

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Oracle Database: SQL and PL/SQL Fundamentals NEW

Oracle Database: SQL and PL/SQL Fundamentals NEW Oracle University Contact Us: + 38516306373 Oracle Database: SQL and PL/SQL Fundamentals NEW Duration: 5 Days What you will learn This Oracle Database: SQL and PL/SQL Fundamentals training delivers the

More information

SUBGROUPS OF CYCLIC GROUPS. 1. Introduction In a group G, we denote the (cyclic) group of powers of some g G by

SUBGROUPS OF CYCLIC GROUPS. 1. Introduction In a group G, we denote the (cyclic) group of powers of some g G by SUBGROUPS OF CYCLIC GROUPS KEITH CONRAD 1. Introduction In a group G, we denote the (cyclic) group of powers of some g G by g = {g k : k Z}. If G = g, then G itself is cyclic, with g as a generator. Examples

More information