# Clean Answers over Dirty Databases: A Probabilistic Approach

Save this PDF as:

Size: px
Start display at page:

## Transcription

7 RewriteClean(q) Given an SPJ query q of the form: select A 1,..., A n from R 1,..., R m where W Output a query Q rew of the form: select A 1,..., A n,sum(r 1.prob.*...* R m.prob) from R 1,..., R m where W group by A 1,..., A n Figure 4. Query rewriting the vertices of G are the relations used in q; and there is an arc from R i to R j if a non-identifier attribute of R i is equated with the identifier attribute of R j. We now define the class of rewritable queries. Dfn 7 (rewritable query) Let q be an SPJ query, and G be the join graph of q. We say that q is rewritable if: 1. all the joins involve the identifier of at least one relation, 2. G is a tree, 3. a relation appears in the from clause at most once, 4. the identifier of the relation at the root of G appears in the select clause. The first condition rules out joins on two non-identifier attributes. However, the definition does allow joins between identifiers (which correspond to the keys of the dirty relations), or between a non-identifier and an identifier (for example, a foreign key join). The second condition states that the join graph of the query must be a tree. The third condition restricts each relation of the schema to appear at most once in the from clause. This rules out self-joins, for example, but places no restriction on the number of relations that can appear in the query. The last condition is related to the tree structure of the join graph. In particular, we impose that the identifier of the relation at the root of the tree must appear in the from clause. Notice that the query of Example 7 violates this condition because it does not project on the id attribute of order, the relation at the root the tree. For the data cleaning tasks we wish to support using our rewriting, including the identifier in the select clause is not an onerous restriction. Our rewritings are designed not to support general analysis queries, but rather to help a user in understanding the entities in the dirty database. The next theorem is the main result in this section, and states the correctness of RewriteClean(q) for the class of rewritable queries. We give the proof of the theorem in the full version of this paper [2]. Theorem 1 Let q be a rewritable query. Then, RewriteClean(q) computes the clean answers for q on every dirty database. Input : A set of tuples T, - a clustering C = {c 1, c 2,..., c k } of T, where c i is the identifier of cluster i - a distance measure d. Output : For every tuple t in T, a probability prob(t). Main Procedure : - (Step 1) For i = 1... k: * compute cluster representative rep i for c i by merging all the tuples that belong to it. * initialize sum of distances for c i, S(c i ) = 0. - (Step 2) For each tuple t T that belongs to c i : * compute d t = d(t, rep i ), the distance of t to the representative of its cluster. * Add d t to S(c i ). - (Step 3) For each tuple t T that belongs to c i : * compute similarity s t = 1 dt S(c. i) * prob(t) = 1.0 if c i = 1, or prob(t) = st c otherwise. i 1 Figure 5. Assigning Tuple Probabilities 4 Computing tuple probabilities We have been assuming that the results from some tuple matching method are given. Even if such methods produce a clustering of the database, they usually do not produce a probability (or some kind of quantitative characterization) indicating how likely a potential duplicate is to be in the clean database. In this section, we present an approach that assigns such probabilities. Given a clustering, we compute a representative of each cluster. Then, we assign each tuple a probability which is based on the distance of this tuple from its cluster representative. We give a generic procedure to compute the probabilities in Figure 5. In this procedure, we make use of a given clustering C and a distance measure d. The distance measure is used to assign a probability to each cluster. The actual measure we use in our work will be given in Section The first step of the algorithm computes the cluster representatives and initializes the sum of distances S in each cluster, which will be used as a normalization factor to compute the probabilities. Notice that this is a general procedure and any clustering method as well as distance measure can be employed. For each tuple t within a particular cluster with identifier c i, Step 2 calculates the distance of this tuple to the representative, rep i, of its cluster and adds this distance to S(c i ). Finally, in order to compute the probability of a tuple t, Step 3 turns the distance computed in the previous step for this tuple into a similarity. Intuitively, if this similarity is small, then the tuple under consideration is almost as good as the representative and will eventually acquire a high probability. At this point, we would like to emphasize the role of the cluster representatives. In a semi-automatic process of assigning probabilities, a person s intuition guides the ranking 7

8 of a tuple as more (or less) probable of being in the clean database. In our fully automatic approach, the representatives contain the most common features that appear in the cluster. Thus, the quantification of similarity between the tuples of a cluster and its representative gives us a hint of which tuple is more likely to act as a representative by itself. Overall, the higher the similarity, the higher the probability of a tuple appearing in the clean database. Finally, the similarity of each tuple computed by the algorithm in Figure 5 is normalized to fall in the interval [0.0, 1.0] and the resulting value is the probability of a tuple being in the clean database (denoted as prob(t)). Of course, if a cluster consists of a single tuple (i.e., c i = 1), the probability assigned to the tuple in this cluster is equal to one, as we are certain about its existence in the clean database. It is important to find a representative and measure of distance that reflect the semantics of clean answers. To illustrate our approach, consider the customer relation of Figure 6, where c 1, c 2 and c 3 are the cluster identifiers for the three clusters that exist in this relation. To do query anname mktsegmt nation address t 1 Mary building USA Jones Ave c 1 t 2 Mary banking USA Jones Ave c 1 t 3 Marion banking USA Jones ave c 1 t 4 John building America Arrow c 2 t 5 John S. building USA Arrow c 2 t 6 John banking Canada Baldwin c 3 Figure 6. A dirty customer relation. swering over such a dirty database, we need a way of understanding how likely each tuple is to be in the clean database. Intuitively, in cluster c 1, t 2 seems to be the most likely tuple since it shares all of its values with at least one other tuple in the group. This intuition is based on our query answering semantics. For cluster c 1, Mary is the most likely name value, banking the most likely mktsegment, USA is the certain nation value, and so on. Our method formalizes this intuition. 4.1 Dealing with Categorical Data When we compare tuples, it is often assumed that there exists some well-defined notion of similarity, or distance, between them. When the objects are defined by a set of numerical attributes, there are natural definitions of distance based on geometric analogies. These definitions rely on the semantics of the data values. For example, values \$10,000 and \$9,500 are more similar than \$10,000 and \$1. In many domains, data records are composed of a set of descriptive attributes, many of which are neither numerical nor inherently ordered in any way. For example, in the setting of our customer relation, it is not immediately obvious what the distance, or similarity, is between the values banking and building, or the tuples of John S. and Mary. The values without an inherent distance measure defined among them are called categorical. We will deal here with the problem of calculating the distance among tuples defined over categorical values. We will employ a distance measure that has been used in the clustering of large data sets with categorical values [4], based on information loss. This distance measure provides a natural way of quantifying the duplication of values within tuples and it has been used to detect redundancy in dirty legacy databases [3]. In the following subsections, we explain the data representation and the actual computation of the distance. Note that when a distance measure between tuples (e.g., string edit distance) is available, our method can incorporate it. However, since such measures are well studied in the deduplication and approximate matching literature, we do not investigate them further here Data representation We now introduce some conventions and notations that we will use in the rest of the section. We will assume that we have a set T of n tuples. The tuples are defined over attributes A 1, A 2,..., A m. We assume that the domain V i of each attribute A i is named in such a way that identical values from different attributes are treated as distinct values. For each attribute A i, a tuple t T takes exactly one value from the set V i. Let V = V 1 V m denote the set of all possible attribute values. Let V denote the size of V. We represent the data as an n V matrix M. In this matrix, the value of M[t, v] is 1, if tuple t T contains attribute value v V; and zero otherwise. Each tuple contains one value for each attribute, so each tuple vector contains exactly m 1 s. Now, let T and V be random variables that range over T (the set of tuples) and V (the set of attribute values), respectively. We normalize matrix M so that the entries of each row sum up to 1. For a tuple t T, the corresponding row of the normalized matrix holds the conditional probability distribution p(v t). Since each tuple contains exactly m attribute values, for some v V, p(v t) = 1/m if v appears in tuple t, and zero otherwise. Example 8 Table 1 contains the normalized matrix M for the customer relation. In this matrix, given tuple t 1, we have a probability of 0.25 of choosing one of the four values that appear in it Tuple and Cluster Summaries Our task now is to build the cluster representatives, against which each tuple of a cluster will be compared. We summarize a cluster of tuples in a Distributional Cluster Feature (DCF ) [4]. We will use the information in the relevant DCF s to compute the distance between a cluster summary (representative) and a tuple. Let T denote a set of tuples over a set V of attributes, and let T and V be the corresponding random variables, as 8

9 Mary Marion John John S. banking building USA Canada America Jones Ave Jones ave Arrow Baldwin Cluster Id t c1 t c1 t c1 t c2 t c2 t c3 Table 1. The normalized customer matrix described earlier. Also let C denote a clustering of the tuples in T and let C be the corresponding random variable. The Distributional Cluster Feature (DCF ) of a cluster with identifier c i is defined by the pair ( ) DCF (c i) = c i, p(v c i) where c i is the cardinality of cluster c i, and p(v c i ) is the conditional probability distribution of the attribute values given the cluster c i. Practically, the cluster representative of rep i of cluster c i will be its DCF, DCF (c i ). If c i consists of a single tuple, then c i = 1, and p(v c i ) is computed as described in the previous subsection. For larger clusters, the DCF is computed recursively as follows. Let c denote the cluster we obtain by merging two clusters with identifiers c 1 ( and c 2. The DCF ) of the cluster c is equal to DCF (c ) = c ), p(v c ), where c and p(v c ) are computed using the following equations: c = c 1 + c 2 p(v c ) = c1 c p(v c 1 + c 2 1) + c2 c p(v c 1 + c 2 2) Intuitively, when computing a new cluster representative, its cardinality becomes the sum of the cardinalities of the merged sub-clusters, while the conditional probability of its values is the weighted average of the conditional probability distribution of the values from the merged sub-clusters. Note that a cluster representative may not necessarily belong to the initial set of tuples. Table 2 depicts the cluster representative for the customer relation as dictated by the cluster identifiers given in the last column of Table 1. In this table, we see that rep 1 summarizes three tuples that contain the value USA since the probability of this value in this cluster remains the same as in the initial tuples. Similarly, both tuples t 4 and t 5 contain the values building and Arrow, which is reflected in their representative, rep 2. Finally, rep 3 contains the last tuple of the relation, t Distance as Information Loss Our notion of distance d between tuples and summaries that include categorical values, is based on an intuitive calculation of the loss of information when two distributions are merged. Information here is defined in terms of Information Theory [12]. More precisely, we use the mutual information I(X; Y ) of two random variables X and Y, to quantify the information that variable X contains about Y and vice versa. To compute the value of I(X; Y ), we need the probability distributions of X and Y. In our case, we quantify the information that the clusters in C contain about the values in V and use the distributions that describe the tuples and clusters stored in the corresponding DCF s. The main objective of the quantification of the loss of information is to realize which tuples share as many common values as possible with their cluster representative. Thus, given the DCF s of summaries s 1 and s 2, the information loss (and hence the distance) between s 1 and s 2 is given by the following expression: d(s 1, s 2) = I(C; V ) I(C ; V ) where C denotes the clustering after merging the summaries s 1 and s 2. Example 9 Table 3 shows the distance between each tuple of the customer relation (3rd column) and its corresponding representative (2nd column). The same table contains the calculation of the similarity (4th column) and the final probability of each tuple being in the clean database (5th column). Notice that the smaller the distance of a tuple to each representative, the higher the similarity and, consequently, the higher the probability that it belongs to the clean database. rep d(t, rep) s t p(t) t 1 rep t 2 rep t 3 rep t 4 rep t 5 rep t 6 rep Table 3. Probability calculation in customer There are several observations we can make from Table 3. Concerning the tuples of cluster c 1, tuple t 2 is the most probable one to be in the clean database, which agrees with our intuition. This tuple contains the values with the highest frequency in this cluster, and thus it is the one that can replace its cluster representative in the best way. On the other hand, c 2 contains just two tuples, which are equally likely to be in the clean database and, finally, we have no uncertainty in having t 6 in the clean database since it constitutes a cluster summary of its own. In the next subsection, we present a brief discussion on the effectiveness of using our method for computing tuple probabilities (a more detailed discussion can be found in the extended version of this paper [2]). 9

10 Mary Marion John John S. banking building USA Canada America Jones Ave Jones ave Arrow Baldwin c rep rep rep Table 2. The three cluster representatives for customer tuples 4.2 Qualitative Evaluation Our technique for assigning tuple probabilities is performed on each relation of a database separately. Hence, we use a single relation to evaluate the technique and investigate whether the assigned probabilities agree with human intuition. In our experiments [2], we used clusters from the Cora data set [19]. This data set contains computer science research papers integrated from several sources. It has been used in other data cleaning projects [19, 8] and we take advantage of previous labelings of the tuples into clusters. As an example, we considered a cluster that corresponds to a publication by Robert E. Shapire. It contains 56 tuples and due to space considerations, the full cluster is not given here (but can be found in our technical report [2]). In Table 4, we give the most frequent values in all attributes of this cluster, as well as the two most likely and the two least likely tuples as they are ranked by the assignment of probabilities. Table 4 re-confirms the intuitive ranking of the tuples within a cluster. In particular, the most likely tuple shares all its values with the set of most frequent values, while the next most likely tuple shares all but one of these values (the value of the volume attribute). On the other hand, the second least likely tuple (the penultimate tuple in the table) corresponds to a different publication, and thus it should have been placed in a different cluster. Finally, although the least likely tuple (the last tuple in the table) corresponds to the same publication of Shapire, its values are stored in a different way than used in the rest of the tuples for this publication. 5 Experimental Evaluation In this section, we evaluate the efficiency of our approach. All the experiments were performed on a machine with 2.8 Ghz Intel Pentium 4 CPU and 1GB of RAM (80% of which was allocated to the database manager). The queries were run on DB2 UDB Version under Windows XP Professional. 5.1 Data Set Generation In order to assess the efficiency of our techniques on very large data sets, we used a synthetic data generator, the UIS Database Generator. 4 This generator was written by Mauricio Hernández and has been used for the evaluation of duplicate detection [17]. Given a parameter that controls the number of tuples and a parameter that controls the number of clusters, the generator creates clusters 4 A copy of the generator can be found at: of potential duplicates. We use this generator to produce data that conforms to the schema of the TPC-H specification (http://www.tpc.org/tpch). 5.2 Parameter Setting The parameters for the UIS generator that we set in our experiments are presented below. Scaling Factor (sf). The scaling factor is used to control the size of the relations created for the TPC-H specification. If sf = 1, we create a data set of size 1GB (approximately 8 million tuples), if sf = 2 the size is 2GB (approximately 16 million tuples), etc. Inconsistency Factor (if). The UIS generator creates a number of clusters, where each cluster contains on average the same number of tuples. The cluster cardinalities are drawn from a uniform distribution, whose mean is the value of if. More precisely, if x is the value of if, the generator creates cluster cardinalities between 1 and 2x 1. As the value of if increases, the degree of inconsistency increases. In our experiments, we create data tables for the TPC-H specification that are of size 0.1GB, 0.5GB, 1GB and 2GB, and clusters with 1, 2, 3, 4, 5 and 25 tuples per cluster. 5.3 Efficiency Evaluation We first evaluate the time required to annotate the database with probabilities. Then, we study the performance of our method for producing clean query answers. Probability Computation We used the technique of Section 4 to produce the probabilities. For the propagation of identifiers, we used the approach that replaces the values of the original keys of the relations with the identifier selected by the tuple matching tool. We experimented with data sets of size 1GB (sf = 1), and values 1, 5 and 25 for the if parameter. 5 The total execution time for the propagation and probability calculation was, for all databases, less than 30 minutes. This is a reasonable time for an off-line technique. The propagation technique is independent of the number of tuples in each cluster and is only sensitive to the total size of the relations. However, the time to compute probabilities increases as the number of tuples in each cluster increases. This happens since more tuples are used in the computation of the cluster representatives. We experimentally validate this fact with the results in Figure 7. This figure depicts the time taken to propagate and compute the probabilities for the largest relation of our database, the lineitem relation, 5 Notice that the value if = 1 corresponds to a completely clean database, and hence the computation of probabilities should include the addition of a probability 1.0 to each tuple. However, since this is an off-line procedure, we decided to keep an un-optimized version of our method for if = 1, in order to get a baseline reference. 10

11 Most frequent values author title venue volume year pages robert e. schapire the strength of weak learnability machine learning 5(2) Top-2 Tuples author title venue volume year pages robert e. schapire the strength of weak learnability machine learning 5(2) robert e. schapire the strength of weak learnability machine learning Bottom-2 Tuples author title venue volume year pages r. schapire on the strength of weak proc of the 30th NULL 1989 pp learnability i.e.e.e. symposium... schapire, r.e., the strength of weak learnability machine learning 5 2 (1990) pp Table 4. Example from the Cora data set as well as the time required to perform one linear scan over this relation. The running time for the computation of prob- Execution Time (sec) Propagation Probability Calculation Linear Scan if Figure 7. Offline times for lineitem abilities grows as the number of tuples increases within each cluster. When the value of if is small, the distance calculation of each tuple to its representative dominates the time, while with larger if values, it is the time of the creation of representatives that prevails. Hence, when we move from if = 1 to if = 2, the difference in computation time for the probabilities is mainly due to the merging of more tuples in cluster representatives. This difference is more pronounced as we move from if = 2 to if = 25. Clean Query Answering We performed our experiments on thirteen queries from the TPC-H specification, which contain from one to six joins. In particular, we focused on queries 1, 2, 3, 4, 6, 9, 10, 11, 12, 14, 17, 18, and 20 from the specification. The only change that we made to the queries was removing the aggregate expressions. The queries in the TPC-H specification are parameterized. For all queries, we used the parameters suggested by the standard for query validation. The thirteen queries used in the experiments appear in the full version of this paper [2]. For each instance, we created indices on the identifier, and ran the DB2 RUNSTATS command on all attributes. In Figure 8, we show the running times of the thirteen queries on a 1 GB database with an average cluster size of 3 tuples. Notice that the overhead of running the rewritten queries is not significant. All queries (except Query 9) execute within 1.5 times the time of the original query. Notably, the rewritten versions of eight queries take less than 1.05 times the execution time of the original query. These are Queries 2, 4, 6, 11, 14, 17, 18, and 20. Note that the running times for Query 9 and its rewriting extend well beyond the scale of this graph (the times are indicated in numbers at the top of the figure). This query has six joins and a high selectivity (a large fraction of tuples satisfying its conditions). With a large number of duplicates satisfying the query, the grouping and aggregation of the rewritten query becomes costly. As a result, this rewriting has the highest overhead (1.8 times the time of the original query). In Figure 9, we show the effect of cluster size on the running time. These running times correspond to Query 3, which we show next. This query contains a three-way join, three selection conditions, and an order by clause. select l_orderkey, l_extendedprice*(1-l_discount) as revenue, o_orderdate, o_shippriority from customer,orders,lineitem where c_mktsegment = BUILDING and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < and l_shipdate > order by revenue desc, o_orderdate We report results on 1GB instances with if =1, 2, 3, 4 and 5. The solid lines correspond to the running times of Query 3 and its rewriting. The running time of both queries increases with the average size of the clusters. The reason for this behavior is that the size of the result set for both queries increases with the number of tuples per cluster because a tuple may join with more tuples of another relation. Therefore, the order by of the original query and the order by and group by of the rewritten query become more costly. In order to have a better insight into the performance of the rewritten queries, we analyzed the running time of Query 3 and its rewritten version when the order by clause is removed. We show the results in Figure 9 using dashed lines. In this case, the running time of the original query without order by is not affected by the size of the clusters, as 11

### Procedia Computer Science 00 (2012) 1 21. Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

Procedia Computer Science 00 (2012) 1 21 Procedia Computer Science Top-k best probability queries and semantics ranking properties on probabilistic databases Trieu Minh Nhut Le, Jinli Cao, and Zhen He

### IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

### Efficiently Identifying Inclusion Dependencies in RDBMS

Efficiently Identifying Inclusion Dependencies in RDBMS Jana Bauckmann Department for Computer Science, Humboldt-Universität zu Berlin Rudower Chaussee 25, 12489 Berlin, Germany bauckmann@informatik.hu-berlin.de

### Linear Codes. Chapter 3. 3.1 Basics

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

### Designing and Using Views To Improve Performance of Aggregate Queries

Designing and Using Views To Improve Performance of Aggregate Queries Foto Afrati 1, Rada Chirkova 2, Shalu Gupta 2, and Charles Loftis 2 1 Computer Science Division, National Technical University of Athens,

### 2. Basic Relational Data Model

2. Basic Relational Data Model 2.1 Introduction Basic concepts of information models, their realisation in databases comprising data objects and object relationships, and their management by DBMS s that

### Relational Model CENG 351 1

Relational Model CENG 351 1 Relational Database: Definitions Relational database: a set of relations Relation: made up of 2 parts: Instance : a table, with rows and columns. #Rows = cardinality, #fields

### virtual class local mappings semantically equivalent local classes ... Schema Integration

Data Integration Techniques based on Data Quality Aspects Michael Gertz Department of Computer Science University of California, Davis One Shields Avenue Davis, CA 95616, USA gertz@cs.ucdavis.edu Ingo

### FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

### KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

### Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

### Course Notes on The Bases of the Relational Model

Course Notes on The Bases of the Relational Model Intuitive View of Relations Popular view of the relational model = information is structured as 2-dimensional tables of simple values (with lines, or rows,

### Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques

Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques Sean Thorpe 1, Indrajit Ray 2, and Tyrone Grandison 3 1 Faculty of Engineering and Computing,

### Efficient Auditing For Complex SQL queries

Efficient Auditing For Complex SQL queries Raghav Kaushik Microsoft Research skaushi@microsoft.com Ravi Ramamurthy Microsoft Research ravirama@microsoft.com ABSTRACT We address the problem of data auditing

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

### Unraveling versus Unraveling: A Memo on Competitive Equilibriums and Trade in Insurance Markets

Unraveling versus Unraveling: A Memo on Competitive Equilibriums and Trade in Insurance Markets Nathaniel Hendren January, 2014 Abstract Both Akerlof (1970) and Rothschild and Stiglitz (1976) show that

### Creating Probabilistic Databases from Duplicated Data

VLDB Journal manuscript No. (will be inserted by the editor) Creating Probabilistic Databases from Duplicated Data Oktie Hassanzadeh Renée J. Miller Received: 14 September 2008 / Revised: 1 April 2009

### Data Quality in Information Integration and Business Intelligence

Data Quality in Information Integration and Business Intelligence Leopoldo Bertossi Carleton University School of Computer Science Ottawa, Canada : Faculty Fellow of the IBM Center for Advanced Studies

### Lecture 3: Linear Programming Relaxations and Rounding

Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can

### Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

### 2 Associating Facts with Time

TEMPORAL DATABASES Richard Thomas Snodgrass A temporal database (see Temporal Database) contains time-varying data. Time is an important aspect of all real-world phenomena. Events occur at specific points

### Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive

### The Entity-Relationship Model

The Entity-Relationship Model 221 After completing this chapter, you should be able to explain the three phases of database design, Why are multiple phases useful? evaluate the significance of the Entity-Relationship

### The Relational Model. Why Study the Relational Model? Relational Database: Definitions

The Relational Model Database Management Systems, R. Ramakrishnan and J. Gehrke 1 Why Study the Relational Model? Most widely used model. Vendors: IBM, Microsoft, Oracle, Sybase, etc. Legacy systems in

### Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No. # 8 Functional Dependencies and Normal Forms Two different kinds

### Data and Its Structure. The Relational Data Model. Physical Data Level. Conceptual Data Level. Conceptual Data Level (con t)

Data and Its Structure The Relational Data Model Chapter 4 Data is actually stored as bits, but it is difficult to work with data at this level. It is convenient to view data at different levels of abstraction.

### PartJoin: An Efficient Storage and Query Execution for Data Warehouses

PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2

### Relational model. Relational model - practice. Relational Database Definitions 9/27/11. Relational model. Relational Database: Terminology

COS 597A: Principles of Database and Information Systems elational model elational model A formal (mathematical) model to represent objects (data/information), relationships between objects Constraints

### STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

### ICS 434 Advanced Database Systems

ICS 434 Advanced Database Systems Dr. Abdallah Al-Sukairi sukairi@kfupm.edu.sa Second Semester 2003-2004 (032) King Fahd University of Petroleum & Minerals Information & Computer Science Department Outline

### Data Preprocessing. Week 2

Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

### MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

### Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No. # 7 ER Model to Relational Mapping Hello and welcome to the next

### The Trip Scheduling Problem

The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems

### 1 Gaussian Elimination

Contents 1 Gaussian Elimination 1.1 Elementary Row Operations 1.2 Some matrices whose associated system of equations are easy to solve 1.3 Gaussian Elimination 1.4 Gauss-Jordan reduction and the Reduced

### Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

### The process of database development. Logical model: relational DBMS. Relation

The process of database development Reality (Universe of Discourse) Relational Databases and SQL Basic Concepts The 3rd normal form Structured Query Language (SQL) Conceptual model (e.g. Entity-Relationship

### The Relational Model. Why Study the Relational Model?

The Relational Model Chapter 3 Instructor: Vladimir Zadorozhny vladimir@sis.pitt.edu Information Science Program School of Information Sciences, University of Pittsburgh 1 Why Study the Relational Model?

### Senior Secondary Australian Curriculum

Senior Secondary Australian Curriculum Mathematical Methods Glossary Unit 1 Functions and graphs Asymptote A line is an asymptote to a curve if the distance between the line and the curve approaches zero

### Min-cost flow problems and network simplex algorithm

Min-cost flow problems and network simplex algorithm The particular structure of some LP problems can be sometimes used for the design of solution techniques more efficient than the simplex algorithm.

### Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

### On-line Data De-duplication. Ιωάννης Κρομμύδας

On-line Data De-duplication Ιωάννης Κρομμύδας Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning

### A Brief Introduction to Property Testing

A Brief Introduction to Property Testing Oded Goldreich Abstract. This short article provides a brief description of the main issues that underly the study of property testing. It is meant to serve as

### Efficient Integration of Data Mining Techniques in Database Management Systems

Efficient Integration of Data Mining Techniques in Database Management Systems Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France

### Test of Hypotheses. Since the Neyman-Pearson approach involves two statistical hypotheses, one has to decide which one

Test of Hypotheses Hypothesis, Test Statistic, and Rejection Region Imagine that you play a repeated Bernoulli game: you win \$1 if head and lose \$1 if tail. After 10 plays, you lost \$2 in net (4 heads

### Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No. # 1a Conceptual Design Hello in today session we would be looking

### XML Data Integration

XML Data Integration Lucja Kot Cornell University 11 November 2010 Lucja Kot (Cornell University) XML Data Integration 11 November 2010 1 / 42 Introduction Data Integration and Query Answering A data integration

### Report Paper: MatLab/Database Connectivity

Report Paper: MatLab/Database Connectivity Samuel Moyle March 2003 Experiment Introduction This experiment was run following a visit to the University of Queensland, where a simulation engine has been

### Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 Outline More Complex SQL Retrieval Queries

### Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

### Simulation-Based Security with Inexhaustible Interactive Turing Machines

Simulation-Based Security with Inexhaustible Interactive Turing Machines Ralf Küsters Institut für Informatik Christian-Albrechts-Universität zu Kiel 24098 Kiel, Germany kuesters@ti.informatik.uni-kiel.de

### The Relational Model. Ramakrishnan&Gehrke, Chapter 3 CS4320 1

The Relational Model Ramakrishnan&Gehrke, Chapter 3 CS4320 1 Why Study the Relational Model? Most widely used model. Vendors: IBM, Informix, Microsoft, Oracle, Sybase, etc. Legacy systems in older models

### Implementing evolution: Database migration

2IS55 Software Evolution Sources Implementing evolution: Database migration Alexander Serebrenik / SET / W&I 7-6-2010 PAGE 1 Last week Recapitulation Assignment 8 How is it going? Questions to Marcel:

### December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

### INTRODUCTION TO PROOFS: HOMEWORK SOLUTIONS

INTRODUCTION TO PROOFS: HOMEWORK SOLUTIONS STEVEN HEILMAN Contents 1. Homework 1 1 2. Homework 2 6 3. Homework 3 10 4. Homework 4 16 5. Homework 5 19 6. Homework 6 21 7. Homework 7 25 8. Homework 8 28

### Linear Dependence Tests

Linear Dependence Tests The book omits a few key tests for checking the linear dependence of vectors. These short notes discuss these tests, as well as the reasoning behind them. Our first test checks

### Distributed Database Design and Distributed Query Execution. Designing with distribution in mind: top-down

Distributed Database Design and Distributed Query Execution Designing with distribution in mind: top-down 1 Data Fragmentation and Placement Fragmentation: How to split up the data into smaller fragments?

### The Optimality of Naive Bayes

The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

### COMP 378 Database Systems Notes for Chapter 7 of Database System Concepts Database Design and the Entity-Relationship Model

COMP 378 Database Systems Notes for Chapter 7 of Database System Concepts Database Design and the Entity-Relationship Model The entity-relationship (E-R) model is a a data model in which information stored

### Offline sorting buffers on Line

Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

### The Relational Model. Chapter 3. Comp 521 Files and Databases Fall

The Relational Model Chapter 3 Comp 521 Files and Databases Fall 2016 1 Why Study the Relational Model? Most widely used model by industry. IBM, Informix, Microsoft, Oracle, Sybase, etc. It is simple,

### Load Balancing and Switch Scheduling

EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load

### 1. LINEAR EQUATIONS. A linear equation in n unknowns x 1, x 2,, x n is an equation of the form

1. LINEAR EQUATIONS A linear equation in n unknowns x 1, x 2,, x n is an equation of the form a 1 x 1 + a 2 x 2 + + a n x n = b, where a 1, a 2,..., a n, b are given real numbers. For example, with x and

### Lecture 6. SQL, Logical DB Design

Lecture 6 SQL, Logical DB Design Relational Query Languages A major strength of the relational model: supports simple, powerful querying of data. Queries can be written intuitively, and the DBMS is responsible

### Search Result Optimization using Annotators

Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

### Bootstrapping Pay-As-You-Go Data Integration Systems

Bootstrapping Pay-As-You-Go Data Integration Systems Anish Das Sarma Stanford University California, USA anish@cs.stanford.edu Xin Dong AT&T Labs Research New Jersey, USA lunadong@research.att.com Alon

### Data Mining and Database Systems: Where is the Intersection?

Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise

### Testing LTL Formula Translation into Büchi Automata

Testing LTL Formula Translation into Büchi Automata Heikki Tauriainen and Keijo Heljanko Helsinki University of Technology, Laboratory for Theoretical Computer Science, P. O. Box 5400, FIN-02015 HUT, Finland

### The Relational Model. Why Study the Relational Model? Relational Database: Definitions. Chapter 3

The Relational Model Chapter 3 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Study the Relational Model? Most widely used model. Vendors: IBM, Informix, Microsoft, Oracle, Sybase,

### We have defined, in section 1.3, a three-dimension strategy for our research regarding

Chapter 6 Improving Semantic Representation 6.1 Introduction We have defined, in section 1.3, a three-dimension strategy for our research regarding improving conceptual models Improve ER schema, Improve

### 13.3 Inference Using Full Joint Distribution

191 The probability distribution on a single variable must sum to 1 It is also true that any joint probability distribution on any set of variables must sum to 1 Recall that any proposition a is equivalent

### Mathematics for Economics (Part I) Note 5: Convex Sets and Concave Functions

Natalia Lazzati Mathematics for Economics (Part I) Note 5: Convex Sets and Concave Functions Note 5 is based on Madden (1986, Ch. 1, 2, 4 and 7) and Simon and Blume (1994, Ch. 13 and 21). Concave functions

### Can you design an algorithm that searches for the maximum value in the list?

The Science of Computing I Lesson 2: Searching and Sorting Living with Cyber Pillar: Algorithms Searching One of the most common tasks that is undertaken in Computer Science is the task of searching. We

### CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical

### Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001

A comparison of the OpenGIS TM Abstract Specification with the CIDOC CRM 3.2 Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001 1 Introduction This Mapping has the purpose to identify, if the OpenGIS

### The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

### Notes 11: List Decoding Folded Reed-Solomon Codes

Introduction to Coding Theory CMU: Spring 2010 Notes 11: List Decoding Folded Reed-Solomon Codes April 2010 Lecturer: Venkatesan Guruswami Scribe: Venkatesan Guruswami At the end of the previous notes,

### Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

### SQL Query & Modification

CS145 Lecture Notes #7 SQL Query & Modification Introduction SQL Structured Query Language Pronounced S-Q-L or sequel The query language of every commercial RDBMS Evolution of SQL standard: SQL89 SQL92

### Math 54. Selected Solutions for Week Is u in the plane in R 3 spanned by the columns

Math 5. Selected Solutions for Week 2 Section. (Page 2). Let u = and A = 5 2 6. Is u in the plane in R spanned by the columns of A? (See the figure omitted].) Why or why not? First of all, the plane in

### Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

### Secure Cooperative Data Access in Multi-Cloud Environment

Mason Archival Repository Service http://mars.gmu.edu etd @ Mason (Electronic Theses and Dissertations) The Volgenau School of Engineering 2013 Secure Cooperative Data Access in Multi-Cloud Environment

### Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

### Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Jasna S MTech Student TKM College of engineering Kollam Manu J Pillai Assistant Professor

### 3. Relational Model and Relational Algebra

ECS-165A WQ 11 36 3. Relational Model and Relational Algebra Contents Fundamental Concepts of the Relational Model Integrity Constraints Translation ER schema Relational Database Schema Relational Algebra

### We call this set an n-dimensional parallelogram (with one vertex 0). We also refer to the vectors x 1,..., x n as the edges of P.

Volumes of parallelograms 1 Chapter 8 Volumes of parallelograms In the present short chapter we are going to discuss the elementary geometrical objects which we call parallelograms. These are going to

### High School Algebra 1 Common Core Standards & Learning Targets

High School Algebra 1 Common Core Standards & Learning Targets Unit 1: Relationships between Quantities and Reasoning with Equations CCS Standards: Quantities N-Q.1. Use units as a way to understand problems

### a. Write a query to generate the number of charter flights to the same destination and shows the results in alphabetic order.

CS 631 _ DBMS Sample Final Exam Questions Question #1: Using the data given in the AviaCo database, perform the following queries and show the expected results: a. Write a query to generate the number

### PowerTeaching i3: Algebra I Mathematics

PowerTeaching i3: Algebra I Mathematics Alignment to the Common Core State Standards for Mathematics Standards for Mathematical Practice and Standards for Mathematical Content for Algebra I Key Ideas and

### Improving the Processing of Decision Support Queries: Strategies for a DSS Optimizer

Universität Stuttgart Fakultät Informatik Improving the Processing of Decision Support Queries: Strategies for a DSS Optimizer Authors: Dipl.-Inform. Holger Schwarz Dipl.-Inf. Ralf Wagner Prof. Dr. B.

### 2.5 Gaussian Elimination

page 150 150 CHAPTER 2 Matrices and Systems of Linear Equations 37 10 the linear algebra package of Maple, the three elementary 20 23 1 row operations are 12 1 swaprow(a,i,j): permute rows i and j 3 3

### OPRE 6201 : 2. Simplex Method

OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2

### 2.1.1 Examples of Sets and their Elements

Chapter 2 Set Theory 2.1 Sets The most basic object in Mathematics is called a set. As rudimentary as it is, the exact, formal definition of a set is highly complex. For our purposes, we will simply define

### 8. Query Processing. Query Processing & Optimization

ECS-165A WQ 11 136 8. Query Processing Goals: Understand the basic concepts underlying the steps in query processing and optimization and estimating query processing cost; apply query optimization techniques;

### Checking for Dimensional Correctness in Physics Equations

Checking for Dimensional Correctness in Physics Equations C.W. Liew Department of Computer Science Lafayette College liew@cs.lafayette.edu D.E. Smith Department of Computer Science Rutgers University dsmith@cs.rutgers.edu

### For example, estimate the population of the United States as 3 times 10⁸ and the

CCSS: Mathematics The Number System CCSS: Grade 8 8.NS.A. Know that there are numbers that are not rational, and approximate them by rational numbers. 8.NS.A.1. Understand informally that every number

### You know from calculus that functions play a fundamental role in mathematics.

CHPTER 12 Functions You know from calculus that functions play a fundamental role in mathematics. You likely view a function as a kind of formula that describes a relationship between two (or more) quantities.