# Discovering Queries based on Example Tuples

Save this PDF as:

Size: px
Start display at page:

## Transcription

2 Candidate Queries Valid Queries Candidate Queries Valid Queries #rows (a) IMDB 1 #rows (b) CUST Figure 3: #candidate queries vs #valid queries vice name) with Office installed, another customer named Mary bought an ipad (but not sure what app was installed on it) and another customer named Bob bought a device with Dropbox installed on it (but not sure which device). She provides the above information in the form of the example table (ET) shown in Figure 2; she can do so by typing it into a spreadsheet-style interface (e.g., Microsoft Excel, Google Spreadsheet). Note some cells are partially specified (e.g., customer, device, app names) and some are totally empty. We propose to return a small set of queries so that the user can select one of them and easily modify it to obtain the final query. Each returned query should have the following desired characteristics: (i) Its output should have the desired schema. (ii) Its output should contain all the rows in ET. (iii) Considering the class of SPJ queries where all joins are foreign key joins, there are numerous queries that satisfies (i) and (ii); we refer to them as valid queries. We should return the fewest valid queries in order to avoid overwhelming the user. At the same time, all the distinct join paths/projected columns should be covered. Hence, we do not consider selections, i.e., we consider only project join queries. Furthermore, we consider only the minimal project join queries, i.e., the ones where we cannot remove any relation or join condition so that it is still valid. For the ET in Figure 2, there is only one valid minimal query as shown in the figure. The query is represented by the join tree. The red arrows show the projected column for each ET column. Each projected column is labeled by the name of corresponding ET column (A, B and C). Note that our problem is different from query by output which aims to find the query that produces the exact table [22, 23]. Technical Challenges: The key challenge is to compute the valid minimal queries efficiently. Existing approaches on keyword search in databases [1, 1, 2, 14, 9, 19, 4, 6, 15, 12] and sample driven schema mapping [18] focus on the single input tuple scenario. A naive adaptation to our problem is to first compute the set valid minimal queries for each row in ET and then intersect those sets. Most valid minimal queries for the individual tuples are not valid for the overall ET; the latter set is much more constrained. The above adaptation does not push these constraints down. It executes join queries to compute valid queries for the individual tuples but subsequently throws most of them away as they are not valid for the overall ET. This is wasteful. We identify a set of constraints that can be pushed below the joins. A valid minimal query must contain a projected column for each column C in ET; all the values in C must appear in it. We refer to it as the column constraint for C. A valid minimal query must satisfy the column constraints for all the columns in ET. We define the set of candidate queries as the set of minimal project join queries that satisfy all the column constraints. Figure 4 shows a few examples of candidate queries. We push the column constraints down by first computing the candidate set; this can be done very inexpensively (without performing any joins). We refer to this Figure 4: Candidate queries step as the candidate generation step. All the algorithms considered in the paper, including the baseline algorithms, first perform this candidate generation step. Not all the candidate queries are valid ones. Since the candidate generation step is relatively inexpensive, the main technical problem is: given the set of candidate queries defined above, compute the valid ones efficiently. We refer to it as candidate verification problem. One option is to individually verify whether the candidate query contains each row in ET in its output. We perform each candidate query-row verification by issuing a SQL join query with keyphrase containment predicates shown in Section 4.1; we refer to them as CQ-row verification. The algorithm (referred to as VERI- FYALL) performs a high number of CQ-row verifications. We observe that most candidate queries are invalid. Figure 3(a) and (b) shows the number of candidate and valid queries for two real-life databases and real-life ETs (we vary the # rows in the ETs). More than 9% candidate queries are invalid. The invalid candidate queries typically contain some of the example tuples in their output but not all. VERIFYALL performs many CQ-row verifications for such invalid queries; this is wasteful. How can we quickly prune the invalid candidate queries without verifying them for any tuple? Our main insight is that there exists sub-join trees of the candidate queries which, when verified for carefully selected rows of ET, can prune many invalid candidate queries. We refer to these as filters. This happens when the filter fails and the sub-join tree is shared by many candidate queries. Judiciously selecting the filters and evaluating them can significantly reduce cost. It is important to consider sub-join trees because the original candidate queries are rarely shared by other candidate queries, no matter how judiciously we select. We illustrate this through an example. EXAMPLE 2. For the ET in Figure 2, suppose there are 4 candidate queries generated as shown in Figure 4. The baseline approach (VERIFYALL) performs 3 (CQ 1 contains all 3 rows in its output) (CQ 2, CQ 3 and CQ 4 contain row 1 in their output but not row 2) = 9 CQ-row verifications. There is a common sub-join tree that is shared by CQ 2 and CQ 3 (shown using green dashed line). If we can verify a common sub-join tree for a row that is likely to be not contained in its output early on, we can prune all the candidate queries that contain that sub-join tree. Suppose we evaluate the following filter first: the common sub-join tree between CQ 2 and CQ 3 for the second row. It fails. We can prune CQ 2 and CQ 3, thus reducing the number of CQ-row verifications to3(cq 1) +2(CQ 4) +1(sub-join tree)= 6. Contributions: Our contributions are summarized as follows. We define the end-to-end system task, introduce the candidate generation-verification framework and define the main technical problem in the paper: candidate verification problem (Section 2). We develop efficient techniques to compute the valid queries from the set of candidate queries. All techniques considered in this paper produce the same output (the correct set of valid minimal queries); they differ only in efficiency. We introduce a set of filters which can be used to prune the invalid candidate queries. We then formalize the candidate verification problem as the following filter selection problem: find the set of filters that verifies the validity of

3 all the valid candidate queries and prunes all the invalid ones while incurring the least evaluation cost. We show the problem is NPhard even if we have perfect knowledge about which candidates are valid/invalid and which filters succeed/fail. In reality, where we do not have this knowledge, we show that an algorithm can be much worse than the optimal solution. We propose a novel algorithm and show that it has provable performance guarantee with respect to the optimal solution (Section 4 and 5). We perform extensive experiments on two real-life datasets, a real customer support database as well as IMDB. Our experiments show that our proposed solution significantly outperforms the baseline approach (by a factor of 2-1) in terms of the number of CQ-row verifications as well as execution time (Section 6). 2. MODEL AND PROBLEM We first introduce our data model and some notations. We then formally define the problem of discovering queries by example table. Finally, we introduce our candidate generation-verification framework to discover all the valid queries w.r.t. a given example table and discuss the challenges. 2.1 Notations and Data Model We consider a database D with m relations R 1,R 2,,R m. For a relation R, let R[i] denote its i th column, and col(r) = {R[i]} i=1,2,... col(r) denote the set of columns of R. For a tuple t inr, let t[i] denote its value on the column R[i]. Let G(V,E) denote the directed schema graph of D where the vertices in V represent the relations in D, and the edges in E represent foreign key references between two relations: there is an edge from R j to R k in E iff the primary key defined on R k is referenced by a foreign key defined in R j. In general, there can be multiple edges fromr j tor k and we label each edge with the corresponding foreign key s attribute name. For simplicity, we omit edge labels in our examples and description if they are clear from the context. In a relationr i, we refer to a column as text column if keyword search is allowed on that column. Example 1 illustrates an example database of seven relations, with arrowed lines denoting foreign key references (i.e., edges in the directed schema graph). There are five text columns: Customer.CustName, Employee.EmpName, Device.DevName, App.AppName, and ESR.Desc. In the rest of this paper, when we refer to a column, it is a text column by default. 2.2 Discovering Queries by Example Table For a user-provided example table T, the goal of our system is to discover (project join) queries whose answers in the database D are consistent witht. We formally define an example table. DEFINITION 1. (Example Table) An example tablet is a table with rows {r} and columns col(t), where each cell of a row is either a string (i.e., one or more tokens) or empty. For a rowr T, let r[i] denote its value on the column i col(t), and let r[i] = if r[i] is empty. Without loss of generality, we assume that T does not contain any empty row or column. We focus on project join queries with foreign key joins. A project join query (J,C) can be specified by: i) a join graph J G, i.e., a subgraph of the schema graph G of D representing all the relations involved (vertices of J ) and all the joins (edges of J ) let col(j) be the set of columns of all the relations in J ; and ii) a set of columns C col(j) from the relations in J, which the join result are projected onto. Let A(J, C) be the resulting relation. Informally, a project join query is said to be valid w.r.t. an example table T, if every row of T can be found in the join-project result in the database D, with the columns of T properly mapped to the resulting relation. DEFINITION 2. (Valid Project Join Queries) Given an example table T in a database D, consider a project join query Q = (J,C,φ) with a mapping φ : col(t) C from columns of T to projection C. Let A(Q) be the resulting relation of the project join query. Q is valid w.r.t. a row r T iff there exists a tuple t A(Q) s.t. for any columni col(t), r[i] = r[i] t[φ(i)], (1) where x y denotes that stringxis contained in stringy. A project join query Q is valid w.r.t. an example tablet, iffqis valid w.r.t. every row of T. Remarks. There are various ways to define x y. In our system, we define it as: tokens in x appear consecutively in y. Synonyms and approximate matching can be introduced to extend the definition of string containment. We can also enable users to specify that x is a number and thus needs to exactly match y. Our system and algorithms can be easily revised to accommodate such extensions. Among all valid project join queries, we are interested in the minimal ones. A valid project join query is said to be minimal if none of relations and its foreign key join constraints can be removed so that it is still valid. DEFINITION 3. (Minimal Valid Project Join Queries) A valid project join query Q = (J,C,φ) w.r.t. an example table T is minimal iff i) without directions on edges, J is an undirected tree; and ii) for any degree-1 vertex (relation) R in J, there exists a column i col(t) s.t. φ(i) col(r), i.e., a column of T is mapped to a column of R. EXAMPLE 3. Figure 2 shows an example table with three rows and three columns{a, B, C}. Consider the database shown in Figure 1. Figure 2 shows the only minimal valid project join query for the example table. The mapping φ defined from {A, B, C} to projection C = {Customer.CustName, Device.DeviceName, App.AppName} is shown using red arrows in Figure 2. We also label each projected column with the ET column (A, B or C) it is mapped from. The query is valid w.r.t. the first row as there is a tuple (Mike Jones, ThinkPad X1, Office 213) in the resulting relation that contains the three keywords in the first row. Similarly, it is valid w.r.t. the second and third rows. It is minimal because there is an example table column mapped to at least one column of each degree-1 relation (Customer,Device,App). End-to-end system task. For a user-provided example table T in a database D, the goal of this paper is to find all the minimal valid project join queries w.r.t. T in D. 2.3 Candidate Generation-Verification Candidates of valid queries. In order to discover all the minimal valid project join queries w.r.t. a given example table, our system first generates a list of candidates of the final results. The list of candidates is relatively cheap to be generated, and should be a superset of all the valid queries. DEFINITION 4. (Candidate Project Join Queries) Given an example tablet in a databased, a project join queryq = (J,C,φ) is said to be a candidate project join query w.r.t. T iff for any columni col(t), r T t A(Q) s.t.r[i] t[φ(i)]. (2)

4 Example Table Candidate Query Generation Candidate Projection Column Retrieval IR Engine maintaining inverted index on text columns (CI) Schema Graph Traversal Database Schema Candidate Query Verification Figure 5: System architecture Database Instance Valid Queries Similarly, a candidate project join query is said to be minimal if it satisfies the conditions i) and ii) in Definition 3. The minimal candidate project join queries are defined on the schema level, and no join is needed to generate them, so we can enumerate them efficiently. Details will be discussed in Section 3. We will call them candidate queries or candidates for short in the rest of this paper. Intuitively, the strings in each column of the example table should appear in the corresponding (mapped) column of the resulting relation of the candidate query. Equation (2) states this intuition formally. We can easily prove the following result to ensure the set of candidates is complete. COROLLARY 1. Any (minimal) valid project join query must also be a (minimal) candidate project join query, for a given example table in any database. EXAMPLE 4. Figure 4 shows the four candidate queries for the example table in Figure 2. The candidate query CQ 1 is the only valid query, and the rest of them are invalid. Candidate verification. Suppose the set of all the candidate queries Q C has been generated, in the rest of this paper, the task we focus on is to remove all the false positives. 3. SYSTEM ARCHITECTURE The query discovery system has two components: offline preprocessing and query time processing. 3.1 Offline Pre-processing There are 2 main steps in this component: Build FTS and other indexes on relations: As discussed in Section 1, we perform CQ-row verifications by executing SQL join queries with keyphrase containment predicates. Hence, we build a full text search (FTS) index on each text column of each relation in the database. All major commercial DBMSs support FTS indexes on text columns. We also assume appropriate physical design for efficient execution of the SQL join queries, specifically indexes on the primary and foreign key columns. Build master inverted index on columns: Our candidate generation step first identifies base table columns that satisfy the column constraints; we build a single master inverted index over all text columns in the database for that purpose. We refer to it as column index (CI). Given a string W, CI(W) returns the cells (and their corresponding columns) containingw. Since commercial DBMSs do not support such a master index, we implement CI using an information retrieval (IR) engine. We scan each text column in the database. For each token in each cell in each column, we create a posting containing the column identifier (which uniquely identifies it across all columns in the database), the cell identifier (which 1 for each column i col(t) 2 Compute the setc i of candidate projection columns 3 Q C GenerateCandidateQueries(G, C 1,C 2,,C col(t) ) 4 Q V VerifyCandidateQueries(Q C,T,D) Table 1: Query discovery algorithm uniquely identifies it within the column) and token position within the cell and store it in the posting list of the token. We can support the operation CI(W) using these posting lists. Any IR engine (e.g., Lucene) that allows customization of the payload in a posting can be used to implement CI. 3.2 Query Time Processing Figure 5 shows the architecture of this component. The pseudocode of the overall query discovery is shown in Table 1. There are two main steps in this component: Candidate query generation. The input to this step is the ETT and the output is the set Q C of candidate queries of T (as defined in Definition 4). This step consists of two substeps: (1) Candidate projection column retrieval: For each column i col(t) in the ET T, we first identify the set of base table columns that satisfies the column constraint (lines 1-2 in Table 1). These are the ones that contain all the values in column i. We refer to them as the candidate projection columns for column i. For example, the candidate projection columns for the ET in Figure 2 are Customer.CustName and Employee.EmpName (for column A), Device.DevName (for column B) and Apps.AppName and ESR.Desc (for column C). We leverage the column index (CI) for this purpose. For each non-empty cell T[i,j] in column j, we issue a query CI(T[i,j]) against the CI. Let cols(t[i, j]) denote the distinct columns in the output of CI(T[i,j]). Then the set C j of candidate projection columns for column j of ET T is: C j = cols(t[i,j]) (3) i,t[i,j] empty (2) Candidate query enumeration: Given the candidate projection columns C j of each column j in ET, we enumerate the candidate queries Q C by traversing the schema graph. We use a straightforward adaptation of candidate network generation algorithm proposed in [1] (line 3 in Table 1). This step does not perform any joins and represents a negligible fraction of the overall query processing time. Candidate query verification. The input to this step is the ET T and the set Q C of candidate queries. The output is set Q V of valid queries. This is the main contribution of the paper. The rest of the paper focuses on this step. Section 4 presents two baseline algorithms for candidate verification while Section 5 presents the proposed approach. 4. BASELINES FOR CANDIDATE VERIFI- CATION 4.1 Baseline Algorithm (VERIFYALL) One way to verify whether a candidate query is valid is to execute it and check whether its output contains all the rows in the ET [23]. This is typically very expensive, hence we do not follow this approach. Since ET typically contains a few rows (typically less than 1), a more efficient alternative is to individually verify whether the candidate query contains each of the rows in ET in its output. We henceforth refer to it as verifying the candidate query for the row.

5 CQ 2 Owner A B C Employee Device App CQ 5 Owner A B Employee Device App C ESR Figure 6: Failure dependency Recall that a candidate query (J,C,φ) contains the row r T in its output iff there exists a tuple t in its output such that every nonempty cell r[i] in that row is textually contained int[φ(i)] (Definition 2). Hence, we can verify the candidate query(j,c,φ) for row r by issuing the following SQL query and checking the results: SELECT * TOP 1 FROM V(J) WHEREE(J) i col(t),r[i] CONTAINS(φ(i),r[i]) where V(J) denotes the set of vertices (relations) ofj ande(j) the set of edges (join conditions) inj. For example, the SQL query to verify the candidate query CQ 1 in Figure 4 for the second row in the ET in Figure 2 is: SELECT * TOP 1 FROM Sales, Customer, Device, App WHERE Sales.CustId=Customer.CustId AND Sales.DevId=Device.DevId AND Sales.AppId=App.AppId AND CONTAINS(CustName, Mary ) AND CONTAINS(DevName, ipad ) If the result is non-empty, the candidate query(j,c,φ) contains row r, if it is empty, it does not. We also refer to it as candidate query satisfies the row r and fails for the row r respectively. Since we just need to check the existence of a row, we use TOP 1 in the SELECT clause to reduce the data transfer cost. We refer to these SQL queries as CQ-row verifications. The baseline algorithm (referred to as VERIFYALL) iterates over the candidate queries Q C in the outer loop and the rows in ET in the inner loop (or vice versa) and executes the SQL queries. If a candidate query fails for a row, we eliminates it right away, i.e., do not verify it for the remaining rows. Note that the order of the candidate queries does not affect the number of CQ-row verifications performed but the order of the rows does. In our experiments, we consider random ordering of rows. We also consider the rows in decreasing order of number of non-empty cells. This is because a candidate query is more likely to fail for densely populated rows; verifying them first might lead to early elimination of candidate queries. EXAMPLE 5. Figure 4 shows a subset of the candidate queries (4 of them) for the ET in Figure 2 on the example database in Figure 1. VERIFYALL performs 3 CQ-row verifications for CQ 1 as it satisfies all 3 rows and 2 each for the other 3 candidate queries as they all satisfy row 1 but fail for row 2. Hence, it performs a total of 9 CQ-row verifications. Although simple, VERIFYALL is wasteful as it performs many CQ-row verifications for invalid candidate queries. In the worst case, it performs T Q C CQ-row verifications, where T is the number of rows in ET. This is expensive as each CQ-row verification involves one or more joins. 4.2 Simple Pruning Algorithm Since most candidate queries are invalid, we can speed up candidate verification by quickly pruning the invalid candidate queries. We observe that there exist dependencies among the results of CQrow verifications. This can be used to prune candidate queries, i.e., eliminate them without verifying for any rows. Consider the ET in Figure 2 and two candidate queries in Figure 6. CQ 2 fails for the second row. This implies that CQ 5 will also fail for that row. This is because: (1) the join tree in CQ 2 is a subtree of that in CQ 5 and (2) the non-empty columns in the second row (i.e., columns A and B) are mapped to the same base table columns; they both mapaandb to Employee.EmpName and Device.DevName respectively. Since CQ 2 does not contain a tuple whose values on Employee.EmpName and Device.DevName contains Mary and ipad respectively in its output, any supertree (e.g., CQ 5) is more restrictive and hence cannot contain such a tuple in its output. We refer to it as the failure dependency : LEMMA 1 (FAILURE DEPENDENCY). For an ET T, the failure of candidate query Q 1 = (J 1,C 1,φ 1) on row r in T implies the failure of candidate query Q 2 = (J 2,C 2,φ 2) for rowr int if: i)j 1 is a subtree of J 2 and ii) for any columni col(t), r[i] = φ(i) = φ (i) We denote (Q 1,r) (Q 2,r) iff Q 1 and Q 2 satisfies i) and ii) in Lemma 1. If we evaluate CQ 2 on row 2 earlier, we can prune CQ 5 right away. We present a simple algorithm SIMPLEPRUNE that leverages the above observation to save CQ-row verifications. Like VERI- FYALL, SIMPLEPRUNE iterates over the candidate queries in the outer loop and the rows in the inner loop. It maintains the list of CQ-row verifications performed so far that failed. When we encounter a new candidate query Q during the outer loop iteration, we first check whether there exists a CQ-row evaluation (Q,r) in that list such that(q,r) (Q,r). If so, we can prune Q immediately as it is guaranteed to fail for row r and hence is not valid. Otherwise, we verify it for all the rows. Since the join trees corresponding to the candidate queries are small, the overhead of checking failure dependencies is negligible; the cost is dominated by that of executing the CQ-row verifications. Unlike VERIFYALL, the order of candidate queries affects the number of CQ-row verifications performed by SIMPLEPRUNE. Intuitively, smaller join trees are more likely to have supertrees and verifying them first will result in more pruning. Hence, we order the candidate queries in increasing join tree size. EXAMPLE 6. Consider the ET in Figure 2 and two candidate queries in Figure 6. VERIFYALL performs 4 CQ-row verifications (both candidate queries satisfy row 1 and fail for row 2). SIM- PLEPRUNE verifies CQ 2 first and then CQ 5. It performs 2 CQrow verifications forcq 2. When it encounters CQ 5, it finds a CQrow verification (CQ 2,2) in the failed list. Since (CQ 2,2) (CQ 5,2), it prunes CQ 5. The total number of CQ-row verifications reduces from 4 to2. 5. FAST VERIFICATION VIA FILTERS We focus on how to speedup candidate verification, which is the bottleneck of our system. We first introduce the concept of filters together with its properties in Section 5.1 and give intuitions on why it is the key to the speedup. Intuitively, a filter is a substructure of a candidate query. A candidate is valid only if all of its filters are valid, and a single filter is invalid implies that the candidate is invalid. Different candidates may share common filters and evaluating filters is usually cheaper than evaluating the candidate itself, so evaluating those shared filters first is usually beneficial. On the other hand, the number of filters is usually much larger than the number of candidates, so we need to select the filters to be evaluated judiciously, for which we define the filter selection problem in Section 5.2. This problem is shown to be hard in theory but we propose an effective adaptive solution in Section 5.3, which works well in practice and has provable performance guarantee.

### KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

### PartJoin: An Efficient Storage and Query Execution for Data Warehouses

PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2

### Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

### Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

### Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

### Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

### MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

### The Minimum Consistent Subset Cover Problem and its Applications in Data Mining

The Minimum Consistent Subset Cover Problem and its Applications in Data Mining Byron J Gao 1,2, Martin Ester 1, Jin-Yi Cai 2, Oliver Schulte 1, and Hui Xiong 3 1 School of Computing Science, Simon Fraser

### Index Selection Techniques in Data Warehouse Systems

Index Selection Techniques in Data Warehouse Systems Aliaksei Holubeu as a part of a Seminar Databases and Data Warehouses. Implementation and usage. Konstanz, June 3, 2005 2 Contents 1 DATA WAREHOUSES

### In Search of Influential Event Organizers in Online Social Networks

In Search of Influential Event Organizers in Online Social Networks ABSTRACT Kaiyu Feng 1,2 Gao Cong 2 Sourav S. Bhowmick 2 Shuai Ma 3 1 LILY, Interdisciplinary Graduate School. Nanyang Technological University,

### Lecture 7: Approximation via Randomized Rounding

Lecture 7: Approximation via Randomized Rounding Often LPs return a fractional solution where the solution x, which is supposed to be in {0, } n, is in [0, ] n instead. There is a generic way of obtaining

### 2.3 Scheduling jobs on identical parallel machines

2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one

### Clean Answers over Dirty Databases: A Probabilistic Approach

Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento periklis@dit.unitn.it Ariel Fuxman University of Toronto afuxman@cs.toronto.edu Renée J. Miller University

### ! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of

### Permutation Betting Markets: Singleton Betting with Extra Information

Permutation Betting Markets: Singleton Betting with Extra Information Mohammad Ghodsi Sharif University of Technology ghodsi@sharif.edu Hamid Mahini Sharif University of Technology mahini@ce.sharif.edu

### Database Design Patterns. Winter 2006-2007 Lecture 24

Database Design Patterns Winter 2006-2007 Lecture 24 Trees and Hierarchies Many schemas need to represent trees or hierarchies of some sort Common way of representing trees: An adjacency list model Each

### Efficient Algorithms for Masking and Finding Quasi-Identifiers

Efficient Algorithms for Masking and Finding Quasi-Identifiers Rajeev Motwani Stanford University rajeev@cs.stanford.edu Ying Xu Stanford University xuying@cs.stanford.edu ABSTRACT A quasi-identifier refers

### Automated Selection of Materialized Views and Indexes for SQL Databases

Automated Selection of Materialized Views and Indexes for SQL Databases Sanjay Agrawal Surajit Chaudhuri Vivek Narasayya Microsoft Research Microsoft Research Microsoft Research sagrawal@microsoft.com

### Lecture 3: Linear Programming Relaxations and Rounding

Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can

### Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like

### ! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three

### Chapter 20: Data Analysis

Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

### Week 5 Integral Polyhedra

Week 5 Integral Polyhedra We have seen some examples 1 of linear programming formulation that are integral, meaning that every basic feasible solution is an integral vector. This week we develop a theory

### Approximation Algorithms

Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait

### An Empirical Study of Two MIS Algorithms

An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. tushar.bisht@research.iiit.ac.in,

### ON THE COMPLEXITY OF THE GAME OF SET. {kamalika,pbg,dratajcz,hoeteck}@cs.berkeley.edu

ON THE COMPLEXITY OF THE GAME OF SET KAMALIKA CHAUDHURI, BRIGHTEN GODFREY, DAVID RATAJCZAK, AND HOETECK WEE {kamalika,pbg,dratajcz,hoeteck}@cs.berkeley.edu ABSTRACT. Set R is a card game played with a

### Scheduling Shop Scheduling. Tim Nieberg

Scheduling Shop Scheduling Tim Nieberg Shop models: General Introduction Remark: Consider non preemptive problems with regular objectives Notation Shop Problems: m machines, n jobs 1,..., n operations

### D1.1 Service Discovery system: Load balancing mechanisms

D1.1 Service Discovery system: Load balancing mechanisms VERSION 1.0 DATE 2011 EDITORIAL MANAGER Eddy Caron AUTHORS STAFF Eddy Caron, Cédric Tedeschi Copyright ANR SPADES. 08-ANR-SEGI-025. Contents Introduction

### Data Mining and Database Systems: Where is the Intersection?

Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise

### CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms

### The Relative Worst Order Ratio for On-Line Algorithms

The Relative Worst Order Ratio for On-Line Algorithms Joan Boyar 1 and Lene M. Favrholdt 2 1 Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, joan@imada.sdu.dk

### Performance Tuning for the Teradata Database

Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document

### Finding Patterns in a Knowledge Base using Keywords to Compose Table Answers

Finding Patterns in a Knowledge Base using Keywords to Compose Table Answers Mohan Yang Bolin Ding 2 Surajit Chaudhuri 2 Kaushik Chakrabarti 2 University of California, Los Angeles, CA 2 Microsoft Research,

### Victor Shoup Avi Rubin. fshoup,rubing@bellcore.com. Abstract

Session Key Distribution Using Smart Cards Victor Shoup Avi Rubin Bellcore, 445 South St., Morristown, NJ 07960 fshoup,rubing@bellcore.com Abstract In this paper, we investigate a method by which smart

### MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

### Offline sorting buffers on Line

Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

### Keywords: Regression testing, database applications, and impact analysis. Abstract. 1 Introduction

Regression Testing of Database Applications Bassel Daou, Ramzi A. Haraty, Nash at Mansour Lebanese American University P.O. Box 13-5053 Beirut, Lebanon Email: rharaty, nmansour@lau.edu.lb Keywords: Regression

### EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES

EFFICIET SCHEMA BASED KEYWORD SEARCH I RELATIOAL DATABASES Myint Myint Thein 1 and Mie Mie Su Thwin 2 1 University of Computer Studies, Mandalay, Myanmar mmyintt@gmail.com 2 University of Computer Studies,

### Integrating Pattern Mining in Relational Databases

Integrating Pattern Mining in Relational Databases Toon Calders, Bart Goethals, and Adriana Prado University of Antwerp, Belgium {toon.calders, bart.goethals, adriana.prado}@ua.ac.be Abstract. Almost a

### You know from calculus that functions play a fundamental role in mathematics.

CHPTER 12 Functions You know from calculus that functions play a fundamental role in mathematics. You likely view a function as a kind of formula that describes a relationship between two (or more) quantities.

### Network (Tree) Topology Inference Based on Prüfer Sequence

Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 vanniarajanc@hcl.in,

### Why? A central concept in Computer Science. Algorithms are ubiquitous.

Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

### Protein Protein Interaction Networks

Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

### Report Paper: MatLab/Database Connectivity

Report Paper: MatLab/Database Connectivity Samuel Moyle March 2003 Experiment Introduction This experiment was run following a visit to the University of Queensland, where a simulation engine has been

### Two General Methods to Reduce Delay and Change of Enumeration Algorithms

ISSN 1346-5597 NII Technical Report Two General Methods to Reduce Delay and Change of Enumeration Algorithms Takeaki Uno NII-2003-004E Apr.2003 Two General Methods to Reduce Delay and Change of Enumeration

### Exponential time algorithms for graph coloring

Exponential time algorithms for graph coloring Uriel Feige Lecture notes, March 14, 2011 1 Introduction Let [n] denote the set {1,..., k}. A k-labeling of vertices of a graph G(V, E) is a function V [k].

### Bicolored Shortest Paths in Graphs with Applications to Network Overlay Design

Bicolored Shortest Paths in Graphs with Applications to Network Overlay Design Hongsik Choi and Hyeong-Ah Choi Department of Electrical Engineering and Computer Science George Washington University Washington,

### The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints

### Security-Aware Beacon Based Network Monitoring

Security-Aware Beacon Based Network Monitoring Masahiro Sasaki, Liang Zhao, Hiroshi Nagamochi Graduate School of Informatics, Kyoto University, Kyoto, Japan Email: {sasaki, liang, nag}@amp.i.kyoto-u.ac.jp

### Security in Outsourcing of Association Rule Mining

Security in Outsourcing of Association Rule Mining W. K. Wong The University of Hong Kong wkwong2@cs.hku.hk David W. Cheung The University of Hong Kong dcheung@cs.hku.hk Ben Kao The University of Hong

### Multi-dimensional index structures Part I: motivation

Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for

### Cloud Service Placement via Subgraph Matching

Cloud Service Placement via Subgraph Matching Bo Zong, Ramya Raghavendra *, Mudhakar Srivatsa *, Xifeng Yan, Ambuj K. Singh, Kang-Won Lee * Dept. of Computer Science, University of California at Santa

### Procedia Computer Science 00 (2012) 1 21. Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

Procedia Computer Science 00 (2012) 1 21 Procedia Computer Science Top-k best probability queries and semantics ranking properties on probabilistic databases Trieu Minh Nhut Le, Jinli Cao, and Zhen He

### Solutions to Homework 6

Solutions to Homework 6 Debasish Das EECS Department, Northwestern University ddas@northwestern.edu 1 Problem 5.24 We want to find light spanning trees with certain special properties. Given is one example

### School Timetabling in Theory and Practice

School Timetabling in Theory and Practice Irving van Heuven van Staereling VU University, Amsterdam Faculty of Sciences December 24, 2012 Preface At almost every secondary school and university, some

### Efficient Computation of Multiple Group By Queries Zhimin Chen Vivek Narasayya

Efficient Computation of Multiple Group By Queries Zhimin Chen Vivek Narasayya Microsoft Research {zmchen, viveknar}@microsoft.com ABSTRACT Data analysts need to understand the quality of data in the warehouse.

### COLORED GRAPHS AND THEIR PROPERTIES

COLORED GRAPHS AND THEIR PROPERTIES BEN STEVENS 1. Introduction This paper is concerned with the upper bound on the chromatic number for graphs of maximum vertex degree under three different sets of coloring

### Beyond the Stars: Revisiting Virtual Cluster Embeddings

Beyond the Stars: Revisiting Virtual Cluster Embeddings Matthias Rost Technische Universität Berlin September 7th, 2015, Télécom-ParisTech Joint work with Carlo Fuerst, Stefan Schmid Published in ACM SIGCOMM

### Designing and Using Views To Improve Performance of Aggregate Queries

Designing and Using Views To Improve Performance of Aggregate Queries Foto Afrati 1, Rada Chirkova 2, Shalu Gupta 2, and Charles Loftis 2 1 Computer Science Division, National Technical University of Athens,

### Approximation Algorithms: LP Relaxation, Rounding, and Randomized Rounding Techniques. My T. Thai

Approximation Algorithms: LP Relaxation, Rounding, and Randomized Rounding Techniques My T. Thai 1 Overview An overview of LP relaxation and rounding method is as follows: 1. Formulate an optimization

### Graph Security Testing

JOURNAL OF APPLIED COMPUTER SCIENCE Vol. 23 No. 1 (2015), pp. 29-45 Graph Security Testing Tomasz Gieniusz 1, Robert Lewoń 1, Michał Małafiejski 1 1 Gdańsk University of Technology, Poland Department of

### Homework 15 Solutions

PROBLEM ONE (Trees) Homework 15 Solutions 1. Recall the definition of a tree: a tree is a connected, undirected graph which has no cycles. Which of the following definitions are equivalent to this definition

### So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

### Testing LTL Formula Translation into Büchi Automata

Testing LTL Formula Translation into Büchi Automata Heikki Tauriainen and Keijo Heljanko Helsinki University of Technology, Laboratory for Theoretical Computer Science, P. O. Box 5400, FIN-02015 HUT, Finland

### Chapter 4. Trees. 4.1 Basics

Chapter 4 Trees 4.1 Basics A tree is a connected graph with no cycles. A forest is a collection of trees. A vertex of degree one, particularly in a tree, is called a leaf. Trees arise in a variety of applications.

### Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

### THE SEARCH FOR NATURAL DEFINABILITY IN THE TURING DEGREES

THE SEARCH FOR NATURAL DEFINABILITY IN THE TURING DEGREES ANDREW E.M. LEWIS 1. Introduction This will be a course on the Turing degrees. We shall assume very little background knowledge: familiarity with

### Efficient visual search of local features. Cordelia Schmid

Efficient visual search of local features Cordelia Schmid Visual search change in viewing angle Matches 22 correct matches Image search system for large datasets Large image dataset (one million images

### Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

### Can you design an algorithm that searches for the maximum value in the list?

The Science of Computing I Lesson 2: Searching and Sorting Living with Cyber Pillar: Algorithms Searching One of the most common tasks that is undertaken in Computer Science is the task of searching. We

### CS5314 Randomized Algorithms. Lecture 16: Balls, Bins, Random Graphs (Random Graphs, Hamiltonian Cycles)

CS5314 Randomized Algorithms Lecture 16: Balls, Bins, Random Graphs (Random Graphs, Hamiltonian Cycles) 1 Objectives Introduce Random Graph Model used to define a probability space for all graphs with

### Simulation-Based Security with Inexhaustible Interactive Turing Machines

Simulation-Based Security with Inexhaustible Interactive Turing Machines Ralf Küsters Institut für Informatik Christian-Albrechts-Universität zu Kiel 24098 Kiel, Germany kuesters@ti.informatik.uni-kiel.de

### Locality Based Protocol for MultiWriter Replication systems

Locality Based Protocol for MultiWriter Replication systems Lei Gao Department of Computer Science The University of Texas at Austin lgao@cs.utexas.edu One of the challenging problems in building replication

### 4 Basics of Trees. Petr Hliněný, FI MU Brno 1 FI: MA010: Trees and Forests

4 Basics of Trees Trees, actually acyclic connected simple graphs, are among the simplest graph classes. Despite their simplicity, they still have rich structure and many useful application, such as in

### Nan Kong, Andrew J. Schaefer. Department of Industrial Engineering, Univeristy of Pittsburgh, PA 15261, USA

A Factor 1 2 Approximation Algorithm for Two-Stage Stochastic Matching Problems Nan Kong, Andrew J. Schaefer Department of Industrial Engineering, Univeristy of Pittsburgh, PA 15261, USA Abstract We introduce

### princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 6: Provable Approximation via Linear Programming Lecturer: Sanjeev Arora

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 6: Provable Approximation via Linear Programming Lecturer: Sanjeev Arora Scribe: One of the running themes in this course is the notion of

### Generating models of a matched formula with a polynomial delay

Generating models of a matched formula with a polynomial delay Petr Savicky Institute of Computer Science, Academy of Sciences of Czech Republic, Pod Vodárenskou Věží 2, 182 07 Praha 8, Czech Republic

### each college c i C has a capacity q i - the maximum number of students it will admit

n colleges in a set C, m applicants in a set A, where m is much larger than n. each college c i C has a capacity q i - the maximum number of students it will admit each college c i has a strict order i

### Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

### SQL DATA DEFINITION: KEY CONSTRAINTS. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 7

SQL DATA DEFINITION: KEY CONSTRAINTS CS121: Introduction to Relational Database Systems Fall 2015 Lecture 7 Data Definition 2 Covered most of SQL data manipulation operations Continue exploration of SQL

### Classification with Decision Trees

Classification with Decision Trees Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong 1 / 24 Y Tao Classification with Decision Trees In this lecture, we will discuss

### (67902) Topics in Theory and Complexity Nov 2, 2006. Lecture 7

(67902) Topics in Theory and Complexity Nov 2, 2006 Lecturer: Irit Dinur Lecture 7 Scribe: Rani Lekach 1 Lecture overview This Lecture consists of two parts In the first part we will refresh the definition

### Effective Keyword-based Selection of Relational Databases

Effective Keyword-based Selection of Relational Databases Bei Yu National University of Singapore Guoliang Li Tsinghua University Anthony K. H. Tung National University of Singapore Karen Sollins MIT ABSTRACT

### Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

### Lecture 3: Finding integer solutions to systems of linear equations

Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture

### A Sublinear Bipartiteness Tester for Bounded Degree Graphs

A Sublinear Bipartiteness Tester for Bounded Degree Graphs Oded Goldreich Dana Ron February 5, 1998 Abstract We present a sublinear-time algorithm for testing whether a bounded degree graph is bipartite

### CS 598CSC: Combinatorial Optimization Lecture date: 2/4/2010

CS 598CSC: Combinatorial Optimization Lecture date: /4/010 Instructor: Chandra Chekuri Scribe: David Morrison Gomory-Hu Trees (The work in this section closely follows [3]) Let G = (V, E) be an undirected

### Notes from Week 1: Algorithms for sequential prediction

CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

### Efficient Use of the Query Optimizer for Automated Physical Design

Efficient Use of the Query Optimizer for Automated Physical Design Stratos Papadomanolakis Debabrata Dash Anastasia Ailamaki Carnegie Mellon University Ecole Polytechnique Fédérale de Lausanne {stratos,ddash,natassa}@cs.cmu.edu

### A Mathematical Programming Solution to the Mars Express Memory Dumping Problem

A Mathematical Programming Solution to the Mars Express Memory Dumping Problem Giovanni Righini and Emanuele Tresoldi Dipartimento di Tecnologie dell Informazione Università degli Studi di Milano Via Bramante

### Integrating Vertical and Horizontal Partitioning into Automated Physical Database Design

Integrating Vertical and Horizontal Partitioning into Automated Physical Database Design Sanjay Agrawal Microsoft Research sagrawal@microsoft.com Vivek Narasayya Microsoft Research viveknar@microsoft.com

### Relations Graphical View

Relations Slides by Christopher M. Bourke Instructor: Berthe Y. Choueiry Introduction Recall that a relation between elements of two sets is a subset of their Cartesian product (of ordered pairs). A binary

### The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

### Semi-Automatic Index Tuning: Keeping DBAs in the Loop

Semi-Automatic Index Tuning: Keeping DBAs in the Loop Karl Schnaitter Teradata Aster karl.schnaitter@teradata.com Neoklis Polyzotis UC Santa Cruz alkis@ucsc.edu ABSTRACT To obtain good system performance,

### Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities