Efficient Auditing For Complex SQL queries

Transcription

1 Efficient Auditing For Complex SQL queries Raghav Kaushik Microsoft Research Ravi Ramamurthy Microsoft Research ABSTRACT We address the problem of data auditing that asks for an audit trail of all users and queries that potentially breached information about sensitive data. A lot of the previous work in data auditing has focused on providing strong privacy guarantees and studied the class of queries that can be audited efficiently while retaining the guarantees. In this paper, we approach data auditing from a different perspective. Our goal is to design an auditing system for arbitrary SQL queries containing constructs such as grouping, aggregation and correlated subqueries. Pivoted on the ability to feasibly address arbitrary queries, we study (1) what privacy guarantees we can expect, and (2) how we can efficiently perform auditing. Categories and Subject Descriptors H.2 [Database Management]: Systems General Terms Security, Performance Keywords Security, Auditing, Access Control, Privacy, Query Processing 1. INTRODUCTION Database systems are used today as the primary repository of the most valuable information in any organization. As the volume of data stored in these repositories has increased, protecting the security of the data has gained increasing importance. Further, the responsible management of sensitive data is mandated through laws such as the Sarbanes-Oaxley Act, the United States Fair Information Practices Act, the European Union Privacy Directive and the Health Insurance Portability and Accountability Act (HIPAA). One of the important components of the DBMS security infrastructure is an auditing system that can be used to aposteriori investigate potential security breaches. Accordingly, there has been an increase in database auditing products on the market [1, 2, 3] including from the major database vendors. As the database system Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD 11, June 12 16, 2011, Athens, Greece. Copyright 2011 ACM /11/06...$ is in production, these products monitor various operations such as user logins, queries, data updates and DDL statements to obtain an audit trail. The audit trail is analyzed offline either periodically or when needed to answer questions about access to schema objects such as: (1) find failed login attempts and (2) find queries and corresponding users that accessed columns corresponding to PII (personal identifier information). An important class of auditing is data auditing. A simple example is single tuple auditing where the goal is to find all queries and update statements that accessed a particular tuple, e.g. PII of a specific individual. Such queries potentially reveal sensitive information. While we are not aware of any commercial database auditing system that supports this functionality, there has been prior research that studies this form of auditing. Prior work has proposed two fundamentally different semantics for data auditing which we can classify broadly as (data) instance dependent and (data) instance independent. The basis for all data auditing semantics is to define what it means for a query to have accessed a particular tuple (that is, single-tuple auditing). In the instance-dependent approach [4], a query is said to access a tuple if deleting the tuple changes the query result on the database instance where the query was originally run a tuple accessed by the query is said to be indispensable to the query. Unfortunately, subsequent work has shown that the instance dependent approach can lead to breaches of privacy [5]. In contrast, an instance independent approach can be used to get strong privacy guarantees [5, 6, 7]. Under the instance independent approach, a query is said to have accessed a tuple if there is some database instance where deleting it changes the query result. Previous work developing the instance independent approach has focused on increasing the class of queries that can be audited efficiently while retaining strong privacy guarantees. Efficient auditing techniques have been developed for interesting subclasses of select-project-join (SPJ) queries. However, real-world queries such as the benchmark TPC-H queries are often complex using constructs like grouping, aggregation and correlated subqueries. While it may be acceptable to consider a restricted class of audit expressions, an auditing system that restricts the class of audited queries is fundamentally incomplete. Therefore, in this paper we approach data auditing from a different perspective. Our goal is to design an auditing system for arbitrary SQL queries. Pivoted on the ability to feasibly address arbitrary queries, we ask (1) what privacy guarantees we can expect, and (2) how we can efficiently perform auditing. We first show that previously proposed instance independent semantics are computationally incompatible with complex SQL in the presence of subqueries, enforcing the semantics becomes undecidable ( 3). 697

2 Therefore, we revisit the instance dependent approach to define our auditing semantics. We begin with the special case of singletuple auditing ( 4). There are several real-world scenarios where a single tuple is a natural unit of analysis. For example, in an employee-department database, an employee tuple contains sensitive information such as the employee s salary. We define an auditing semantics by applying the informal notion of an indispensable tuple uniformly to all queries. In contrast, the approach proposed by Agrawal et al. [4] is (1) non-uniform, (2) does not cover arbitrary SQL and (3) becomes similar to the instance independent approach in the presence of groupby and having clauses ( 4.1). By adopting an instance dependent semantics uniformly, we obtain a feasible implementation for an arbitrary query it is possible to perform single tuple auditing by running the query and a rewritten version that excludes the tuple and checking if the results are equal. But in this process we also inherit the known privacy limitations of the instance dependent approach [5]. We then ask the question whether there is at least a weaker privacy guarantee our semantics can provide. We answer in the affirmative. We introduce the notion of a risk-free attack and show that under our instance dependent semantics, no attack is risk-free ( 4.2). Intuitively, this means that an attacker may get access to sensitive information but not without taking a risk of getting detected. While the above guarantee falls short of the stronger privacy guarantees yielded by the instance independent approach, we believe it offers us an interesting way forward in addressing the full complexity of SQL. We then study how we can efficiently audit a workload of queries ( 5). The straightforward implementation suggested by our definition can be inefficient (especially for a workload of complex queries). Even though auditing is typically an offline process, the efficiency of auditing is important. For instance, auditing may be invoked as a response to an information breach in an attempt to narrow down the cause of the breach, in which case the efficiency of auditing is critical. In order to address the whole of SQL, we propose a novel rule-based optimization framework ( 5.2) that attempts to audit a query without any query execution. The idea is to start with an algebraic plan for a query and find if we can reach a plan for the rewritten query by transforming the initial plan. The plan is transformed using equivalence rules in the usual way deployed by any rule-based query optimizer. The main difference is that in addition to the standard rules that hold for all database instances (for example, pushing a selection below join, join commutativity), we also handle rules that are instance specific. Instance specific rules are naturally derived from audit checks a query that passes the audit is equivalent to the rewritten query. We refer to our framework as an audit optimizer. Just as the extensibility of rule based query optimizers is key to ensuring the efficient compilation of arbitrary queries, our audit optimizer is naturally extensible allowing us to add more rules over time. We also present a technique that considers reordering the queries in the workload to further improve the efficiency ( 5.4). In general, we would like to audit not only a single tuple but more complex audit expressions. Similar to previous work, we formulate our audit expression in the form of a forbidden view that captures the sensitive information. We discuss how our semantics, privacy guarantees and optimization techniques naturally extend to cover our class of forbidden views ( 6). We then report the results of our initial empirical evaluation ( 7) conducted over benchmark data in order to study the impact of our optimization techniques for complex queries. Our evaluation shows that using our framework, we can reduce the time taken for auditing by up to an order of magnitude even when the queries are complex (containing correlated subqueries). We discuss related work in 8 and conclude in 9. Auditing Tool Offline Audit Log Online Monitoring Infrastructure DBMS Application Figure 1: Auditing Infrastructure To summarize, we propose an auditing semantics that can be feasibly implemented for all of SQL. We characterize its privacy guarantees and present novel techniques for efficient auditing, while not compromising the class of queries that we can support. We view our paper as an important first step in efficiently auditing complex SQL queries with privacy guarantees. 2. PRELIMINARIES In this section, we discuss the auditing infrastructure and overall auditing algorithm. We defer a discussion of the auditing semantics to Auditing Infrastructure As with most systems that perform database auditing [1, 2, 3, 4, 5] our auditing system consists of two components an online component and an offline component, shown in Figure 1. The online component is used in production to log the query and update statements issued to the database system. This is implemented using the monitoring infrastructure supported by all commercial database systems. Along with each query/update statement, we also log the corresponding user id. (The user id does not necessarily have to be a DBMS user id many applications manage their users internally. We provide a call-back mechanism that can be used to ascertain the application user id.) We refer to the sequence of query and update statements with associated user ids as the workload. The workload is logged in a separate secure database called the audit log (we use previous work to ensure the audit log is tamper proof). The audit log is used to perform auditing in the offline component. In general, auditing is performed not only against the current database state but also over past database states. Accordingly, the auditing component needs to be able to reconstruct past database states. We use the point-in-time recovery API provided by commercial database systems that lets us rewind the database state to any point in the past using the database transaction log. 2.2 Audit Expressions In general, auditing is performed given an audit expression as input. Similar to previous work, our audit expression specifies sensitive data via a forbidden view (defined formally in 6). We define a notion of a query being safe with respect to the given view. We are given the workload W collected by the online component. Our goal is to find all query and update statements in W that are not safe, along with the corresponding user ids. We introduce the formal definition of safety for the special case of single tuple auditing in 4, and the general case in 6. While we are not aware of any commercial database auditing system that supports such a 698

3 functionality, there has been prior research that studies this form of auditing [4, 5, 6, 7]. 2.3 Overall Auditing Algorithm In this paper, we primarily focus on the offline component of the auditing tool (see Figure 1). We now outline our overall algorithm in Algorithm 2.1. Algorithm 2.1 Overall Auditing Algorithm Input: Workload W of query/update statements with corresponding user ids; Audit Expression AE Output: Subset of W that is unsafe 1: Use the database log to rewind to the appropriate state of the database 2: Split W into query-only workloads with alternating updates 3: Proceed in the same sequence as W 4: For each query-only workload QW 5: Call QUERYAUDIT(QW, AE) to find all queries in QW that are unsafe 6: For each update statement U 7: Call UPDATEAUDIT(U) to check if U is unsafe 8: Update the current state of the database 9: Return all queries and updates that are unsafe along with corresponding user ids Our auditing is performed on the state of the database when the query was originally run. Accordingly, we first use the transaction log to rewind to the state of the database when the first query in the workload was run (we can further optimize this step by taking periodic backups of the database). We then process the workload in order splitting it into query-only portions alternating with data updates. We refer to the query-only portion of the workload as a query workload. The query workloads are audited by a query auditing algorithm QUERYAUDIT that is invoked over the corresponding state of the database that we refer to as the current instance in the rest of the paper. The query auditing algorithm returns the unsafe queries. The update statement is audited using an update auditing algorithm UPDATEAUDIT. The update is also replayed to obtain a new state of the database. Updates are handled in our approach by finding the query underlying the update and extending our semantics for queries. Updates can have a cascading effect. For example, a sensitive record may be copied by an update statement and future queries could reference the copy. In this example, our auditing system will flag the original update statement that performs the copy. It is possible to track the copy in a subsequent auditing session. We could think of extending our auditing semantics to automatically include cascading effects but we defer the extension to future work. For ease of exposition, in the rest of the paper we consider a fixed instance of the database and focus only on auditing queries. The details of the auditing algorithms are presented in 4, 5 and INSTANCE INDEPENDENT SEMAN- TICS Previously proposed instance independent semantics are pivoted on strong privacy guarantees. We have the notion of perfect privacy introduced by Miklau and Suciu [7] and the (weaker) notion of weak syntactic suspiciousness introduced by Motwani et al. [5]. For the special case of single tuple auditing, both of the above definitions reduce to checking whether a tuple is critical to a query. A tuple drawn from the domain of the corresponding table is said to be critical to a query if there is some database instance where dropping the tuple changes the query result. We illustrate with an example. Example 3.1 Consider the TPC-H benchmark database. Consider the query: select * from customer where c_custkey = 100 that asks for all details of the customer with id 100. Any customer tuple t with id 100 is critical since we can construct a database with a customer with id 100 where deleting t makes the query result empty. Previous work developing the instance independent approach has focused on increasing the class of queries that can be audited efficiently while retaining strong privacy guarantees. Efficient techniques have been developed for checking whether a tuple is critical to a query for large subclasses of conjunctive queries [6]. As stated in 1, our goal is different we wish to be able to handle arbitrary SQL containing constructs such as grouping, aggregation and correlated subqueries. Even though the notion of a critical tuple applies to the whole of SQL, we show that checking whether a tuple is critical to a complex query involving subqueries is undecidable. THEOREM 3.2. Checking if a tuple is critical to a query that is allowed to contain subqueries is undecidable. PROOF. Recent work shows that the query containment problem under bag semantics in the presence of inequalities is undecidable [8]. We reduce this problem to checking criticality. Suppose we want to check if query Q 1 is contained in query Q 2 under bag semantics, denoted Q 1 b Q 2. We create a new query Q select * from T where EXISTS (Q1 except all Q2) where T is a new table not referenced in either Q 1 or Q 2. We use the except all clause in SQL to compute the bag difference between Q 1 and Q 2. If Q 1 b Q 2, then Q is always empty and hence no tuple in the domain of T is critical. On the other hand, if Q 1 b Q 2, then every tuple in the domain of T is critical to Q. It is known that checking perfect privacy and weak syntactic suspiciousness is at least as hard as checking whether a tuple is critical to a query [5, 7]. Thus, from Theorem 3.2, it follows that the previously proposed instance independent semantics are computationally incompatible with the complexity of full SQL. We therefore revisit the instance dependent semantics for auditing. 4. SINGLE TUPLE AUDITING Motivated by the need to audit complex queries, we reconsider an instance dependent semantics. We present our overall auditing semantics by beginning with the special case of single tuple auditing. As noted in 1, there are several scenarios where a single tuple is a natural unit of auditing. We present our extension to more general audit expressions in Query Differentials We define the notion of query differentials to capture instance dependent single tuple auditing. DEFINITION 4.1. Given a database instance D, a query Q and a tuple t specified by the value v of its primary key, we rewrite Q to exclude t from T by adding the predicate T.id v (we refer to this expression as T t). The rewritten query is called its differential with respect to t, denoted Q t. We say that query Q accesses tuple t if Q(D) Q t(d). If Q(D) = Q t(d), then we say that Q is safe with respect to t. We illustrate Definition 4.1 with an example. 699

4 Example 4.2 Consider the data and query in Example 3.1. Suppose there is a customer tuple t with id 100 in the current instance, then by Definition 4.1, tuple t is accessed by the query. Our notion of a tuple being accessed by a query is identical to the informal notion of an indispensable tuple introduced by Agrawal et al. [4]. For the class of SPJ queries, the formal definitions are also identical. However: 1. The formal definition proposed by Agrawal et al. [4] is nonuniform. In fact, no definition of indispensability is offered for multi-block SQL queries that cannot be decorrelated. 2. For select-project-groupby-having queries, the approach proposed by Agrawal et al. [4] proceeds by dropping the having clause which reduces it to the instance independent semantics (we illustrate in Example 4.3). In contrast, Definition 4.1 is uniformly instance dependent. The following example illustrates the difference between the instance dependent and independent semantics (and also the difference between our semantics and that of Agrawal et al. [4]). Example 4.3 We continue with the database in Example 3.1. Consider the query Q: select o_custkey, count(*) from orders group by o_custkey having count(*) >= 10 that finds the number of orders placed per customer but only for those customers that have at least 10 orders. Suppose that customer with id 100 has placed only 5 (< 10) orders corresponding to tuples {o 1,..., o 5} in the orders table. The output of Q on the current database would not have an entry corresponding to customer 100. Deleting any of the customer s orders does not change the output of Q. Thus, by Definition 4.1, none of the tuples in {o 1,..., o 5} is accessed by Q. However, each of the tuples in {o 1,..., o 5} is critical to Q; for example we can construct an instance of the database where customer 100 has 10 orders including o 1 where deleting o 1 changes the output of Q. We note that the definition offered by Agrawal et al. drops the having clause in the above query; thus their definition reduces to the instance independent semantics. 4.2 Privacy Guarantees As noted in 1, there are known privacy breaches implied by the instance dependent semantics. One class of breaches is what are called negative disclosures. Consider a query Q that finds the subset of tuples satisfying a predicate P in a table T. Suppose an adversary is aware that a tuple t exists in the database. If the output of query Q does not include the tuple t, the adversary can infer that tuple t does not satisfy the predicate P. We now provide a different example of a privacy breach implied by the instance dependent semantics. Example 4.4 Suppose the customer table in the TPC-H database has a credit rating attribute. Suppose that in the current instance of the database, customer John Doe has a credit rating of 700. Consider the following queries, Q 1: select sum(creditrating - 700) from customer and Q 2: select sum(creditrating - 700) from customer where c_custname <> John Doe By checking if the results of the two queries are equal, an adversary can learn that John Doe s credit rating is 700. However, the tuple corresponding to John Doe is not accessed by either Q 1 or Q 2. Thus our auditing semantics would fail to detect the above attack. The question arises whether we can precisely characterize the privacy guarantee that the instance dependent approach yields. Let us re-examine Example 4.4. The attack in Example 4.4 essentially requires knowing the credit rating value apriori if we change the credit rating of John Doe to say 600, query Q 1 accesses the corresponding tuple and is therefore flagged as unsafe. Thus, if the adversary does not know the value of John Doe s credit rating upfront, then by issuing queries Q 1 and Q 2, he is taking a risk of being detected by the audit system. Based on the above intuition, we define the notion of a risk-free attack below. As in the definition of perfect privacy [7], we assume that the adversary views the database as a probability distribution (over instances) obtained from independent tuple probabilities. We first formalize the notion of an attack. Fix a database instance D and a tuple t T of interest. For this discussion we assume that (without loss of generality) there is one sensitive attribute A in table T. Suppose the adversary has knowledge of all the data in the database except tuple t. Since we are assuming that the tuple probabilities for the adversary are independent and since t is not critical to T t, the adversary s knowledge preserves perfect privacy of the value of t.a [7]. The adversary would like to learn some non-trivial boolean property of t.a we call a boolean property non-trivial if there is at least one value in the domain of t.a that makes it true and at least one value that makes it false. In Example 4.4, the attacker would like to know whether John Doe s credit is 700 (or not). We assume that the only interaction the adversary has with the data is running queries and examining their results. The adversary issues a set of queries {Q 1,..., Q n} to the database and computes a boolean function f on their results. Intuitively, the pair <{Q 1,..., Q n}, f> constitutes an attack if the overall computation reveals some non-trivial property of t.a. We formalize this intuition below. DEFINITION 4.5. Consider the pair <{Q 1,..., Q n}, f> where {Q 1,..., Q n} is a set of queries and f is a boolean function. Define the following function h defined over the domain of t.a. Given value α in the domain of t.a: 1. Create database instance D α obtained from the instance D by changing the value of t.a to α 2. Compute h(α) = f(q 1(D α),..., Q n(d α)) The pair <{Q 1,..., Q n}, f> is defined to be an attack if h is some non-trivial boolean property of t.a. Example 4.4 illustrates an example of an attack. The goal of an attack is to cheat the auditing system. In other words, it should infer a non-trivial boolean property of a tuple by running a set of queries that the auditing system deems to be safe. DEFINITION 4.6. An attack <{Q 1,..., Q n}, f> is said to be: 1. Successful for a database instance D α (as defined in Definition 4.5) if each Q i {Q 1,..., Q n} is safe with respect to t in D α. 2. Risk-free with respect to an auditing system if it is successful for all database instances D α. 700

5 Example 4.7 The attack described in Example 4.4 is successful in a database instance where the credit rating of John Doe is 700. The above example illustrates a key difference between the instance dependent and the instance independent semantics (based on perfect privacy and weak syntactic suspiciousness) which does not permit successful attacks. However, as noted above the attack in Example 4.4 is not a risk-free attack. If the credit rating of John Doe is not 700, then query Q 1 would be flagged as unsafe. In fact, we can show that for our auditing system that uses the instance dependent semantics, no risk-free attacks are possible. As a special case, it follows that negative disclosures are not risk-free. THEOREM 4.8. No attack is risk-free with respect to the singletuple auditing semantics introduced in Definition 4.1. PROOF. We provide a proof sketch. Suppose that the tuple of interest is t T. Suppose to the contrary that we have a riskfree attack <{Q 1,..., Q n}, f>. From Definition 4.6, it follows that each Q i could essentially be equivalently rewritten to only go over the view T t (formally, each Q i is conditionally valid [9] with respect to the view T t). This contradicts the fact that <{Q 1,..., Q n}, f> is an attack. In this section, we present a novel characterization of the privacy guarantee that the instance dependent approach yields. While it is a weaker guarantee than what the instance independent approach yields, we think it is interesting given that it does not restrict the set of queries that can be feasibly audited. As mentioned in the introduction, an auditing system that cannot handle arbitrary queries is fundamentally incomplete. At the same time, there are stronger notions of privacy that detect all successful attacks but it is not clear how we can support them efficiently for complex queries. Understanding the spectrum in between is an interesting direction for future work. 4.3 Baseline Query Auditing Algorithm We note that Definition 4.1 lends itself to a feasible implementation for arbitrary SQL queries run the query and its differential and check if the results are the same. This baseline algorithm is illustrated in Algorithm 4.1. As described in 4.2, this algorithm disallows risk-free attacks. The baseline algorithm can be ineffi- Algorithm 4.1 Baseline Query Auditing Algorithm QUERYAUDIT Input: (1) Query Workload QW of query statements with corresponding user ids, (2) Current state of database D, (3) Tuple t Output: Subset of QW that is unsafe 1: For each query Q QW 2: Check if Q(D) and Q t (D) are equal by executing both 3: If (not equal) report that Q is unsafe cient especially given a workload of complex queries. As discussed in 1, even though auditing is typically an offline process, the efficiency of auditing is important. We next present a more efficient algorithm for query auditing. 5. AUDIT OPTIMIZER As discussed in 1, even though auditing is typically an offline process, the efficiency of auditing is important. For instance, auditing may be invoked as a response to an information breach in an attempt to narrow down the cause of the breach, in which case the efficiency of auditing is critical. The baseline implementation outlined in Algorithm 4.1 involves the execution of the query and its differential and a check to examine if the results are identical which can be quite expensive. Efficient auditing is the focus of this section. We first discuss optimizations for simple classes of queries such as select-project-join (SPJ) queries and then discuss an audit optimization framework for general queries. 5.1 Optimization Techniques For Simple Queries Based on previous work on incremental view maintenance [10], we can design various optimizations for special classes of queries such as select-project-join (SPJ) queries. We first discuss technique which we term as substitution based on previous work [4] for auditing SPJ queries without self-joins. Consider a single tuple t in a table T. Suppose we want to check if a self-join-free SPJ query Q accessed t. The substitution technique creates a new query Q from Q by substituting table T with a table T which has exactly the single tuple t in it. This is accomplished by adding a suitable predicate to Q. Checking whether Q accessed t is equivalent to checking if the output of Q is empty. The following example illustrates this technique. Example 5.1 Consider the following query Q on the TPC-H schema. select * from orders where o_orderdate > Suppose the tuple of interest is a specific sensitive order, say o_orderkey = The substitution approach would check if the output of the following query Q is empty. select * from orders where o_orderdate > and o_orderkey = 5000 This check can be very efficiently implemented if there is an index on the o_orderkey column. Sometimes, we can check if the modified query Q is empty without executing it, for instance if Q contains the predicate o_orderkey < 3000 and we are interested in o_orderkey = This technique has been termed static pruning [11]. We can extend the above technique to handle simple grouping and aggregation. However, just as with previous work on incremental view maintenance, there is no simple extension to more complex queries. 5.2 Rule Based Framework In this section, we present a novel rule based framework that is extensible to the whole of SQL and that attempts to audit queries without any query execution. We motivate our framework by using a simple example. Example 5.2 Consider the following workload of two queries. 1. select * from orders where o_orderdate > select * from orders where o_orderdate > Suppose that the tuple of interest is a specific order tuple t. Suppose that we have already verified that the first query is safe. Since the second query is subsumed by the first, we can infer that the second query also is safe without executing it. Example 5.2 suggests that we can use the result of auditing one query in auditing future queries. We propose a rule based framework similar to what is used in rule-based query optimizers [12, 13] to capture the optimization 701

6 illustrated in Example 5.2 (We defer a discussion of related work in answering queries using views to 8.) The idea is that if a query Q does not access tuple t, then Q is equal to its differential on the current database instance. We incorporate this knowledge in the form of a rule. Unlike traditional optimizer rules (e.g., the join commutativity rule) which are valid for all database instances, the new rules are valid only on the current database instance. Hence, we term them instance equivalence rules. We represent the rules using the algebraic representation of a query in the form of a logical plan. DEFINITION 5.3. An instance equivalence rule is an ordered pair of logical plans, with a left-hand side (LHS) and a right-hand side (RHS), whose results are equal on the current database instance. While leveraging instance equivalence rules is reminiscent of semantic query optimization [14, 15], we note that there are important differences. We model the rules algebraically that lets us easily integrate them with traditional rules as well as represent equivalences between complex query plans (in contrast to semantic query optimization that typically relies on exposing soft constraints as semantic rules). Moreover, instance equivalences have not been previously leveraged for the notion of data auditing where the previously audited queries provide a natural source to derive such rules. Example 5.4 Consider the first example query discussed in Example 5.2 which is safe with respect to the tuple t mentioned in the same example. This fact is represented as an instance equivalence rule shown in Figure 2. The example rule shows that the selection!!"""""""""""""""""!!" "" #$%&$'" " (#$%&$')*+" Figure 2: Instance " Equivalence Rule For Select Query predicate (denoted by σ1) on the original table produces the same result as its differential note that the table in the RHS of the rule is - t. Broadly, we have both the standard transformation rules (like those used in an query optimizer) along with instance equivalence rules on the current instance of the database. For a given query, we use these rules to try to reach its differential. If we succeed, we know that the query and its differential are also equal on the current database instance, without performing any query execution. On the other hand, if we fail then we fall back to query execution (perhaps an optimized form as discussed in 5.1). We illustrate this approach through an example. Example 5.5 Consider the second query in Example 5.2. Using the instance equivalence rule in Figure 2, we can start from the logical plan corresponding to the second query and reach the logical plan of its differential. The sequence of transformations is shown in Figure 3. The selection predicate for the second query is denoted by. The individual steps in the derivation are as follows. 1. Rewrite the original plan as a predicate on the LHS of the instance equivalence rule. 2. Replace the LHS of the instance equivalence rule with the RHS.!!!!!!"!!!!!!!!!!!!!!!!!!!!!!!"!!!!!!!!!!!!!!!!!!!!!!!"!!!!!!!!!!!!!!!!!!!!!!"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!#!! $%&'%(!!!!!!!$%&'%(!!!!!!)$%&'%(*+,!!!!)$%&'%(*+,! Figure 3: Inferring Equality of a Query and its Differential for a Select Query 3. Collapse the predicates by using a rule that eliminates redundant selection predicates to obtain the target differential plan. We can see from Example 5.5 that we match against the LHS of a rule in order to use the rule. Once we find a match, we consider applying the rule by replacing the matching portion by the RHS. A plan P is said to be reachable from plan P if we can transform P to P by matching and applying one or more rules. We now define the formal problem addressed by the transformation based audit optimizer. Reachability Problem: Given a set of transformation rules R and a set of instance equivalence rules R, an input plan P and a plan P of its differential, check if P is reachable from P through a sequence of transformation rules in (R R ). Our framework is extensible. Just as commercial optimizers can handle arbitrary SQL using transformation rules, we can easily extend our framework to handle complex SQL queries by adding suitable transformation and instance equivalence rules. Rule matching and application are performed in a manner similar to rule-based query optimizers both for standard transformation rules as well as for instance equivalence rules. The LHS is treated as a pattern and matched exactly. Alternately, we check if any part of the plan under consideration is subsumed by the LHS. For subsumption, we use view matching logic. For instance, in Example 5.5 we invoke the first step of the rewrite based on view matching. View matching as implemented in commercial optimizers is restricted to the class of materializable views. In our implementation, we also match against more complex views as we illustrate below. Example 5.6 Consider the following two queries similar to the previous example but including a nested subquery. 1. select * from orders where o_orderdate > and o_totalprice > select (avg(o_totalprice) from orders) 2. select * from orders where o_orderdate > and o_totalprice > select (avg(o_totalprice) from orders) Let the tuple of interest t be a specific order tuple. Suppose that the first query is safe. The corresponding instance equivalence rule is shown in Figure 4. The LHS of the rule uses an operator [16, 17] to represent the subquery. We normalize the operator by pulling the selection above the operator (we identify the selection above the with the.) Similar to the previous example, we can again establish the equality of the second query and its differential as shown in Figure 5. An interesting point to note is that the first step leverages a more sophisticated form of 702

7 !!&!!&!!!!!!"##$%! '()*(+! ",-! '()*(+ +!!!!!!"##$%!.'()*(+/01! ",-!.'()*(+!"!01! 1! Figure 4: Instance Equivalence rule for a nested subquery!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"!!!!!!"##$%! &'()'*! "+,! &'()'* *!!"!!-!!!!!!"##$%! &'()'*! "+,! &'()'* *!!"!!-!!!!!!"##$%!.&'()'*/01! "+,!!!.&'()'*!#01!!"!!!!!!"##$%!.&'()'*/01! "+,!.&'()'*!#!01! 1! Figure 5: Inferring Equality of a Query and its Differential for a Nested Subquery view matching than Example 5.5. We reason that an operator is subsumed by another if its parent selection is subsumed and its children are identical. We note that this goes beyond existing view matching techniques in commercial systems that do not handle subqueries. The rest of the derivation is similar to the previous example. We finally illustrate an example of a rewriting that involves multiple applications of instance equivalence rules. Example 5.7 We show the instance equivalence rules obtained from prior executions below. We observe that there are two of σ1 σ1 (-t) (-t) Figure 6: Instance Equivalence Rules them. The rule involving the max aggregation corresponds to the fact that the tuple of interest does not have the maximum value in the table. We illustrate how we can leverage the above two rules to derive the safety of a complex query involving a nested subquery in Figure 7. We note in Figure 7, we move away from the normalized form of the operator by pushing the selection down. We then find that both the left and right children of the operator are subsumed by the LHS of some instance equivalence rule. ing both the rules yields the target plan of the differential query. 5.3 Implementation In this section we briefly discuss how the proposed rule based framework can be implemented. If we modify the DBMS code, we can consider the framework to be a special mode of the query optimizer ( the audit optimizer mode) in which we modify a traditional rule based query optimizer to accept as additional input the s (-t) ( t) ) s σ1 (-t) (-t) s (-t) σ1 (-t) s s Figure 7: Inferring equality using multiple instanceequivalence rules instance equivalence rules, the original logical plan and to check if the differential plan is reachable using its rule engine. We would in addition need to augment the view matching rules to match more complex operator trees as discussed above. Alternately, we can also build a client based solution in which we implement a rule engine to manage the set of original transformation rules as well as the instance equivalence rules. Note that we only need the rule application and matching logic in a query optimizer; in particular, we do not need the cardinality estimation and cost estimation modules. In addition, we can leverage the database server for performing predicate subsumption and view matching using the hypothetical views API provided by most commercial database systems [18]. In order to check if a view V 1 is subsumed by V 2, we create a hypothetical view (which is just a metadata entry in the catalog and does not contain any data), optimize the query corresponding to V 1 and check if it can be answered using the hypothetical view V 2. Algorithm 5.1 OPTQUERYAUDIT Input: Query-only Workload QW of query statements with corresponding user ids; Tuple t in table T Output: Subset of QW that is unsafe 1: Let D be the database state corresponding to QW 2: Let I be the set of instance equivalence rules 3: For each query Q QW 4: Let P and P denote the logical plans for Q(D) and Q t 5: (D) Check if P is reachable from P using I 6: If not, Check if Q(D) and Q t (D) are equal 7: If (equal) augment I with a new instance equivalence rule 8: If (not equal) report that Q is unsafe As the number of instance equivalence rules increases, the overhead of checking reachability using the rule engine also increases. We can leverage techniques that are typically used to control the overheads of query optimization, such as timeouts we run the rule based framework till a particular timeout expires. If we have not yet reached the target plan, we resort to query execution. We defer a more thorough study of how the overheads of the rule based framework can be tuned to future work. Another issue to consider is the maintenance of the instance equivalence rules under updates. Recall that these rules are valid only for a particular database instance. Currently, as a first step, we use simple syntactic checks such as only invalidating rules that refer to the table being updated. More advanced techniques for main- 703

8 taining the instance equivalence rules in the presence of updates is an interesting direction for future work. We present the optimized algorithm for single tuple auditing that augments the baseline approach with the rule based framework in Algorithm 5.1. The key differences from the original algorithm are as follows. In Steps 4-5, we use the rule based framework to check if the differential plan is reachable from the original plan. If so, we skip the execution for this query. Otherwise, we generate a new instance equivalence rule (which is added to the set I) if the query is safe. 5.4 Reordering queries We note that in Algorithm 5.1, we do not change the order in which queries are processed. In this section we present an optimization that considers reordering the queries in the workload to further improve the efficiency. We motivate this optimization with an example. Example 5.8 Consider the following workload of three queries presented in order. 1. select * from orders where o_orderdate > and o_totalprice > select (avg(o_totalprice) from orders) 2. select * from orders where o_orderdate > and o_totalprice > select (avg(o_totalprice) from orders) 3. select * from orders where o_orderdate > and o_totalprice > select (avg(o_totalprice) from orders) As discussed previously in Example 5.6, we can use instance equivalence rules to avoid any query execution for the second query. However, we would still need to execute the third query as it is not subsumed by any of the previous queries. Thus, we would only save the execution corresponding to the second query. However, suppose that we reorder the queries in the workload so that we audit the third query first. If the third query is safe, we can then infer the same for both the remaining queries thus saving two query executions. We formally define the Query Reordering Problem below. Query Reordering Problem: Given a query-only workload, find the permutation of the sequence of the queries that leads to the minimal number of executions for single tuple auditing. We solve the above problem by creating a subsumption graph which is a directed acyclic graph (DAG) in which there is one node corresponding to each query in the workload and there is an edge from node Q 1 to node Q 2 if Q 2 is subsumed by Q 1. We order queries in the order yielded by a topological sort of the subsumption graph. The rest of the algorithm is identical to Algorithm AUDIT EXPRESSIONS We now briefly discuss how we extend our solution for single tuple auditing to more general audit expressions. Similar to previous work, our audit expressions are expressed as forbidden views. We begin by considering forbidden views expressed as a predicate over a single table. If the predicate is of the form id = value, then we reduce to single tuple auditing. Example 6.1 Consider the following view over the Patients( patientid, disease) table in a health care database. select * from Patients where disease = cancer The above view expresses the intuition that information about cancer patients is sensitive. We also support a limited class of joins as forbidden views. Our goal in extending forbidden views to joins is to express simple predicates on set-valued attributes. For instance, an extension of Example 6.1 would be where the database has the information about patients in a table called Patients but the diseases are stored in a separate table Diseases to account for the fact that a patient might suffer from multiple diseases over time. In this instance, we can think of a patient as consisting of a set valued attribute containing all diseases they have suffered from over time. More formally, we allow the following class of forbidden views: select * from universal-table where condition list The term universal table above refers to the join of a set of tables where we begin with a key table (e.g., Patients) that intuitively captures the atomic attributes in the set, and join other tables (e.g., Diseases) through foreign key lookups that essentially add the set-valued attributes. The condition list consists of simple predicates that do not involve subqueries. We illustrate with the following example. Example 6.2 Consider the Patients-Diseases example above. We can write the following view: select * from Patients P join Diseases D on patientid where P.zipcode = and D.disease = cancer to capture our desire to hide information about cancer patients in the zip code Although our forbidden views are not as general as the class considered by previous work, they cover most of the examples previously considered [4, 5]. We now briefly discuss how our auditing semantics extend to cover the above class of forbidden views. As noted above, we think of the forbidden view as expressing a boolean predicate on a tuple containing a set-valued attribute. For single tuple auditing, we rewrite a query to exclude the tuple. Given a forbidden view, we similarly rewrite the query to exclude any tuples in the universal table that belong to the forbidden view. The rewritten query is called its differential with respect to the forbidden view. A query is deemed to be safe with respect to the forbidden view if it has the same result as its differential on the current instance of the database. We illustrate with an example. Example 6.3 We continue with the forbidden view in Example 6.2. Consider the query Q: select * from Patients where patientid = Alice The differential of Q is: select * from Patients P where patientid = Alice and not (zipcode = and exists (select * from Disease D where P.patientID = D.patientID and D.disease = cancer )) If Alice happens to live in the zipcode and suffered from cancer, then the differential would produce a different result. Otherwise, Q would be deemed safe with respect to the forbidden view. 704

9 In general, each table referred to in a query is rewritten to exclude the forbidden sets. We omit the details of the rewriting which are straightforward. Interestingly, both our privacy guarantees as well as our audit optimization techniques extend in a straightforward manner to cover the class of forbidden views above. We assume that the adversary views the database as a probability distribution (over instances) obtained from independent universal table tuple probabilities. The elements in a set-valued attribute need not be independent. For instance, in Example 6.2, we do not assume that the diseases afflicting a patient are independent of one another. But there is independence between the universal records (patient records in Example 6.2). With this adversary model, it is not difficult to extend the definition of an attack and a risk-free attack to tuples with setvalued attributes. Just as in Theorem 4.8, we can show that the audit semantics proposed above for forbidden views guarantees that no attack is risk-free. THEOREM 6.4. No attack is risk-free with respect to the forbidden view auditing semantics defined above. Similarly, the baseline algorithm and audit optimization techniques also extend. Instead of an expression of the form T t for a table T and tuple t, we have an expression of the form T σ p(t ) where predicate p is derived from the forbidden view. With that change, the plan transformation proceeds as in 5. For instance, the instance equivalence rule r 1 : σ 1() σ 1(-t) in Figure 2 would instead be the instance equivalence r 1 : σ 1() σ 1( σ p()). The instance equivalence rule r 1 can be used in the same way as r 1 in Example Relationship to Previous Work There are two strains of previous work we now comment on. Our approach to forbidden views outlined above differs from the (instance dependent) semantics adopted by Agrawal et al. [4]. The semantics proposed by Agrawal et al. [4] deems a query to be safe with respect to a forbidden view if they share no indispensable tuples. The following example illustrates how our semantics is different. Example 6.5 Consider the forbidden view expressed in Example 6.2 and the query select * from Disease D where D.disease = heart attack and D.patientID = Alice. Suppose that patient Alice lives in the zip code and has suffered from both heart attack and cancer. Alice s disease record corresponding to heart attack is not indispensable to the forbidden view. Thus, by the semantics proposed by Agrawal et al. [4], the query would be deemed safe. On the other hand, under our definition, the query would be deemed unsafe. Essentially, our semantics for forbidden views assumes that universal table tuples are independent but the elements with a single set-valued attribute (e.g., heart attack and cancer which are both correlated with smoking) are not necessarily independent. Our semantics yields the privacy guarantees discussed above. This is different from the approach proposed by Agrawal et al. [4] where the privacy guarantees are not discussed. The second strain of previous work that is related to our semantics is the recent work by Fabbri et al. [11] that focuses on auditing an authorization policy in an instance dependent approach. As with authorization policies, the auditing semantics presented in this paper also performs query rewriting. However, our input is a forbidden view, not an authorization policy. We can think of our semantics intuitively as deriving an authorization policy based on the complement of the forbidden view in order to rewrite the query. However, we formally analyze the privacy guarantees yielded by such an approach. We also propose a generic optimization framework applicable to arbitrary queries; in contrast, the optimizations proposed by Fabbri et al. [11] focus mostly on SPJ queries. 7. EXPERIMENTAL EVALUATION In this section, we present an experimental evaluation of the techniques presented in 5 to reduce the cost of computing differentials for complex queries. We focus primarily on the offline component of the auditing tool. The goals of the experimental study are to study the impact of our optimization techniques as a function of (1) the data size, (2) the number of queries failing the audit, (3) the selectivity of the audit expression, and (4) the fraction of updates in the workload. We first discuss some details of our experimental setup and then present our results and summarize their implications. 7.1 Implementation and Experimental Setup Our auditing tool is built as a client application over Microsoft SQL Server The online part of the audit relies on the auditing infrastructure of SQL Server [19]. The audit optimizer is built as part of the client application (as described in 5.3). The current prototype has the basic set of transformation rules (covering joins, selects, aggregates, group-by etc.) as well as view matching rules for operator trees. As discussed previously we build on the hypothetical view API of Microsoft SQL Server 2008 for checking subsumption. We also leverage a client-side parser tool to parse SQL queries into logical plans and also to derive instance equivalence rules for queries that are safe. We use the TPC-H benchmark database [20] for our experiments. We use the 1GB version for most of our experiments but also present results on the 100MB as well as the 10GB version when we examine the sensitivity of the optimization techniques to data size. For our query workload, we choose the following stored procedure on the TPC-H schema which obtains information about high-priced orders (which are defined as orders that are close to the highest priced order item) that were shipped after a certain date. The procedure below is based on query 17 in the benchmark. SELECT * FROM orders, lineitem WHERE l_orderkey = o_orderkey and l_shipdate > $1 and o_totalprice > 0.9 *(SELECT max(o_totalprice) from orders) The query workload is generated by using 30 random instances of the above stored procedure. We first consider a scenario for single tuple auditing where we assume that we are interested in auditing for a specific order of a particular customer (that could represent sensitive information) which is specified by a predicate on the o_orderkey column. In other words, we are interested in finding the specific instances of the stored procedure that accessed the information corresponding to an order of a particular customer. For our experiments on audit expressions, we assume that we are interested in auditing for orders in a particular time window which is specified by a predicate on the o_orderdate column. We choose the above stored procedure because it tests the performance of our optimizations for a complex query containing aggregation and a subquery. Moreover, stored procedures are commonly used in databases and it is important to test the performance of our techniques for such scenarios. 7.2 Comparing different alternatives In this experiment, we first compare how the different techniques discussed in 5 perform for the basic case of single tuple auditing when all the queries are safe (we revisit this assumption in a later 705

10 Execution Time (sec.) 1,200 1, Execution Time Number of Executions Baseline Opt Opt-Reorder No. of Executions Speedup Over Baseline Opt Opt-Reorder 100MB 1GB 10GB Database Size Figure 8: (a)comparison of different alternatives (b)varying Data Size experiment). We ensure this by auditing for a customer order which is not among the high priced ones. We present results for the following techniques. 1. Baseline: This is the baseline approach that checks if the query and its differential are equal for every query in the workload 2. Opt: This technique (described in Algorithm 5.1) leverages previous safe queries by deriving and exploiting instance equivalence rules. 3. Opt-Reorder: This technique (described in 5.4) in addition reorders the sequence of queries in order to best exploit the instance equivalence rules Figure 8(a) illustrates (1) the total number of query executions and (2) the total execution time in seconds. The graph indicates that Opt-Reorder reduces the number of executions by a factor of 30 when compared to the Baseline approach. Note that in this experiment all queries are safe, thus Opt-Reorder would in fact execute only the query with the largest predicate range of l_shipdate first and the remaining queries would then be inferred to be equivalent to the corresponding differential queries using the instance equivalence rules. However, since Opt-Reorder would still execute the most expensive query the corresponding ratios for execution times are less (though still significant). The graph indicates that the optimization techniques proposed in this paper have the potential to reduce the time taken for single tuple auditing by even an order of magnitude. 7.3 Varying Data Size In this experiment, we repeat the previous experiment while varying the original database size. We present results for three versions of the TPC-H database (100 MB, 1GB and 10GB). Since the number of executions remain the same for the techniques (independent of the database size) we only illustrate the speedup in execution times over the baseline algorithm. The results in Figure 8(b) illustrate the following points. First, for small databases Opt-Reorder can potentially be worse than Opt. This is because the relative overheads of computing the subsumption graph (recall that our implementation needs to leverage the hypothetical views API for checking predicate/view subsumption) can be significant particularly when the query execution times are small. Second, as the database size increases, the relative overheads of computing the subsumption graph are smaller and Opt-Reorder can provide significant reductions in the time taken for auditing when compared to Opt. 7.4 Sensitivity to Audit Failure Rate Both Opt and Opt-Reorder leverage instance equivalence rules that are derived from previous safe queries. Thus, an important parameter that could influence the performance of the proposed techniques is the Audit Failure Rate which is defined as the fraction of the queries that fail the audit. In order to vary this parameter we repeat the experiment in 7.2 but force the first k% percent of the queries to fail the audit. We note that this experiment is biased against Opt-Reorder (and not Opt) since the first few queries (after reordering) are typically the most expensive queries in the workload and we essentially force Opt-Reorder to execute them. The results in Figure 9(a) indicate that the proposed techniques work well even when the audit failure rate is high. For instance, the speedup obtained using Opt-Reorder is around 3 even for the case when 20% of the queries fail the audit (which is a large fraction). The performance of Opt and Opt-Reorder are relatively close in this example. This is because the experiment is biased against Opt- Reorder as described previously. 7.5 Sensitivity to Audit Expression Selectivity The results presented thus far have focused on the scenario of single tuple auditing. As discussed earlier, our optimization techniques naturally extend to support a larger class of audit expressions. In this section, we examine the sensitivity of the optimization techniques to the selectivity of the audit expression used. Recall that the audit expression we consider is the set of orders in a particular time window which is specified by a predicate on the o_orderdate column. Figure 9(b) shows the results as the predicate range on the o_orderdate column is varied from a week to a year. The results show that the speedup obtained reduces as the date range widens. This is expected because a larger date range in turn increases the audit failure rate. However, the results show that even when the audit expression selects all orders made over a year, our optimization techniques speed up execution by a factor of nearly Sensitivity to Updates Finally, we examine the sensitivity of Opt and Opt-Reorder to updates in the workload. Recall that both Opt and Opt-Reorder utilize the notion of instance equivalence rules. However, these 706

11 Baseline Opt Opt-Reorder Baseline Opt Opt-Reorder Execution Time (sec.) % 5% 10% 20% Fraction of Audit Failures Execution Time (sec.) week 1 month 6 months 1 year Predicate Range Figure 9: Varying (a)audit Failure Rate (b)audit Expression Selectivity Execution Time (sec.) Baseline Opt Opt-Reorder 0% 3% 6% 15% Fraction of Updates (uniform) Figure 10: Effect of Updates rules are only valid for a particular instance of the database and may need to be invalidated in the presence of updates. As discussed in 5, the rules can be still applied for the queries in a workload that are between update statements. In order to study the effect of updates, we add an increasing number of update queries uniformly to the workload of 30 instances of the stored procedure. Figure 10 shows the results for an audit expression that selects all orders made over a range of three months. For instance, the data point corresponding to 6% updates would essentially split the original 30 query sequence into 3 smaller subsequences of 10 queries each. The results indicate that Opt-Reorder yields significant speedup even in the presence of a large fraction of updates (e.g., providing a speedup of a factor of 3 even when the percentage of updates is 15%). We find similar results for the case when the updates are not uniformly distributed but skewed to appear among the last 10 queries of the original workload. 7.7 Summary We now summarize our empirical results. Our initial experiments indicate that Opt-Reorder has the potential to significantly reduce the time taken to audit queries (especially for instances of a stored procedure). It is particularly effective in cases where the database size is large and the audit failure fraction is low (which is likely to be common in practice). Moreover, the performance improvements yielded by Opt-Reorder remain significant as the input characteristics such as the data size, audit failure rate, audit expression selectivity and the fraction of updates in the workload change. We defer a more thorough evaluation of our framework to future work. 8. RELATED WORK The related work in the space of data auditing is discussed earlier in the paper. This section focuses mostly on the work related to the audit optimizer. Our audit optimizer is architected similar to a rule based optimizer [12, 13] and can be extended by adding additional transformation rules. In addition to regular transformation rules, an audit optimizer also leverages instance equivalence rules that are only valid for a current instance. While the notion of leveraging additional semantic rules has previously been studied in the area of semantic query optimization [14, 15], the main difference in our approach is that we model the rules algebraically (in the form of logical plans) that lets us easily integrate the instance equivalence rules with standard transformation rules into the optimizer search space. Further, auditing previous queries is a novel and natural source of instance equivalence rules. Our techniques for efficiently performing an audit are also reminiscent of previous work in query answering using views [21]. For instance, instead of using a rule based framework, we could leverage previously audited queries using the following problem formulation. Given a set of views corresponding to previously audited queries, we can avoid the execution of the current query and its corresponding differential if it can be rewritten completely in terms of the views. However, we do not use this problem formulation for the following reasons: 1) the problem of answering queries using views is known to be a hard problem for complex queries/views [21] and 2) the view language used by database systems is typically very limited. Finally, the techniques proposed in 5.1 for simple queries are based on techniques for incremental view maintenance [10]. Most techniques for incremental view maintenance are restricted to simple classes of queries. Moreover, our rule based framework is complementary in that it attempts to completely avoid any query execution. 9. CONCLUSIONS In this paper we study the problem of auditing real world queries that typically leverage constructs such as subqueries. We establish that the instance-independent semantics is computationally incompatible with complex queries. We therefore revisit the instance de- 707

12 pendent approach and propose a semantics for auditing that is feasible for arbitrary SQL. We formally analyze the privacy guarantees yielded by our approach. We propose a novel audit optimization framework that is general and attempts to perform auditing without any query execution. The key advantage of the semantics and the optimization techniques is the fact that they are applicable to arbitrarily complex SQL queries. Our initial experiments indicate that the audit optimizer can significantly cut down the time taken for auditing complex queries. We view our work as an important first step in auditing complex SQL queries with privacy guarantees. There are several interesting avenues for future work such as: (1) further improving the efficiency of auditing, (2) understanding the interplay between updates and queries, and (3) drilling down into unsafe queries. [19] Microsoft Corporation, SQL Server 2008 Auditing, [20] The TPC-H Benchmark, [21] A. Y. Halevy, Answering queries using views: A survey, VLDB J., vol. 10, no. 4, REFERENCES [1] Guardium, [2] Lumigent, Auditdb, [3] Oracle Corporation, Oracle audit vault, html. [4] R. Agrawal, R. J. Bayardo, C. Faloutsos, J. Kiernan, R. Rantzau, and R. Srikant, Auditing compliance with a hippocratic database, in VLDB, [5] R. Motwani, S. U. Nabar, and D. Thomas, Auditing sql queries, in ICDE, [6] A. Machanavajjhala and J. Gehrke, On the efficiency of checking perfect privacy, in PODS, [7] G. Miklau and D. Suciu, A formal analysis of information disclosure in data exchange, JCSS, vol. 73, no. 3, [8] T. S. Jayram, P. G. Kolaitis, and E. Vee, The containment problem for REAL conjunctive queries with inequalities, in PODS, [9] S. Rizvi, A. O. Mendelzon, S. Sudarshan, and P. Roy, Extending query rewriting techniques for fine-grained access control, in SIGMOD, [10] A. Gupta and I. S. Mumick, Maintenance of Materialized Views: Problems, Techniques, and Applications, IEEE Data Engineering Bulletin, vol. 18, pp. 3 18, [11] D. Fabbri, K. LeFevre, and Q. Zhu, PolicyReplay: Misconfiguration-Response Queries for Data Breach Reporting. in VLDB, [12] G. Graefe, Volcano - an extensible and parallel query evaluation system, IEEE Trans. Knowl. Data Eng., vol. 6, no. 1, [13] H. Pirahesh, T. Y. C. Leung, and W. Hasan, A Rule Engine for Query Transformation in Starburst and IBM DB2 C/S DBMS, in ICDE, [14] M. Hammer and S. B. Zdonik, Knowledge-based query processing, in VLDB, [15] J. J. King, QUIST: A System for Semantic Query Optimization in Relational Databases, in VLDB, [16] C. A. Galindo-Legaria and M. Joshi, Orthogonal optimization of subqueries and aggregation, in SIGMOD, [17] R. Guravannavar, H. S. Ramanujam, and S. Sudarshan, Optimizing nested queries with parameter sort orders, in VLDB, [18] Special issue on self-managing systems, IEEE Data Eng. Bull., vol. 29, no. 3,